Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major Index Update: change default index to base_uae_mem #4241

Merged
merged 14 commits into from
May 17, 2024

Conversation

shifucun
Copy link
Contributor

@shifucun shifucun commented May 15, 2024

This is a major release of the NL feature to replace the all-mini-lm-l6 model with uae-large-v1 that is hosted on vertex AI.

Also made fundamental improvements to stat var descriptions, and only use one accurate description per stat var. This get rid of the need to use alternatives.

This adds some small debug info UI improvement.

@shifucun shifucun requested a review from pradh May 15, 2024 03:59
"Count_HousingUnit_HomeValue150000To174999USDollar",
"Count_HousingUnit_HomeValue2000000OrMoreUSDollar",
"Count_HousingUnit_HomeValueUpto10000USDollar"
"dc/topic/HousesByOwnershipCost"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

diff is OK, but just need to check why the topic order flipped compared to dev.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From debug page, the variables look very different from the golden here.

Query "What is the relationship between housing size and home prices in California".

should the chart_config.json be exact same as what shows up on the page?

@shifucun shifucun requested a review from pradh May 16, 2024 07:41
"geoId/06077"
],
"statVarKey": [
"Count_Person_15OrMoreYears_WithIncome10"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a loss. Currently we have "Household Median Income", which is reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed now

"geoId/06013",
"geoId/06061"
],
"description": "Non-Institutionalized Civilian Adults With No Health Insurance, Earnings 4,999 USD or Less in Los Angeles County",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too, household median income on LHS seems more reasonable than this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query string is "income 50000", with the new stat var description, we have $50000 and $5000 exist in the description now.

if you change the query to "Counties in California where income is over 60000", it will be different. So i think the issue here is to handle the numbers in the query?

Copy link
Contributor

@pradh pradh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - pending the 4 diff Qs.

server/integration_tests/explore_test.py Show resolved Hide resolved
@@ -10,27 +10,28 @@
"tiles": [
{
"statVarKey": [
"Median_Income_Household_HouseholderRaceAsianAlone"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forget what we said about this, LHS seems to have income and race, but RHS only race?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query string is "asian population income". There is no exact match for this one but has:

  • asian population
  • income of asian householder household
  • earnings of household with asian born people

Looks like the model prefers to match super set and then match deviating constraints.

@@ -10,27 +10,28 @@
"tiles": [
{
"statVarKey": [
"Median_Income_Household_HouseholderRaceAsianAlone"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the other "tell me asian california population with low income" issue

Copy link
Contributor

@pradh pradh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super excited to get this in!!!

@shifucun shifucun changed the title update default index to base_uae_mem Major Index Update: change default index to base_uae_mem May 16, 2024
@shifucun
Copy link
Contributor Author

Thanks for review, debugging and all the help! Going to merge now.

@shifucun shifucun enabled auto-merge (squash) May 16, 2024 21:28
@@ -188,10 +189,7 @@ function update_integration_test_golden {
export LLM_API_KEY=
export ENABLE_EVAL_TOOL=true

export ENV_PREFIX=Autopush
python3 -m pytest -vv server/integration_tests/topic_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't remove the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already covered in the test below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, yes

@shifucun shifucun merged commit c993fdb into datacommonsorg:master May 17, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants