Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precompute or optimize queries against firefox_desktop.pageload #4957

Closed
data-sync-user opened this issue Feb 5, 2024 · 10 comments
Closed

Comments

@data-sync-user
Copy link
Collaborator

data-sync-user commented Feb 5, 2024

The performance team makes heavy use of the firefox_desktop.pageload table to analyze data from the pageload event.

Some examples:

Colab query on experiment results.

Redash query to compare experiment results:

Redash query intended to be used on dashboards, but hits data limits

However as we are making queries against a huge dataset the performance can be poor and the usage costs are likely quite high. Furthermore, we run into data query limits via redash.

Creating this ticket as [~accountid:70121:7c899675-2b52-4a02-b363-378de64acfe3] suggested that modelling to pre-compute results could be helpful here.

Discussion in #data-help: https://mozilla.slack.com/archives/C4D5ZA91B/p1706214488829459

┆Issue is synchronized with this Jira Story

@data-sync-user
Copy link
Collaborator Author

➤ Andrew Creskey commented:

cc: Denis Palmeiro who also makes use of queries against firefox_desktop.pageload.

@data-sync-user
Copy link
Collaborator Author

➤ Shell Escalante commented:

we don’t have a focus area. George Kaberere should we tag this in any way before moving to DENG? there wasn’t a data modelling area.

@data-sync-user
Copy link
Collaborator Author

➤ Denis Palmeiro commented:

After using this table quite extensively recently, I think what would help us substantially is to have at least these subsets in separate tables to make lookups faster:

firefoxdesktop.pageloadnightly (nightly only)
firefox.desktop.pageload_1pct (1% of firefox_desktop.pageload)
firefoxdesktop.pageloadexperiments (pings that have a non-null value for “ping_info.experiments”)

Andrew Creskey or Bas Schouten can maybe think of some other useful subsets. I don’t use the beta population often, but maybe that could also be useful to have.

@data-sync-user
Copy link
Collaborator Author

➤ Winnie Chan commented:

Denis Palmeiro Andrew Creskey
I have started a pull request to create the three tables: https://github.com/mozilla/bigquery-etl/pull/5359/files ( https://github.com/mozilla/bigquery-etl/pull/5359/files|smart-link )

There are some questions I hope you can help answer:

  • Can you take a look at the query.sql files to ensure it captures what you need?
  • how far back do these tables need data for?
  • most 1pct tables are created by filtering sample_id. However, majority of these pings do not have client_id and therefore no sample_id (that is generated by client_id). As a result, I used document_id instead. It is unclear to me if this is appropriate so it would be ideal if you can help verify this.

cc George Kaberere

@data-sync-user
Copy link
Collaborator Author

➤ Denis Palmeiro commented:

  1. The query.sql files look good to me.
  2. Preferably a year.
  3. I’m not familiar with the document_id but from https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/obsolete/fhr/identifiers.html ( https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/obsolete/fhr/identifiers.html|smart-link ) , it sounds like it’s pretty much random so I think generating the 1pct tables from document_id should be fine.

Thanks!

@data-sync-user
Copy link
Collaborator Author

➤ Winnie Chan commented:

Denis Palmeiro

I have created the three views with data for April 2024 only at the moment. Could you take a look to make sure it fits your needs before I backfill the tables for more data (as you mentioned perhaps from a year ago around 2023-04-01).

  • moz-fx-data-shared-prod.firefox_desktop.pageload_nightly
  • moz-fx-data-shared-prod.firefox_desktop.pageload_1pct
  • moz-fx-data-shared-prod.firefox_desktop.pageload_experiments

Note that the experiments table is still big in size (currently at 35TB with 1 month of data. With another 11 months of data it may not be that much smaller than the original table of 200TB in size). Let me know what you think.

Thanks.

@data-sync-user
Copy link
Collaborator Author

➤ Denis Palmeiro commented:

Winnie Chan Thanks, those tables look great. Since the experiments table is still so large, let’s just get rid of it and do the other 2 instead. Thanks!

@data-sync-user
Copy link
Collaborator Author

➤ Winnie Chan commented:

Denis Palmeiro I have backfilled the following tables from 2023-05-01. Let me know if you need more data.

  • moz-fx-data-shared-prod.firefox_desktop.pageload_nightly
  • moz-fx-data-shared-prod.firefox_desktop.pageload_1pct

I can go ahead and delete moz-fx-data-shared-prod.firefox_desktop.pageload_experiments.

However, I wonder if the two tables above would be sufficient for your use cases in querying for experiments? The ticket included some sample redash queries (96384 ( https://sql.telemetry.mozilla.org/queries/96384 ), 92832 ( https://sql.telemetry.mozilla.org/queries/92832 )) that may look at experiments in other channels?

@data-sync-user
Copy link
Collaborator Author

➤ Denis Palmeiro commented:

Thanks Winnie Chan!

The experiment subset is the biggest use case for us, and we are mostly just interested in performance experiments but unfortunately there is no good way to just isolate those. However, the nightly and 1% should help us when we're doing quick lookups of data.

@data-sync-user
Copy link
Collaborator Author

➤ Winnie Chan commented:

Thanks Denis Palmeiro .

In that case I will close this ticket and delete the experiments table moz-fx-data-shared-prod.firefox_desktop.pageload_experiments.

You can start using the new tables where applicable, particularly in any scheduled redash queries or dashboards. I will continue to monitor usage/costs of the the pageload tables for the next little while and see if there are more needs to optimize.

Feel free to reach out if you have any questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant