Precompute or optimize queries against firefox_desktop.pageload #4957

data-sync-user · 2024-02-05T16:04:56Z

The performance team makes heavy use of the firefox_desktop.pageload table to analyze data from the pageload event.

Some examples:

Colab query on experiment results.

Redash query to compare experiment results:

Redash query intended to be used on dashboards, but hits data limits

However as we are making queries against a huge dataset the performance can be poor and the usage costs are likely quite high. Furthermore, we run into data query limits via redash.

Creating this ticket as [~accountid:70121:7c899675-2b52-4a02-b363-378de64acfe3] suggested that modelling to pre-compute results could be helpful here.

Discussion in #data-help: https://mozilla.slack.com/archives/C4D5ZA91B/p1706214488829459

┆Issue is synchronized with this Jira Story

The text was updated successfully, but these errors were encountered:

data-sync-user · 2024-02-05T16:05:00Z

➤ Andrew Creskey commented:

cc: Denis Palmeiro who also makes use of queries against firefox_desktop.pageload.

data-sync-user · 2024-02-05T16:05:02Z

➤ Shell Escalante commented:

we don’t have a focus area. George Kaberere should we tag this in any way before moving to DENG? there wasn’t a data modelling area.

data-sync-user · 2024-02-05T16:05:04Z

➤ Denis Palmeiro commented:

After using this table quite extensively recently, I think what would help us substantially is to have at least these subsets in separate tables to make lookups faster:

firefoxdesktop.pageloadnightly (nightly only)
firefox.desktop.pageload_1pct (1% of firefox_desktop.pageload)
firefoxdesktop.pageloadexperiments (pings that have a non-null value for “ping_info.experiments”)

Andrew Creskey or Bas Schouten can maybe think of some other useful subsets. I don’t use the beta population often, but maybe that could also be useful to have.

data-sync-user · 2024-04-18T22:39:13Z

➤ Winnie Chan commented:

Denis Palmeiro Andrew Creskey
I have started a pull request to create the three tables: https://github.com/mozilla/bigquery-etl/pull/5359/files ( https://github.com/mozilla/bigquery-etl/pull/5359/files|smart-link )

There are some questions I hope you can help answer:

Can you take a look at the query.sql files to ensure it captures what you need?
how far back do these tables need data for?
most 1pct tables are created by filtering sample_id. However, majority of these pings do not have client_id and therefore no sample_id (that is generated by client_id). As a result, I used document_id instead. It is unclear to me if this is appropriate so it would be ideal if you can help verify this.

cc George Kaberere

data-sync-user · 2024-04-19T15:45:47Z

➤ Denis Palmeiro commented:

The query.sql files look good to me.
Preferably a year.
I’m not familiar with the document_id but from https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/obsolete/fhr/identifiers.html ( https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/obsolete/fhr/identifiers.html|smart-link ) , it sounds like it’s pretty much random so I think generating the 1pct tables from document_id should be fine.

Thanks!

data-sync-user · 2024-04-25T23:27:51Z

➤ Winnie Chan commented:

Denis Palmeiro

I have created the three views with data for April 2024 only at the moment. Could you take a look to make sure it fits your needs before I backfill the tables for more data (as you mentioned perhaps from a year ago around 2023-04-01).

moz-fx-data-shared-prod.firefox_desktop.pageload_nightly
moz-fx-data-shared-prod.firefox_desktop.pageload_1pct
moz-fx-data-shared-prod.firefox_desktop.pageload_experiments

Note that the experiments table is still big in size (currently at 35TB with 1 month of data. With another 11 months of data it may not be that much smaller than the original table of 200TB in size). Let me know what you think.

Thanks.

data-sync-user · 2024-05-02T01:10:38Z

➤ Denis Palmeiro commented:

Winnie Chan Thanks, those tables look great. Since the experiments table is still so large, let’s just get rid of it and do the other 2 instead. Thanks!

data-sync-user · 2024-05-02T21:17:36Z

➤ Winnie Chan commented:

Denis Palmeiro I have backfilled the following tables from 2023-05-01. Let me know if you need more data.

moz-fx-data-shared-prod.firefox_desktop.pageload_nightly
moz-fx-data-shared-prod.firefox_desktop.pageload_1pct

I can go ahead and delete moz-fx-data-shared-prod.firefox_desktop.pageload_experiments.

However, I wonder if the two tables above would be sufficient for your use cases in querying for experiments? The ticket included some sample redash queries (96384 ( https://sql.telemetry.mozilla.org/queries/96384 ), 92832 ( https://sql.telemetry.mozilla.org/queries/92832 )) that may look at experiments in other channels?

data-sync-user · 2024-05-13T14:53:56Z

➤ Denis Palmeiro commented:

Thanks Winnie Chan!

The experiment subset is the biggest use case for us, and we are mostly just interested in performance experiments but unfortunately there is no good way to just isolate those. However, the nightly and 1% should help us when we're doing quick lookups of data.

data-sync-user · 2024-05-22T15:21:37Z

➤ Winnie Chan commented:

Thanks Denis Palmeiro .

In that case I will close this ticket and delete the experiments table moz-fx-data-shared-prod.firefox_desktop.pageload_experiments.

You can start using the new tables where applicable, particularly in any scheduled redash queries or dashboards. I will continue to monitor usage/costs of the the pageload tables for the next little while and see if there are more needs to optimize.

Feel free to reach out if you have any questions!

data-sync-user closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precompute or optimize queries against firefox_desktop.pageload #4957

Precompute or optimize queries against firefox_desktop.pageload #4957

data-sync-user commented Feb 5, 2024 •

edited

data-sync-user commented Feb 5, 2024

data-sync-user commented Feb 5, 2024

data-sync-user commented Feb 5, 2024

data-sync-user commented Apr 18, 2024

data-sync-user commented Apr 19, 2024

data-sync-user commented Apr 25, 2024

data-sync-user commented May 2, 2024

data-sync-user commented May 2, 2024

data-sync-user commented May 13, 2024

data-sync-user commented May 22, 2024

Precompute or optimize queries against firefox_desktop.pageload #4957

Precompute or optimize queries against firefox_desktop.pageload #4957

Comments

data-sync-user commented Feb 5, 2024 • edited

data-sync-user commented Feb 5, 2024

data-sync-user commented Feb 5, 2024

data-sync-user commented Feb 5, 2024

data-sync-user commented Apr 18, 2024

data-sync-user commented Apr 19, 2024

data-sync-user commented Apr 25, 2024

data-sync-user commented May 2, 2024

data-sync-user commented May 2, 2024

data-sync-user commented May 13, 2024

data-sync-user commented May 22, 2024

data-sync-user commented Feb 5, 2024 •

edited