-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bugs in derived_view_schemas
SQL generator
#5592
Conversation
This comment has been minimized.
This comment has been minimized.
for path in dataset_path.iterdir() | ||
if path.is_dir() | ||
and (path / VIEW_FILE).exists() | ||
and not (path / SCHEMA_FILE).exists() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think view deploys update the schema from dry run if a schema.yaml exists, unlike table deploys. So this might cause view deployment to publish outdated schemas and fail. IMO we can update the schema if it already exists, but if not then any schemas that need to be updated should be included in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if there's an explicit schema.yaml
file for a view it's reasonable to expect that to be kept up to date and be accurate (e.g. your code for View.view_schema
seems to make that assumption).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a safe assumption because those schemas get updated by this generator but that wouldn't be the case with this change. Running this generator locally, at least funnel_retention_clients
and funnel_retention_week_4
have columns added so those would fail to publish here
bigquery-etl/bigquery_etl/view/__init__.py
Lines 371 to 376 in 23fa414
try: | |
schema_path = Path(self.path).parent / "schema.yaml" | |
if schema_path.is_file(): | |
self.view_schema.deploy(target_view) | |
except Exception as e: | |
print(f"Could not update field descriptions for {target_view}: {e}") |
Those should be updated in this PR. I would still lean towards updating existing schemas since we can't always rely on schemas being up to date which is the case for tables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK:
- I've changed the
View.view_schema
logic to always try dry-running first, then augment with theschema.yaml
if one exists. - I've changed the
derived_view_schemas
SQL generator to useView.view_schema
to get schemas, rather than getting the deployed view schema withSchema.for_table()
(which I suspect was probably preventing the logic you implemented in Bug 1868244 Publish views if dry run schema differs from deployed #5057 from working properly). - I've changed the
derived_view_schemas
SQL generator to not skip views that already haveschema.yaml
files.
It now gets the view schemas from their latest `schema.yaml` file or dry-running their latest SQL. Getting the schema from the currently deployed view wasn't appropriate because it wouldn't reflect the latest view code.
This comment has been minimized.
This comment has been minimized.
…he latest schema.
This comment has been minimized.
This comment has been minimized.
bigquery_etl/view/__init__.py
Outdated
"""Derive view schema from a dry run result and/or a schema file.""" | ||
schema = None | ||
|
||
# check schema based on dry run results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think dry running everything makes sense to make sure everything is updated but it will probably add a few minutes to the publish views task. Views should usually be up to date since the generator already ran. I don't have any issues with this but worth checking later on to see how much time this adds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reversed course on this and went back to having it return the schema from schema.yaml
if one exits by default, because it turns out dry-runs for some views which select from very large tables can take well over a minute.
The derived_view_schemas
SQL generator logic will still always try to dry-run views to get their up-to-date schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did manage to get the dry-runs on some of the views that select from large tables to run quicker by adding partition column filters, but I'm still going to leave this with the original behavior of using the schema from schema.yaml
by default, with the derived_view_schemas
SQL generator using custom logic to always try to dry-run the views.
0450a63
to
b094dcd
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
bigquery_etl/view/__init__.py
Outdated
# We have to remove `CREATE OR REPLACE VIEW ... AS` from the query to avoid | ||
# view creation permission denied errors, and we have to apply a `WHERE FALSE` | ||
# filter to avoid partition column filter missing errors. | ||
query=dedent( | ||
f""" | ||
WITH view_query AS ( | ||
{CREATE_VIEW_PATTERN.sub("", self.content)} | ||
) | ||
SELECT * | ||
FROM view_query | ||
WHERE FALSE | ||
""" | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was pleasantly surprised that dry-running the full CREATE VIEW ...
query returns the view's schema, nicely side-stepping the issue of potentially having to add query filters on any partition columns from the underlying tables. However, it turns out that only works if you actually have permission to create the view in the target dataset, which we DEs generally do but CI does not, so the previous dry-run-to-get-view-schema logic would have always failed in a CI context.
CI doesn't have permission to do the dry-runs directly.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
44aa8b9
to
c25347b
Compare
…the latest schema. Dry-runs for some views which select from very large tables can take well over a minute.
c25347b
to
6e4c94e
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…ml` files if those files actually exist.
This comment has been minimized.
This comment has been minimized.
eb6e1e6
to
7ef67d9
Compare
To speed up the dry-run queries.
7ef67d9
to
40d59a5
Compare
This comment has been minimized.
This comment has been minimized.
…1` table. So the `derived_view_schemas` SQL generator can detect the partition column to use and successfully dry-run the view to determine its schema.
This comment has been minimized.
This comment has been minimized.
In case other SQL generators create derived views.
bigquery_etl/cli/generate.py
Outdated
# Run `stable_views` after `glean_usage` because both update `dataset_metadata.yaml` files | ||
# and we want the `stable_views` updates to take precedence. | ||
case "glean_usage": | ||
return (1, command.name) | ||
case "stable_views": | ||
return (2, command.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is based on comments in #2594.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scholtzan the sql.diff
shows this changing some dataset_metadata.yaml
files. Since I based the implementation on your comments, can you please confirm that stable_views
should still run after glean_usage
, and not vice versa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe for _, cmd in reversed(generate.commands.items()):
causes stable_views
to be run before glean_usage
. I'm not sure whether it makes a difference here though? I believe glean_app_ping_views
might depended on the stable views at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've reversed the order in this PR so that stable_views
runs before glean_usage
, and now no dataset_metadata.yaml
changes appear in the sql.diff
.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
case _: | ||
return (3, command.name) | ||
|
||
for cmd in sorted(generate.commands.values(), key=generator_command_sort_key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hoping running the SQL generators in a deterministic order will also reduce the seemingly random unrelated diffs that sometimes get reported in sql.diff
. Currently the SQL generators are run in the reverse order that Path.iterdir()
returns the sql_generators/*
subdirectories, and according to the docs that order is "arbitrary".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes make sense to me but definitely something to keep an eye on to make sure there aren't unintended side effects
@@ -247,7 +247,7 @@ def _traverse( | |||
f"for {prefix}.{field_path} are incompatible" | |||
) | |||
|
|||
if dtype == "RECORD": | |||
if dtype == "RECORD" and nodes[node_name]["type"] == "RECORD": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an issue with this pop up while you were testing? I'm ignore_incompatible_fields
would need to be true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an issue with this pop up while you were testing?
Yes, it was causing the derived_view_schemas
generator to fail for a couple views that transform an underlying struct column into something else:
telemetry.clients_daily_joined
telemetry.clients_daily_v6
I'm
ignore_incompatible_fields
would need to be true
The derived_view_schemas
generator calls Schema.merge()
with attributes=["description"]
, which kind of has a similar effect since the field incompatibility logic is applied at the attribute level.
This comment has been minimized.
This comment has been minimized.
Integration report for "Have
|
The
derived_view_schemas
SQL generator was running against all types of directories (not just view directories), and it was reading the view schemas from the currently deployed views (not based on the latest view code).Changes:
Limit
derived_view_schemas
SQL generator to actual view directories.Fix the
derived_view_schemas
SQL generator to get the view schemas by dry-running their latest SQL and/or from their latestschema.yaml
file. Getting the schema from the currently deployed view wasn't appropriate because it wouldn't reflect the latest view code.Rename
View.view_schema
toView.schema
.Change
View
so its schema dry-runs use the cloud function (CI doesn't have permission to run dry-run queries directly).Apply partition column filters in view dry-run queries when possible for speed/efficiency.
Don't allow missing fields to prevent view schema enrichment.
Only copy column descriptions during view schema enrichment.
Only try enriching view schemas from their reference table
schema.yaml
files if those files actually exist.Change
main_1pct
view to select directly from themain_remainder_1pct_v1
table, so thederived_view_schemas
SQL generator can detect the partition column to use and successfully dry-run the view to determine its schema.Formalize the order
bqetl generate all
runs the SQL generators in.Have
bqetl generate all
runderived_view_schemas
last, in case other SQL generators create derived views.Fix
Schema._traverse()
to only recurse if both fields are records.Checklist for reviewer:
<username>:<branch>
of the fork as parameter. The parameter will also show upin the logs of the
manual-trigger-required-for-fork
CI task together with more detailed instructions.For modifications to schemas in restricted namespaces (see
CODEOWNERS
):┆Issue is synchronized with this Jira Task