-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingestion/lookml): liquid template resolution and use datahub sqlparser for column level lineage #10542
base: master
Are you sure you want to change the base?
fix(ingestion/lookml): liquid template resolution and use datahub sqlparser for column level lineage #10542
Conversation
sid-acryl
commented
May 20, 2024
•
edited
edited
- Update code to use DataHub SqlParser for SQL parsing
- And fixes issues in CLL generation when view definition language is SQL
- Added support for liquid template resolution for lookml views
syntax_func_map: Dict[str, Callable] = { | ||
"syntax1": partial(_update_fields, self.fields, spr), | ||
"syntax2": partial(_create_fields, spr), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks pretty weird - can we just use an if statement instead of this whole "syntax1/2" thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this section
@@ -1295,64 +1299,28 @@ def _extract_metadata_from_derived_table_sql( | |||
# Skip queries that contain liquid variables. We currently don't parse them correctly. | |||
# Docs: https://cloud.google.com/looker/docs/liquid-variable-reference. | |||
# TODO: also support ${EXTENDS} and ${TABLE} | |||
if "{%" in sql_query: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does all the jinja-related logic go now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this commit I updated the old code to as per syntax, will add complete code in next commit.
…hub-fork into master+ing-510-lookml-cll
@@ -1673,7 +1595,30 @@ def _get_upstream_lineage( | |||
make_schema_field_urn( | |||
upstream_dataset_urn, upstream_field | |||
) | |||
for upstream_field in field.upstream_fields | |||
for upstream_field in cast( | |||
List[str], field.upstream_fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this cast required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the ViewField.upstream_fields from List[str] to upstream_fields: Union[List[str], List[ColumnRef]]
. Because it is either a column name in-case of non-sql lineage or ColumnRef for sql query parsed result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't it always be List[ColumnRef]
? we can convert the strings to ColumnRef so that it's all uniform
|
||
return fields | ||
|
||
|
||
def get_qt_name(urn: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is qt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is get_qualified_table_name.
Undefined.__str__ = lambda instance: "NULL" # type: ignore | ||
|
||
# Resolve liquid template | ||
return Template(sql_query).render(self.liquid_context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the user should also be able to pass in a variable list into the ingestion config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the point I asked you on slack , do we really need variable list if NULL can work? , let me check other scenario as discussed over slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code to pass liquid_variable as config
upstreams=[ | ||
make_schema_field_urn(cll_ref.table, cll_ref.column) | ||
for cll_ref in cast( | ||
List[ColumnRef], field.upstream_fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this cast required?
in general, anytime you're using cast, there's a decent chance you're doing something wrong or need an if thing is None
check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check ViewField.upstream_fields type
…hub-fork into master+ing-510-lookml-cll
…hub-fork into master+ing-510-lookml-cll
@@ -7,7 +7,11 @@ sql: | |||
FROM | |||
order | |||
WHERE | |||
{% condition order_region %} order.region {% endcondition %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python-liquid template was raising error for condition
as it is not liquid tag, it is specific to looker and later as it was present in sql, the parser was raising parsing error for this.
I updated this section to use liquid conditional logic, see the updated file lkml_samples_hive/liquid.view.lkml
|
||
spr: SqlParsingResult = create_lineage_sql_parsed_result( | ||
query=query, | ||
default_schema=connection.default_schema, | ||
default_db=connection.default_db, | ||
platform=connection.platform, | ||
platform_instance=connection.platform_instance, | ||
env=env, | ||
graph=ctx.graph, | ||
env=cast(str, connection.platform_env), # It's never going to be None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is true, let's just change the type of connection.platform_env to not be optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will break existing recipe, please check this L108 and as per below comments it is never going to be None as we are setting it to env of looker
# LookerConnectionDefinition.platform_env | ||
if connection_def: | ||
if connection_def.platform_env is None: | ||
connection_def.platform_env = self.source_config.env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pleaser read this field description #L178 and no where we were making sure that it should happened. It was crashing the sql parser, not able to recall exact issue, but it setting env to None
@@ -1673,7 +1595,30 @@ def _get_upstream_lineage( | |||
make_schema_field_urn( | |||
upstream_dataset_urn, upstream_field | |||
) | |||
for upstream_field in field.upstream_fields | |||
for upstream_field in cast( | |||
List[str], field.upstream_fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't it always be List[ColumnRef]
? we can convert the strings to ColumnRef so that it's all uniform
@@ -2120,6 +1271,14 @@ def get_internal_workunits(self) -> Iterable[MetadataWorkUnit]: # noqa: C901 | |||
|
|||
project_name = self.get_project_name(model_name) | |||
|
|||
looker_view_id_cache: LookerViewIdCache = LookerViewIdCache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for the case where view is referred in SQL statement as ${view-name.SQL_TABLE_NAME}.
Check this view top_10_employee_income_source.view.lkml.
In this derived sql case at L401 we are getting urn with .sql_table_name
suffix and platform, database and schema as per setting available in connection, however these things are not true for derived view
At this line L417, we are overwriting such urn by actual view-urn.
This cache is helping to find out LookerViewId so that we can generate the view urn and also keeping the LookerViewId in cache
…hub-fork into master+ing-510-lookml-cll