Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingestion/lookml): liquid template resolution and use datahub sqlparser for column level lineage #10542

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

sid-acryl
Copy link
Contributor

@sid-acryl sid-acryl commented May 20, 2024

  • Update code to use DataHub SqlParser for SQL parsing
  • And fixes issues in CLL generation when view definition language is SQL
  • Added support for liquid template resolution for lookml views

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label May 20, 2024
syntax_func_map: Dict[str, Callable] = {
"syntax1": partial(_update_fields, self.fields, spr),
"syntax2": partial(_create_fields, spr),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks pretty weird - can we just use an if statement instead of this whole "syntax1/2" thing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this section

@@ -1295,64 +1299,28 @@ def _extract_metadata_from_derived_table_sql(
# Skip queries that contain liquid variables. We currently don't parse them correctly.
# Docs: https://cloud.google.com/looker/docs/liquid-variable-reference.
# TODO: also support ${EXTENDS} and ${TABLE}
if "{%" in sql_query:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does all the jinja-related logic go now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this commit I updated the old code to as per syntax, will add complete code in next commit.

@@ -1673,7 +1595,30 @@ def _get_upstream_lineage(
make_schema_field_urn(
upstream_dataset_urn, upstream_field
)
for upstream_field in field.upstream_fields
for upstream_field in cast(
List[str], field.upstream_fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this cast required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the ViewField.upstream_fields from List[str] to upstream_fields: Union[List[str], List[ColumnRef]]. Because it is either a column name in-case of non-sql lineage or ColumnRef for sql query parsed result

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't it always be List[ColumnRef]? we can convert the strings to ColumnRef so that it's all uniform


return fields


def get_qt_name(urn: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is qt

Copy link
Contributor Author

@sid-acryl sid-acryl May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is get_qualified_table_name.

Undefined.__str__ = lambda instance: "NULL" # type: ignore

# Resolve liquid template
return Template(sql_query).render(self.liquid_context)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the user should also be able to pass in a variable list into the ingestion config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the point I asked you on slack , do we really need variable list if NULL can work? , let me check other scenario as discussed over slack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code to pass liquid_variable as config

upstreams=[
make_schema_field_urn(cll_ref.table, cll_ref.column)
for cll_ref in cast(
List[ColumnRef], field.upstream_fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this cast required?

in general, anytime you're using cast, there's a decent chance you're doing something wrong or need an if thing is None check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check ViewField.upstream_fields type

@sid-acryl sid-acryl changed the title fix(ingestion/looker): use datahub sqlparser for column level lineage fix(ingestion/looker): use datahub sqlparser for column level lineage and liquid template resolution Jun 11, 2024
@sid-acryl sid-acryl changed the title fix(ingestion/looker): use datahub sqlparser for column level lineage and liquid template resolution fix(ingestion/looker): liquid template resolution and use datahub sqlparser for column level lineage Jun 11, 2024
@sid-acryl sid-acryl changed the title fix(ingestion/looker): liquid template resolution and use datahub sqlparser for column level lineage fix(ingestion/lookml): liquid template resolution and use datahub sqlparser for column level lineage Jun 11, 2024
@@ -7,7 +7,11 @@ sql:
FROM
order
WHERE
{% condition order_region %} order.region {% endcondition %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python-liquid template was raising error for condition as it is not liquid tag, it is specific to looker and later as it was present in sql, the parser was raising parsing error for this.

I updated this section to use liquid conditional logic, see the updated file lkml_samples_hive/liquid.view.lkml


spr: SqlParsingResult = create_lineage_sql_parsed_result(
query=query,
default_schema=connection.default_schema,
default_db=connection.default_db,
platform=connection.platform,
platform_instance=connection.platform_instance,
env=env,
graph=ctx.graph,
env=cast(str, connection.platform_env), # It's never going to be None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is true, let's just change the type of connection.platform_env to not be optional

Copy link
Contributor Author

@sid-acryl sid-acryl Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will break existing recipe, please check this L108 and as per below comments it is never going to be None as we are setting it to env of looker

# LookerConnectionDefinition.platform_env
if connection_def:
if connection_def.platform_env is None:
connection_def.platform_env = self.source_config.env
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem right

Copy link
Contributor Author

@sid-acryl sid-acryl Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pleaser read this field description #L178 and no where we were making sure that it should happened. It was crashing the sql parser, not able to recall exact issue, but it setting env to None

@@ -1673,7 +1595,30 @@ def _get_upstream_lineage(
make_schema_field_urn(
upstream_dataset_urn, upstream_field
)
for upstream_field in field.upstream_fields
for upstream_field in cast(
List[str], field.upstream_fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't it always be List[ColumnRef]? we can convert the strings to ColumnRef so that it's all uniform

@@ -2120,6 +1271,14 @@ def get_internal_workunits(self) -> Iterable[MetadataWorkUnit]: # noqa: C901

project_name = self.get_project_name(model_name)

looker_view_id_cache: LookerViewIdCache = LookerViewIdCache(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for the case where view is referred in SQL statement as ${view-name.SQL_TABLE_NAME}.
Check this view top_10_employee_income_source.view.lkml.

In this derived sql case at L401 we are getting urn with .sql_table_name suffix and platform, database and schema as per setting available in connection, however these things are not true for derived view

At this line L417, we are overwriting such urn by actual view-urn.

This cache is helping to find out LookerViewId so that we can generate the view urn and also keeping the LookerViewId in cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants