Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bq Avro format issue for nested records #5611

Merged
merged 3 commits into from
Feb 19, 2025
Merged

Conversation

clairemcginty
Copy link
Contributor

A proper fix for #5597.

tl;dr for Avro-formatted BQ loads, an Avro schema should always set the type of a nested record type to match its field name, since BQ schemas do not support naming nested schemas. The behavior of Beam's BigQueryUtils.toGenericAvroSchema utility is correct in this regard (minus inconsistent namespace assignment); we just needed to properly set the right field names in the BigQueryType macro.

@clairemcginty clairemcginty marked this pull request as ready for review February 18, 2025 20:53

// Due to Beam bug with automatic schema detection, can't parse nested record types as GenericRecords yet
// Todo remove assertThrows after fixing in Beam
assertThrows[UnresolvedUnionException] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically Beam attempts to infer an Avro schema the selected BQ table here, and it doesn't work correctly if you have a nested record (capitalization is wrong and record name is just "root" for some reason?). There's also no builder method available for us to manually set an Avro schema, either, which I think we'll have to add to Beam.

Base automatically changed from revert-0.14.12 to main February 19, 2025 17:46
@@ -1050,6 +1050,7 @@ lazy val `scio-google-cloud-platform` = project
moduleFilter("org.apache.arrow", "arrow-vector"),
moduleFilter("com.fasterxml.jackson.datatype", "jackson-datatype-jsr310")
).reduce(_ | _),
Test / javaOptions += "-Doverride.type.provider=com.spotify.scio.bigquery.validation.SampleOverrideTypeProvider",
Copy link
Contributor Author

@clairemcginty clairemcginty Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, BigQueryValidationTest was failing because the system property override.type.provider that was being set in beforeAll seemed to be getting unset before the test actually run. This happened because I added the fromAvro/toAro tests to ConverterProviderTest, which invoke a new OverrideTypeProvider instance without the system properties set, and override the existing one. Let's just set it in the build options here and keep things simple.

Copy link

codecov bot commented Feb 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 61.37%. Comparing base (596bdc2) to head (536a745).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5611      +/-   ##
==========================================
+ Coverage   61.32%   61.37%   +0.04%     
==========================================
  Files         314      314              
  Lines       11236    11243       +7     
  Branches      778      788      +10     
==========================================
+ Hits         6891     6900       +9     
+ Misses       4345     4343       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@clairemcginty clairemcginty merged commit 34729e0 into main Feb 19, 2025
12 checks passed
@clairemcginty clairemcginty deleted the fix-bq-avro-load branch February 19, 2025 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants