-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bq Avro format issue for nested records #5611
Conversation
|
||
// Due to Beam bug with automatic schema detection, can't parse nested record types as GenericRecords yet | ||
// Todo remove assertThrows after fixing in Beam | ||
assertThrows[UnresolvedUnionException] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically Beam attempts to infer an Avro schema the selected BQ table here, and it doesn't work correctly if you have a nested record (capitalization is wrong and record name is just "root" for some reason?). There's also no builder method available for us to manually set an Avro schema, either, which I think we'll have to add to Beam.
e1ae8aa
to
588e9bc
Compare
@@ -1050,6 +1050,7 @@ lazy val `scio-google-cloud-platform` = project | |||
moduleFilter("org.apache.arrow", "arrow-vector"), | |||
moduleFilter("com.fasterxml.jackson.datatype", "jackson-datatype-jsr310") | |||
).reduce(_ | _), | |||
Test / javaOptions += "-Doverride.type.provider=com.spotify.scio.bigquery.validation.SampleOverrideTypeProvider", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, BigQueryValidationTest was failing because the system property override.type.provider
that was being set in beforeAll seemed to be getting unset before the test actually run. This happened because I added the fromAvro/toAro tests to ConverterProviderTest
, which invoke a new OverrideTypeProvider instance without the system properties set, and override the existing one. Let's just set it in the build options here and keep things simple.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5611 +/- ##
==========================================
+ Coverage 61.32% 61.37% +0.04%
==========================================
Files 314 314
Lines 11236 11243 +7
Branches 778 788 +10
==========================================
+ Hits 6891 6900 +9
+ Misses 4345 4343 -2 ☔ View full report in Codecov by Sentry. |
A proper fix for #5597.
tl;dr for Avro-formatted BQ loads, an Avro schema should always set the type of a nested record type to match its field name, since BQ schemas do not support naming nested schemas. The behavior of Beam's
BigQueryUtils.toGenericAvroSchema
utility is correct in this regard (minus inconsistent namespace assignment); we just needed to properly set the right field names in the BigQueryType macro.