You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My guess is that something is misconfigured with the splittability of the BinaryIO reader implementation? Or there's some method to direct ReadAllViaFileBasedSource not to try to split the source into offset ranges. We should compare against a sample Beam non-splittable source (I think TFRecordIO is such an example?)
The text was updated successfully, but these errors were encountered:
Upon following my own advice, I tried copying TFRecordIO's technique of setting desiredBundleSizeBytes of Long.MAX_VALUE to avoid splitting, and it solved the problem 👍
Dataflow will try to break the file into offset splits of
desiredByteSizeBytes
, which we've set to 64MB, although binary files should not be split.Sample repro:
which, on read, throws this error in DF:
My guess is that something is misconfigured with the splittability of the BinaryIO reader implementation? Or there's some method to direct
ReadAllViaFileBasedSource
not to try to split the source into offset ranges. We should compare against a sample Beam non-splittable source (I think TFRecordIO is such an example?)The text was updated successfully, but these errors were encountered: