Create `ArrowReaderMetadata` from externalized metadata #5582

kylebarron · 2024-04-02T21:24:55Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In some multi-file Parquet dataset layouts, there is a sidecar metadata file, canonically named _metadata, which holds only the metadata for each row group in the dataset. See https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files:

Some processing frameworks such as Spark or Dask (optionally) use _metadata and _common_metadata files with partitioned datasets.

Those files include information about the schema of the full dataset (for _common_metadata) and potentially all row group metadata of all files in the partitioned dataset as well (for _metadata). The actual files are metadata-only Parquet files. Note this is not a Parquet standard, but a convention set in practice by those frameworks.

Using those files can give a more efficient creation of a parquet Dataset, since it can use the stored schema and file paths of all row groups, instead of inferring the schema and crawling the directories for all Parquet files (this is especially the case for filesystems where accessing files is expensive).

I'd like to be able to use such metadata files to accelerate reading of Parquet datasets in geoarrow-rs. Mimicking pyarrow's API, I currently have a ParquetFile struct, which is backed by a single R: AsyncFileReader, as well as a ParquetDataset struct, which is backed by Vec<ParquetFile<R>>, where R: AsyncFileReader. This allows concurrent async reads across multiple files.

I'd like to have a ParquetDataset::from_metadata method, which constructs itself from a _metadata file. But to do that I need to be able to construct ArrowReaderMetadata for each underlying file. This is entirely possible with existing APIs, except that ArrowReaderMetadata::try_new has visibility pub(crate).

Describe the solution you'd like

Give ArrowReaderMetadata::try_new full public visibility.

Describe alternatives you've considered

Unsure of alternatives.

Additional context

The text was updated successfully, but these errors were encountered:

tustvold · 2024-04-17T13:49:14Z

label_issue.py automatically added labels {'parquet'} from #5583

kylebarron added the enhancement label Apr 2, 2024

kylebarron mentioned this issue Apr 2, 2024

Expose ArrowReaderMetadata::try_new #5583

Merged

tustvold closed this as completed in #5583 Apr 3, 2024

tustvold added the parquet label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create `ArrowReaderMetadata` from externalized metadata #5582

Create `ArrowReaderMetadata` from externalized metadata #5582

kylebarron commented Apr 2, 2024

tustvold commented Apr 17, 2024

Create ArrowReaderMetadata from externalized metadata #5582

Create ArrowReaderMetadata from externalized metadata #5582

Comments

kylebarron commented Apr 2, 2024

tustvold commented Apr 17, 2024

Create `ArrowReaderMetadata` from externalized metadata #5582

Create `ArrowReaderMetadata` from externalized metadata #5582