Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create ArrowReaderMetadata from externalized metadata #5582

Closed
kylebarron opened this issue Apr 2, 2024 · 1 comment · Fixed by #5583
Closed

Create ArrowReaderMetadata from externalized metadata #5582

kylebarron opened this issue Apr 2, 2024 · 1 comment · Fixed by #5583
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@kylebarron
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In some multi-file Parquet dataset layouts, there is a sidecar metadata file, canonically named _metadata, which holds only the metadata for each row group in the dataset. See https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files:

Some processing frameworks such as Spark or Dask (optionally) use _metadata and _common_metadata files with partitioned datasets.

Those files include information about the schema of the full dataset (for _common_metadata) and potentially all row group metadata of all files in the partitioned dataset as well (for _metadata). The actual files are metadata-only Parquet files. Note this is not a Parquet standard, but a convention set in practice by those frameworks.

Using those files can give a more efficient creation of a parquet Dataset, since it can use the stored schema and file paths of all row groups, instead of inferring the schema and crawling the directories for all Parquet files (this is especially the case for filesystems where accessing files is expensive).

I'd like to be able to use such metadata files to accelerate reading of Parquet datasets in geoarrow-rs. Mimicking pyarrow's API, I currently have a ParquetFile struct, which is backed by a single R: AsyncFileReader, as well as a ParquetDataset struct, which is backed by Vec<ParquetFile<R>>, where R: AsyncFileReader. This allows concurrent async reads across multiple files.

I'd like to have a ParquetDataset::from_metadata method, which constructs itself from a _metadata file. But to do that I need to be able to construct ArrowReaderMetadata for each underlying file. This is entirely possible with existing APIs, except that ArrowReaderMetadata::try_new has visibility pub(crate).

Describe the solution you'd like

Give ArrowReaderMetadata::try_new full public visibility.

Describe alternatives you've considered

Unsure of alternatives.

Additional context

@kylebarron kylebarron added the enhancement Any new improvement worthy of a entry in the changelog label Apr 2, 2024
@tustvold tustvold added the parquet Changes to the parquet crate label Apr 17, 2024
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'parquet'} from #5583

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants