-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: adding parquet.votable #16375
base: main
Are you sure you want to change the base?
ENH: adding parquet.votable #16375
Conversation
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
👋 Thank you for your draft pull request! Do you know that you can use |
Fixed up the codestyle advisories, but two remained and are related to XML. How do we ended up standing with defusedxml? Should I just ignore these for now? |
|
||
# Convert the VOTable object into a Byte string to create an | ||
# XML that we can add to the Parquet metadata | ||
xml_bstr = io.BytesIO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a byte string? The votable content will already be utf-8. Are the bytes needed for parquet?
Data table that is to be written to output. | ||
output : str or path-like | ||
The filename to write the table to. | ||
metadata : dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the user expected to generate this metadata object? How does it relate to the .meta
info that may already be in the table
or its columns?
It looks like this metadata
is intended to supplement what the votable writer supplied from the table
. I'm concerned that, especially if table
was created from reading an actual votable, that metadata
could conflict with the existing metadata in table
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have issues with this kwarg, too, and wrote up a few test cases. E.g. I think we should reuse the column metadata if the Table already has it (e.g. the table could have been read in from a votable/be a result of an astroquery query). So I would instead add another kwarg for metadata_overwrite
, and then use the metadata
kward to overwrite some of the column metadata (but still, it would be nice to allow partial overwrite).
As far as I see, we do nothing with the table metadata, but that is already the case when reading in from a votable? (E.g. with this gaia file, it's not clear how to get the e.g. QUERY
info out of it, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in votable.parquet
we call this column_metadata
, so I suppose we need to do the same. And it's similarly a mandatory kwarg, which IMO needs to be changed and allowed it to use existing metadata if/when it's sufficiently provided.
mass = np.random.uniform(low=1e8, high=1e10, size=number_of_objects) | ||
sfr = np.random.uniform(low=1, high=100, size=number_of_objects) | ||
|
||
input_table = Table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to have test cases where the input astropy table was created by reading an actual votable with its own metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a test example for this
…opulated metadata Query to reproduce: "SELECT TOP 1 source_id, ra, dec, parallax from gaiadr3.gaia_source"
@pllim - WIW, this again run into some coverage shenanigans, something else than the previous upload fail. |
You have to rebase anyway, I think, to pick up fix for RTD. Also maybe address the pre-commit checks? I don't know why there are two different |
pyarrow_table = pyarrow_table.replace_schema_metadata(updated_metadata) | ||
|
||
# write the parquet file with required Type 1 metadata | ||
pyarrow.parquet.write_table(pyarrow_table, output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably need to use the existing writer, otherwise we need to do the same type of dtype wizzardy. What do you think @gpdf?
Thank you for your work, this PR would be really useful for hipscat. The standard describes column data types, would this PR or future work cover that? |
This PR adds a new format
parquet.votable
to be able to read parquet files that include a votable metadata.I open this as a draft for the first round of comments, and while adding some narrative documentation.
cc @tomdonaldson @gpdf @afaisst