-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Compartment Joins for Unified Parquet Output #22
Conversation
decouple from concat, preparing for use within merge ops without prior concat
standardize language surrounding join operations (pyarrow only uses join wording).
also include validation for CITATION.cff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I made a few comments. Generally, I'd say that if you end up excluding some functions from the public interface, it's ok to not document them from the perspective of an end user, but if they are intended to be used by end users I'd definitely include descriptions of what the end user should pass for parameters. For example, It'd be good to see a sample input dataset in the documentation, an identification of all the files in the dataset, and how you'd use the library to process it, e.g., what specific file paths in the dataset are given to which methods.
Thank you again @gwaybio and @falquaddoomi for your reviews! I've addressed or created new issues for the comments you provided. |
Description
The changes in this PR seek to enable compartment data joins as a single parquet file. Along the way, many other related changes were added. Thank you in advance for your comments and suggestions!
Outline of changes:
tests/data
.CITATION.cff
file has been added to help document testing dataset sources and other contributions.exceptions.py
) or increasing readability through line-length reduction (record-based tasks moved fromconvert.py
torecords.py
) (abiding pylint too-many-lines / C0302).Design considerations:
SQL-based DuckDB joins:
Initial work towards this PR included
PyArrow.Table.join()
and related configuration presets for controlling join behavior with a Python dictionary (I'll reference this as a "join dictionary"). The data structure of this join dictionary mimicked that of SQL-based options (for exLEFT JOIN table ON x=y
as{"left":"table","join_keys":["x"]}
, etc). Given this, these points urged me to change the code before this PR.Citation references:
I wanted to make sure we acknowledged the great work of those who generated datasets used for testing within this repo. Using a
CITATION.cff
file for this felt right but I was unsure of the best way to both link to existingCITATION.cff
files and also intermix references to individuals not found within aCITATION.cff
file. I'd welcome any thoughts on the best or most preferred way to do this so we make sure those involved are credited.What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.