Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Compartment Joins for Unified Parquet Output #22

Merged
merged 82 commits into from
Jan 23, 2023
Merged

Enable Compartment Joins for Unified Parquet Output #22

merged 82 commits into from
Jan 23, 2023

Conversation

d33bs
Copy link
Member

@d33bs d33bs commented Dec 7, 2022

Description

The changes in this PR seek to enable compartment data joins as a single parquet file. Along the way, many other related changes were added. Thank you in advance for your comments and suggestions!

Outline of changes:

Design considerations:

SQL-based DuckDB joins:

Initial work towards this PR included PyArrow.Table.join() and related configuration presets for controlling join behavior with a Python dictionary (I'll reference this as a "join dictionary"). The data structure of this join dictionary mimicked that of SQL-based options (for ex LEFT JOIN table ON x=y as {"left":"table","join_keys":["x"]}, etc). Given this, these points urged me to change the code before this PR.

  • Using the join dictionary involved a complex for-loop to join the entirety of the data where otherwise SQL allows for multiple joins to be stated in a single statement.
  • I couldn't justify reinventing SQL join syntax from a human understandability standpoint; using SQL I feel will help ensure the code is better understood and able to be maintained over time. It also offers a cross-language way to manage this data which doesn't rely on Pythonic syntax or data (should a move from Python to another be desired).
  • Using DuckDB enabled for SQLite-based data ingest in addition to enabling the join work to take place (where earlier in these changes I had relied on connector-x for SQLite data ingest due to its high-performance capabilities).
  • There may be additional benefits to using DuckDB and/or SQL in this way which have not yet been implemented (record batching, performance, etc).

Citation references:

I wanted to make sure we acknowledged the great work of those who generated datasets used for testing within this repo. Using a CITATION.cff file for this felt right but I was unsure of the best way to both link to existing CITATION.cff files and also intermix references to individuals not found within a CITATION.cff file. I'd welcome any thoughts on the best or most preferred way to do this so we make sure those involved are credited.

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

decouple from concat, preparing for use within merge ops without prior concat
standardize language surrounding join operations (pyarrow only uses join wording).
also include validation for CITATION.cff
Copy link
Collaborator

@falquaddoomi falquaddoomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I made a few comments. Generally, I'd say that if you end up excluding some functions from the public interface, it's ok to not document them from the perspective of an end user, but if they are intended to be used by end users I'd definitely include descriptions of what the end user should pass for parameters. For example, It'd be good to see a sample input dataset in the documentation, an identification of all the files in the dataset, and how you'd use the library to process it, e.g., what specific file paths in the dataset are given to which methods.

tests/utils/pylint_checker_no_ospath.py Outdated Show resolved Hide resolved
docs/source/overview.md Outdated Show resolved Hide resolved
pycytominer_transform/convert.py Outdated Show resolved Hide resolved
docs/source/tutorial.md Show resolved Hide resolved
pycytominer_transform/convert.py Show resolved Hide resolved
pycytominer_transform/convert.py Outdated Show resolved Hide resolved
pycytominer_transform/convert.py Show resolved Hide resolved
@d33bs
Copy link
Member Author

d33bs commented Jan 23, 2023

Thank you again @gwaybio and @falquaddoomi for your reviews! I've addressed or created new issues for the comments you provided.

@d33bs d33bs merged commit f359b68 into cytomining:main Jan 23, 2023
@d33bs d33bs deleted the enable-compartment-merge branch January 23, 2023 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants