Enable Compartment Joins for Unified Parquet Output #22

d33bs · 2022-12-07T18:51:05Z

Description

The changes in this PR seek to enable compartment data joins as a single parquet file. Along the way, many other related changes were added. Thank you in advance for your comments and suggestions!

Outline of changes:

A focus was placed on CellProfiler CSV and SQLite data source conversion (seeking to cover Convert CellProfiler CSV files to parquet #4 , Convert CellProfiler SQLite files to parquet #5)
Presets were added for common configuration patterns (with a focus, for now, on CellProfiler source data)
Testing datasets have been directly added to this repo for the purposes of validating results under tests/data.
Initial testing for cytominer-database is included with related changes (beginning to address Add or verify cytominer-database SQLite data source input compatibility #18 )
A CITATION.cff file has been added to help document testing dataset sources and other contributions.
DuckDB was added as a dependency to assist with both SQLite data reads and SQL-based join work.
Documentation covering technical and data architecture (in addition to other areas) were added to help share knowledge associated with the work in this repo and other areas. (seeks to cover parts of Document system and workflow architecture options #17)
Prefect logging controls were added to minimize logging output with default settings. (Provide optional logging suppression control #21)
Some modules were reorganized for consistency (for example, exceptions.py) or increasing readability through line-length reduction (record-based tasks moved from convert.py to records.py) (abiding pylint too-many-lines / C0302).

Design considerations:

SQL-based DuckDB joins:

Initial work towards this PR included PyArrow.Table.join() and related configuration presets for controlling join behavior with a Python dictionary (I'll reference this as a "join dictionary"). The data structure of this join dictionary mimicked that of SQL-based options (for ex LEFT JOIN table ON x=y as {"left":"table","join_keys":["x"]}, etc). Given this, these points urged me to change the code before this PR.

Using the join dictionary involved a complex for-loop to join the entirety of the data where otherwise SQL allows for multiple joins to be stated in a single statement.
I couldn't justify reinventing SQL join syntax from a human understandability standpoint; using SQL I feel will help ensure the code is better understood and able to be maintained over time. It also offers a cross-language way to manage this data which doesn't rely on Pythonic syntax or data (should a move from Python to another be desired).
Using DuckDB enabled for SQLite-based data ingest in addition to enabling the join work to take place (where earlier in these changes I had relied on connector-x for SQLite data ingest due to its high-performance capabilities).
There may be additional benefits to using DuckDB and/or SQL in this way which have not yet been implemented (record batching, performance, etc).

Citation references:

I wanted to make sure we acknowledged the great work of those who generated datasets used for testing within this repo. Using a CITATION.cff file for this felt right but I was unsure of the best way to both link to existing CITATION.cff files and also intermix references to individuals not found within a CITATION.cff file. I'd welcome any thoughts on the best or most preferred way to do this so we make sure those involved are credited.

What is the nature of your change?

Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
New and existing unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

decouple from concat, preparing for use within merge ops without prior concat

standardize language surrounding join operations (pyarrow only uses join wording).

also include validation for CITATION.cff

falquaddoomi

Nice work! I made a few comments. Generally, I'd say that if you end up excluding some functions from the public interface, it's ok to not document them from the perspective of an end user, but if they are intended to be used by end users I'd definitely include descriptions of what the end user should pass for parameters. For example, It'd be good to see a sample input dataset in the documentation, an identification of all the files in the dataset, and how you'd use the library to process it, e.g., what specific file paths in the dataset are given to which methods.

tests/utils/pylint_checker_no_ospath.py

docs/source/overview.md

pycytominer_transform/convert.py

docs/source/tutorial.md

pycytominer_transform/convert.py

d33bs · 2023-01-23T21:56:36Z

Thank you again @gwaybio and @falquaddoomi for your reviews! I've addressed or created new issues for the comments you provided.

d33bs added 30 commits October 20, 2022 10:12

organize and prepare merge args

6873c41

add get_merge_chunks task

5afff4f

add merge_records task

727e0e7

add concat_merge_records task; linting changes

ee88415

align merge operations; compartments vs meta prep

440a9ce

add prepend_column_name task; adding meta spec

467467e

add infer_record_group_common_schema task

30d76d3

decouple from concat, preparing for use within merge ops without prior concat

merge progress join and filters for metadata

bd473b3

simplify merge

a9b4820

enriched join config; replace merge w/ join words

51d3f5e

standardize language surrounding join operations (pyarrow only uses join wording).

add test data from cytominer-database + citations

65ce7fd

also include validation for CITATION.cff

update deps for testing purposes

aa4824b

add cytominer-database pycytominer merge tests

c53cf81

update dependencies

8cf5e49

remove test dir after testing session

f2ac6ed

remove simulated ExampleHuman output and tests

69a9119

add metadata join treatment and join column select

02582b9

align testing with pycytominer parquet output

fb1d036

enable python 3.8 compatibility and testing

40c342d

add presets to organize common config

705a783

update tests for new features and config

37c71d6

ingest cellprofiler sqlite export data

4612668

table and record naming alignment

6031379

update preset table refs for cellprofiler_sqlite

d218b69

add NF1_data.sqlite and citation

702ef72

add examplehuman cellprofiler output from cppipe

c937336

simplify joined output

40aef01

add arg and set default log level for prefect

61bb980

deps version updates

b46eb28

set log level env overrides for prefect

0299336

d33bs added 13 commits December 9, 2022 07:40

update citation author order

cfe2216

remove os.path; add pylint checker os.path usage

f0e838b

disable linting for custom check

d03feaa

add duckdb_with_sqlite util for sqlite ops

7f4cae4

add utils to py api docs

3bd6eb4

correct custom lint checker

9fd9411

move from records to sources naming for clarity

5c70b3e

order metadata before compartments for clarity

bc4b95b

enhance cytoplasm parent field documentation

829f4b0

one sentence per line

6867b5f

simplify prepend_column_name and expand testing

ffb27fc

black formatting

81294fb

update dependencies

badf4f6

falquaddoomi reviewed Jan 18, 2023

View reviewed changes

d33bs added 5 commits January 18, 2023 08:35

clarify custom check symbol and desc

d99b3ac

update dependencies

df2811e

affect instead of effect

b5c5b31

elif for read_data readability

50b02e3

syntax for pylint

5bebeb1

This was referenced Jan 20, 2023

Describe tutorial use cases and clarify steps #25

Open

Enhance documentation and typing for interface composability #26

Open

private and public definitions; deps update

90db7e3

falquaddoomi approved these changes Jan 20, 2023

View reviewed changes

d33bs added 3 commits January 21, 2023 14:29

greater docs specificity for _prepend_column_name

44f1402

_concat_source_group docs example

faf345f

sphinx docstring formatting

8227461

d33bs merged commit f359b68 into cytomining:main Jan 23, 2023

d33bs deleted the enable-compartment-merge branch January 23, 2023 22:04

d33bs mentioned this pull request Feb 3, 2023

Provide optional logging suppression control #21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Compartment Joins for Unified Parquet Output #22

Enable Compartment Joins for Unified Parquet Output #22

d33bs commented Dec 7, 2022

falquaddoomi left a comment

d33bs commented Jan 23, 2023

Enable Compartment Joins for Unified Parquet Output #22

Enable Compartment Joins for Unified Parquet Output #22

Conversation

d33bs commented Dec 7, 2022

Description

Outline of changes:

Design considerations:

SQL-based DuckDB joins:

Citation references:

What is the nature of your change?

Checklist

falquaddoomi left a comment

Choose a reason for hiding this comment

d33bs commented Jan 23, 2023