Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add golden table tests from delta spark/java kernel #295

Merged
merged 23 commits into from
Aug 6, 2024

Conversation

zachschuermann
Copy link
Collaborator

@zachschuermann zachschuermann commented Jul 30, 2024

Adding delta's existing golden table tests to the kernel integration test suite on a best-effort basis. Each golden table is a .tar.zst compressed folder of the table (test input) and expected data (to assert proper reads) - all are in kernel/tests/golden_data/.

Each golden-table.tar.zst is organized like:

example-test/
├── expected
│   └── part-00000-3b15cd4d-5002-4a49-8031-211fbcf09c15-c000.snappy.parquet
└── delta
    ├── _delta_log
    │   └── 00000000000000000000.json
    ├── part-00000-2abbde89-2d0f-465e-a2f0-3e84f1b84654-c000.snappy.parquet
    └── part-00001-5419c9a2-bb44-454f-a109-6e6c6f000a24-c000.snappy.parquet

The golden tests persisted in the delta repo are the 'input' of the test cases - i.e. the delta tables we would like to read (under delta/ subdirectory.

Expected outputs were generated by reading the latest snapshot of each table from above with PySpark then persisted in the expected/ subdirectory.

A new integration suite kernel/tests/golden_tables.rs runs each of the new tests based on a set of macros to run either (1) golden_test!: a test function against the golden table, (2) negative_test!: run the test with the latest snapshot and expect it to fail, (3) skip_test!: skip the test with a reason

run the new tests with cargo t -p delta_kernel --test golden_tables

test-log = { version = "0.2", default-features = false, features = ["trace"] }
tempfile = "3"
test-case = { version = "3.1.0" }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like it wasn't used?

@zachschuermann zachschuermann marked this pull request as ready for review July 30, 2024 21:41
fmt
@zachschuermann zachschuermann requested review from nicklan and roeap July 30, 2024 22:09
fmt
kernel/tests/read.rs Outdated Show resolved Hide resolved
@@ -387,14 +381,6 @@ macro_rules! assert_batches_sorted_eq {
};
}

fn to_arrow(data: Box<dyn EngineData>) -> DeltaResult<RecordBatch> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to common

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this is much better!

Left some comments, but mostly looking good, thanks.

kernel/tests/golden_tables.rs Outdated Show resolved Hide resolved
kernel/tests/golden_tables.rs Outdated Show resolved Hide resolved
full_scan_test!("multi-part-checkpoint");
full_scan_test!("only-checkpoint-files");

// TODO some of the parquet tests use projections
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we support projections, so why can't we do these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I think with your comment below I'll just try to unify all the tests under a golden_table! macro and we can just always pass in a 'test function' and I'll make one for full table scan, scan with predicate, etc. (or maybe a function that has that stuff as params

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking as follow-up. I can make issues for some of these TODOs?

skip_test!("data-reader-timestamp_ntz-id-mode": "id column mapping mode not supported");
full_scan_test!("data-reader-timestamp_ntz-name-mode");

// TODO test with predicate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate a bit on what exactly we need to do here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as comment above I think i'll be able to implement these - but will size how much work is needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually going to just punt on these for a separate PR just so we can get this merged and iterate

kernel/tests/golden_tables.rs Show resolved Hide resolved
kernel/tests/golden_tables.rs Outdated Show resolved Hide resolved
Ok(Some(all_data))
}

// TODO: change to do something similar to dat tests instead of string comparison
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: explain what dat does. i.e. "change to use arrow's column Eq like dat does, instead of string comparison. Should print out string when test fails to make debugging easier"

This could be a good first issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea i can go through some of these todos and make explicit issues?

kernel/tests/golden_tables.rs Outdated Show resolved Hide resolved
kernel/tests/golden_tables.rs Outdated Show resolved Hide resolved
kernel/tests/golden_tables.rs Outdated Show resolved Hide resolved
@zachschuermann
Copy link
Collaborator Author

todo find/profile the long test

fmt
@zachschuermann zachschuermann requested a review from nicklan August 1, 2024 20:57
@zachschuermann
Copy link
Collaborator Author

todo find/profile the long test

golden_parquet_decimal_dictionaries* tests

@zachschuermann
Copy link
Collaborator Author

slowness was due to printing tables for comparison. the large tables in particular in the decimal tests. fixed now to be smart and just clear metadata from record batches and then compare sorted record batches with existing Eq implementation

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking pretty good, just a few last things

kernel/tests/golden_tables.rs Outdated Show resolved Hide resolved
.map(normalize_col)
.collect::<Vec<_>>();

let left: RecordBatch =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're creating a record batch here out of the normalized cols, then immediately deconstructing it into columns again to sort them.

I think you should be able to just pass the columns as a slice to sort_record_batch (and probably rename it sort_columns or similar).

Alternately, do it more like in DAT, where you first sort the cols, then iterate over each column, normalize it, and then check that it's equal. That let's you be a bit nicer in the error message:

fn assert_columns_match(actual: &[Arc<dyn Array>], expected: &[Arc<dyn Array>]) {
    for (actual, expected) in actual.iter().zip(expected) {
        let actual = normalize_col(actual.clone());
        let expected = normalize_col(expected.clone());
        // note that array equality includes data_type equality
        // See: https://arrow.apache.org/rust/arrow_data/equal/fn.equal.html
        assert_eq!(
            &actual, &expected,
            "Column data didn't match. Got {actual:?}, expected {expected:?}"
        );
    }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm doing this to rely on the schema + cols check for the whole record batch. seemed like DAT was doing something less clean with checking schema and data equality separately? but yea wonder if I should descruct, normalize, sort, then put back together?

kernel/tests/golden_tables.rs Outdated Show resolved Hide resolved
@zachschuermann zachschuermann requested a review from nicklan August 3, 2024 01:46
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! we do need to exclude (fixing I assume is a bit more work) some more tests, but otherwise should give us a great dela more confidence things work as they should!

fix
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks for iterating on this, and for the couple of critical bugfixes!

@zachschuermann zachschuermann merged commit 0a3834b into delta-io:main Aug 6, 2024
9 checks passed
@zachschuermann zachschuermann deleted the golden-table-tests-3 branch August 6, 2024 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants