Add golden table tests from delta spark/java kernel #295

zachschuermann · 2024-07-30T21:20:43Z

Adding delta's existing golden table tests to the kernel integration test suite on a best-effort basis. Each golden table is a .tar.zst compressed folder of the table (test input) and expected data (to assert proper reads) - all are in kernel/tests/golden_data/.

Each golden-table.tar.zst is organized like:

example-test/
├── expected
│   └── part-00000-3b15cd4d-5002-4a49-8031-211fbcf09c15-c000.snappy.parquet
└── delta
    ├── _delta_log
    │   └── 00000000000000000000.json
    ├── part-00000-2abbde89-2d0f-465e-a2f0-3e84f1b84654-c000.snappy.parquet
    └── part-00001-5419c9a2-bb44-454f-a109-6e6c6f000a24-c000.snappy.parquet

The golden tests persisted in the delta repo are the 'input' of the test cases - i.e. the delta tables we would like to read (under delta/ subdirectory.

Expected outputs were generated by reading the latest snapshot of each table from above with PySpark then persisted in the expected/ subdirectory.

A new integration suite kernel/tests/golden_tables.rs runs each of the new tests based on a set of macros to run either (1) golden_test!: a test function against the golden table, (2) negative_test!: run the test with the latest snapshot and expect it to fail, (3) skip_test!: skip the test with a reason

run the new tests with cargo t -p delta_kernel --test golden_tables

zachschuermann · 2024-07-30T21:22:58Z

kernel/Cargo.toml

 test-log = { version = "0.2", default-features = false, features = ["trace"] }
 tempfile = "3"
-test-case = { version = "3.1.0" }


looks like it wasn't used?

kernel/tests/read.rs

zachschuermann · 2024-07-30T22:16:33Z

kernel/tests/read.rs

@@ -387,14 +381,6 @@ macro_rules! assert_batches_sorted_eq {
    };
 }

-fn to_arrow(data: Box<dyn EngineData>) -> DeltaResult<RecordBatch> {


moved to common

nicklan

nice, this is much better!

Left some comments, but mostly looking good, thanks.

kernel/tests/golden_tables.rs

nicklan · 2024-07-30T23:07:56Z

kernel/tests/golden_tables.rs

+full_scan_test!("multi-part-checkpoint");
+full_scan_test!("only-checkpoint-files");
+
+// TODO some of the parquet tests use projections


we support projections, so why can't we do these?

yea I think with your comment below I'll just try to unify all the tests under a golden_table! macro and we can just always pass in a 'test function' and I'll make one for full table scan, scan with predicate, etc. (or maybe a function that has that stuff as params

taking as follow-up. I can make issues for some of these TODOs?

nicklan · 2024-07-30T23:13:48Z

kernel/tests/golden_tables.rs

+skip_test!("data-reader-timestamp_ntz-id-mode": "id column mapping mode not supported");
+full_scan_test!("data-reader-timestamp_ntz-name-mode");
+
+// TODO test with predicate


can you elaborate a bit on what exactly we need to do here?

same as comment above I think i'll be able to implement these - but will size how much work is needed

actually going to just punt on these for a separate PR just so we can get this merged and iterate

kernel/tests/golden_tables.rs

nicklan · 2024-07-30T23:29:57Z

kernel/tests/golden_tables.rs

+    Ok(Some(all_data))
+}
+
+// TODO: change to do something similar to dat tests instead of string comparison


nit: explain what dat does. i.e. "change to use arrow's column Eq like dat does, instead of string comparison. Should print out string when test fails to make debugging easier"

This could be a good first issue.

good idea i can go through some of these todos and make explicit issues?

kernel/tests/golden_tables.rs

zachschuermann · 2024-08-01T17:38:26Z

todo find/profile the long test

zachschuermann · 2024-08-01T21:46:57Z

todo find/profile the long test

golden_parquet_decimal_dictionaries* tests

zachschuermann · 2024-08-02T21:12:11Z

slowness was due to printing tables for comparison. the large tables in particular in the decimal tests. fixed now to be smart and just clear metadata from record batches and then compare sorted record batches with existing Eq implementation

nicklan

looking pretty good, just a few last things

kernel/tests/golden_tables.rs

nicklan · 2024-08-02T21:57:44Z

kernel/tests/golden_tables.rs

+        .map(normalize_col)
+        .collect::<Vec<_>>();
+
+    let left: RecordBatch =


We're creating a record batch here out of the normalized cols, then immediately deconstructing it into columns again to sort them.

I think you should be able to just pass the columns as a slice to sort_record_batch (and probably rename it sort_columns or similar).

Alternately, do it more like in DAT, where you first sort the cols, then iterate over each column, normalize it, and then check that it's equal. That let's you be a bit nicer in the error message:

fn assert_columns_match(actual: &[Arc<dyn Array>], expected: &[Arc<dyn Array>]) { for (actual, expected) in actual.iter().zip(expected) { let actual = normalize_col(actual.clone()); let expected = normalize_col(expected.clone()); // note that array equality includes data_type equality // See: https://arrow.apache.org/rust/arrow_data/equal/fn.equal.html assert_eq!( &actual, &expected, "Column data didn't match. Got {actual:?}, expected {expected:?}" ); } }

I'm doing this to rely on the schema + cols check for the whole record batch. seemed like DAT was doing something less clean with checking schema and data equality separately? but yea wonder if I should descruct, normalize, sort, then put back together?

kernel/tests/golden_tables.rs

roeap

LGTM! we do need to exclude (fixing I assume is a bit more work) some more tests, but otherwise should give us a great dela more confidence things work as they should!

nicklan

lgtm! thanks for iterating on this, and for the couple of critical bugfixes!

add golden table tests

Loading
Loading status checks…

fdbfc9e

zachschuermann mentioned this pull request Jul 30, 2024

[wip] Add golden table tests from delta spark #292

Closed

zachschuermann commented Jul 30, 2024

View reviewed changes

add schemaString to canonicalized-paths-* tests

Loading
Loading status checks…

dc9a494

zachschuermann marked this pull request as ready for review July 30, 2024 21:41

zachschuermann added 2 commits July 30, 2024 15:07

ignore failing

Loading
Loading status checks…

e5a0db2

fmt

Loading
Loading status checks…

705ca76

zachschuermann requested review from nicklan and roeap July 30, 2024 22:09

zachschuermann added 2 commits July 30, 2024 15:14

appease clippy

Loading
Loading status checks…

a283b4c

fmt

Loading
Loading status checks…

65ce9e5

zachschuermann commented Jul 30, 2024

View reviewed changes

nicklan requested changes Jul 30, 2024

View reviewed changes

zachschuermann added 2 commits July 30, 2024 22:49

address feedback

Loading
Loading status checks…

fd68ad7

dat fix and clippy

Loading
Loading status checks…

0a6ee9e

zachschuermann added 2 commits August 1, 2024 13:55

rename table to delta in each golden table, and some cleanup

Loading
Loading status checks…

090b591

fmt

Loading
Loading status checks…

5509f0a

zachschuermann requested a review from nicklan August 1, 2024 20:57

remove print

Loading
Loading status checks…

39fe2ab

zachschuermann added 3 commits August 1, 2024 14:47

misc fix

Loading
Loading status checks…

f174fbe

make it fast (no more print assert_eq)

Loading
Loading status checks…

7c765ed

no more macros in common

Loading
Loading status checks…

a82c01f

zachschuermann added 2 commits August 2, 2024 14:12

Merge remote-tracking branch 'upstream/main' into golden-table-tests-3

Loading
Loading status checks…

be87da2

make it nice

Loading
Loading status checks…

70b0b5d

nicklan requested changes Aug 2, 2024

View reviewed changes

lil fixes

Loading
Loading status checks…

ee931f6

zachschuermann requested a review from nicklan August 3, 2024 01:46

Merge remote-tracking branch 'upstream/main' into golden-table-tests-3

Loading
Loading status checks…

51c83eb

roeap approved these changes Aug 4, 2024

View reviewed changes

zachschuermann added 5 commits August 4, 2024 17:34

fix

Loading
Loading status checks…

515c932

remove normalize col from assertion and add better comment

Loading
Loading status checks…

da8a088

Merge remote-tracking branch 'upstream/main' into golden-table-tests-3

Loading
Loading status checks…

c7bde4a

agree more with DAT

Loading
Loading status checks…

1d3fe65

timestamp fix, add back normalize cols

Loading
Loading status checks…

10bfac3

nicklan approved these changes Aug 6, 2024

View reviewed changes

zachschuermann merged commit 0a3834b into delta-io:main Aug 6, 2024
9 checks passed

zachschuermann deleted the golden-table-tests-3 branch August 6, 2024 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add golden table tests from delta spark/java kernel #295

Add golden table tests from delta spark/java kernel #295

zachschuermann commented Jul 30, 2024 •

edited

Loading

zachschuermann Jul 30, 2024

zachschuermann Jul 30, 2024

nicklan left a comment

nicklan Jul 30, 2024

zachschuermann Jul 31, 2024

zachschuermann Aug 1, 2024

nicklan Jul 30, 2024

zachschuermann Jul 31, 2024

zachschuermann Aug 1, 2024

nicklan Jul 30, 2024

zachschuermann Jul 31, 2024

zachschuermann commented Aug 1, 2024

zachschuermann commented Aug 1, 2024

zachschuermann commented Aug 2, 2024

nicklan left a comment

nicklan Aug 2, 2024

zachschuermann Aug 3, 2024

roeap left a comment

nicklan left a comment

Add golden table tests from delta spark/java kernel #295

Add golden table tests from delta spark/java kernel #295

Conversation

zachschuermann commented Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicklan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachschuermann commented Aug 1, 2024

zachschuermann commented Aug 1, 2024

zachschuermann commented Aug 2, 2024

nicklan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roeap left a comment

Choose a reason for hiding this comment

nicklan left a comment

Choose a reason for hiding this comment

zachschuermann commented Jul 30, 2024 •

edited

Loading