[Kernel][Writes] Add support of inserting data into tables #3030

vkorukanti · 2024-05-02T18:02:22Z

Description

(Split from #2944)

Adds support for inserting data into the table.

How was this patch tested?

Tests for inserting into partitioned and unpartitioned tables with various combinations of the types, partition values etc. Also tests the checkpoint is ready to create.

allisonport-db · 2024-05-03T01:52:05Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/InternalUtils.java

+            .collect(Collectors.toMap(e -> e.getKey().toLowerCase(), Map.Entry::getValue));
+    }
+
+    public static int findColIndex(StructType schema, String colNameLowerCase) {


Do we need this to avoid case sensitivity? Maybe comment that for the future?

done and moved this to SchemaUtils

allisonport-db · 2024-05-03T01:56:59Z

kernel/kernel-api/src/main/java/io/delta/kernel/Transaction.java

+
+        String targetDirectory = getTargetDirectory(
+                getTableRoot(transactionState),
+                toLowerCaseList(getPartitionColumnsList(transactionState)),


Are partition column names always lowercase in the file path? If this should be documented in the function for the expected input

Refactored and fixed it to always preserve the case as given by the connector when the table is created.

allisonport-db · 2024-05-03T01:58:43Z

kernel/kernel-api/src/main/java/io/delta/kernel/Transaction.java

+                partitionValuesLowerCaseName);
+        return new DataWriteContext(
+                targetDirectory,
+                partitionValuesLowerCaseName,


Also the lower-case nature of this should be documented somewhere? Because the keys here will be different from like transaction.getPartitionColumns() right?

This is no longer exposed to the connect.

allisonport-db · 2024-05-03T02:01:01Z

kernel/kernel-api/src/main/java/io/delta/kernel/Transaction.java

+                    Row addFileRow = AddFile.convertDataFileStatus(
+                            tableRoot,
+                            dataFileStatus,
+                            dataWriteContext.getPartitionValues(),


Aren't these now lower case keys? Isn't that incorrect or is that how they are stored?

I simplified this logic. Now it always preserves the case as given by the connector when the table is created.

kernel/kernel-api/src/main/java/io/delta/kernel/internal/actions/AddFile.java

allisonport-db · 2024-05-03T02:23:00Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/InternalUtils.java

+     *             child paths.
+     * @return
+     */
+    public static Path relativizePath(Path child, URI root) {


Do we have a test that uses this? I didn't see one can we add one?

I will add it in a follow up PR. Given this code calling another well-tested API, it should be ok.

allisonport-db · 2024-05-03T02:26:55Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/PartitionUtils.java

+                    if (partValue == null) {
+                        return new Tuple2<>(partColName, (String) null);
+                    } else {
+                        return new Tuple2<>(partColName, serializePartitionValue(partValue));
+                    }


Can we make sure we have tests with null partition values?

vkorukanti · 2024-05-03T22:27:04Z

kernel/kernel-api/src/main/java/io/delta/kernel/Transaction.java

+                    Row addFileRow = AddFile.convertDataFileStatus(
+                            tableRoot,
+                            dataFileStatus,
+                            dataWriteContext.getPartitionValues(),


I simplified this logic. Now it always preserves the case as given by the connector when the table is created.

vkorukanti · 2024-05-03T22:28:47Z

kernel/kernel-api/src/main/java/io/delta/kernel/Transaction.java

+
+        String targetDirectory = getTargetDirectory(
+                getTableRoot(transactionState),
+                toLowerCaseList(getPartitionColumnsList(transactionState)),


Refactored and fixed it to always preserve the case as given by the connector when the table is created.

vkorukanti · 2024-05-03T22:29:00Z

kernel/kernel-api/src/main/java/io/delta/kernel/Transaction.java

+                partitionValuesLowerCaseName);
+        return new DataWriteContext(
+                targetDirectory,
+                partitionValuesLowerCaseName,


This is no longer exposed to the connect.

kernel/kernel-api/src/main/java/io/delta/kernel/internal/actions/AddFile.java

vkorukanti · 2024-05-03T22:33:10Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/InternalUtils.java

+     *             child paths.
+     * @return
+     */
+    public static Path relativizePath(Path child, URI root) {


I will add it in a follow up PR. Given this code calling another well-tested API, it should be ok.

vkorukanti · 2024-05-03T22:36:21Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/InternalUtils.java

+            .collect(Collectors.toMap(e -> e.getKey().toLowerCase(), Map.Entry::getValue));
+    }
+
+    public static int findColIndex(StructType schema, String colNameLowerCase) {


done and moved this to SchemaUtils

kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/CreateCheckpointSuite.scala

allisonport-db

Minor comments. The partition name stuff is a lot cleaner thanks :)

allisonport-db · 2024-05-05T17:43:09Z

kernel/kernel-api/src/main/java/io/delta/kernel/Transaction.java

@@ -104,7 +118,37 @@ static CloseableIterator<FilteredColumnarBatch> transformLogicalData(
            Row transactionState,
            CloseableIterator<FilteredColumnarBatch> dataIter,
            Map<String, Literal> partitionValues) {
-        throw new UnsupportedOperationException("Not implemented yet");
+        List<String> partitionColNames = getPartitionColumnsList(transactionState);
+        validatePartitionValues(partitionColNames, partitionValues);


Should we validate the types here too?

Actually, what do we even use the partitionValues for here? Isn't this param unused?

This is one of the discussions where we concluded that taking the partition values as input forces the connector to not pass data from multiple partitions.

Type validation makes sense. Adding those.

Can you maybe add a comment about why we have partitionValues there then? For future reference

allisonport-db · 2024-05-05T18:05:35Z

kernel/kernel-api/src/main/java/io/delta/kernel/engine/JsonHandler.java

-     *     <li>{@code map}: only a {@code map} with {@code string} key type is supported</li>
+     *     <li>{@code map}: only a {@code map} with {@code string} key type is supported. If an
+     *     entry value is {@code null}, it should be written to the file.</li>


Why do we need this change? How are partitionValues serialized in delta spark? Is this with the other JSON serialization rules..

It turns out Spark writes the null values in partitionValues to Delta Log. Yeah, this is one of those undocumented detail, which was found through testing.

How is this done though? Since I thought we saw that null values weren't written for maps some other way?

val addFile = AddFile( path = "sdfsdf.parquet", partitionValues = Map("a" -> "b", "c" -> null), size = 12345, modificationTime = 54321, dataChange = true, stats = null ) val json = addFile.json assert(json == "sdfsd")

returns {"add":{"path":"sdfsdf.parquet","partitionValues":{"a":"b","c":null},"size":12345,"modificationTime":54321,"dataChange":true}}

Basically, the ObjectMapper, when .setSerializationInclusion(Include.NON_ABSENT) doesn't write any property (in the above example stats) whose value is null, but nulls in the map are always written.

Applying that to Kernel: any struct fields that are null - not written. For any nulls in Map, written out.

kernel/kernel-api/src/main/java/io/delta/kernel/Transaction.java

kernel/kernel-defaults/src/main/java/io/delta/kernel/defaults/engine/DefaultParquetHandler.java

kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/DeltaTableWriteSuiteBase.scala

allisonport-db · 2024-05-05T18:54:36Z

kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/DeltaTableWritesSuite.scala

+
+        verifyCommitResult(commitResult0, expVersion = 0, expIsReadyForCheckpoint = false)
+        verifyCommitInfo(tblPath, version = 0, expPartCols, operation = WRITE)
+        verifyWrittenContent(tblPath, schema, expV0Data)


Is this enough to be sure the add.partitionValues has the right case sensitivity?

kernel/kernel-api/src/main/java/io/delta/kernel/internal/util/PartitionUtils.java

kernel/kernel-defaults/src/test/scala/io/delta/kernel/defaults/DeltaTableWritesSuite.scala

allisonport-db

LGTM but could you follow up on the remaining question I had? Just for my understanding

…a-io#3030) (Split from delta-io#2944) Adds support for inserting data into the table. Tests for inserting into partitioned and unpartitioned tables with various combinations of the types, partition values etc. Also tests the checkpoint is ready to create.

vkorukanti added the kernel label May 2, 2024

vkorukanti requested a review from allisonport-db May 2, 2024 18:02

vkorukanti force-pushed the insertData branch 2 times, most recently from 04235fd to 9a2e59c Compare May 3, 2024 00:00

allisonport-db reviewed May 3, 2024

View reviewed changes

vkorukanti force-pushed the insertData branch from 9a2e59c to 17fe088 Compare May 3, 2024 22:54

vkorukanti requested a review from allisonport-db May 3, 2024 22:55

vkorukanti force-pushed the insertData branch 3 times, most recently from 6cc1a69 to ffda586 Compare May 4, 2024 02:17

[Kernel][Writes] Add support of inserting data into tables

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

44275c8

vkorukanti force-pushed the insertData branch 2 times, most recently from 863215c to f89a592 Compare May 5, 2024 18:47

more tests

Loading
Loading status checks…

98815ff

vkorukanti force-pushed the insertData branch from f89a592 to 98815ff Compare May 5, 2024 19:03

vkorukanti commented May 5, 2024

View reviewed changes

allisonport-db reviewed May 5, 2024

View reviewed changes

vkorukanti added 2 commits May 5, 2024 12:56

review

Loading
Loading status checks…

5263991

style

Loading
Loading status checks…

15dfcf2

allisonport-db approved these changes May 5, 2024

View reviewed changes

review

Loading
Loading status checks…

9a29f84

vkorukanti merged commit 7f199fe into delta-io:master May 5, 2024
10 checks passed

vkorukanti deleted the insertData branch May 9, 2024 02:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Writes] Add support of inserting data into tables #3030

[Kernel][Writes] Add support of inserting data into tables #3030

vkorukanti commented May 2, 2024 •

edited

Loading

allisonport-db May 3, 2024

vkorukanti May 3, 2024

allisonport-db May 3, 2024

vkorukanti May 3, 2024

allisonport-db May 3, 2024

vkorukanti May 3, 2024

allisonport-db May 3, 2024

vkorukanti May 3, 2024

allisonport-db May 3, 2024

vkorukanti May 3, 2024

allisonport-db May 3, 2024

vkorukanti May 3, 2024

vkorukanti May 3, 2024

vkorukanti May 3, 2024

vkorukanti May 3, 2024

vkorukanti May 3, 2024

allisonport-db left a comment

allisonport-db May 5, 2024

allisonport-db May 5, 2024

vkorukanti May 5, 2024

vkorukanti May 5, 2024

allisonport-db May 5, 2024

vkorukanti May 5, 2024

allisonport-db May 5, 2024

vkorukanti May 5, 2024

allisonport-db May 5, 2024

vkorukanti May 5, 2024

allisonport-db May 5, 2024

allisonport-db left a comment

[Kernel][Writes] Add support of inserting data into tables #3030

[Kernel][Writes] Add support of inserting data into tables #3030

Conversation

vkorukanti commented May 2, 2024 • edited Loading

Description

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonport-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonport-db left a comment

Choose a reason for hiding this comment

vkorukanti commented May 2, 2024 •

edited

Loading