Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel][Writes] Add support of inserting data into tables #3030

Merged
merged 5 commits into from
May 5, 2024

Conversation

vkorukanti
Copy link
Collaborator

@vkorukanti vkorukanti commented May 2, 2024

Description

(Split from #2944)

Adds support for inserting data into the table.

How was this patch tested?

Tests for inserting into partitioned and unpartitioned tables with various combinations of the types, partition values etc. Also tests the checkpoint is ready to create.

@vkorukanti vkorukanti requested a review from allisonport-db May 2, 2024 18:02
@vkorukanti vkorukanti force-pushed the insertData branch 2 times, most recently from 04235fd to 9a2e59c Compare May 3, 2024 00:00
.collect(Collectors.toMap(e -> e.getKey().toLowerCase(), Map.Entry::getValue));
}

public static int findColIndex(StructType schema, String colNameLowerCase) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this to avoid case sensitivity? Maybe comment that for the future?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done and moved this to SchemaUtils


String targetDirectory = getTargetDirectory(
getTableRoot(transactionState),
toLowerCaseList(getPartitionColumnsList(transactionState)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are partition column names always lowercase in the file path? If this should be documented in the function for the expected input

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored and fixed it to always preserve the case as given by the connector when the table is created.

partitionValuesLowerCaseName);
return new DataWriteContext(
targetDirectory,
partitionValuesLowerCaseName,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the lower-case nature of this should be documented somewhere? Because the keys here will be different from like transaction.getPartitionColumns() right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer exposed to the connect.

Row addFileRow = AddFile.convertDataFileStatus(
tableRoot,
dataFileStatus,
dataWriteContext.getPartitionValues(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these now lower case keys? Isn't that incorrect or is that how they are stored?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified this logic. Now it always preserves the case as given by the connector when the table is created.

* child paths.
* @return
*/
public static Path relativizePath(Path child, URI root) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test that uses this? I didn't see one can we add one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it in a follow up PR. Given this code calling another well-tested API, it should be ok.

Comment on lines 131 to 137
if (partValue == null) {
return new Tuple2<>(partColName, (String) null);
} else {
return new Tuple2<>(partColName, serializePartitionValue(partValue));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure we have tests with null partition values?

@vkorukanti vkorukanti requested a review from allisonport-db May 3, 2024 22:55
@vkorukanti vkorukanti force-pushed the insertData branch 3 times, most recently from 6cc1a69 to ffda586 Compare May 4, 2024 02:17

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@vkorukanti vkorukanti force-pushed the insertData branch 2 times, most recently from 863215c to f89a592 Compare May 5, 2024 18:47
Row addFileRow = AddFile.convertDataFileStatus(
tableRoot,
dataFileStatus,
dataWriteContext.getPartitionValues(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified this logic. Now it always preserves the case as given by the connector when the table is created.


String targetDirectory = getTargetDirectory(
getTableRoot(transactionState),
toLowerCaseList(getPartitionColumnsList(transactionState)),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored and fixed it to always preserve the case as given by the connector when the table is created.

partitionValuesLowerCaseName);
return new DataWriteContext(
targetDirectory,
partitionValuesLowerCaseName,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer exposed to the connect.

* child paths.
* @return
*/
public static Path relativizePath(Path child, URI root) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it in a follow up PR. Given this code calling another well-tested API, it should be ok.

.collect(Collectors.toMap(e -> e.getKey().toLowerCase(), Map.Entry::getValue));
}

public static int findColIndex(StructType schema, String colNameLowerCase) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done and moved this to SchemaUtils

Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. The partition name stuff is a lot cleaner thanks :)

@@ -104,7 +118,37 @@ static CloseableIterator<FilteredColumnarBatch> transformLogicalData(
Row transactionState,
CloseableIterator<FilteredColumnarBatch> dataIter,
Map<String, Literal> partitionValues) {
throw new UnsupportedOperationException("Not implemented yet");
List<String> partitionColNames = getPartitionColumnsList(transactionState);
validatePartitionValues(partitionColNames, partitionValues);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate the types here too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, what do we even use the partitionValues for here? Isn't this param unused?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the discussions where we concluded that taking the partition values as input forces the connector to not pass data from multiple partitions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type validation makes sense. Adding those.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you maybe add a comment about why we have partitionValues there then? For future reference

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

Comment on lines -113 to +114
* <li>{@code map}: only a {@code map} with {@code string} key type is supported</li>
* <li>{@code map}: only a {@code map} with {@code string} key type is supported. If an
* entry value is {@code null}, it should be written to the file.</li>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this change? How are partitionValues serialized in delta spark? Is this with the other JSON serialization rules..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out Spark writes the null values in partitionValues to Delta Log. Yeah, this is one of those undocumented detail, which was found through testing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this done though? Since I thought we saw that null values weren't written for maps some other way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    val addFile = AddFile(
      path = "sdfsdf.parquet",
      partitionValues = Map("a" -> "b", "c" -> null),
      size = 12345,
      modificationTime = 54321,
      dataChange = true,
      stats = null
    )
    val json = addFile.json
    assert(json == "sdfsd")

returns {"add":{"path":"sdfsdf.parquet","partitionValues":{"a":"b","c":null},"size":12345,"modificationTime":54321,"dataChange":true}}

Basically, the ObjectMapper, when .setSerializationInclusion(Include.NON_ABSENT) doesn't write any property (in the above example stats) whose value is null, but nulls in the map are always written.

Applying that to Kernel: any struct fields that are null - not written. For any nulls in Map, written out.


verifyCommitResult(commitResult0, expVersion = 0, expIsReadyForCheckpoint = false)
verifyCommitInfo(tblPath, version = 0, expPartCols, operation = WRITE)
verifyWrittenContent(tblPath, schema, expV0Data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this enough to be sure the add.partitionValues has the right case sensitivity?

vkorukanti added 2 commits May 5, 2024 12:56
Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but could you follow up on the remaining question I had? Just for my understanding

@vkorukanti vkorukanti merged commit 7f199fe into delta-io:master May 5, 2024
10 checks passed
vkorukanti added a commit to vkorukanti/delta that referenced this pull request May 5, 2024
…a-io#3030)

(Split from delta-io#2944)

Adds support for inserting data into the table.

Tests for inserting into partitioned and unpartitioned tables with
various combinations of the types, partition values etc. Also tests the
checkpoint is ready to create.
@vkorukanti vkorukanti deleted the insertData branch May 9, 2024 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants