Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-28866: [JAVA] Java Dataset API ScanOptions expansion #41646

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jinchengchenghh
Copy link
Contributor

@jinchengchenghh jinchengchenghh commented May 14, 2024

Rationale for this change

What changes are included in this PR?

Support to add ArrowSchema to specify C++ CsvFragmentScanOptions.convert_options.column_types
And use Map to set the config, serialize in java and deserialize in C++ for CsvFragmentScanOptions

Are these changes tested?

new added UT.

Are there any user-facing changes?

No.

@jinchengchenghh jinchengchenghh marked this pull request as draft May 14, 2024 08:15
Copy link

⚠️ GitHub issue #28866 has been automatically assigned in GitHub to PR creator.

@jinchengchenghh
Copy link
Contributor Author

Can you help review this PR? If the framework is OK, I will add more common config in this PR. Thanks! @westonpace

*/
public ByteBuffer serialize() {
Map<String, String> options = Stream.concat(Stream.concat(readOptions.entrySet().stream(),
parseOptions.entrySet().stream()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert all the options to a map because it is a easy implement, and now we don't have same option name in CPP parse_options and read_options, but to further extend, we may need to serialize more accurately. I'm open to here if you think we should serialize each option

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe having a better serialize option for each would be better. But I see your point, maybe we could do it in a follow up PR.

cc @lidavidm @westonpace

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 16, 2024
@lidavidm
Copy link
Member

CC @vibhatha

@@ -108,7 +108,7 @@ ARROW_SUBSTRAIT_BUILD_SHA256_CHECKSUM=f989a862f694e7dbb695925ddb7c4ce06aa6c51aca
ARROW_S2N_TLS_BUILD_VERSION=v1.3.35
ARROW_S2N_TLS_BUILD_SHA256_CHECKSUM=9d32b26e6bfcc058d98248bf8fc231537e347395dd89cf62bb432b55c5da990d
ARROW_THRIFT_BUILD_VERSION=0.16.0
ARROW_THRIFT_BUILD_SHA256_CHECKSUM=f460b5c1ca30d8918ff95ea3eb6291b3951cf518553566088f3f2be8981f6209
ARROW_THRIFT_BUILD_SHA256_CHECKSUM=df2931de646a366c2e5962af679018bca2395d586e00ba82d09c0379f14f8e7b
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Occasional change, for my local environment, will remove it

column_types[field->name()] = field->type();
}
} else {
return Status::Invalid("Not support this config " + it.first);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

Suggested change
return Status::Invalid("Not support this config " + it.first);
return Status::Invalid("Config " + it.first + " is not supported.");

}

if (!literal.has_map()) {
return Status::Invalid("Literal does not have map");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
return Status::Invalid("Literal does not have map");
return Status::Invalid("Literal does not have a map");

#endif
default:
std::string error_message =
"illegal file format id: " + std::to_string(file_format_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
"illegal file format id: " + std::to_string(file_format_id);
"Illegal file format id: " + std::to_string(file_format_id);

* @param config config map
* @return bufer to jni call argument, should be DirectByteBuffer
*/
default ByteBuffer serializeMap(Map<String, String> config) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function just written to pass a Java Map to C++ via JNI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

} else if (key == "quoting") {
options->parse_options.quoting = parseBool(value);
} else if (key == "column_type") {
int64_t schema_address = std::stol(value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check for possible -1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it in Java side to not add invalid schema address


import io.substrait.proto.Expression;

public class StringMapNode implements Serializable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at the functionality, I think what we have here is a util class which converts a particular map config to a particular Substrait protobuf message. Since this can be used in other cases, it could come under substrait.util package. And the toProtobuf could be mapToExpressionLiteral() ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have doubts about having a separate class for this purpose though.

Copy link
Collaborator

@vibhatha vibhatha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added a few comments. But I am going to go through the content once more.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels May 22, 2024
@jinchengchenghh
Copy link
Contributor Author

Do you have further comments? @vibhatha @lidavidm

@vibhatha
Copy link
Collaborator

@jinchengchenghh will re-review this later today.

return arrow::dataset::CsvFragmentScanOptions::From(configs);
#endif
default:
return arrow::Status::Invalid("Illegal file format id: " ,file_format_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we don't have proper C++ linting @lidavidm ?

Suggested change
return arrow::Status::Invalid("Illegal file format id: " ,file_format_id);
return arrow::Status::Invalid("Illegal file format id: ", file_format_id);

@@ -43,7 +45,8 @@ private JniWrapper() {
* @return the native pointer of the arrow::dataset::FileSystemDatasetFactory instance.
* @see FileFormat
*/
public native long makeFileSystemDatasetFactory(String uri, int fileFormat);
public native long makeFileSystemDatasetFactory(String uri, int fileFormat,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update Java docs?

@@ -54,7 +57,8 @@ private JniWrapper() {
* @return the native pointer of the arrow::dataset::FileSystemDatasetFactory instance.
* @see FileFormat
*/
public native long makeFileSystemDatasetFactory(String[] uris, int fileFormat);
public native long makeFileSystemDatasetFactoryWithFiles(String[] uris, int fileFormat,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update Java docs

@@ -80,7 +80,8 @@ private JniWrapper() {
* @return the native pointer of the arrow::dataset::Scanner instance.
*/
public native long createScanner(long datasetId, String[] columns, ByteBuffer substraitProjection,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update Java docs?

ByteBuffer serialize();

/**
* serialize the map.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
* serialize the map.
* Serialize the map.

assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
int rowCount = 0;
while (reader.loadNextBatch()) {
assertEquals("[1, 2, 3]", reader.getVectorSchemaRoot().getVector("Id").toString());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we check all columns?

Copy link
Collaborator

@vibhatha vibhatha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh Added a few more comments.

@vibhatha
Copy link
Collaborator

@jinchengchenghh will re-review this later today.

@jinchengchenghh sorry about the unexpected delay. I have added a few comments today.

@vibhatha vibhatha requested a review from lidavidm May 31, 2024 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants