-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-28866: [JAVA] Java Dataset API ScanOptions expansion #41646
base: main
Are you sure you want to change the base?
Conversation
|
Can you help review this PR? If the framework is OK, I will add more common config in this PR. Thanks! @westonpace |
*/ | ||
public ByteBuffer serialize() { | ||
Map<String, String> options = Stream.concat(Stream.concat(readOptions.entrySet().stream(), | ||
parseOptions.entrySet().stream()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert all the options to a map because it is a easy implement, and now we don't have same option name in CPP parse_options and read_options, but to further extend, we may need to serialize more accurately. I'm open to here if you think we should serialize each option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe having a better serialize option for each would be better. But I see your point, maybe we could do it in a follow up PR.
CC @vibhatha |
cpp/thirdparty/versions.txt
Outdated
@@ -108,7 +108,7 @@ ARROW_SUBSTRAIT_BUILD_SHA256_CHECKSUM=f989a862f694e7dbb695925ddb7c4ce06aa6c51aca | |||
ARROW_S2N_TLS_BUILD_VERSION=v1.3.35 | |||
ARROW_S2N_TLS_BUILD_SHA256_CHECKSUM=9d32b26e6bfcc058d98248bf8fc231537e347395dd89cf62bb432b55c5da990d | |||
ARROW_THRIFT_BUILD_VERSION=0.16.0 | |||
ARROW_THRIFT_BUILD_SHA256_CHECKSUM=f460b5c1ca30d8918ff95ea3eb6291b3951cf518553566088f3f2be8981f6209 | |||
ARROW_THRIFT_BUILD_SHA256_CHECKSUM=df2931de646a366c2e5962af679018bca2395d586e00ba82d09c0379f14f8e7b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Occasional change, for my local environment, will remove it
cpp/src/arrow/dataset/file_csv.cc
Outdated
column_types[field->name()] = field->type(); | ||
} | ||
} else { | ||
return Status::Invalid("Not support this config " + it.first); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
return Status::Invalid("Not support this config " + it.first); | |
return Status::Invalid("Config " + it.first + " is not supported."); |
} | ||
|
||
if (!literal.has_map()) { | ||
return Status::Invalid("Literal does not have map"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
return Status::Invalid("Literal does not have map"); | |
return Status::Invalid("Literal does not have a map"); |
#endif | ||
default: | ||
std::string error_message = | ||
"illegal file format id: " + std::to_string(file_format_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
"illegal file format id: " + std::to_string(file_format_id); | |
"Illegal file format id: " + std::to_string(file_format_id); |
* @param config config map | ||
* @return bufer to jni call argument, should be DirectByteBuffer | ||
*/ | ||
default ByteBuffer serializeMap(Map<String, String> config) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function just written to pass a Java Map to C++ via JNI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
} else if (key == "quoting") { | ||
options->parse_options.quoting = parseBool(value); | ||
} else if (key == "column_type") { | ||
int64_t schema_address = std::stol(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check for possible -1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it in Java side to not add invalid schema address
|
||
import io.substrait.proto.Expression; | ||
|
||
public class StringMapNode implements Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking at the functionality, I think what we have here is a util class which converts a particular map config to a particular Substrait protobuf message. Since this can be used in other cases, it could come under substrait.util
package. And the toProtobuf
could be mapToExpressionLiteral()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also have doubts about having a separate class for this purpose though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only added a few comments. But I am going to go through the content once more.
@jinchengchenghh will re-review this later today. |
return arrow::dataset::CsvFragmentScanOptions::From(configs); | ||
#endif | ||
default: | ||
return arrow::Status::Invalid("Illegal file format id: " ,file_format_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think we don't have proper C++ linting @lidavidm ?
return arrow::Status::Invalid("Illegal file format id: " ,file_format_id); | |
return arrow::Status::Invalid("Illegal file format id: ", file_format_id); |
@@ -43,7 +45,8 @@ private JniWrapper() { | |||
* @return the native pointer of the arrow::dataset::FileSystemDatasetFactory instance. | |||
* @see FileFormat | |||
*/ | |||
public native long makeFileSystemDatasetFactory(String uri, int fileFormat); | |||
public native long makeFileSystemDatasetFactory(String uri, int fileFormat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update Java docs?
@@ -54,7 +57,8 @@ private JniWrapper() { | |||
* @return the native pointer of the arrow::dataset::FileSystemDatasetFactory instance. | |||
* @see FileFormat | |||
*/ | |||
public native long makeFileSystemDatasetFactory(String[] uris, int fileFormat); | |||
public native long makeFileSystemDatasetFactoryWithFiles(String[] uris, int fileFormat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update Java docs
@@ -80,7 +80,8 @@ private JniWrapper() { | |||
* @return the native pointer of the arrow::dataset::Scanner instance. | |||
*/ | |||
public native long createScanner(long datasetId, String[] columns, ByteBuffer substraitProjection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update Java docs?
ByteBuffer serialize(); | ||
|
||
/** | ||
* serialize the map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
* serialize the map. | |
* Serialize the map. |
assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields()); | ||
int rowCount = 0; | ||
while (reader.loadNextBatch()) { | ||
assertEquals("[1, 2, 3]", reader.getVectorSchemaRoot().getVector("Id").toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should we check all columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh Added a few more comments.
@jinchengchenghh sorry about the unexpected delay. I have added a few comments today. |
Rationale for this change
What changes are included in this PR?
Support to add ArrowSchema to specify C++ CsvFragmentScanOptions.convert_options.column_types
And use Map to set the config, serialize in java and deserialize in C++ for CsvFragmentScanOptions
Are these changes tested?
new added UT.
Are there any user-facing changes?
No.