-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure: Add FileIO that supports ADLSv2 storage #8303
Conversation
azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSv2Location.java
Outdated
Show resolved
Hide resolved
I am wondering how this differs from #4465, which has been lying dormant (and as far as I can see is just waiting for someone to hit the "merge" button) for over a year now? |
azure/src/test/java/org/apache/iceberg/azure/adlsv2/ADLSv2LocationTest.java
Outdated
Show resolved
Hide resolved
azure/src/test/java/org/apache/iceberg/azure/adlsv2/ADLSv2LocationTest.java
Outdated
Show resolved
Hide resolved
azure/src/test/java/org/apache/iceberg/azure/adlsv2/ADLSv2OutputStreamTest.java
Outdated
Show resolved
Hide resolved
azure/src/test/java/org/apache/iceberg/azure/adlsv2/ADLSv2InputStreamTest.java
Outdated
Show resolved
Hide resolved
azure/src/main/java/org/apache/iceberg/azure/AzureProperties.java
Outdated
Show resolved
Hide resolved
azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSv2FileIO.java
Outdated
Show resolved
Hide resolved
The main difference is this PR uses the data lake client API instead of the blob client API. File operations work the same way but bulk and prefix operations will differ. |
dependencies { | ||
implementation platform(libs.azuresdk.bom) | ||
implementation "com.azure:azure-storage-file-datalake" | ||
implementation "com.azure:azure-identity" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other integrations, we don't bundle the dependencies and only ship the Iceberg side. That keeps our bundle small and doesn't force any particular version on downstream consumers. It also avoids needing to do a lot of license and notice documentation work. Is that possible here? Is there a dependency bundle that we can use at runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The azure-bundle
project build was set up for bundling all of the necessary Azure dependencies in one shadow jar as an (optional) convenience to users, similar to the aws-bundle
and gcp-bundle
projects, which include only the necessary runtime libraries at the same version used for the Iceberg build and shades conflicting libraries. A user can opt to include their own Azure dependencies if desired and not use this at all. For example, all you would need to run with Spark is the Spark runtime and Azure/AWS/GCP bundle. Neither Microsoft nor Google provide such a bundle. Amazon has one for AWS but it is very large, which causes issues with some systems.
This is separate from the azure
project build which declares the Azure dependencies as compileOnly
so they are not included with any runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see. I didn't know about the other bundle projects. Looks like the LICENSE file is updated for those, but not the NOTICE. Did you check whether each bundled project has a NOTICE that we need to include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not, I'll do that now for all three.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this. I'll open a separate PR for the AWS and GCP bundles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR for the AWS and GCP bundles is here: #8323
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It's amazing that we can automate this now. It was such a giant pain to do this in the past!
azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSFileIO.java
Outdated
Show resolved
Hide resolved
azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSFileIO.java
Outdated
Show resolved
Hide resolved
azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSOutputStream.java
Outdated
Show resolved
Hide resolved
azure/src/test/java/org/apache/iceberg/azure/adlsv2/ADLSOutputStreamTest.java
Show resolved
Hide resolved
azure/src/test/java/org/apache/iceberg/azure/adlsv2/ADLSInputStreamTest.java
Show resolved
Hide resolved
azure/src/main/java/org/apache/iceberg/azure/adlsv2/ADLSOutputStream.java
Outdated
Show resolved
Hide resolved
Preconditions.checkState(!closed, "Cannot seek: already closed"); | ||
Preconditions.checkArgument(newPos >= 0, "Cannot seek: position %s is negative", newPos); | ||
|
||
// this allows a seek beyond the end of the stream but the next read will fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why allow seek beyond the end of the stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done to keep the behavior consistent with S3InputStream
* Support</a> | ||
*/ | ||
class ADLSLocation { | ||
private static final Pattern URI_PATTERN = Pattern.compile("^abfss?://(.+?)([/?#].*)?$"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be safer to use [^/?#]+
for the first group instead of using non-greedy matching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks, I made this change
|
||
String uriPath = matcher.group(2); | ||
uriPath = uriPath == null ? "" : uriPath.startsWith("/") ? uriPath.substring(1) : uriPath; | ||
this.path = uriPath.split("\\?", -1)[0].split("#", -1)[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If uriPath
is null, then path
is going to be an empty string? I generally try to avoid empty string as a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done primarily for the Azure API, which expects an empty string instead of null for the root.
Thanks @bryanck for working on this, and @danielcweeks, @rdblue & @nastra for the review 🙌🏻 |
This PR adds a FileIO implementation for Azure Data Lake Storage Gen2. The URI format was kept consistent with Hadoop's Azure URI format to make any transition easier, however TLS is always used with either the
abfs
orabfss
scheme. Range reads were also implemented. The new FileIO was added as a delegate type in ResolvingFileIO. Both the prefix and bulk operation mixin interfaces were implemented as well.To limit the scope of this PR, authorization was limited to using the default Azure credential chain, SAS token, or connection string. Enhancements can be addressed in a follow-up PR.
The project was added as a dependency to the Spark and Flink runtimes, similar to AWS and GCP. Also added was an Azure bundle project for building a jar with the necessary Azure dependencies when running with Spark or Flink. This is similar to the bundles for AWS and GCP.
Currently Azurite doesn't yet support ADLSv2 directory operations, so mocks were used for prefix-related tests. Manual testing was performed against a real Azure account.