-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48314][SS] Don't double cache files for FileStreamSource using Trigger.AvailableNow #46627
Conversation
@HeartSaVioR since you added the file caching originally back in the day |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in overall. Left a few comments for better testing.
f | ||
} | ||
|
||
source.latestOffset(FileStreamSourceOffset(-1L), ReadLimit.maxFiles(5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we check the result just for completeness sake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to check files returned, and used the new setting from #45362 to verify correct number of files based on not caching
|
||
// Reading again leverages the files already tracked in allFilesForTriggerAvailableNow, | ||
// so no more listings need to happen | ||
source.latestOffset(FileStreamSourceOffset(-1L), ReadLimit.maxFiles(5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 pending CI.
Thanks! Merging to master. |
What changes were proposed in this pull request?
Files don't need to be cached for reuse in
FileStreamSource
when usingTrigger.AvailableNow
because all files are already cached for the lifetime of the query inallFilesForTriggerAvailableNow
.Why are the changes needed?
As reported in https://issues.apache.org/jira/browse/SPARK-44924 (with a PR to address #45362), the hard coded cap of 10k files being cached can cause problems when using a maxFilesPerTrigger > 10k. It causes every other batch to be 10k files, which can greatly limit the throughput of a new streaming trying to catch up.
Does this PR introduce any user-facing change?
Every other streaming batch won't be 10k files if using Trigger.AvailableNow and maxFilesPerTrigger greater than 10k.
How was this patch tested?
New UT
Was this patch authored or co-authored using generative AI tooling?
No