[GCS FT] Mark job as finished for dead node (#40431) #40742
Merged
+189
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This blocks several customers, and lots of kuberay users complain about this issue.
The risk of the fix is pretty low, but we should still rerun the release tests.
The issue is ray list nodes is timeout before it managed to return list of all nodes. The issue is caused by the timeout when GcsJobManager tried to get pending tasks from the driver of node which already dead (killed head nodes), which is 2 mins timeout to confirm if the node is dead. The solution we proposed is when node is dead, we mark all the job submitted to that head node as "finished", so the RPC calls for the pending tasks will only applied to drivers of current head node which most likely to be alive => resulting no timeout.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.