Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double-pausing shard snapshot #109148

Conversation

DaveCTurner
Copy link
Contributor

Closes #109143

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.14.1 v8.15.0 labels May 29, 2024

Partially verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.
@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@DaveCTurner DaveCTurner marked this pull request as ready for review May 31, 2024 06:05
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 31, 2024
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think this fixes ES-8563 as well. Could you please cross link between this PR and ES-8563? Thanks!

Comment on lines 3357 to 3360
if (updatedState.state() == ShardState.PAUSED_FOR_NODE_REMOVAL) {
// leave subsequent entries for this shard alone until this one is unpaused
iterator.remove();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still feels a bit odd that tryStartNextTaskAfterSnapshotUpdated can init shard snapshot without checking whether this shard is already active in snapshots that come before it. I think it relied on the fact that the possible updated states sent by data nodes are all "non-active" which is true till we add PAUSED_FOR_NODE_REMOVAL. Theoretically this could happen again in future if we add another similar ShardState. That said, I can see it would be quite a lot work to change things in tryStartNextTaskAfterSnapshotUpdated. So I am good with this fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ valid point, I think we can at least add an assertion to catch a future bug in this area - see 615cebf.

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.
@DaveCTurner DaveCTurner added auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge labels May 31, 2024

Unverified

This commit is not signed, but one or more authors requires that any commit attributed to them is signed.

Partially verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
We cannot verify signatures from co-authors, and some of the co-authors attributed to this commit require their commits to be signed.
@elasticsearchmachine elasticsearchmachine merged commit f416688 into elastic:main May 31, 2024
15 checks passed
@DaveCTurner DaveCTurner deleted the 2024/05/29/snapshot-double-pause-fix branch May 31, 2024 12:37
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request May 31, 2024
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.14

elasticsearchmachine pushed a commit that referenced this pull request May 31, 2024
Closes #109143
@DaveCTurner DaveCTurner restored the 2024/05/29/snapshot-double-pause-fix branch June 17, 2024 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.14.1 v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] SnapshotStressTestsIT testRandomActivities failing
3 participants