Skip to content

Handling exceptions on watcher reload #105442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 11, 2024

Conversation

sakurai-youhei
Copy link
Member

Exceptions during watcher reload, such as those during loading watches, may not start the trigger service, making the watcher stop triggering anything. Coupled with the behavior that the reload does not occur unless the routing table changes, the current exception handling can leave the watcher unfunctional for some time.

This PR improves the exception handling to allow the reload to occur again even if the routing table stays identical.

Closes #69842

@elasticsearchmachine elasticsearchmachine added the external-contributor Pull request authored by a developer outside the Elasticsearch team label Feb 13, 2024
@sakurai-youhei sakurai-youhei marked this pull request as ready for review February 13, 2024 13:25
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Feb 13, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@sakurai-youhei sakurai-youhei marked this pull request as draft February 13, 2024 18:25
@sakurai-youhei sakurai-youhei marked this pull request as ready for review February 13, 2024 18:25
@elasticsearchmachine
Copy link
Collaborator

Hi @sakurai-youhei, I've created a changelog YAML for you.

@masseyke
Copy link
Member

Thanks @sakurai-youhei. This seems like a good thing to do. A couple of questions though:
Do you think it would be difficult to put this into an integration test? It would be great to have one that has a failure during reload, and see that it does try again.
Also I notice that if I revert your change but leave the tests, 2 out of the 3 tests still pass. Was that intentional?

@sakurai-youhei
Copy link
Member Author

sakurai-youhei commented Feb 15, 2024

@masseyke Thank you for your reviewing this PR.

[1]

Do you think it would be difficult to put this into an integration test? It would be great to have one that has a failure during reload, and see that it does try again.

TBH, it's hard to cause the retry intentionally because it requires an exception at the part before triggerService.start(watches) HERE. #69842 shows the exception was rooted in refreshing the watcher index HERE, but I don't think it's reasonably reproducible.

One way would be to make the refresh timeout, which is currently fixed in 5 seconds, configurable to reproduce the timeout easily, but I'm afraid that this kind of extra change would be an unfavorable scope extension. If you have other ideas, please let me know, and I will explore the options.

Btw, the retry is compromisely checked with RuntimeException forcibly thrown in one of the unit test cases. (Sorry, I mixed up test cases; the retry itself is not confirmed using any exceptions as of now.)

[2]

Also I notice that if I revert your change but leave the tests, 2 out of the 3 tests still pass. Was that intentional?

Yes. Those passing tests declare prerequisite behaviors causing #69842 that require a change in this PR. If either one of them changes in the future, the change introduced through this PR may also need to be reconsidered, so I included the cases.

@sakurai-youhei
Copy link
Member Author

@masseyke Can I get your help to release this change? The issue #69842 harms reliability of watcher executions in some circumstances, and I want to eliminate such troubles with this PR.

@masseyke
Copy link
Member

masseyke commented Mar 6, 2024

I've got a lot going on, but I will try to take a look at this later this week.

@@ -166,7 +166,9 @@ public void clusterChanged(ClusterChangedEvent event) {
if (watcherService.validate(event.state())) {
previousShardRoutings.set(localAffectedShardRoutings);
if (state.get() == WatcherState.STARTED) {
watcherService.reload(event.state(), "new local watcher shard allocation ids");
watcherService.reload(event.state(), "new local watcher shard allocation ids", (exception) -> {
clearAllocationIds(); // will cause reload again
Copy link
Member

@masseyke masseyke Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to also set the state to WatcherState.STARTING here on exception when we clear the allocation ids? It's been a couple of years since i've been in here, so I'm not sure what bad things might happen if the state is STARTED but watcher is actually unavailable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masseyke I understand the point but I don't think so. If the state changed to STARTING here, the watcher service would no longer be reloaded, which is the problematic state that was also reported in #44981. Since there are only four states, STARTED (but it's rather DEGRADED actually) would be affordable, in my opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK that makes sense. And it's not going to be worse than the current situation (having it in STARTED state, but not doing anything).

Copy link
Member

@masseyke masseyke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good change to me.

@masseyke masseyke added v8.13.1 and removed v8.12.3 labels Mar 11, 2024
@masseyke masseyke merged commit 9fe8b96 into elastic:main Mar 11, 2024
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
7.17
8.13

sakurai-youhei added a commit to sakurai-youhei/elasticsearch that referenced this pull request Mar 11, 2024
sakurai-youhei added a commit to sakurai-youhei/elasticsearch that referenced this pull request Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :Data Management/Watcher external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Data Management Meta label for data/management team v7.17.19 v8.13.1 v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Watcher should retry a failed reload
3 participants