-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Handling exceptions on watcher reload #105442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling exceptions on watcher reload #105442
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
Hi @sakurai-youhei, I've created a changelog YAML for you. |
Thanks @sakurai-youhei. This seems like a good thing to do. A couple of questions though: |
@masseyke Thank you for your reviewing this PR. [1]
TBH, it's hard to cause the retry intentionally because it requires an exception at the part before One way would be to make the refresh timeout, which is currently fixed in 5 seconds, configurable to reproduce the timeout easily, but I'm afraid that this kind of extra change would be an unfavorable scope extension. If you have other ideas, please let me know, and I will explore the options.
[2]
Yes. Those passing tests declare prerequisite behaviors causing #69842 that require a change in this PR. If either one of them changes in the future, the change introduced through this PR may also need to be reconsidered, so I included the cases. |
I've got a lot going on, but I will try to take a look at this later this week. |
@@ -166,7 +166,9 @@ public void clusterChanged(ClusterChangedEvent event) { | |||
if (watcherService.validate(event.state())) { | |||
previousShardRoutings.set(localAffectedShardRoutings); | |||
if (state.get() == WatcherState.STARTED) { | |||
watcherService.reload(event.state(), "new local watcher shard allocation ids"); | |||
watcherService.reload(event.state(), "new local watcher shard allocation ids", (exception) -> { | |||
clearAllocationIds(); // will cause reload again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to also set the state to WatcherState.STARTING here on exception when we clear the allocation ids? It's been a couple of years since i've been in here, so I'm not sure what bad things might happen if the state is STARTED but watcher is actually unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@masseyke I understand the point but I don't think so. If the state changed to STARTING
here, the watcher service would no longer be reloaded, which is the problematic state that was also reported in #44981. Since there are only four states, STARTED
(but it's rather DEGRADED actually) would be affordable, in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK that makes sense. And it's not going to be worse than the current situation (having it in STARTED state, but not doing anything).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good change to me.
Exceptions during watcher reload, such as those during loading watches, may not start the trigger service, making the watcher stop triggering anything. Coupled with the behavior that the reload does not occur unless the routing table changes, the current exception handling can leave the watcher unfunctional for some time.
This PR improves the exception handling to allow the reload to occur again even if the routing table stays identical.
Closes #69842