Handling exceptions on watcher reload #105442

sakurai-youhei · 2024-02-13T10:02:58Z

Exceptions during watcher reload, such as those during loading watches, may not start the trigger service, making the watcher stop triggering anything. Coupled with the behavior that the reload does not occur unless the routing table changes, the current exception handling can leave the watcher unfunctional for some time.

This PR improves the exception handling to allow the reload to occur again even if the routing table stays identical.

Closes #69842

elasticsearchmachine · 2024-02-13T13:25:46Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2024-02-13T18:25:47Z

Hi @sakurai-youhei, I've created a changelog YAML for you.

masseyke · 2024-02-14T23:12:39Z

Thanks @sakurai-youhei. This seems like a good thing to do. A couple of questions though:
Do you think it would be difficult to put this into an integration test? It would be great to have one that has a failure during reload, and see that it does try again.
Also I notice that if I revert your change but leave the tests, 2 out of the 3 tests still pass. Was that intentional?

sakurai-youhei · 2024-02-15T05:26:51Z

@masseyke Thank you for your reviewing this PR.

[1]

Do you think it would be difficult to put this into an integration test? It would be great to have one that has a failure during reload, and see that it does try again.

TBH, it's hard to cause the retry intentionally because it requires an exception at the part before triggerService.start(watches) HERE. #69842 shows the exception was rooted in refreshing the watcher index HERE, but I don't think it's reasonably reproducible.

One way would be to make the refresh timeout, which is currently fixed in 5 seconds, configurable to reproduce the timeout easily, but I'm afraid that this kind of extra change would be an unfavorable scope extension. If you have other ideas, please let me know, and I will explore the options.

~~Btw, the retry is compromisely checked with RuntimeException forcibly thrown in one of the unit test cases.~~ (Sorry, I mixed up test cases; the retry itself is not confirmed using any exceptions as of now.)

[2]

Also I notice that if I revert your change but leave the tests, 2 out of the 3 tests still pass. Was that intentional?

Yes. Those passing tests declare prerequisite behaviors causing #69842 that require a change in this PR. If either one of them changes in the future, the change introduced through this PR may also need to be reconsidered, so I included the cases.

sakurai-youhei · 2024-03-05T01:19:14Z

@masseyke Can I get your help to release this change? The issue #69842 harms reliability of watcher executions in some circumstances, and I want to eliminate such troubles with this PR.

masseyke · 2024-03-06T15:31:27Z

I've got a lot going on, but I will try to take a look at this later this week.

masseyke · 2024-03-08T22:37:21Z

...ck/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/WatcherLifeCycleService.java

@@ -166,7 +166,9 @@ public void clusterChanged(ClusterChangedEvent event) {
            if (watcherService.validate(event.state())) {
                previousShardRoutings.set(localAffectedShardRoutings);
                if (state.get() == WatcherState.STARTED) {
-                    watcherService.reload(event.state(), "new local watcher shard allocation ids");
+                    watcherService.reload(event.state(), "new local watcher shard allocation ids", (exception) -> {
+                        clearAllocationIds(); // will cause reload again


Do we need to also set the state to WatcherState.STARTING here on exception when we clear the allocation ids? It's been a couple of years since i've been in here, so I'm not sure what bad things might happen if the state is STARTED but watcher is actually unavailable.

@masseyke I understand the point but I don't think so. If the state changed to STARTING here, the watcher service would no longer be reloaded, which is the problematic state that was also reported in #44981. Since there are only four states, STARTED (but it's rather DEGRADED actually) would be affordable, in my opinion.

OK that makes sense. And it's not going to be worse than the current situation (having it in STARTED state, but not doing anything).

masseyke

Looks like a good change to me.

elasticsearchmachine · 2024-03-11T23:04:52Z

💚 Backport successful

Status	Branch	Result
✅	7.17
✅	8.13

Handling exceptions on watcher reload

5f24134

sakurai-youhei added >bug :Data Management/Watcher auto-backport v8.13.0 v8.12.2 v7.17.19 labels Feb 13, 2024

elasticsearchmachine added the external-contributor label Feb 13, 2024

Fix error from :x-pack:plugin:watcher:spotlessJavaCheck

cb982a6

sakurai-youhei marked this pull request as ready for review February 13, 2024 13:25

elasticsearchmachine added the Team:Data Management label Feb 13, 2024

sakurai-youhei marked this pull request as draft February 13, 2024 18:25

sakurai-youhei marked this pull request as ready for review February 13, 2024 18:25

Update docs/changelog/105442.yaml

73ab577

sakurai-youhei requested a review from masseyke February 13, 2024 21:14

elasticsearchmachine added v8.14.0 and removed v8.13.0 labels Feb 14, 2024

elasticsearchmachine added v8.12.3 and removed v8.12.2 labels Feb 20, 2024

masseyke reviewed Mar 8, 2024

View reviewed changes

masseyke approved these changes Mar 11, 2024

View reviewed changes

masseyke added v8.13.1 and removed v8.12.3 labels Mar 11, 2024

masseyke merged commit 9fe8b96 into elastic:main Mar 11, 2024

This was referenced Mar 11, 2024

[7.17] Handling exceptions on watcher reload (#105442) #106209

Merged

[8.13] Handling exceptions on watcher reload (#105442) #106210

Merged

sakurai-youhei added a commit to sakurai-youhei/elasticsearch that referenced this pull request Mar 11, 2024

Handling exceptions on watcher reload (elastic#105442)

9102769

sakurai-youhei added a commit to sakurai-youhei/elasticsearch that referenced this pull request Mar 11, 2024

Handling exceptions on watcher reload (elastic#105442)

1a3b984

masseyke pushed a commit that referenced this pull request Mar 13, 2024

Handling exceptions on watcher reload (#105442) (#106210)

5198e7f

masseyke pushed a commit that referenced this pull request Mar 13, 2024

[7.17] Handling exceptions on watcher reload (#105442) (#106209)

057843e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling exceptions on watcher reload #105442

Handling exceptions on watcher reload #105442

sakurai-youhei commented Feb 13, 2024

elasticsearchmachine commented Feb 13, 2024

elasticsearchmachine commented Feb 13, 2024

masseyke commented Feb 14, 2024

sakurai-youhei commented Feb 15, 2024 •

edited

Loading

sakurai-youhei commented Mar 5, 2024

masseyke commented Mar 6, 2024

masseyke Mar 8, 2024 •

edited

Loading

sakurai-youhei Mar 11, 2024

masseyke Mar 11, 2024

masseyke left a comment

elasticsearchmachine commented Mar 11, 2024

Handling exceptions on watcher reload #105442

Handling exceptions on watcher reload #105442

Conversation

sakurai-youhei commented Feb 13, 2024

elasticsearchmachine commented Feb 13, 2024

elasticsearchmachine commented Feb 13, 2024

masseyke commented Feb 14, 2024

sakurai-youhei commented Feb 15, 2024 • edited Loading

sakurai-youhei commented Mar 5, 2024

masseyke commented Mar 6, 2024

masseyke Mar 8, 2024 • edited Loading

Choose a reason for hiding this comment

sakurai-youhei Mar 11, 2024

Choose a reason for hiding this comment

masseyke Mar 11, 2024

Choose a reason for hiding this comment

masseyke left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Mar 11, 2024

💚 Backport successful

sakurai-youhei commented Feb 15, 2024 •

edited

Loading

masseyke Mar 8, 2024 •

edited

Loading