[CI] SearchableSnapshotActionIT testSearchableSnapshotsInHotPhasePinnedToHotNodes failing #107421

luigidellaquila · 2024-04-12T13:46:41Z

Build scan:
https://gradle-enterprise.elastic.co/s/unsdhiimapjfc/tests/:x-pack:plugin:ilm:qa:multi-node:javaRestTest/org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT/testSearchableSnapshotsInHotPhasePinnedToHotNodes

Reproduction line:

gradlew ':x-pack:plugin:ilm:qa:multi-node:javaRestTest' --tests "org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT.testSearchableSnapshotsInHotPhasePinnedToHotNodes" -Dtests.seed=63B915F409E7F1D1 -Dtests.locale=es-AR -Dtests.timezone=US/Mountain -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
No

Failure history:
Failure dashboard for org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT#testSearchableSnapshotsInHotPhasePinnedToHotNodes

Failure excerpt:

org.elasticsearch.client.ResponseException: method [DELETE], host [http://[::1]:59427], URI [_data_stream/logs-ihhfczesvn], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"snapshot_in_progress_exception","reason":"Cannot delete indices that are being snapshotted: [[.ds-logs-ihhfczesvn-2024.04.12-000002/ABSsl8IPTn-zW16PB5RMwg]]. Try again after snapshot finishes or cancel the currently running snapshot."}],"type":"snapshot_in_progress_exception","reason":"Cannot delete indices that are being snapshotted: [[.ds-logs-ihhfczesvn-2024.04.12-000002/ABSsl8IPTn-zW16PB5RMwg]]. Try again after snapshot finishes or cancel the currently running snapshot."},"status":400}

  at __randomizedtesting.SeedInfo.seed([63B915F409E7F1D1:1653EFC6B3966860]:0)
  at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:351)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:317)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:292)
  at org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT.testSearchableSnapshotsInHotPhasePinnedToHotNodes(SearchableSnapshotActionIT.java:873)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-04-12T13:47:04Z

Pinging @elastic/es-distributed (Team:Distributed)

idegtiarenko · 2024-05-17T07:23:05Z

The failure reported above concerns ILM,

In particular the test creates a data stream (with backing index .ds-logs-ihhfczesvn-2024.04.12-000001) and rolls it over to create a searchable snapshot out of it. In the reported failure the second rollover was happening and we were snapshotting .ds-logs-ihhfczesvn-2024.04.12-000002 (even after the datastream and policy were removed).

I am re-labeling to data management to check why.

elasticsearchmachine · 2024-05-17T07:24:33Z

Pinging @elastic/es-data-management (Team:Data Management)

nielsbauman · 2024-05-17T09:36:07Z

Leaving some notes here for when I get back to this:

ILM rolls over data stream after inserting 1 doc:

[2024-04-12T13:35:00,899][INFO ][o.e.x.c.i.WaitForRolloverReadyStep] [javaRestTest-3] index [.ds-logs-ihhfczesvn-2024.04.12-000001] is ready for rollover, conditions: [{[min_docs: 1]=true, [max_primary_shard_docs: 200000000]=false, [max_docs: 1]=true}]
....
[2024-04-12T13:35:00,985][DEBUG][o.e.x.i.IndexLifecycleRunner] [javaRestTest-3] [.ds-logs-ihhfczesvn-2024.04.12-000001] running policy with async action step [{"phase":"hot","action":"rollover","name":"attempt-rollover"}]
[2024-04-12T13:35:00,994][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-3] [.ds-logs-ihhfczesvn-2024.04.12-000002] creating index, cause [rollover_data_stream], templates [wtsrqyic], shards [1]/[1]
Another rollover gets executed:

[2024-04-12T13:35:01,179][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-3] [.ds-logs-ihhfczesvn-2024.04.12-000003] creating index, cause [rollover_data_stream], templates [wtsrqyic], shards [1]/[1]

The second index is still in the initialization phase, so this rollover isn't caused by another document being ingested in the second index for some reason.
ILM proceeds to initialize the second and third index, while the first index is still awaiting the rollover request:

[2024-04-12T13:35:01,358][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-3] [.ds-logs-ihhfczesvn-2024.04.12-000001] retrieved current step key: {"phase":"hot","action":"rollover","name":"attempt-rollover"}

Note this attempt-rollover step isn't waiting for the shards to become active; that's what we have the wait-for-active-shards step for.
ILM logs the callback after the rollover request:

[2024-04-12T13:35:01,359][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-3] cs-change-async-action-callback, [.ds-logs-ihhfczesvn-2024.04.12-000001], current-step: {"phase":"hot","action":"rollover","name":"attempt-rollover"}
The remaining steps/logs aren't as interesting, as in this first part, we already saw 2 rollovers while we should have only seen 1.

Conclusion so far: ILM fired a rollover request, which created a new index, but then a second rollover request got executed before the callback of the first rollover request was completed.

nielsbauman · 2024-05-17T09:41:22Z

Ah after reading through my notes again (I am a rubber duck?), I'm guessing the first rollover is from ILM and the second (erroneous) one is from

elasticsearch/x-pack/plugin/ilm/qa/multi-node/src/javaRestTest/java/org/elasticsearch/xpack/ilm/actions/SearchableSnapshotActionIT.java

Line 860 in 25e3551

rolloverMaxOneDocCondition(client(), dataStream);

I'll try to reproduce this later today. That seems like the most plausible explanation right now. Although, I would have expected this to happen more often.

Manually rolling over the data stream was 1. not necessary, and 2. causing issues when it happened after ILM had already rolled over, which would make ILM convert the second backing index into a searchable snapshot as well, possibly preventing the data stream from being deleted. Fixes elastic#107421

nielsbauman · 2024-05-17T18:43:31Z

I wasn't able to reproduce this right away, as the timing has to be just right in order for this to happen, but I'm fairly confident that #108790 will prevent this failure from happening again.

luigidellaquila added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Apr 12, 2024

elasticsearchmachine added blocker Team:Distributed Meta label for distributed team labels Apr 12, 2024

volodk85 added low-risk An open issue or test failure that is a low risk to future releases and removed blocker labels Apr 24, 2024

idegtiarenko assigned idegtiarenko and unassigned idegtiarenko May 17, 2024

idegtiarenko added :Data Management/ILM+SLM Index and Snapshot lifecycle management and removed :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels May 17, 2024

elasticsearchmachine added Team:Data Management Meta label for data/management team and removed Team:Distributed Meta label for distributed team labels May 17, 2024

nielsbauman self-assigned this May 17, 2024

nielsbauman mentioned this issue May 17, 2024

Avoid rolling over twice in searchable snapshot test #108790

Merged

nielsbauman closed this as completed in #108790 May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] SearchableSnapshotActionIT testSearchableSnapshotsInHotPhasePinnedToHotNodes failing #107421

[CI] SearchableSnapshotActionIT testSearchableSnapshotsInHotPhasePinnedToHotNodes failing #107421

luigidellaquila commented Apr 12, 2024

elasticsearchmachine commented Apr 12, 2024

idegtiarenko commented May 17, 2024

elasticsearchmachine commented May 17, 2024

nielsbauman commented May 17, 2024

nielsbauman commented May 17, 2024

nielsbauman commented May 17, 2024

[CI] SearchableSnapshotActionIT testSearchableSnapshotsInHotPhasePinnedToHotNodes failing #107421

[CI] SearchableSnapshotActionIT testSearchableSnapshotsInHotPhasePinnedToHotNodes failing #107421

Comments

luigidellaquila commented Apr 12, 2024

elasticsearchmachine commented Apr 12, 2024

idegtiarenko commented May 17, 2024

elasticsearchmachine commented May 17, 2024

nielsbauman commented May 17, 2024

nielsbauman commented May 17, 2024

nielsbauman commented May 17, 2024