Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SearchableSnapshotActionIT testSearchableSnapshotsInHotPhasePinnedToHotNodes failing #107421

Closed
luigidellaquila opened this issue Apr 12, 2024 · 6 comments · Fixed by #108790
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI

Comments

@luigidellaquila
Copy link
Contributor

Build scan:
https://gradle-enterprise.elastic.co/s/unsdhiimapjfc/tests/:x-pack:plugin:ilm:qa:multi-node:javaRestTest/org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT/testSearchableSnapshotsInHotPhasePinnedToHotNodes

Reproduction line:

gradlew ':x-pack:plugin:ilm:qa:multi-node:javaRestTest' --tests "org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT.testSearchableSnapshotsInHotPhasePinnedToHotNodes" -Dtests.seed=63B915F409E7F1D1 -Dtests.locale=es-AR -Dtests.timezone=US/Mountain -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
No

Failure history:
Failure dashboard for org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT#testSearchableSnapshotsInHotPhasePinnedToHotNodes

Failure excerpt:

org.elasticsearch.client.ResponseException: method [DELETE], host [http://[::1]:59427], URI [_data_stream/logs-ihhfczesvn], status line [HTTP/1.1 400 Bad Request]
{"error":{"root_cause":[{"type":"snapshot_in_progress_exception","reason":"Cannot delete indices that are being snapshotted: [[.ds-logs-ihhfczesvn-2024.04.12-000002/ABSsl8IPTn-zW16PB5RMwg]]. Try again after snapshot finishes or cancel the currently running snapshot."}],"type":"snapshot_in_progress_exception","reason":"Cannot delete indices that are being snapshotted: [[.ds-logs-ihhfczesvn-2024.04.12-000002/ABSsl8IPTn-zW16PB5RMwg]]. Try again after snapshot finishes or cancel the currently running snapshot."},"status":400}

  at __randomizedtesting.SeedInfo.seed([63B915F409E7F1D1:1653EFC6B3966860]:0)
  at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:351)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:317)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:292)
  at org.elasticsearch.xpack.ilm.actions.SearchableSnapshotActionIT.testSearchableSnapshotsInHotPhasePinnedToHotNodes(SearchableSnapshotActionIT.java:873)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

@luigidellaquila luigidellaquila added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Apr 12, 2024
@elasticsearchmachine elasticsearchmachine added blocker Team:Distributed Meta label for distributed team labels Apr 12, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@volodk85 volodk85 added low-risk An open issue or test failure that is a low risk to future releases and removed blocker labels Apr 24, 2024
@idegtiarenko
Copy link
Contributor

The failure reported above concerns ILM,

In particular the test creates a data stream (with backing index .ds-logs-ihhfczesvn-2024.04.12-000001) and rolls it over to create a searchable snapshot out of it. In the reported failure the second rollover was happening and we were snapshotting .ds-logs-ihhfczesvn-2024.04.12-000002 (even after the datastream and policy were removed).

I am re-labeling to data management to check why.

@idegtiarenko idegtiarenko added :Data Management/ILM+SLM Index and Snapshot lifecycle management and removed :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels May 17, 2024
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team and removed Team:Distributed Meta label for distributed team labels May 17, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@nielsbauman nielsbauman self-assigned this May 17, 2024
@nielsbauman
Copy link
Contributor

Leaving some notes here for when I get back to this:

  • ILM rolls over data stream after inserting 1 doc:

    [2024-04-12T13:35:00,899][INFO ][o.e.x.c.i.WaitForRolloverReadyStep] [javaRestTest-3] index [.ds-logs-ihhfczesvn-2024.04.12-000001] is ready for rollover, conditions: [{[min_docs: 1]=true, [max_primary_shard_docs: 200000000]=false, [max_docs: 1]=true}]
    ....
    [2024-04-12T13:35:00,985][DEBUG][o.e.x.i.IndexLifecycleRunner] [javaRestTest-3] [.ds-logs-ihhfczesvn-2024.04.12-000001] running policy with async action step [{"phase":"hot","action":"rollover","name":"attempt-rollover"}]
    [2024-04-12T13:35:00,994][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-3] [.ds-logs-ihhfczesvn-2024.04.12-000002] creating index, cause [rollover_data_stream], templates [wtsrqyic], shards [1]/[1]

  • Another rollover gets executed:

    [2024-04-12T13:35:01,179][INFO ][o.e.c.m.MetadataCreateIndexService] [javaRestTest-3] [.ds-logs-ihhfczesvn-2024.04.12-000003] creating index, cause [rollover_data_stream], templates [wtsrqyic], shards [1]/[1]

    The second index is still in the initialization phase, so this rollover isn't caused by another document being ingested in the second index for some reason.

  • ILM proceeds to initialize the second and third index, while the first index is still awaiting the rollover request:

    [2024-04-12T13:35:01,358][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-3] [.ds-logs-ihhfczesvn-2024.04.12-000001] retrieved current step key: {"phase":"hot","action":"rollover","name":"attempt-rollover"}

    Note this attempt-rollover step isn't waiting for the shards to become active; that's what we have the wait-for-active-shards step for.

  • ILM logs the callback after the rollover request:

    [2024-04-12T13:35:01,359][TRACE][o.e.x.i.IndexLifecycleRunner] [javaRestTest-3] cs-change-async-action-callback, [.ds-logs-ihhfczesvn-2024.04.12-000001], current-step: {"phase":"hot","action":"rollover","name":"attempt-rollover"}

  • The remaining steps/logs aren't as interesting, as in this first part, we already saw 2 rollovers while we should have only seen 1.

Conclusion so far: ILM fired a rollover request, which created a new index, but then a second rollover request got executed before the callback of the first rollover request was completed.

@nielsbauman
Copy link
Contributor

Ah after reading through my notes again (I am a rubber duck?), I'm guessing the first rollover is from ILM and the second (erroneous) one is from


I'll try to reproduce this later today. That seems like the most plausible explanation right now. Although, I would have expected this to happen more often.

nielsbauman added a commit to nielsbauman/elasticsearch that referenced this issue May 17, 2024
Manually rolling over the data stream was 1. not necessary, and
2. causing issues when it happened after ILM had already rolled
over, which would make ILM convert the second backing index into a
searchable snapshot as well, possibly preventing the data stream
from being deleted.

Fixes elastic#107421
@nielsbauman
Copy link
Contributor

I wasn't able to reproduce this right away, as the timing has to be just right in order for this to happen, but I'm fairly confident that #108790 will prevent this failure from happening again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management low-risk An open issue or test failure that is a low risk to future releases Team:Data Management Meta label for data/management team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants