Fix TasksIT#testGetTaskWaitForCompletionWithoutStoringResult #108094

arteam · 2024-04-30T14:24:24Z

It seems that the failure (the missed index) has always existed in the test scenario and it's supposed to be handled by TransportGetTaskAction.java. We catch IndexNotFoundException here and convert it to ResourceNotFoundException. Then we catch ResourceNotFoundException here and return a snapshot of a task as a response.

In the stack trace, getFinishedTaskFromIndex was called from getRunningTaskFromNode, not from waitedForCompletion due to a race between creating a get request and unblocking request which are sent asynchronously. I've changed the waitForCompletionTestCase test method to unblock the task only after the request started waiting for the task completion by registering a removal listener. By doing so, we make sure we test the "wait for completion" branch when task is running.

The part about the missed index seems to irrelevant, since waitedForCompletion is able to suppress the error and return a snapshot of running task which is not possible if getFinishedTaskFromIndex gets called directly from getRunningTaskFromNode.

Resolves #107823

Make sure the `.tasks` index is created before we starting testing task completion without storing its result. To achieve that, we store a fake task before we start `waitForCompletionTestCase`. Resolves #107823

elasticsearchmachine · 2024-04-30T14:24:48Z

Pinging @elastic/es-distributed (Team:Distributed)

arteam · 2024-05-02T06:29:40Z

@elasticmachine update branch

arteam · 2024-05-07T14:01:22Z

@elasticmachine update branch

henningandersen · 2024-05-10T14:33:12Z

The linked issue says that the tasks index got deleted, but that does not seem to match the resolution here? Can we find out why the tasks index was deleted too soon instead?

arteam · 2024-05-13T07:50:09Z

@henningandersen I believe the comment in the linked issue is wrong. The index was never deleted, because the test doesn't create the index. The test waits for the a completion of a task and the tasks only completes, because we have special error handling for the case where the index doesn't exist. I guess in some cases the error handling doesn't can't figure out that the root cause was IndexNotFoundException which should be converted to ResourceNotFoundException which is silently ignored.

I believe we shoud just explicitly create the index, because testGetTaskWaitForCompletionWithoutStoringResult is supposed to test task completion, not the error handling for missed indexes which is done in testGetTaskNotFound and testTasksGetWaitForNoTask.

henningandersen · 2024-05-15T06:54:51Z

@arteam it still smells like we might be covering up for a bug here. AFAICS, we expect the logic to work regardless of whether the index exists or not. Can you elaborate on how the test differentiates between whether the task exists or not? Since it if it is within the actual tasks code, we may want to target that instead (as well as add a dedicated test for it).

DaveCTurner · 2024-05-15T14:07:31Z

On Wed, May 15, 2024 at 3:02 PM Artem Prigoda ***@***.***> wrote: Started digging more deeply and the test stopped failing after #108052 <#108052> got merged

I'm pretty sure #108052 had no effect here, it was a pure refactoring.

…

Message ID: ***@***.***>

This reverts commit bf3b27d.

arteam · 2024-05-21T07:39:22Z

@elasticmachine update branch

…for completion

arteam · 2024-05-21T07:54:40Z

@henningandersen That was a very good catch! getFinishedTaskFromIndex was called from getRunningTaskFromNode, not from waitedForCompletion. There indeed seems to be a race between creating a get request and unblocking request which are sent asynchronously. I've changed waitForCompletionTestCase to unblock the task only after the request started waiting for the task completion by registering a removal listener. By doing so, we make sure we test the "wait for completion" branch when task is running.

The part about the missed index seems to irrelevant, since waitedForCompletion is able to suppress the error and return a snapshot of running task which is not possible if getFinishedTaskFromIndex gets called directly from getRunningTaskFromNode.

arteam · 2024-05-24T14:39:15Z

@henningandersen Any chance you would be able to get a look at the changes in the PR?

henningandersen · 2024-05-27T09:00:00Z

There indeed seems to be a race between creating a get request and unblocking request which are sent asynchronously

Did you manage to reproduce this by putting in a sleep somewhere? I'd like to fully understand the situation.

arteam · 2024-05-27T12:40:41Z

@henningandersen Yes, the error is reproduced trivially if you unblock the request first and add a small delay before calling the waitForCompletion request.

 // Unblock the request so the wait for completion request can finish
client().execute(UNBLOCK_TASK_ACTION, new TestTaskPlugin.UnblockTestTasksRequest()).get();
Thread.sleep(1000);
// Spin up a request to wait for the test task to finish
waitResponseFuture = wait.apply(taskId);

henningandersen · 2024-05-27T12:42:38Z

trivially if you unblock the request first and add a small delay before calling the waitForCompletion request.

I am not sure I understand why it would be an ok reproduction to swap the order of unblock and wait here, can you elaborate? Is it possible to just add a sleep somewhere else to see it fail?

arteam · 2024-05-27T13:09:24Z

I am not sure I understand why it would be an ok reproduction to swap the order of unblock and wait here, can you elaborate? Is it possible to just add a sleep somewhere else to see it fail?

@henningandersen I believe the issue is that order is undefined since both operations are run asynchronously. We do not check that the request clusterAdmin().prepareGetTask(id).setWaitForCompletion(true) is finished before unblocking the task. We just get a ActionFuture and immediately call client().execute(UNBLOCK_TASK_ACTION, new TestTaskPlugin.UnblockTestTasksRequest()).get().

So, depending on a race which of one these requests will be processed first, we will get a different result. That's why taskManager.getTask(request.getTaskId().getId()) in TransportGetTaskAction can return null which happens if the unlock request manages to win the race.

henningandersen · 2024-05-27T13:32:13Z

Thanks, that makes sense. I would have hoped we could put in a simple sleep somewhere to provoke it but I were not successful on that yet.

henningandersen · 2024-05-28T06:48:22Z

I can reproduce this using -Dtests.seed=F52B12BE60A068C8 and a sleep at the beginning of getRunningTaskFromNode.

henningandersen

LGTM.

Thanks for the extra iterations, this version looks good (have a few smaller comments only).

.../main/java/org/elasticsearch/action/admin/cluster/node/tasks/get/TransportGetTaskAction.java

henningandersen · 2024-05-28T07:06:18Z

.../src/internalClusterTest/java/org/elasticsearch/action/admin/cluster/node/tasks/TasksIT.java

+            @Override
+            public void onRemovedTaskListenerRegistered(RemovedTaskListener removedTaskListener) {
+                // Unblock the request only after it started waiting for task completion
+                if (removedTaskListener.toString().startsWith("Completing running task Task{id=" + taskId.getId())) {


This seems a bit strange, I think it works without it too, since there should be no other wait for completions going on.

@henningandersen There seems to be a bug in TestTaskPlugin#TransportTestTaskAction. It checks whether a task is blocked by running waitUntil for 10 seconds, but doesn't check whether waitUntil finished successfully.

…de/tasks/get/TransportGetTaskAction.java Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>

arteam · 2024-05-28T07:43:39Z

@elasticmachine update branch

This reverts commit f235b87.

arteam · 2024-05-29T07:15:07Z

Thank you!

…#108094) It seems that the failure (the missed index) has always existed in the test scenario and it's supposed to be handled by TransportGetTaskAction.java. We catch IndexNotFoundException here and convert it to ResourceNotFoundException. Then we catch ResourceNotFoundException here and return a snapshot of a task as a response. In the stack trace, getFinishedTaskFromIndex was called from getRunningTaskFromNode, not from waitedForCompletion due to a race between creating a get request and unblocking request which are sent asynchronously. I've changed the waitForCompletionTestCase test method to unblock the task only after the request started waiting for the task completion by registering a removal listener. By doing so, we make sure we test the "wait for completion" branch when task is running. The part about the missed index seems to irrelevant, since waitedForCompletion is able to suppress the error and return a snapshot of running task which is not possible if getFinishedTaskFromIndex gets called directly from getRunningTaskFromNode. Resolves elastic#107823

Fix TasksIT#testGetTaskWaitForCompletionWithoutStoringResult

bf3b27d

Make sure the `.tasks` index is created before we starting testing task completion without storing its result. To achieve that, we store a fake task before we start `waitForCompletionTestCase`. Resolves #107823

arteam added >test Issues or PRs that are addressing/adding tests :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. labels Apr 30, 2024

elasticsearchmachine added Team:Distributed Meta label for distributed team v8.15.0 labels Apr 30, 2024

arteam requested review from idegtiarenko and DaveCTurner April 30, 2024 17:06

Merge branch 'main' into save-fake-tasks-to-create-task-index

39fb24a

arteam requested review from idegtiarenko, volodk85 and DaveCTurner and removed request for idegtiarenko and DaveCTurner May 2, 2024 07:26

Merge branch 'main' into save-fake-tasks-to-create-task-index

88eeddc

arteam requested review from idegtiarenko, DaveCTurner, volodk85 and a team and removed request for idegtiarenko, volodk85 and DaveCTurner May 8, 2024 07:56

arteam added 4 commits May 21, 2024 01:19

Unblock request only after we started waiting for completion

800d56c

Update comment

63396ac

Remove outdated comment

1b2bded

Revert "Fix TasksIT#testGetTaskWaitForCompletionWithoutStoringResult"

e8241a5

This reverts commit bf3b27d.

elasticmachine and others added 2 commits May 21, 2024 08:39

Merge branch 'main' into save-fake-tasks-to-create-task-index

59036cd

Make sure we register onRemovedTaskListenerRegistered before we wait …

f59ff4e

…for completion

arteam requested a review from henningandersen May 21, 2024 07:55

henningandersen approved these changes May 28, 2024

View reviewed changes

Update server/src/main/java/org/elasticsearch/action/admin/cluster/no…

ceea234

…de/tasks/get/TransportGetTaskAction.java Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>

elasticmachine and others added 6 commits May 28, 2024 08:43

Merge branch 'main' into save-fake-tasks-to-create-task-index

533dfe6

Adjust test for the new task name

a928f28

Remove check for removedTaskListener type

a61e9c9

Make sure the task gets unblocked

f235b87

Revert "Make sure the task gets unblocked"

7e6f7de

This reverts commit f235b87.

Remove listener after test finished

26170ea

arteam merged commit e622101 into main May 29, 2024
16 checks passed

arteam deleted the save-fake-tasks-to-create-task-index branch May 29, 2024 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TasksIT#testGetTaskWaitForCompletionWithoutStoringResult #108094

Fix TasksIT#testGetTaskWaitForCompletionWithoutStoringResult #108094

arteam commented Apr 30, 2024 •

edited

elasticsearchmachine commented Apr 30, 2024

arteam commented May 2, 2024

arteam commented May 7, 2024

henningandersen commented May 10, 2024

arteam commented May 13, 2024

henningandersen commented May 15, 2024

DaveCTurner commented May 15, 2024 via email

arteam commented May 21, 2024

arteam commented May 21, 2024 •

edited

arteam commented May 24, 2024

henningandersen commented May 27, 2024

arteam commented May 27, 2024

henningandersen commented May 27, 2024

arteam commented May 27, 2024

henningandersen commented May 27, 2024

henningandersen commented May 28, 2024

henningandersen left a comment

henningandersen May 28, 2024

arteam May 28, 2024

arteam commented May 28, 2024

arteam commented May 29, 2024

Fix TasksIT#testGetTaskWaitForCompletionWithoutStoringResult #108094

Fix TasksIT#testGetTaskWaitForCompletionWithoutStoringResult #108094

Conversation

arteam commented Apr 30, 2024 • edited

elasticsearchmachine commented Apr 30, 2024

arteam commented May 2, 2024

arteam commented May 7, 2024

henningandersen commented May 10, 2024

arteam commented May 13, 2024

henningandersen commented May 15, 2024

DaveCTurner commented May 15, 2024 via email

arteam commented May 21, 2024

arteam commented May 21, 2024 • edited

arteam commented May 24, 2024

henningandersen commented May 27, 2024

arteam commented May 27, 2024

henningandersen commented May 27, 2024

arteam commented May 27, 2024

henningandersen commented May 27, 2024

henningandersen commented May 28, 2024

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen May 28, 2024

Choose a reason for hiding this comment

arteam May 28, 2024

Choose a reason for hiding this comment

arteam commented May 28, 2024

arteam commented May 29, 2024

arteam commented Apr 30, 2024 •

edited

arteam commented May 21, 2024 •

edited