Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues with ansible-playbook-callbacks test #82407

Merged
merged 1 commit into from Jan 17, 2024

Conversation

markgoddard
Copy link
Contributor

@markgoddard markgoddard commented Dec 12, 2023

SUMMARY

This PR fixes various issues with the ansible-playbook-callbacks integration test.

ISSUE TYPE
  • Test Pull Request
ADDITIONAL INFORMATION

The timing of the async tasks was a little unpredictable, meaning that sometimes we would get an unexpected number of v2_runner_on_async_poll callbacks, and fail the test. This change fixes the issue by increasing the poll interval to 2 seconds and the sleep duration to 3 seconds, such that on a reasonably responsive system we will poll twice per task, with the sleep ending in the middle of the two polls.

The include_me.yml file does not exist in this integration test. It has been added.

The remote_tmp_dir.path expression is invalid - the setup_remote_tmp_dir role uses set_fact to set remote_tmp_dir to remote_tmp_dir.path.

The integration tests run with ANSIBLE_HOST_PATTERN_MISMATCH=error, meaning that the final play was never reached. Set ANSIBLE_HOST_PATTERN_MISMATCH=warning to continue past the play and trigger the v2_playbook_on_no_hosts_matched callback.

@ansibot ansibot added test This PR relates to tests. needs_triage Needs a first human triage before being processed. labels Dec 12, 2023
@webknjaz

This comment was marked as resolved.

This comment was marked as resolved.

@bcoca bcoca removed the needs_triage Needs a first human triage before being processed. label Dec 12, 2023
@bcoca bcoca requested a review from mattclay December 12, 2023 15:38
Copy link
Contributor

@mkrizek mkrizek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for identifying these issues.

In addition to the inline comments the playbook also needs to run with force_handlers turned off as it is on by default for ansible-test so v2_playbook_on_no_hosts_remaining is called as intended. (Un)setting ANSIBLE_FORCE_HANDLERS temporarily for the playbook should be all that is needed.

@@ -96,26 +96,22 @@
ignore_errors: true

- name: async poll ok
command: sleep 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed as I have yet to see this being an issue in our CI. Generally we want the sleep to be as short as possible to reduce tests running time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely hit the issue locally, and I saw it in CI once too, but that may have been with an earlier iteration of the values. Either way, it should be possible to run the integration tests on a laptop reliably.

I understand the concern around test runtime, and set it up to ensure that only one poll callback is sent per task.

The problem is that the sleep duration is a multiple of the poll interval, so the number of polls can vary. I chose these numbers to make the sleep finish in the middle of the 1st and 2nd poll.

I'll try halving all of the numbers, which should reduce the task duration to 2s, as it was before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing this locally like this:

for i in $(seq 1 10); do if ! ansible-test integration ansible-playbook-callbacks; then echo "Failed on $i"; break; fi; done

With the original numbers (sleep 2 poll 1) and it failed on the 2nd attempt. Tried again and it failed on the 5th attempt.

I tried it twice with the change proposed here (sleep 3 poll 2) and it did not fail.

I tried it with the numbers halved (sleep 1.5 poll 1) and it failed on the second attempt.

I tried a few other combinations, but couldn't find something that worked well (poll must be an integer).

So I'm inclined to keep the numbers proposed here - a slow test is better than an unreliable one.


- name: async poll failed
shell: sleep 2; false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@ansibot ansibot added the needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. label Dec 13, 2023
Copy link
Contributor Author

@markgoddard markgoddard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for identifying these issues.

In addition to the inline comments the playbook also needs to run with force_handlers turned off as it is on by default for ansible-test so v2_playbook_on_no_hosts_remaining is called as intended. (Un)setting ANSIBLE_FORCE_HANDLERS temporarily for the playbook should be all that is needed.

Thank you for the review. This PR was a spin-off from #81550 in which I'm doing some work around v2_playbook_on_no_hosts_remaining and indeed noticed a change in behaviour with force handlers. Due to the bug being fixed there (#81549), that callback does not fire unless max_fail_percentage is set (which it is in the final play).

I'm inclined to keep force handlers in place, to ensure that the fix for #81549 works in that case. The current proposed solution does, but originally it did not, and this integration test caught that.

@@ -96,26 +96,22 @@
ignore_errors: true

- name: async poll ok
command: sleep 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely hit the issue locally, and I saw it in CI once too, but that may have been with an earlier iteration of the values. Either way, it should be possible to run the integration tests on a laptop reliably.

I understand the concern around test runtime, and set it up to ensure that only one poll callback is sent per task.

The problem is that the sleep duration is a multiple of the poll interval, so the number of polls can vary. I chose these numbers to make the sleep finish in the middle of the 1st and 2nd poll.

I'll try halving all of the numbers, which should reduce the task duration to 2s, as it was before.

The timing of the async tasks was a little unpredictable, meaning that
sometimes we would get an unexpected number of v2_runner_on_async_poll
callbacks, and fail the test. This change fixes the issue by increasing
the poll interval to 2 seconds and the sleep duration to 3 seconds, such
that on a reasonably responsive system we will poll twice per task, with
the sleep ending in the middle of the two polls.

The include_me.yml file does not exist in this integration test. It has
been added.

The remote_tmp_dir.path expression is invalid - the setup_remote_tmp_dir
role uses set_fact to set remote_tmp_dir to remote_tmp_dir.path.

The integration tests run with ANSIBLE_HOST_PATTERN_MISMATCH=error,
meaning that the final play was never reached. Set
ANSIBLE_HOST_PATTERN_MISMATCH=warning to continue past the play and
trigger the v2_playbook_on_no_hosts_matched callback.
@ansibot ansibot added stale_review Updates were made after the last review and the last review is more than 7 days old. and removed needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. labels Dec 13, 2023
@markgoddard
Copy link
Contributor Author

Hi @mkrizek, how is it looking now?

@ansibot ansibot added the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Jan 2, 2024
@webknjaz
Copy link
Member

@mkrizek could you take a look? It seems like this may fix flaky failures in our own CI that we observe quite regularly.

@@ -5,6 +5,7 @@ set -eux
export ANSIBLE_CALLBACK_PLUGINS=../support-callback_plugins/callback_plugins
export ANSIBLE_ROLES_PATH=../
export ANSIBLE_STDOUT_CALLBACK=callback_debug
export ANSIBLE_HOST_PATTERN_MISMATCH=warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this should be set for the all-callbacks.yml playbook only as opposed to the whole script in case more tests are added in the future but not a blocker.

Copy link
Contributor

@mkrizek mkrizek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@webknjaz I was not able to reproduce the failures described in https://github.com/ansible/ansible/pull/82407/files#r1425469321 locally so I cannot verify the changes being correct but feel free to merge to see if it fixes the flaky tests you mentioned.

@markgoddard Thank you for the PR.

The difference in callbacks being run regarding on force_handlers being un/set can be worked on in a separate PR.

@ansibot ansibot added needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. and removed stale_review Updates were made after the last review and the last review is more than 7 days old. labels Jan 16, 2024
@markgoddard
Copy link
Contributor Author

Thank you for the review @mkrizek. Ansibot has added the needs_revision label. Is it because the branch is old? Does it need a merge/rebase?

@webknjaz
Copy link
Member

I cannot verify the changes being correct but feel free to merge to see if it fixes the flaky tests you mentioned.

I was hoping somebody with a better understanding of the consequences could approve first. The tests are flaky, but the failures happen once in a few days. To check this experimentally, we'd need to watch the CI nightlies post-merge for weeks, before declaring the fix working. I didn't want to experiment like that.

@mattclay
Copy link
Member

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ansibot ansibot removed the stale_ci This PR has been tested by CI more than one week ago. Close and re-open this PR to get it retested. label Jan 16, 2024
@mkrizek mkrizek merged commit 4a2de76 into ansible:devel Jan 17, 2024
62 checks passed
@markgoddard markgoddard deleted the fix-ansible-playbook-callbacks branch January 19, 2024 11:30
@markgoddard
Copy link
Contributor Author

Thank you for approving. This has unlocked #81550.

@webknjaz
Copy link
Member

FTR I think I haven't seen those failures since the merge. So I think it'd be useful to have this on other branches too (as the issues weren't unique to devel).

@markgoddard could you send us backport PRs against the stable versions (2.16, 2.15, 2.14 and 2.12) per https://docs.ansible.com/ansible/latest/community/development_process.html#backporting-merged-prs-in-ansible-core?

markgoddard added a commit to stackhpc/ansible that referenced this pull request Jan 31, 2024
The timing of the async tasks was a little unpredictable, meaning that
sometimes we would get an unexpected number of v2_runner_on_async_poll
callbacks, and fail the test. This change fixes the issue by increasing
the poll interval to 2 seconds and the sleep duration to 3 seconds, such
that on a reasonably responsive system we will poll twice per task, with
the sleep ending in the middle of the two polls.

The include_me.yml file does not exist in this integration test. It has
been added.

The remote_tmp_dir.path expression is invalid - the setup_remote_tmp_dir
role uses set_fact to set remote_tmp_dir to remote_tmp_dir.path.

The integration tests run with ANSIBLE_HOST_PATTERN_MISMATCH=error,
meaning that the final play was never reached. Set
ANSIBLE_HOST_PATTERN_MISMATCH=warning to continue past the play and
trigger the v2_playbook_on_no_hosts_matched callback.

(cherry picked from commit 4a2de76)
markgoddard added a commit to stackhpc/ansible that referenced this pull request Jan 31, 2024
The timing of the async tasks was a little unpredictable, meaning that
sometimes we would get an unexpected number of v2_runner_on_async_poll
callbacks, and fail the test. This change fixes the issue by increasing
the poll interval to 2 seconds and the sleep duration to 3 seconds, such
that on a reasonably responsive system we will poll twice per task, with
the sleep ending in the middle of the two polls.

The include_me.yml file does not exist in this integration test. It has
been added.

The remote_tmp_dir.path expression is invalid - the setup_remote_tmp_dir
role uses set_fact to set remote_tmp_dir to remote_tmp_dir.path.

The integration tests run with ANSIBLE_HOST_PATTERN_MISMATCH=error,
meaning that the final play was never reached. Set
ANSIBLE_HOST_PATTERN_MISMATCH=warning to continue past the play and
trigger the v2_playbook_on_no_hosts_matched callback.

(cherry picked from commit 4a2de76)
@markgoddard
Copy link
Contributor Author

FTR I think I haven't seen those failures since the merge. So I think it'd be useful to have this on other branches too (as the issues weren't unique to devel).

@markgoddard could you send us backport PRs against the stable versions (2.16, 2.15, 2.14 and 2.12) per https://docs.ansible.com/ansible/latest/community/development_process.html#backporting-merged-prs-in-ansible-core?

I've proposed backports to 2.16 and 2.15, but the callbacks integration test doesn't exist in 2.14 and earlier branches.

#82630
#82631

@ansible ansible locked and limited conversation to collaborators Feb 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs_revision This PR fails CI tests or a maintainer has requested a review/revision of the PR. test This PR relates to tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants