New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issues with ansible-playbook-callbacks test #82407
Fix issues with ansible-playbook-callbacks test #82407
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for identifying these issues.
In addition to the inline comments the playbook also needs to run with force_handlers
turned off as it is on by default for ansible-test
so v2_playbook_on_no_hosts_remaining
is called as intended. (Un)setting ANSIBLE_FORCE_HANDLERS
temporarily for the playbook should be all that is needed.
@@ -96,26 +96,22 @@ | |||
ignore_errors: true | |||
|
|||
- name: async poll ok | |||
command: sleep 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed as I have yet to see this being an issue in our CI. Generally we want the sleep to be as short as possible to reduce tests running time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely hit the issue locally, and I saw it in CI once too, but that may have been with an earlier iteration of the values. Either way, it should be possible to run the integration tests on a laptop reliably.
I understand the concern around test runtime, and set it up to ensure that only one poll callback is sent per task.
The problem is that the sleep duration is a multiple of the poll interval, so the number of polls can vary. I chose these numbers to make the sleep finish in the middle of the 1st and 2nd poll.
I'll try halving all of the numbers, which should reduce the task duration to 2s, as it was before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm testing this locally like this:
for i in $(seq 1 10); do if ! ansible-test integration ansible-playbook-callbacks; then echo "Failed on $i"; break; fi; done
With the original numbers (sleep 2 poll 1) and it failed on the 2nd attempt. Tried again and it failed on the 5th attempt.
I tried it twice with the change proposed here (sleep 3 poll 2) and it did not fail.
I tried it with the numbers halved (sleep 1.5 poll 1) and it failed on the second attempt.
I tried a few other combinations, but couldn't find something that worked well (poll must be an integer).
So I'm inclined to keep the numbers proposed here - a slow test is better than an unreliable one.
|
||
- name: async poll failed | ||
shell: sleep 2; false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for identifying these issues.
In addition to the inline comments the playbook also needs to run with force_handlers turned off as it is on by default for ansible-test so v2_playbook_on_no_hosts_remaining is called as intended. (Un)setting ANSIBLE_FORCE_HANDLERS temporarily for the playbook should be all that is needed.
Thank you for the review. This PR was a spin-off from #81550 in which I'm doing some work around v2_playbook_on_no_hosts_remaining
and indeed noticed a change in behaviour with force handlers. Due to the bug being fixed there (#81549), that callback does not fire unless max_fail_percentage is set (which it is in the final play).
I'm inclined to keep force handlers in place, to ensure that the fix for #81549 works in that case. The current proposed solution does, but originally it did not, and this integration test caught that.
@@ -96,26 +96,22 @@ | |||
ignore_errors: true | |||
|
|||
- name: async poll ok | |||
command: sleep 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely hit the issue locally, and I saw it in CI once too, but that may have been with an earlier iteration of the values. Either way, it should be possible to run the integration tests on a laptop reliably.
I understand the concern around test runtime, and set it up to ensure that only one poll callback is sent per task.
The problem is that the sleep duration is a multiple of the poll interval, so the number of polls can vary. I chose these numbers to make the sleep finish in the middle of the 1st and 2nd poll.
I'll try halving all of the numbers, which should reduce the task duration to 2s, as it was before.
The timing of the async tasks was a little unpredictable, meaning that sometimes we would get an unexpected number of v2_runner_on_async_poll callbacks, and fail the test. This change fixes the issue by increasing the poll interval to 2 seconds and the sleep duration to 3 seconds, such that on a reasonably responsive system we will poll twice per task, with the sleep ending in the middle of the two polls. The include_me.yml file does not exist in this integration test. It has been added. The remote_tmp_dir.path expression is invalid - the setup_remote_tmp_dir role uses set_fact to set remote_tmp_dir to remote_tmp_dir.path. The integration tests run with ANSIBLE_HOST_PATTERN_MISMATCH=error, meaning that the final play was never reached. Set ANSIBLE_HOST_PATTERN_MISMATCH=warning to continue past the play and trigger the v2_playbook_on_no_hosts_matched callback.
8c5bd40
to
9642495
Compare
Hi @mkrizek, how is it looking now? |
@mkrizek could you take a look? It seems like this may fix flaky failures in our own CI that we observe quite regularly. |
@@ -5,6 +5,7 @@ set -eux | |||
export ANSIBLE_CALLBACK_PLUGINS=../support-callback_plugins/callback_plugins | |||
export ANSIBLE_ROLES_PATH=../ | |||
export ANSIBLE_STDOUT_CALLBACK=callback_debug | |||
export ANSIBLE_HOST_PATTERN_MISMATCH=warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this should be set for the all-callbacks.yml
playbook only as opposed to the whole script in case more tests are added in the future but not a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@webknjaz I was not able to reproduce the failures described in https://github.com/ansible/ansible/pull/82407/files#r1425469321 locally so I cannot verify the changes being correct but feel free to merge to see if it fixes the flaky tests you mentioned.
@markgoddard Thank you for the PR.
The difference in callbacks being run regarding on force_handlers
being un/set can be worked on in a separate PR.
Thank you for the review @mkrizek. Ansibot has added the needs_revision label. Is it because the branch is old? Does it need a merge/rebase? |
I was hoping somebody with a better understanding of the consequences could approve first. The tests are flaky, but the failures happen once in a few days. To check this experimentally, we'd need to watch the CI nightlies post-merge for weeks, before declaring the fix working. I didn't want to experiment like that. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Thank you for approving. This has unlocked #81550. |
FTR I think I haven't seen those failures since the merge. So I think it'd be useful to have this on other branches too (as the issues weren't unique to @markgoddard could you send us backport PRs against the stable versions (2.16, 2.15, 2.14 and 2.12) per https://docs.ansible.com/ansible/latest/community/development_process.html#backporting-merged-prs-in-ansible-core? |
The timing of the async tasks was a little unpredictable, meaning that sometimes we would get an unexpected number of v2_runner_on_async_poll callbacks, and fail the test. This change fixes the issue by increasing the poll interval to 2 seconds and the sleep duration to 3 seconds, such that on a reasonably responsive system we will poll twice per task, with the sleep ending in the middle of the two polls. The include_me.yml file does not exist in this integration test. It has been added. The remote_tmp_dir.path expression is invalid - the setup_remote_tmp_dir role uses set_fact to set remote_tmp_dir to remote_tmp_dir.path. The integration tests run with ANSIBLE_HOST_PATTERN_MISMATCH=error, meaning that the final play was never reached. Set ANSIBLE_HOST_PATTERN_MISMATCH=warning to continue past the play and trigger the v2_playbook_on_no_hosts_matched callback. (cherry picked from commit 4a2de76)
The timing of the async tasks was a little unpredictable, meaning that sometimes we would get an unexpected number of v2_runner_on_async_poll callbacks, and fail the test. This change fixes the issue by increasing the poll interval to 2 seconds and the sleep duration to 3 seconds, such that on a reasonably responsive system we will poll twice per task, with the sleep ending in the middle of the two polls. The include_me.yml file does not exist in this integration test. It has been added. The remote_tmp_dir.path expression is invalid - the setup_remote_tmp_dir role uses set_fact to set remote_tmp_dir to remote_tmp_dir.path. The integration tests run with ANSIBLE_HOST_PATTERN_MISMATCH=error, meaning that the final play was never reached. Set ANSIBLE_HOST_PATTERN_MISMATCH=warning to continue past the play and trigger the v2_playbook_on_no_hosts_matched callback. (cherry picked from commit 4a2de76)
I've proposed backports to 2.16 and 2.15, but the callbacks integration test doesn't exist in 2.14 and earlier branches. |
SUMMARY
This PR fixes various issues with the ansible-playbook-callbacks integration test.
ISSUE TYPE
ADDITIONAL INFORMATION
The timing of the async tasks was a little unpredictable, meaning that sometimes we would get an unexpected number of v2_runner_on_async_poll callbacks, and fail the test. This change fixes the issue by increasing the poll interval to 2 seconds and the sleep duration to 3 seconds, such that on a reasonably responsive system we will poll twice per task, with the sleep ending in the middle of the two polls.
The include_me.yml file does not exist in this integration test. It has been added.
The remote_tmp_dir.path expression is invalid - the setup_remote_tmp_dir role uses set_fact to set remote_tmp_dir to remote_tmp_dir.path.
The integration tests run with ANSIBLE_HOST_PATTERN_MISMATCH=error, meaning that the final play was never reached. Set ANSIBLE_HOST_PATTERN_MISMATCH=warning to continue past the play and trigger the v2_playbook_on_no_hosts_matched callback.