Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky unit tests: linkcheck builder (tests/test_build_linkcheck.py) #11299

Closed
jayaddison opened this issue Apr 7, 2023 · 19 comments · Fixed by #11326
Closed

Flaky unit tests: linkcheck builder (tests/test_build_linkcheck.py) #11299

jayaddison opened this issue Apr 7, 2023 · 19 comments · Fixed by #11326

Comments

@jayaddison
Copy link
Contributor

jayaddison commented Apr 7, 2023

Describe the bug

The test_linkcheck_allowed_redirects test appears to be failing rarely and intermittently:

Other tests within the same linkcheck builder test module also appear to be affected:

It's possible that they're slightly separate issues, but are likely to be related (and probably can be fixed at the same time).

How to Reproduce

Run the relevant main.yml continuous integration pytest tests on a repeated basis. This is demonstrated by jayaddison#1.

Environment Information

GitHub Actions workflows (examples are linked in the bug description section)

Sphinx extensions

N/A

Additional context

No response

@picnixz
Copy link
Member

picnixz commented Apr 7, 2023

First, it seems that it only affects Python 3.11 and Python 3.12 builds. One possibility that comes up to me is we might have some race conditions or some deadlock. This may be due to the new implementation of the HTTP server with filelocks but I am not entirely sure about it.

In order to reproduce, I think we need to sequentially run Python 3.10, 3.11 and 3.12 builds so that when Python 3.11 build starts, the Python 3.10 build may not necessarily be "properly" finished (perhaps the timeout is too short for acquiring / releasing all locks properly ?). Alternatively, I would suggest checking whether such failures occurs prior to a80e3fd.

@jayaddison
Copy link
Contributor Author

Do you know where the test parallelism is configured/declared, @picnixz? (I was looking around for it, but haven't figured it out yet)

@picnixz
Copy link
Member

picnixz commented Apr 7, 2023

If you want to run it using CI, I think this should in the .github/workflows/main.yml file itself. In order to do it locally, the best way to do is to write a bash script which would sequentially (or concurrently, depending on what you want to test) run the pytest commands themselves.

I am not even sure that the issue is due to deadlocks/race conditions. Maybe it's an issue that we can only see on github for instance. Maybe @marxin has some idea ?

@jayaddison
Copy link
Contributor Author

Ah, ok, understood (I think). So the current codebase doesn't specify parallelism for the tests, but it can be enabled locally, and doing that is what causes the localhost port-binding issue that #11294 addressed.

Maybe an odd idea: could/should we use a separate port for each test that runs a local HTTP server?

@picnixz
Copy link
Member

picnixz commented Apr 7, 2023

Concerning #11294, you are right. The issue is that this might have induced some issues with the Github CI workflow and this is extremely hard to reproduce (like, we don't know if this is because of the underlying CI framework, if it is because of pytest itself or if this is because of the (physical) machine that is running the CI checks).

Using a separate port for each test would be a possibility, but this means that we need to know which port is used and which port is not. There is no guarantee that when a test finishes, then the port number that it was using is then free. In particular, we need to have as many ports as there will be tests but this 1) is prone to errors if we incorrectly choose the port 2) requires to know which ports we can use 3) and may conflict with local ports when we are testing. One possibility is to wait between the execution of two HTTP-related tests but this would increase the duration of the tests (which are already taking quite a long time to run on github).

One possibility is to first concurrently run all non-HTTP tests as much as possible and then run all HTTP-related tests sequentially. This may serve in isolating future network-related issues as well.

@AA-Turner
Copy link
Member

In order to reproduce, I think we need to sequentially run Python 3.10, 3.11 and 3.12 builds so that when Python 3.11 build starts, the Python 3.10 build may not necessarily be "properly" finished (perhaps the timeout is too short for acquiring / releasing all locks properly ?). Alternatively, I would suggest checking whether such failures occurs prior to a80e3fd.

In GitHub Actions the runners are all independent, there's no shared state between them. I think the likeliest issue is simply that the runners have lower resources (2 cores/etc) and hit the low timeout.

I lowered the timeouts in 97f07ca to reduce the time taken by the linkcheck tests, and had to increase them once in 7d928b3 to fix instability, perhaps we need to increase again.

A

@jayaddison
Copy link
Contributor Author

+1 on the observation about no-shared-state between the runners.

My understanding is that pytest should run tests single-process, sequentially, and without any async, so any temporary webserver + file-locking usage should be setup and torn-down between tests.

It's a good goal to allow running tests with parallelism. I've found the responses library useful for mocking requests calls during tests without requiring a webserver. It's an additional dependency but it could allow factoring-out any reliance on network socket isolation.

@jayaddison
Copy link
Contributor Author

It's a good goal to allow running tests with parallelism. I've found the responses library useful for mocking requests calls during tests without requiring a webserver. It's an additional dependency but it could allow factoring-out any reliance on network socket isolation.

After investigating further, it doesn't look like responses would be applicable to all of the existing HTTP tests in sphinx. In particular, there are tests that exercise TLS/SSL-level behaviour (like this one, for example), and responses does not provide mocking at that level (only per HTTP-request).

@picnixz
Copy link
Member

picnixz commented Apr 11, 2023

Another "broken" test: https://github.com/sphinx-doc/sphinx/actions/runs/4659078989/jobs/8245535037?pr=11312

@jayaddison
Copy link
Contributor Author

The file locking approach seems fine to me, and upping the timeout seems OK, but it would be nice to find the cause.

Looking at the HTTP server logs in the case of the test_linkcheck_allowed_redirects failures could be instructive. Here's one example (with some of the assertion failure context).

>       assert records["http://localhost:7777/path1"]["status"] == "working"
E       AssertionError: assert 'broken' == 'working'
E         - working
E         + broken

tests/test_build_linkcheck.py:353: AssertionError
----------------------------- Captured stderr call -----------------------------
127.0.0.1 - - [] "HEAD /path2 HTTP/1.1" 405 -
127.0.0.1 - - [] "HEAD /path1 HTTP/1.1" 405 -
127.0.0.1 - - [] "GET /path2 HTTP/1.1" 302 -
127.0.0.1 - - [] "GET /?redirected=1 HTTP/1.1" 204 -

So it seems like the HEAD request to path1 wasn't followed by a subsequent GET request. The relevant code section is this, I think:

try:
# try a HEAD request first, which should be easier on
# the server and the network
response = requests.head(req_url, allow_redirects=True,
config=self.config, auth=auth_info,
**kwargs)
response.raise_for_status()
# Servers drop the connection on HEAD requests, causing
# ConnectionError.
except (ConnectionError, HTTPError, TooManyRedirects) as err:
if isinstance(err, HTTPError) and err.response.status_code == 429:
raise
# retry with GET request if that fails, some servers
# don't like HEAD requests.
response = requests.get(req_url, stream=True,
config=self.config,
auth=auth_info, **kwargs)
response.raise_for_status()

(I'm pretty slow to figure this stuff out, and perhaps this is verbose. I'm hoping that writing out the investigation process may help one of us discover more about the cause)

@jayaddison
Copy link
Contributor Author

Hehe, as usual, I haven't read carefully enough. The path1 GET request failure was a timeout: that appears further down in the build logs when the results are printed:

(           index: line    1) broken    http://localhost:7777/path1 - HTTPConnectionPool(host='localhost', port=7777): Read timed out. (read timeout=0.05)

So I guess that lends more support to the increase-the-timeout-threshold approach.

@jayaddison
Copy link
Contributor Author

(determining the root cause, py310-and-beyond nature of the flakiness, and recent appearance of the flakiness are all on my mind too. as is the fact that I haven't been able to replicate this without introducing an artificial time.sleep(0.15) call in the test util webserver run method)

@jayaddison
Copy link
Contributor Author

I wonder whether adding a /healthcheck endpoint to the server and waiting for that to return an HTTP 2xx before commencing app.build in each test case might help.

Theory: perhaps some setup costs for the webserver itself are being delayed until after we begin the linkcheck builder. That could be related to why the path2 HEAD request is logged first in the HTTP server logs (noticeable, although possibly a red herring).

@jayaddison
Copy link
Contributor Author

Hrm. No, setup time seems less applicable given that both HEAD requests succeed, and that it's only a subsequent GET request that fails (after the webserver is up-and-running).

@jayaddison
Copy link
Contributor Author

(determining the root cause, py310-and-beyond nature of the flakiness, and recent appearance of the flakiness are all on my mind too. as is the fact that I haven't been able to replicate this without introducing an artificial time.sleep(0.15) call in the test util webserver run method)

Moving that time.sleep call into the RedirectOnceHandler could be a more accurate way to simulate the failure case (and note: the name of the handler describes the number of redirects that each individual request is provided with, not the total number of redirects that the handler performs in total).

@jayaddison
Copy link
Contributor Author

An idea: navigation by difference: could there be clues in the differences between the test_linkcheck_allowed_redirects test and the other tests in the same file? Why does it seem to be the only flaky test in the module?

Another note: the redirect timeout configuration value is used as a per-request argument -- it's not a total for the duration of the linkcheck builder's operations.

@jayaddison
Copy link
Contributor Author

Another "broken" test: https://github.com/sphinx-doc/sphinx/actions/runs/4659078989/jobs/8245535037?pr=11312

Ok, yep - that was test_too_many_requests_retry_after_HTTP_date, and during (fairly naive) replicability testing in jayaddison#1, test_raw_node is affected too.

So in fact this issue is not specific to test_linkcheck_allowed_redirects at all (and again, I should have read the build outputs and details more carefully, because that could've saved me a bunch of time yesterday).

@jayaddison jayaddison changed the title Flaky unit test: test_linkcheck_allowed_redirects Flaky unit tests: linkcheck builder (tests/test_build_linkcheck.py) Apr 14, 2023
@jayaddison
Copy link
Contributor Author

tl;dr - my vote would be to increase the timeout thresholds

I ran a series of experimental evaluations in jayaddison#1 to try to replicate the intermittent timeout flakiness (and to verify what I considered at one point to be a potential fix.. but have since decided should not be relevant).

The reason for testing in that pull request is partly because I couldn't easily determine a way to replicate the behaviour locally / outside of GitHub Actions. I used a reduced (but otherwise unaltered) GitHub Actions workflow configuration in that branch to test the results.

In short: the issue only appeared twice out of 100 experimental evaluations (and both times, kinda annoyingly, were without the suggested fix in place. I was hoping to find a timeout failure despite the fix being merged, to provide confirmation that the suggested fix was invalid).

@jayaddison
Copy link
Contributor Author

Recap:

I think I may have found a more robust fix: jayaddison@d6f29be and am testing this currently in jayaddison#2.

It seems slightly unpythonic, and I'm not completely sure whether the reasoning behind it is valid, so if anyone can see problems with it, please let me know.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants