Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky unit test: test_auth_header_uses_first_match #11348

Closed
jayaddison opened this issue Apr 22, 2023 · 31 comments · Fixed by #11426
Closed

Flaky unit test: test_auth_header_uses_first_match #11348

jayaddison opened this issue Apr 22, 2023 · 31 comments · Fixed by #11426

Comments

@jayaddison
Copy link
Contributor

Describe the bug

The test_auth_header_uses_first_match test seems to fail intermittently, particularly when threaded test HTTP servers are in use (as in #11340).

How to Reproduce

Visible in continuous build output in jayaddison#2.

Environment Information

N/A

Sphinx extensions

N/A

Additional context

No response

@jayaddison
Copy link
Contributor Author

(I'm opening this because I have a theory about why it's happening, and a potential fix that I'd like to track against this bug separate to other discussions)

@jayaddison
Copy link
Contributor Author

jayaddison commented Apr 25, 2023

For a sample build failure, see the logs at https://github.com/jayaddison/sphinx/actions/runs/4765570057/jobs/8471467146?pr=2#step:10:1918 from commit jayaddison@88fb703 .

The test in question spins up an HTTP server that captures the client's request headers during linkchecking.

The server runs in a separate thread, and should be closed and shutdown before the unit test continues and extracts results from a shared list object.

I think that the append operation -- placing the client header string into the list -- definitely occurs within the server thread before control continues in the unit test. What I'm less clear about is whether that change is reflected during the read that the unit test thread performs from the list.

@francoisfreitag
Copy link
Contributor

The linked failure is a timeout, not the result of an unexpected list.
The context manager http_server ensures that the server is shut down and the thread is joined before resuming the test, so the value is in the list before the test thread resumes. I fail to see how the test could not read the value from the list?

@jayaddison
Copy link
Contributor Author

The linked failure is a timeout, not the result of an unexpected list.

Thanks; my mistake. See https://github.com/jayaddison/sphinx/actions/runs/4765570057/jobs/8471467146?pr=2#step:10:1918 for an example of the failure.

I agree that it is unusual that the header value is not found in the list; the thread/timing theory is the best explanation that I have so far.

@francoisfreitag
Copy link
Contributor

Your first link was actually correct, I misread the stdout capture as the test output 🤦. Sorry!

Here’s what I believe happens:

  1. http_server starts the server thread
  2. test thread client makes the request -- read timeout (as indicated in the stdout capture)
  3. http_server shuts down and joins -- maybe the connect timeout was hit, so the server never received the request
  4. test thread resumes, tries to read the headers, 💥

The timeout is the root cause of the failure. If the request wasn’t interrupted, the server would have received and processed it, and the test would have succeeded.

@jayaddison
Copy link
Contributor Author

No problem - and interesting! That makes sense. I hadn't noticed the timeouts occurring alongside.

Hrm. Well.. on the plus side, I suppose that means that #11350 is not necessary, so I'll close and revert that from #11340 soon.

@jayaddison
Copy link
Contributor Author

jayaddison commented Apr 26, 2023

  1. http_server shuts down and joins -- maybe the connect timeout was hit, so the server never received the request

I'm thinking a bit more about this as a detail, and what it might imply. We would expect to see that behaviour if the client was blocked waiting to obtain a connection from the urllib3 connection pool, correct?

(maybe it's not the only way that a timeout could occur without the server receiving a request, but it is one possibility?)

@jayaddison
Copy link
Contributor Author

@francoisfreitag I appreciate your help on this - I have to admit to being fairly confused about the latest results from #11340, where multiple unit test failures on a30596f show this test_auth_header_uses_first_match failure (and failure to capture the header) -- but without timeouts as the root cause.

It seems like a good puzzle, and I think that session-based requests should be unit-testable, but I don't seem to be making much progress on finding the remaining problems.

@francoisfreitag
Copy link
Contributor

We would expect to see that behaviour if the client was blocked waiting to obtain a connection from the urllib3 connection pool, correct?

It could be yes. As you say, it’s not the only way.
I see the server thread sends the response headers (so using the network) before capturing the header. Perhaps the thread could first store the request headers, since that’s the most important thing. But that’s only band aid.

latest results from #11340

That’s surprising indeed. I can reproduce when running just that test in a tight loop. 🤔

@francoisfreitag
Copy link
Contributor

francoisfreitag commented Apr 26, 2023

With a bisect, the culprit seems to be:
d91ec13: tests: use threaded Python basic HTTP server implementation during unit tests

I believe the reason is that ThreadingHTTPServer uses daemon_threads = True. From the Python docs of ThreadingMixIn:

socketserver.ThreadingMixIn.server_close() now waits until all child processes and non-daemonic threads complete.

Since we have a daemon thread, we may kill the server but have the daemon thread (which handles the request) linger, and have the test check the list content before the daemon thread gets to send the response.

@jayaddison
Copy link
Contributor Author

That's a good line of thinking - I'm tempted to try setting daemon_threads=False. But... wouldn't we expect to find a client timeout if the server-response occurs after-app-build (and shutdown)?

At https://github.com/sphinx-doc/sphinx/actions/runs/4811667688/jobs/8565941211?pr=11340#step:10:1928 there is some output from the client that suggests to me that it does receive a response:

(           index: line    1) ok        http://localhost:7777/

@francoisfreitag
Copy link
Contributor

Why was the HTTPServer changed to ThreadingHTTPServer in the first place? Using HTTPServer does not introduce that complexity, and before spending time configuring the ThreadingHTTPServer, I would like to understand the expected gains.

@francoisfreitag
Copy link
Contributor

there is some output from the client that suggests to me that it does receive a response:

self.send_response(200, "OK")
self.end_headers()
records.append(self.headers.as_string())

It’s not incompatible, I believe we can have the following situation:

  1. The HTTP server thread sends the response (line 189, end_headers), I imagine it releases the GIL when sending bytes over the network
  2. The test thread acquires the GIL to resume
  3. The test thread completes its remaining work (handles the response, shuts down the HTTP server), keeping/reacquiring the GIL, until it performs the assertion.

@jayaddison
Copy link
Contributor Author

there is some output from the client that suggests to me that it does receive a response:

It’s not incompatible, I believe we can have the following situation:
...

Ok, that makes some sense. During a recent rebase of #11340, I dropped a previous commit 4d485ae (now off-branch) that was experimenting along those lines:

--- a/tests/test_build_linkcheck.py
+++ b/tests/test_build_linkcheck.py
@@ -252,9 +252,9 @@ def capture_headers_handler(records):
             self.do_GET()
 
         def do_GET(self):
+            records.append(self.headers.as_string())
             self.send_response(200, "OK")
             self.end_headers()
-            records.append(self.headers.as_string())
     return HeadersDumperHandler
 
 

... so perhaps that could be reintroduced. Completion of the app.build() in the test thread should provide a synchronization boundary, I think? (maybe we would see the timeout reappear as a result)

Why was the HTTPServer changed to ThreadingHTTPServer in the first place? Using HTTPServer does not introduce that complexity, and before spending time configuring the ThreadingHTTPServer, I would like to understand the expected gains.

Before introducing that change, I had seen cases where client-server interaction seemed to pause (perhaps deadlocked, I thought), resulting in timeouts on the client in tests such as test_defaults.

When I was at 7162671 (off-branch) and experimented by removed the linkcheck timeout configuration for each of the test cases, I was able to get most of the tests to pass -- but extremely slowly (20s+ to run pytest tests/test_build_linkcheck.py -k test_defaults), as if there was co-operative-but-extremely-inefficient-multitasking between test clients and servers.

By enabling the threaded test server (originally commit cbedb4b, again now off-branch), I felt that that multitasking had now been unblocked since the workloads were no longer all waiting for time (and blockable resources) within a single thread.

@jayaddison
Copy link
Contributor Author

Also, an additional benefit (that I did hope for when switching to ThreadingHTTPServer, although I could easily be making this up) is that it seems like a good idea for the test servers to behave more-similarly to the webservers that Sphinx will encounter during linkchecking on the Internet, and in particular that requests are unlikely to be handled in-sequence, one-at-a-time, by a stateful recipient thread.

There is some risk in the switch to multi-threading since it could expose test or implementation fragility -- but the reward is that it may expose genuine implementation problems. Ideally we'd likely want tests that are both reliable (to confidently catch regressions and maintain functionality while allowing development to proceed quickly) and also realistic (providing coverage for situations like communication-timing issues).

@jayaddison
Copy link
Contributor Author

Ok, that makes some sense. During a recent rebase of #11340, I dropped a previous commit 4d485ae (now off-branch) that was experimenting along those lines:
...
... so perhaps that could be reintroduced. Completion of the app.build() in the test thread should provide a synchronization boundary, I think? (maybe we would see the timeout reappear as a result)

Before re-introducing that commit to #11340, eleven test failures had occurred for the latest commit (a30596f). After cherry-picking 4d485ae, the failure count has reduced to zero.

(note: that's based on a single run of the unit tests for each of those commits; running the tests repeatedly would help to gain more confidence)

@jayaddison
Copy link
Contributor Author

jayaddison commented Apr 27, 2023

I believe the reason is that ThreadingHTTPServer uses daemon_threads = True.

A correction for this: I believe that the default for ThreadingHTTPServer's daemon_threads attribute is False (no, it turns out that the default is indeed True; see edit notes below), meaning that a server that inherits from ThreadingHTTPServer (and to be pedantic: one that doesn't override any relevant attributes/methods) does wait for all of its threads to complete when server_close() is called.

That is the behaviour we want, I think - we want complete server resource cleanup after each individual unit test completes.

Edit: I had been reading a somewhat-but-not-exactly-related documentation section about socketserver.ThreadingMixIn when writing the original version of this comment. socketserver.ThreadingMixIn does have a default daemon_threads value of False, but that is different to http.server.ThreadingHTTPServer's default.

Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from socketserver import ThreadingMixIn
>>> ThreadingMixIn.daemon_threads
False
>>> from http.server import ThreadingHTTPServer
>>> ThreadingHTTPServer.daemon_threads
True

@jayaddison
Copy link
Contributor Author

I believe the reason is that ThreadingHTTPServer uses daemon_threads = True.

A correction for this: I believe that the default for ThreadingHTTPServer's daemon_threads attribute is False, meaning that a server that inherits from ThreadingHTTPServer (and to be pedantic: one that doesn't override any relevant attributes/methods) does wait for all of its threads to complete when server_close() is called.

...and I have a goldfish-like memory. I experimented with daemon_threads=True early on after enabling ThreadingHTTPServer.

#11324 (comment)

I later removed that, generally on the principle that the smallest set of changes to achieve the desired result are best, and I did not have a reason to believe that enabling daemon threads was helping.

@jayaddison
Copy link
Contributor Author

I think that this bug was only relevant to my development branch of #11340. I'll re-open it if this test failure begins occurring again in the mainline branch.

@jayaddison
Copy link
Contributor Author

jayaddison commented Apr 27, 2023

there is some output from the client that suggests to me that it does receive a response:

It’s not incompatible, I believe we can have the following situation:
...

Ok, that makes some sense. During a recent rebase of #11340, I dropped a previous commit 4d485ae (now off-branch) that was experimenting along those lines:

--- a/tests/test_build_linkcheck.py
+++ b/tests/test_build_linkcheck.py
@@ -252,9 +252,9 @@ def capture_headers_handler(records):
             self.do_GET()
 
         def do_GET(self):
+            records.append(self.headers.as_string())
             self.send_response(200, "OK")
             self.end_headers()
-            records.append(self.headers.as_string())
     return HeadersDumperHandler
 
 

... so perhaps that could be reintroduced. Completion of the app.build() in the test thread should provide a synchronization boundary, I think? (maybe we would see the timeout reappear as a result)

...

Before re-introducing that commit to #11340, eleven test failures had occurred for the latest commit (a30596f). After cherry-picking 4d485ae, the failure count has reduced to zero.

Not quite zero, after all. Although applying the change does significantly reduce the number of occurrences, the empty-headers result does still appear:

https://github.com/sphinx-doc/sphinx/actions/runs/4820527285/jobs/8585137268#step:10:1889

@jayaddison
Copy link
Contributor Author

Wrong link (copy-paste error); that should be: https://github.com/sphinx-doc/sphinx/actions/runs/4820527285/jobs/8585137268#step:10:1889

@francoisfreitag
Copy link
Contributor

A correction for this: I believe that the default for ThreadingHTTPServer's daemon_threads attribute is False, meaning that a server that inherits from ThreadingHTTPServer (and to be pedantic: one that doesn't override any relevant attributes/methods) does wait for all of its threads to complete when server_close() is called.

That’s not what I see in CPython source:

https://github.com/python/cpython/blob/817414321c236a77e05c621911d6f694db1262e2/Lib/http/server.py#L142-L143

20s+ to run pytest tests/test_build_linkcheck.py -k test_defaults

I’m unsure how to replicate? On 4eb706e, with Python 3.10.10 and DU 0.19, this test takes less than 0.2s on my machine.

it seems like a good idea for the test servers to behave more-similarly to the webservers that Sphinx will encounter during linkchecking on the Internet, and in particular that requests are unlikely to be handled in-sequence, one-at-a-time, by a stateful recipient thread.

I do see value here, but that also opens up a can of worms for the implementation, where the server needs to track its threads and join them. It’s certainly possible, but given that requests are few and mostly plain HEAD/GET, I’m not sure the tradeoff is worth it. 🤷

@jayaddison
Copy link
Contributor Author

A correction for this: I believe that the default for ThreadingHTTPServer's daemon_threads attribute is False, meaning that a server that inherits from ThreadingHTTPServer (and to be pedantic: one that doesn't override any relevant attributes/methods) does wait for all of its threads to complete when server_close() is called.

That’s not what I see in CPython source:

https://github.com/python/cpython/blob/817414321c236a77e05c621911d6f694db1262e2/Lib/http/server.py#L142-L143

Ok, thank you. I had been reading the ThreadingMixin documentation, where the default is described as False. I should not have assumed (nor said) that ThreadingHTTPServer uses that value; my apologies.

20s+ to run pytest tests/test_build_linkcheck.py -k test_defaults

I’m unsure how to replicate? On 4eb706e, with Python 3.10.10 and DU 0.19, this test takes less than 0.2s on my machine.

I'm having difficulty replicating that slow performance result currently too, when testing against both 7162671 and 4eb706e.

it seems like a good idea for the test servers to behave more-similarly to the webservers that Sphinx will encounter during linkchecking on the Internet, and in particular that requests are unlikely to be handled in-sequence, one-at-a-time, by a stateful recipient thread.

I do see value here, but that also opens up a can of worms for the implementation, where the server needs to track its threads and join them. It’s certainly possible, but given that requests are few and mostly plain HEAD/GET, I’m not sure the tradeoff is worth it. 🤷

That's reasonable.

I'd prefer to provide some level of parallelism testing alongside the session-based linkcheck client upgrade though: if the change isn't robust, then someone else is going to have to spend time investigating later, and I'd prefer not to cause (or at least to minimize) that time cost.

That said: I'm not particularly proud of how verbose I was during all this. I was determined to figure out various problems, and would like to co-operate effectively, but I think I verged on spamming.

I'll continue to try to replicate the slow-test-duration problem, but may pause for a while. If I'm not contributing effectively then it's better to take some time out.

@jayaddison
Copy link
Contributor Author

20s+ to run pytest tests/test_build_linkcheck.py -k test_defaults

I’m unsure how to replicate? On 4eb706e, with Python 3.10.10 and DU 0.19, this test takes less than 0.2s on my machine.

I'm having difficulty replicating that slow performance result currently too, when testing against both 7162671 and 4eb706e.

I'll continue to try to replicate the slow-test-duration problem, but may pause for a while. If I'm not contributing effectively then it's better to take some time out.

Ok: ed45e3d provides a relevant testing commit.

From that commit, I repeatedly find that pytest tests/test_build_linkcheck.py -k test_defaults takes around five seconds to run a single test -- and that single test does pass successfully.

(the 20s+ duration I quoted earlier may have been my memory of how long the entire test_build_linkcheck took under those conditions)

The relevant changes from that commit are:

@francoisfreitag
Copy link
Contributor

Could it be that the connection to the server is not immediately closed with HTTP/1.1 (uses persistent connections by default), and the HTTPServer waits for the connection to be closed by linkcheck (i.e. requests)?
The client makes its GET request with stream=True, which means the body is not consumed. If requests does not special-case content-length==0, then the server would keep alive the connection from the request, and not start processing the new request until the keep-alive expires, exhibiting the slowdowns and timeouts.

I believe you’re familiar with https://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow:

If you set stream to True when making a request, Requests cannot release the connection back to the pool unless you consume all the data or call Response.close. This can lead to inefficiency with connections. If you find yourself partially reading request bodies (or not reading them at all) while using stream=True, you should make the request within a with statement to ensure it’s always closed:

So, if I’m following correctly:

  1. Enabling HTTP/1.1 caused the test suite to slow down significantly (keep-alive issues)
  2. Using ThreadingHTTPServer improved the situation some (through the use of the connection pool, but caused flakiness)
  3. Correcting the code from linkcheck to call response.close() (through the context manager) helped improve things, but tests remain flaky because of the ThreadingHTTPServer.

What do you think of the following plan:

  • Keep HTTP/1.1 and context manager for requests, using HTTPServer
  • Add a test using a pool with a single connection in the pool, and a large-but-not-too-large timeout (e.g. 1s) to exhibit the connection pool exhaustion from not response.close()-ing?

I believe it would solve the actual issue with linkcheck, keep the test suite reasonably simple, and hopefully prevent regressions.

P.S. even though verbose, I do appreciate your work on identifying and addressing the issue, and detailing your thoughts do help sometimes :)

@jayaddison
Copy link
Contributor Author

Could it be that the connection to the server is not immediately closed with HTTP/1.1 (uses persistent connections by default), and the HTTPServer waits for the connection to be closed by linkcheck (i.e. requests)? The client makes its GET request with stream=True, which means the body is not consumed. If requests does not special-case content-length==0, then the server would keep alive the connection from the request, and not start processing the new request until the keep-alive expires, exhibiting the slowdowns and timeouts.

Yep, that's a great and concise description of the problem, thank you :)

The same conditions also affect HTTP HEAD requests made by the requests library. From initial testing, httpx seems to handle the situation slightly better, possibly due to differences in redirect-handling logic.

I believe you’re familiar with https://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow:

👍

So, if I’m following correctly:

1. Enabling HTTP/1.1 caused the test suite to slow down significantly (keep-alive issues)

2. Using `ThreadingHTTPServer` improved the situation some (through the use of the connection pool, but caused flakiness)

3. Correcting the code from linkcheck to call `response.close()` (through the context manager) helped improve things, but tests remain flaky because of the `ThreadingHTTPServer`.
  1. ✅ correct
  2. ✅ correct
  3. ✅ correct -- and also configuring close_connection = True server-side for HTTP HEAD endpoints helped

⚠️ A detail that I care about: I encountered the test_auth_header_uses_first_match (Authorization header) failure in continuous integration even with all three of the close(), close_connection, and header-collection-relocation (bd2ed0b) changes in place. This build log demonstrates that, and I think may expose another deeper puzzle (please consider this optional! I've taken enough of your time, and I can follow up on it separately).

What do you think of the following plan:

* Keep HTTP/1.1 and context manager for requests, using `HTTPServer`

👍 that sounds good.

* Add a test using a pool with a single connection in the pool, and a large-but-not-too-large timeout (e.g. 1s) to exhibit the connection pool exhaustion from not `response.close()`-ing?

I believe it would solve the actual issue with linkcheck, keep the test suite reasonably simple, and hopefully prevent regressions.

Perfect, yep - good engineering practice to demonstrate the problem being solved before adding/modifying code to address it. In this case, I lacked a clear understanding of what I was solving (even though I had a vague idea something was wrong).

P.S. even though verbose, I do appreciate your work on identifying and addressing the issue, and detailing your thoughts do help sometimes :)

Thanks again!

@francoisfreitag
Copy link
Contributor

test_auth_header_uses_first_match

I’ve been running the test in a tight loop on bd2ed0b for a good hour (so more than 20k attempts), without reproducing. That sounds like a blip, possibly from the machine running the test being momentary slow.

When the server begins its answer (before to actually send bytes over the network), it logs a line in the form 127.0.0.1 - - [03/May/2023 18:09:08] "HEAD / HTTP/1.1" 200 -. We can see from the action output that this line was not logged. So we did not even get to the point where the server calls send_response(). Also, the list.append() operation to store the headers wasn’t executed (hence the test failure).

All I can say is it took more than 0.075s for the request to be sent, and for the server to start processing it. It seems rare enough that we probably don’t need to spend too much time on it. But up to you.

@jayaddison
Copy link
Contributor Author

Ok, thanks for testing that. That's certainly convincing enough for me not to worry about it and redevelop a fresh pull request (with the extra test, this time).

Perhaps that timed-out test was affected by exhaustion of the connection pool client-side? (perhaps explaining the fact that the server never produced any log output. see also yet another side quest that seemed to hint at inter-test resource interference related to garbage collection)

Anyway: I'll begin developing the planned changes.

@jayaddison
Copy link
Contributor Author

Add a test using a pool with a single connection in the pool, and a large-but-not-too-large timeout (e.g. 1s) to exhibit the connection pool exhaustion from not response.close()-ing?

@francoisfreitag Sanity-checking a detail here: does this imply that during the test, the server would need to be able to handle more requests in parallel than there are connections in the client's connection pool?

(in other words: that it must be clear that the test is exercising the client resource consumption behaviour, and not the limitations of the server)

@francoisfreitag
Copy link
Contributor

For what I had in mind, the server capacity did not matter. I was expecting the server to keep the connection alive until the keep-alive timeout, since the first request connection was not closed properly. With the first request picking the connection from the pool (of one connection), requests won’t be able to start another connection for the next request until that first connection reaches the end of the keep-alive timeout. I expect the request timeout to occur before the keep-alive timeout, thus failing the test.

I haven’t tested this at all, so don’t put too much weight behind my words.

@jayaddison
Copy link
Contributor Author

No problem, that approach sounds sensible - just wanted to check, so thanks for confirming. I hope to take a look at that tomorrow.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants