linkcheck builder: begin using session-based HTTP requests #11340

jayaddison · 2023-04-18T17:38:03Z

Feature or Bugfix

Bugfix
Refactoring

Purpose

Reduce the number of HTTP connections used during linkchecking, reducing network resource usage

Detail

~~Intersphinx requests and image downloading are also migrated to session-based requests~~
Resolves linkcheck builder: begin using requests.Session functionality during linkchecking #11324.

Relates

Builds upon code from linkcheck builder: close streamed HTTP response objects when no content reads are required #11318.

…nse-content reads are required

…nkchecking This ensures that the close method of the response is called when the response variable goes out-of-scope

…connections

…RNINGS configuration

…uring linkchecking" This reverts commit 9ec505f.

…ormed

…ived" This reverts commit 98df982.

…een performed" This reverts commit 6fd82b3.

…nagers during linkchecking"" This reverts commit 692fdef.

jayaddison · 2023-04-18T17:42:57Z

Although it might seem strange, I do expect (and intend) the unit tests to fail with 4305d11 to begin with. Despite the pull request title, the session support (ad12d25) is reverted as of that commit, and so the assertion that only a single HTTP connection is used should fail.

Reverting 3bf835c should re-introduce session support, and allow those tests to pass.

AA-Turner · 2023-04-21T01:03:10Z

Intermittant failures on Python 3.10 with DU 0.18 or 0.19--other than that looks good.

A

AA-Turner · 2023-04-21T01:03:30Z

sphinx/builders/linkcheck.py

@@ -424,6 +425,7 @@ def check(docname: str) -> tuple[str, str, int]:
            check_request = self.wqueue.get()
            next_check, hyperlink = check_request
            if hyperlink is None:
+                self._session.close()


Why do we close here?

I wanted to ensure that the requests session's resources are cleaned up when the linkchecker completes -- some warnings appeared during testing to indicate that they weren't -- and this seemed to be a good place for it (the None hyperlink is used as a marker (sentinel?) to indicate that the worker should shutdown).

jayaddison · 2023-04-21T10:15:32Z

Intermittant failures on Python 3.10 with DU 0.18 or 0.19--other than that looks good.

Thanks for the review; I'm taking a look into those test failures now.

FWIW: I saw similar behaviour using Python 3.11 before adding acec710 - and I think that it's possible that calling flush_headers during the response may help (but I'll try to confirm that).

jayaddison · 2023-04-21T10:26:26Z

FWIW: I saw similar behaviour using Python 3.11 before adding acec710 - and I think that it's possible that calling flush_headers during the response may help (but I'll try to confirm that).

On second thoughts: the asserted header is an auth header from the (linkcheck) client - so that makes the server headers seem less-relevant. Either way: I'll investigate.

jayaddison · 2023-04-21T11:52:24Z

Intermittant failures on Python 3.10 with DU 0.18 or 0.19--other than that looks good.

There seem to be two problems that contribute to the failures:

Reading from the client headers after beginning the handler's communication back to the client seems unreliable (sometimes an empty header-set is received)
With stream=False (the default) for body content reads, the server may block during this HEAD request. That might indicate a requests bug, since for an HTTP HEAD request I don't think that body content should be expected. An alternative would be to send a zero Content-Length header, as we do in another server handler fixture -- personally I'd prefer to preserve the two different server behaviours because it provides some test coverage diversity.

I've prepared fixups for each of these -- they're testing well so far -- and may also backport them to #11318 for pedantic consistency.

…connections

jayaddison · 2023-04-21T11:57:43Z

I've prepared fixups for each of these -- they're testing well so far -- and may also backport them to #11318 for pedantic consistency.

Maybe it's a better idea not to backport them. That would help to confirm whether these issues affect both individual (non-session) HTTP requests, or only session-based HTTP requests.

jayaddison · 2023-04-21T12:18:56Z

With stream=False (the default) for body content reads, the server may block during this HEAD request. That might indicate a requests bug, since for an HTTP HEAD request I don't think that body content should be expected. An alternative would be to send a zero Content-Length header, as we do in another server handler fixture -- personally I'd prefer to preserve the two different server behaviours because it provides some test coverage diversity.

Even after adjusting the code to use stream=True, there still seems to be the possibility of a stalled connection (resulting in intermittent timeouts).

I'm still relatively confident that sending the Content-Length header from all of the server test handler fixtures would resolve this.. but I'm still trying to understand the problem in more detail to figure out whether there's a deeper bug that can be resolved, or whether it's unsolvable.

jayaddison · 2023-04-26T18:21:23Z

I seem to be running in circles on this one without really figuring out why the remaining unit test failures (all timeouts, I believe) are occuring.

Could it be that multiple unit tests share the same urllib3 connection pool, and that the workers that each linkcheck test spins up aren't being released? (resource contention across tests)

I'll try to test that theory in my personal fork of the repository.

jayaddison · 2023-04-26T18:23:32Z

I seem to be running in circles on this one without really figuring out why the remaining unit test failures (all timeouts, I believe) are occuring.

🤦 I should read before commenting!

The failures don't exhibit timeouts here, but they do exhibit the test_auth_header_uses_first_match header-recording problem.

…connections

…0 protocol (default) to HTTP/1.1

…it tests Ref: https://docs.python.org/3/library/http.server.html#http.server.ThreadingHTTPServer

…ting the 'close_connection' handler attribute instead of calling connection.close Ref: https://docs.python.org/3/library/http.server.html#http.server.BaseHTTPRequestHandler.close_connection

…gth response header during redirects

…ts_on_HEAD'

…er capture to before any server-side communication has been initiated by the handler (cherry picked from commit 4d485ae)

jayaddison · 2023-04-27T13:38:40Z

There was a failure case that I hadn't seen before in https://github.com/sphinx-doc/sphinx/actions/runs/4820054596/jobs/8584030994 from commit 0ed465c here.

From the test failure output of test_too_many_requests_retry_after_HTTP_date:

E         + Traceback (most recent call last):
E         +   File "C:\hostedtoolcache\windows\Python\3.11.3\x64\Lib\socketserver.py", line 691, in process_request_thread
E         +     self.finish_request(request, client_address)
E         +   File "C:\hostedtoolcache\windows\Python\3.11.3\x64\Lib\socketserver.py", line 361, in finish_request
E         +     self.RequestHandlerClass(request, client_address, self)
E         +   File "C:\hostedtoolcache\windows\Python\3.11.3\x64\Lib\socketserver.py", line 755, in __init__
E         +     self.handle()
E         +   File "C:\hostedtoolcache\windows\Python\3.11.3\x64\Lib\http\server.py", line 434, in handle
E         +     self.handle_one_request()
E         +   File "C:\hostedtoolcache\windows\Python\3.11.3\x64\Lib\http\server.py", line 400, in handle_one_request
E         +     self.raw_requestline = self.rfile.readline(65537)
E         +                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
E         +   File "C:\hostedtoolcache\windows\Python\3.11.3\x64\Lib\socket.py", line 706, in readinto
E         +     return self._sock.recv_into(b)
E         +            ^^^^^^^^^^^^^^^^^^^^^^^
E         + ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

I might rebase in a moment, because I think we can drop #11349 from the changeset here. But I'd like to revisit this failure even if it doesn't reappear immediately.

jayaddison · 2023-04-27T14:35:29Z

Perhaps a better approach than attempting to merge a set of inter-related changes in a single branch here would be to gradually introduce each one as a separate, simpler pull request.

That would probably help to narrow down the cause of any problems.

Apologies for the noise @AA-Turner @francoisfreitag, and thanks for the review and support. I'm going to close this and will re-open a few smaller, isolated changes (like adding connection-measurement support in the tests) soon.

The goal continues to be to enable session-based linkchecking.

jayaddison · 2023-04-27T16:41:30Z

There was a failure case that I hadn't seen before in https://github.com/sphinx-doc/sphinx/actions/runs/4820054596/jobs/8584030994 from commit 0ed465c here.

I now have local repro of this (ConnectionResetError) error, so hopefully that will help to narrow-in on a fix.

This is an intermittent test failure in test_too_many_requests_retry_after_int_delay and seems to stop occurring when I remove the preceding test, test_TooManyRedirects_on_HEAD, suggesting that there may be an unintended dependency between those two unit tests.

jayaddison · 2023-04-27T16:56:57Z

This is an intermittent test failure in test_too_many_requests_retry_after_int_delay and seems to stop occurring when I remove the preceding test, test_TooManyRedirects_on_HEAD, suggesting that there may be an unintended dependency between those two unit tests.

I think that this indicates that the failure occurs within test_TooManyRedirects_on_HEAD; the subsequent test test_too_many_requests_retry_after_int_delay just happens to be the runtime context where the (previous test) server thread's exception is raised and printed to standard-error, resulting in that second test failing.

This is confirmed by removing test_too_many_requests_retry_after_int_delay; the failure then appears in test_too_many_requests_retry_after_HTTP_date (the next test down) instead.

So what's the problem in test_TooManyRedirects_on_HEAD?

Well, the test involves a server that will continually redirect the client.

Meanwhile, our test client will close the connection when it reaches some threshold number of redirects.

jayaddison · 2023-04-27T16:59:57Z

Well, the test involves a server that will continually redirect the client.

...only as long as the client performs HTTP HEAD requests, though. The HTTP GET handler should not perform a redirect...

jayaddison · 2023-04-27T18:28:34Z

This is an intermittent test failure in test_too_many_requests_retry_after_int_delay and seems to stop occurring when I remove the preceding test, test_TooManyRedirects_on_HEAD, suggesting that there may be an unintended dependency between those two unit tests.

I think that this indicates that the failure occurs within test_TooManyRedirects_on_HEAD; the subsequent test test_too_many_requests_retry_after_int_delay just happens to be the runtime context where the (previous test) server thread's exception is raised and printed to standard-error, resulting in that second test failing.

This is confirmed by removing test_too_many_requests_retry_after_int_delay; the failure then appears in test_too_many_requests_retry_after_HTTP_date (the next test down) instead.

So what's the problem in test_TooManyRedirects_on_HEAD?

Well, the test involves a server that will continually redirect the client.

Meanwhile, our test client will close the connection when it reaches some threshold number of redirects.

Ok, I think I've found the problem. It is straightforward to replicate a ConnectionResetError from test_TooManyRedirects_on_HEAD when:

Threaded HTTP servers are in use within the tests
The test HTTP servers speak the HTTP/1.1 protocol
Session-based requests are not in use
Context-managed requests are not in use

The problem occurs after:

The test client follows multiple HTTP HEAD redirects (that's fine)
The test client exceeds the max-redirect limit of 5 (also fine)
The test client issues an HTTP GET fallback in streaming mode (also OK)
The test server responds HTTP 200 OK (still fine up until now)
The test client disconnects, because it was in streaming mode and linkchecking is complete
The test server writes a response body that will never be read

The last two items may occur in the inverse order and the result will be the same.

The Content-Length header in this case does not help; the client, by design, does not intend to read the remainder of the server response.

I think it'd be possible to update the InfiniteRedirectOnHeadHandler test server handler to close the server-side connection after the response is written to fix this, without changing the test's semantics.

…nnection after response body is written, since we do not anticipate the client will read it Ref: sphinx-doc#11340 (comment)

jayaddison · 2023-04-28T09:10:06Z

Ok, I think I've found the problem. It is straightforward to replicate a ConnectionResetError from test_TooManyRedirects_on_HEAD when:
...

Here's an attempt to create a minimal repro case for this problem:

from contextlib import contextmanager
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
import time
from threading import Thread

import pytest


class TestHandler(BaseHTTPRequestHandler):
    protocol_version = "HTTP/1.1"

    def do_GET(self):
        content = b"content"
        self.send_response(200, "OK")
        self.send_header("Content-Length", len(content))
        self.end_headers()
        self.wfile.write(content)
        self.wfile.flush()


class TestServer(Thread):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.server = ThreadingHTTPServer(("localhost", 7777), TestHandler)

    def run(self):
        self.server.serve_forever()

    def join(self):
        self.server.shutdown()
        self.server.server_close()
        super().join()


@contextmanager
def server_context():
    server_thread = TestServer()
    server_thread.start()
    yield server_thread
    server_thread.join()


@pytest.mark.sphinx('linkcheck', testroot='linkcheck-localserver', freshenv=True)
def test_a(app, capsys):
    with server_context():
        app.build()


@pytest.mark.sphinx('linkcheck', testroot='linkcheck-localserver', freshenv=True)
def test_b(app, capsys):
    time.sleep(0.5)

    _stdout, stderr = capsys.readouterr()
    assert stderr == ""

A few notes:

The problem only seems to occur when the same app configured is re-used between the two unit tests, as specified using pytest.mark.sphinx decoration
The second unit test (where the error appears in the stderr output, but not where the stacktrace originates) does not have to run the linkchecker - it is enough for it to time.sleep for a long-enough duration

I'm going to try to simplify the conditions further, likely by temporarily removing code from the pytest marker and linkchecker code.

jayaddison · 2023-04-28T12:40:41Z

Ok, after further reduction of the repro case, I don't think that anything about the ConnectionResetError / inter-unit-test-log-capture issue is specific to Sphinx. Here is a repro case where I can reliably reproduce the problem using cPython 3.11.2:

from contextlib import contextmanager
import gc
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
from threading import Thread
import time

import pytest
import requests


class TestHandler(BaseHTTPRequestHandler):
    protocol_version = "HTTP/1.1"

    def do_GET(self):
        content = b"content"
        self.send_response(200, "OK")
        self.send_header("Content-Length", len(content))
        self.end_headers()
        self.wfile.write(content)
        self.wfile.flush()


class TestServer(Thread):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.server = ThreadingHTTPServer(("localhost", 8080), TestHandler)

    def run(self):
        self.server.serve_forever()

    def join(self):
        self.server.shutdown()
        self.server.server_close()
        super().join()


@contextmanager
def server_context():
    server_thread = TestServer()
    server_thread.start()
    yield server_thread
    server_thread.join()


def test_a():
    with server_context():
        requests.get('http://localhost:8080', stream=True)


def test_b(capsys):
    gc.collect()
    time.sleep(0.1)

    _stdout, stderr = capsys.readouterr()
    assert stderr == ""

Further notes:

Introducing gc to request garbage collection makes the issue more repeatable using cPython.
Therefore the theory would be: the recv_into exception relates to socket-handling (a thread?) that is deferred until the socket resources are cleaned up; that can occur outside of the unit test where those resources were created.
The behaviour here may be OS-specific, since some urllib3 and socket implementation functionality differs by platform.
There may not be much that pytest could do to handle this, since garbage-collecting during teardown of every test would likely introduce significant overhead.

I'm not sure that there's anything else to do for this particular behaviour, given that we can fix it in the unit tests by having the server set self.close_connection = True during request handling (in do_GET).

It's potentially a slightly-annoying cause of noise in server logs, but everything is behaving to-spec, as far as I can tell (maybe that could indicate an HTTP spec limitation, or under-definition of the spec somewhere - I'm not completely sure).

…nnection after response body is written, since we do not anticipate the client will read it Ref: sphinx-doc#11340 (comment)

jayaddison added 14 commits April 11, 2023 16:17

linkcheck builder: close streamed HTTP response objects when no respo…

f538440

…nse-content reads are required

Add test coverage for anchor-in-document linkchecking

524368f

refactor: use requests.Response objects as context managers during li…

9ec505f

…nkchecking This ensures that the close method of the response is called when the response variable goes out-of-scope

cleanup: remove unused HTML test data file

c708cc9

Merge branch 'master' into issue-11317/linkcheck-close-streamed-http-…

4eb7896

…connections

Nitpick: remove duplicated HTML tag-closing character

95dd08a

Merge branch 'master' into issue-11317/linkcheck-close-streamed-http-…

2aceccc

…connections

continuous integration: unit test workflow: simplify default PYTHONWA…

a3a340b

…RNINGS configuration

Revert "refactor: use requests.Response objects as context managers d…

692fdef

…uring linkchecking" This reverts commit 9ec505f.

Fixup: do not close the response before anchor-checking has been perf…

6fd82b3

…ormed

Fixup: close the response after an HTTP HEAD response is received

98df982

Revert "Fixup: close the response after an HTTP HEAD response is rece…

ec415ea

…ived" This reverts commit 98df982.

Revert "Fixup: do not close the response before anchor-checking has b…

5769478

…een performed" This reverts commit 6fd82b3.

Revert "Revert "refactor: use requests.Response objects as context ma…

8269c6d

…nagers during linkchecking"" This reverts commit 692fdef.

This comment was marked as outdated.

Sign in to view

jayaddison changed the title ~~linkcheck builder: begin using session-based HTTP requests~~ web content retrieval: begin using session-based HTTP requests Apr 18, 2023

jayaddison changed the title ~~web content retrieval: begin using session-based HTTP requests~~ linkcheck builder: begin using session-based HTTP requests Apr 18, 2023

This comment was marked as outdated.

Sign in to view

jayaddison mentioned this pull request Apr 19, 2023

linkcheck builder: close streamed HTTP response objects when no content reads are required #11318

Merged

AA-Turner approved these changes Apr 21, 2023

View reviewed changes

AA-Turner reviewed Apr 21, 2023

View reviewed changes

Merge branch 'master' into issue-11317/linkcheck-close-streamed-http-…

89442fc

…connections

jayaddison and others added 9 commits April 27, 2023 13:44

Merge branch 'master' into issue-11317/linkcheck-close-streamed-http-…

69a6828

…connections

linkcheck builder: begin using session-based HTTP requests

5fafefa

tests: linkcheck builder: update test webserver handlers from HTTP/1.…

c6191a1

…0 protocol (default) to HTTP/1.1

tests: use threaded Python basic HTTP server implementation during un…

616c55b

…it tests Ref: https://docs.python.org/3/library/http.server.html#http.server.ThreadingHTTPServer

tests: linkcheck: ConnectionResetHandler: close the connection by set…

2615fe4

…ting the 'close_connection' handler attribute instead of calling connection.close Ref: https://docs.python.org/3/library/http.server.html#http.server.BaseHTTPRequestHandler.close_connection

tests: linkcheck: make_redirect_handler: send zero-valued content-len…

982f862

…gth response header during redirects

tests: linkcheck: add connection-count measurement

a6863bb

tests: linkcheck: reduce worker count to one for 'test_TooManyRedirec…

01c250e

…ts_on_HEAD'

tests: linkcheck: capture_headers_handler: relocate the (client) head…

bd2ed0b

…er capture to before any server-side communication has been initiated by the handler (cherry picked from commit 4d485ae)

jayaddison force-pushed the issue-11324/linkcheck-sessioned-requests branch from 0ed465c to bd2ed0b Compare April 27, 2023 13:44

This comment was marked as outdated.

Sign in to view

jayaddison closed this Apr 27, 2023

jayaddison deleted the issue-11324/linkcheck-sessioned-requests branch April 27, 2023 14:35

This comment was marked as outdated.

Sign in to view

github-actions bot locked as resolved and limited conversation to collaborators May 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linkcheck builder: begin using session-based HTTP requests #11340

linkcheck builder: begin using session-based HTTP requests #11340

jayaddison commented Apr 18, 2023 •

edited

jayaddison commented Apr 18, 2023 •

edited

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

AA-Turner commented Apr 21, 2023

AA-Turner Apr 21, 2023

jayaddison Apr 21, 2023

jayaddison commented Apr 21, 2023

jayaddison commented Apr 21, 2023

jayaddison commented Apr 21, 2023 •

edited

jayaddison commented Apr 21, 2023

jayaddison commented Apr 21, 2023

jayaddison commented Apr 26, 2023

jayaddison commented Apr 26, 2023

jayaddison commented Apr 27, 2023

This comment was marked as outdated.

jayaddison commented Apr 27, 2023

This comment was marked as outdated.

jayaddison commented Apr 27, 2023

jayaddison commented Apr 27, 2023

jayaddison commented Apr 27, 2023

This comment was marked as outdated.

jayaddison commented Apr 27, 2023

jayaddison commented Apr 28, 2023

jayaddison commented Apr 28, 2023

linkcheck builder: begin using session-based HTTP requests #11340

linkcheck builder: begin using session-based HTTP requests #11340

Conversation

jayaddison commented Apr 18, 2023 • edited

Feature or Bugfix

Purpose

Detail

Relates

jayaddison commented Apr 18, 2023 • edited

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

AA-Turner commented Apr 21, 2023

AA-Turner Apr 21, 2023

Choose a reason for hiding this comment

jayaddison Apr 21, 2023

Choose a reason for hiding this comment

jayaddison commented Apr 21, 2023

jayaddison commented Apr 21, 2023

jayaddison commented Apr 21, 2023 • edited

jayaddison commented Apr 21, 2023

jayaddison commented Apr 21, 2023

jayaddison commented Apr 26, 2023

jayaddison commented Apr 26, 2023

jayaddison commented Apr 27, 2023

This comment was marked as outdated.

jayaddison commented Apr 27, 2023

This comment was marked as outdated.

jayaddison commented Apr 27, 2023

jayaddison commented Apr 27, 2023

jayaddison commented Apr 27, 2023

This comment was marked as outdated.

jayaddison commented Apr 27, 2023

jayaddison commented Apr 28, 2023

jayaddison commented Apr 28, 2023

jayaddison commented Apr 18, 2023 •

edited

jayaddison commented Apr 18, 2023 •

edited

jayaddison commented Apr 21, 2023 •

edited