Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linkcheck builder: close streamed HTTP response objects when no content reads are required #11318

Conversation

jayaddison
Copy link
Contributor

@jayaddison jayaddison commented Apr 11, 2023

Feature or Bugfix

  • Bugfix

Purpose

  • Intended to reduce warning noise (ResourceWarning output) during linkcheck build tests

Detail

  • The linkchecker makes HTTP requests in a couple of places in streaming mode to minimize the amount of traffic/data required
  • Since requests doesn't know whether the caller may intend to read content from a streamed HTTP response object, it holds the socket open
  • We can explicitly close invoke each request using a context manager, since we know for certain that we are not going to read data from it in these cases to allow the corresponding __exit__ method to perform cleanup

Relates

@jayaddison jayaddison marked this pull request as draft April 11, 2023 16:00
@@ -318,6 +318,7 @@ def check_uri() -> tuple[str, str, int]:
# Read the whole document and see if #anchor exists
response = requests.get(req_url, stream=True, config=self.config,
auth=auth_info, **kwargs)
response.close() # no HTTP body reads required; close the response
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the content is required here, as part of check_anchor.

(so a context-manager approach might be more appropriate here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better explanation, because I forgot what was going on here:

In the general case, we want to ensure that response.raise_for_status and response.close are both called after receiving a response from the server. They provide error-checking and resource cleanup, respectively.

Most of the time, we can call close as soon as the HTTP status line (200 OK, etc) is received.

The anchor-checking case (looking for a specific HTML ID within a documentation webpage) is different though: it doesn't seem to make sense to anchor-check on error responses (HTTP 4xx), even though theoretically the server could include HTML and the anchors we're looking for in the response. Documentation references shouldn't be served from pages that produce 4xx responses.

So in that case we want to call response.raise_for_status first.

It might be possible to move the response.close call into the finally clause of the try...except handler.

The context manager approach provided by the requests library calls response.close when the response context-manager exits, though: that's convenient and could be easier than writing a finally clause that has to handle multiple different arrival paths.

@jayaddison jayaddison force-pushed the issue-11317/linkcheck-close-streamed-http-connections branch from 3115b20 to 9ec505f Compare April 11, 2023 18:14
@jayaddison jayaddison marked this pull request as ready for review April 11, 2023 18:28
@jayaddison jayaddison marked this pull request as draft April 12, 2023 16:19
@jayaddison
Copy link
Contributor Author

Converting to draft status while learning more about the possible effects on local socket re-use (the linkchecker is likely to contact a host multiple times while checking URL/anchor validity, and so it'd be worthwhile to be cautious about changes here).

@jayaddison
Copy link
Contributor Author

jayaddison commented Apr 14, 2023

So my understanding of this at the moment is:

  • The Sphinx linkcheck builder makes multiple, individual HTTP requests, often to the same host
  • The requests module provides opt-in connection pooling itself; Sphinx doesn't use this currently
  • requests uses urllib3 as a default HTTP adapter
  • urllib3 does provide connection pooling; requests doesn't use that for individual, non-session requests
    • Confirmed locally: checked the open connections on my machine between requests.get(...) and s = requests.Session(); s.get(...) -- the latter did hold a connection open
  • Asking requests to make an HTTP in streaming mode means that we need to tell requests when we no longer require the requests.Response object, because it can't know whether we're going to read from it
    • Garbage collection can sometimes do this for us, although apparently (based on what I've read on the urllib3 issue tracker) not all Python implementations are great at this
  • Using context management seems like a neat (meaning: concise, readable) way to indicate when we no longer require the requests.Response object

Based on those, I do think that this is sensible, and I don't think that it should cause any regressions.

Implementing #11324 at a later date could improve network-traffic utilisation; this pull request should instead only affect local client socket resource usage (and cleanup).

Update: draft implementation of #11324 is available in #11340 (and builds upon this pull request).

@jayaddison jayaddison marked this pull request as ready for review April 14, 2023 15:16
Copy link
Contributor

@francoisfreitag francoisfreitag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

Side note, if we want to speedup linkcheck tests, we may replace links going to servers on the internet with going to the local HTTP server, its latency should be lower and very likely faster.

tests/test_build_linkcheck.py Show resolved Hide resolved
@jayaddison
Copy link
Contributor Author

@AA-Turner this branch should resolve the linkcheck test flakiness short-term, until I can figure out and clear up the mess I'm making of #11340.

@jayaddison jayaddison force-pushed the issue-11317/linkcheck-close-streamed-http-connections branch from d74c6d3 to 77da299 Compare May 8, 2023 13:17
@AA-Turner AA-Turner merged commit c9d0933 into sphinx-doc:master May 9, 2023
22 checks passed
@AA-Turner
Copy link
Member

Thanks!

@jayaddison
Copy link
Contributor Author

Thank you, @AA-Turner!

@AA-Turner AA-Turner modified the milestones: some future version, 7.1.0 May 9, 2023
@jayaddison jayaddison deleted the issue-11317/linkcheck-close-streamed-http-connections branch May 9, 2023 16:17
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

linkcheck builder: streamed HTTP requests do not close their local socket
3 participants