-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: what do you think of this branch adding async support to urllib3? #1323
Comments
I should probably also note that it's easy to add more backends; we only started with trio and twisted because we figured if we could get those working then we were flexible enough to handle anything else. (And we're using twisted instead of asyncio because to handle proxies we need starttls support, which is possible in asyncio right now but requires gross hacks.) |
This is all you need to implement starttls on asyncio 3.6 (will probably work on later 3.5 too): async def start_tls(loop, transport, protocol, sslcontext, *,
server_side=False,
server_hostname=None):
"""Upgrade transport to TLS.
Return a new transport that *protocol* should start using
immediately.
"""
if ssl is None:
raise RuntimeError('Python ssl module is not available')
if not isinstance(sslcontext, ssl.SSLContext):
raise TypeError(
f'sslcontext is expected to be an instance of ssl.SSLContext, '
f'got {sslcontext!r}')
waiter = loop.create_future()
ssl_protocol = asyncio.sslproto.SSLProtocol(
loop, protocol, sslcontext, waiter,
server_side, server_hostname,
call_connection_made=False)
transport.set_protocol(ssl_protocol)
loop.call_soon(ssl_protocol.connection_made, transport)
if not transport.is_reading():
loop.call_soon(transport.resume_reading)
await waiter
return ssl_protocol._app_transport (This is a bit simplified version of my start-tls 3.7 commit: python/cpython@f111b3d) |
So it would be interesting to see if this works with asyncio too. |
I'm sure it can, though it might be better to wait a bit to make sure the internal APIs have stabilized -- the twisted and asyncio backends will probably be very similar, and the twisted version is by far the most non-trivial (because it has to invert the inversion of control), so maintaining two similar versions for now might be churn-y. OTOH if you want to do the maintaining I guess that'd be fine :-) |
It's really up to you ;) I just wanted to note that "which is possible in asyncio right now but requires gross hacks" is a bit exaggerated and shouldn't be a showstopper for having asyncio support from the beginning if this gets merged. |
@1st1 yeah, I only knew of the trick aiohttp uses. Does the |
Yes. I was going to write a few more tests before beta-2 to see if the cancellation is handled correctly (IOW if we need try..except around that await). In any case, if this gets a green light, I'd appreciate this snippet being used to enable asyncio support in urllib3 so that we can test how well it works before 3.7 is released. |
Ping! |
I'm generally wholly in favor of this, but it's harder to review without a pull request, which is why I haven't. |
That's great to hear!
Well, this is why I'm asking about process above :-). The problem is that this is touching a huge proportion of the code. @Lukasa's v2 branch is already "2,431 additions and 2,193 deletions", and then this is a bunch more stuff on top of that. And unfortunately, the files have to be renamed in a way that confuses git. E.g. Basically this isn't something that can be treated as a simple feature branch. Which is not surprising for a v2 branch, but if we're going to go this way we do need to figure out what our strategy is. One option would be to keep developing this branch, possibly after merging it to the v2 branch in the main repo so that you all can watch changes going into it. (If we did that then it'd probably be easier if I had a commit bit so I could merge PRs into the v2 branch.) This certainly seems doable. Given the file renames, keeping up with master will be a bit annoying – in particular, it doesn't seem possible to convince git to automatically figure out that changes to Another option would be to throw away the current branch, and start reconstructing the changes as a series of individual PRs on top of master. I'm honestly not sure how hard this would be. The httplib→h11 changes are largely self-contained, so I guess could be reconstructed fairly easily, and the big change moving files around is a big annoying thing, but again easy to reconstruct. It's always going to be the case though that (a) dropping httplib, and (b) making everything async are both going to be giant big-bang changes. It would be very difficult to make these changes as a series of incremental tweaks where everything keeps working after each change. |
Update: I was able to successfully merge from master. |
Sorry for the late response. I've been pondering this for a while, but should have posted an initial comment sooner. I'm -1 on the code generation approach. While async IO is gaining in popularity, especially in situations where scalability is paramount, most of our users are still going to be using the synchronous API. By doing code generation, we're demoting that API, which most users use, to being not even present in the canonical source tree. What's more, it introduces substantial technical debt. Do we expect contributors to regenerate the synchronous API every time they make a change? If so, that leads to a substantially complicated code review process (both because it's an extra step for the contributor and because the maintainer will need to keep an eye out for changes of this kind). Do we expect the maintainer to regenerate it before pushing to PyPI? If so, that's an extra step that could go wrong, and means that someone using urllib3 for the synchronous API can no longer simply clone the repository and use it. And, it introduces a trust issue in that there's an artifact of the release that doesn't exist canonically anywhere else. I'm +1 on adding an asynchronous API in general. Having a robust async API that can be used by the major packages will likely substantially reduce our support burden. I'm +1 on moving to H11. This is a change I would like to see as a separate pull request on master, although I probably want to postpone to a future release where it's the only major change to ensure that we provide an easy backwards path for people who may potentially encounter issues with it. For this change, it would be nice if the internal API was built in a way that encourages modularity so that we can easily hook into it to add HTTP/2 support with hyper-h2 later on. Also on the H11 front, note that this would be our first hard external requirement. I'd like to invite any Requests people (particularly @sigmavirus24, @Lukasa, and @nateprewitt) to weigh in on their experiences as a core library with hard external requirements and suggest ways that we can integrate this requirement in a minimally-disruptive manner to our downstream consumers. Overall: we need much smaller, self-contained PRs in order to make this succeed. PRs above 500 lines or so become increasingly more difficult to review, and problems will be missed. This may mean adding and figuring out test cases for the underpinnings of the async API before actually creating that API, but it's important to me that the integrity of our code review process remain strong, which means that all code must be thoroughly reviewed. |
@haikuginger Thanks for the reply! I totally understand needing time to think things over for something like this... This is a long comment so I'm going to use headlines to break it up! What exactly the technical consequences of this approach would be
The way the prototype works, the synchronous code is auto-generated when running But you're correct that this would mean giving up on being able to just clone the repository and use it without going through a build step, and this definitely creates some extra friction for development. We can do things to minimize this cost. Right now the branch has a Tradeoffs and compromises
I want all these things too. I don't think it's possible to get all of them, and so we'll have to work out some compromise or another. sync + asyncOne challenge is supporting sync + async without code generation. I've had many many conversations about this over the last few years, with folks like @Lukasa and @kennethreitz, and it's just a really hard problem. Realistically, the only options anyone has demonstrated are:
The code duplication can take a few different forms -- you can not support async at all, and then the community ends up supporting multiple independent libraries, splitting maintainer attention (see: aiohttp, treq, asks, ...). It's very unlikely that these can all be maintained to the same high standard that urllib3 has set. Or, you can write everything as confusing callback-based code, restrict yourself to supporting callback-based async libraries (so e.g. not trio), and then maintain a synchronous shim around all of it. Or, you could write async code like in our branch, but instead of doing code generation you could maintain a second synchronous shim around every public API that implements the sync version using the async version. However, aside from being a lot of extra work, this last approach also requires giving up on Python 2 support, which seems premature for urllib3 given that you're upstream of projects like pip and boto. If there's another option we've missed, then I'd love to know. We only ended up working on the code generation approach after ruling out everything else we could think of. And it's true it has some downsides. But.... it works :-). Do you have any alternative suggestions? h11Now that @Lukasa's no longer at HPE, I don't think anyone is volunteering to do this work unless it also gives us async support (lmk if that's wrong). So while logically this could be done as a standalone feature, I don't think it's viable from a project management perspective? Re: managing the dependency, when an h11-based urllib3 starts shipping I would definitely suggest vendoring h11 (at least at first), because this will be a huge step function increase in h11's usage, and vendoring gives us more flexiblity if we discover we need to make emergency fixes. My impression is that HTTP/2 support will require some significant rearchitecting, rather than being something that can easily be hidden behind an abstract HTTP connection interface. Which isn't a problem, it just means that I don't think it's worth spending a lot of time now trying to design an abstract HTTP connection interface given we'll have to redo it later anyway. The things we can do now to prepare for HTTP/2 support are things like, hiding the connection object from the public API, and generally keeping it in mind when designing new public APIs. (E.g., it suggests that maybe connection shutdown should be async.) Splitting everything up into small PRsI don't really know how to do this. OTOH, I'm more hopeful that we could figure something out here than that we can avoid code generation :-). First, a question: all this work is based on @Lukasa's v2 branch in the official repo. That's just focused on moving to h11, and it's incomplete, but there's already a ton of work there: 278 commits, 2431 insertions(+), 2193 deletions(-). Is this something you'd want to see broken up into small PRs as well, or does the fact that it was already done in the main urllib3 repo, at least in part via PRs into the v2 branch, give us some kind of dispensation? Because I understand what we've done on top of that much better than I understand everything @Lukasa did :-). (@Lukasa, if you have any thoughts on how this could change could be managed in a way that won't break all our heads, I'd love to hear them!) I can imagine splitting this up into a series of PRs into the v2 branch. In fact a lot of it was already done that way, just in this repo rather than the main repo. One possible way forward would be to merge what we have into the v2 branch (either in one big chunk or piecemeal), and then continue to develop it as a series of PRs to that branch, so that the urllib3 maintainers can keep an eye on what's happening as we go. I guess maybe it is also worth pointing out that if you want to add h11 as a dependency, that by itself effectively means you're importing ~1000 lines of my code without reviewing it, so maybe it's not so bad to trust my reviews of @pquentin's work :-). Obviously it's not quite the same though... tl;drI guess the main question is just, does anyone have any realistic alternatives to code generation? That's the big question everything else hinges on. |
A question regarding the code generation approach: I understood @haikuginger's main concern as being with the source tree containing only the async code by default, and generating the sync code from that, even though it's the async version that's new, and the vast majority of library clients will still be relying on the generated synchronous code. So my question would be whether or not the code generation approach would be amenable to a comment, decorator, or marker function based solution whereby it was the synchronous code that lived in source control, and the code generator injected async and await keywords in the appropriate places to make the async version. For example:
Which would become:
You'd need at least those two directives (convert You could potentially even use the current code generator to emit the first version of the annotated synchronous code (by having it add the directive comments when it drops an |
I would certainly be more comfortable with an explicit pragma, as suggested by @ncoghlan, which preserves the synchronous API as the first-class citizen. I'd still like to see if we can arrive at a solution that avoids codegen in general, though. The way I see it is that if the constraints of the Python language syntax make it difficult or impossible to use the same code for both synchronous and asynchronous I/O, then, by default, sync and async versions of a similar function are fundamentally different code, even if textually similar. If, on the other hand, we believe that sync and async are actually fundamentally similar when the code inside the block is similar, but the Python syntax doesn't allow for the needed modularity, then I would consider this an issue with the syntax itself. IMO, the correct way to fix such an issue (if, in fact, it is one) would not be to work around it in library code, but to file a PEP or create a library that adjusts the Python syntax or provides additional faculties to declare the isomorphism. I'm going to be spending the next day or so digging into asyncio to make sure I'm able to carry on a competent conversation about it. Thoughts from anyone else who'd be affected by this change are very welcome. |
@ncoghlan It's not clear to me how your pragmas know where to insert the It seems like we're worried that not enough people know about Python's async/await syntax, so we're considering inventing a new syntax that's more awkward and that nobody knows? @haikuginger So here's the basic problem.
So as long as we support Python 2, it seems completely impossible to have a single implementation of urllib3's logic that supports both sync + async without using code generation. This also makes it hard to fix anything with a PEP, since the only thing a PEP can affect is Python 3.8+. What if we drop Python 2? That increases our options a little bit, but then we have another problem. In async mode, lots of urllib3's public functions/methods change their signature slightly: they're annotated as async instead of sync. So the extra option we gain is: we can keep everything async internally, but then wrap it in a synchronous shim. But this still effectively means maintaining two copies of our public API, which have to be manually kept in sync (pun not intended). If it helps, the code generation is about as trivial as code generation can be. It:
That's why we're inclined to see the sync and async code as being fundamentally similar: literally all of the actual logic and semantics are identical. The only difference is removing this extra bookkeeping that async code requires. |
I think there's almost zero chance that a simple PEP-able solution exists for this problem, so I wouldn't expect anything to happen in 3.8.
I propose a different approach. Instead of changing everything to use async/await and having a tool to convert it to a synchronous code, why don't we design a tool to produce async/await code from synchronous code. Let's call this new hypothetical tool "mirror". And let's imagine we want to make the following snippet of code sync/async compatible: def send_request(self, request, read_timeout):
"""
Given a Request object, performs the logic required to get a response.
"""
h11_response = _start_http_request(
request, self._state_machine, self._sock
)
return _response_from_h11(h11_response, self) We can import the import mirror
@mirror.coroutine
def send_request(self, request, read_timeout):
"""
Given a Request object, performs the logic required to get a response.
"""
h11_response = mirror.wait(_start_http_request(
request, self._state_machine, self._sock
))
return _response_from_h11(h11_response, self) Now, for synchronous code, For asynchronous code, our hypothetical async def send_request(self, request, read_timeout):
"""
Given a Request object, performs the logic required to get a response.
"""
h11_response = await (_start_http_request(
request, self._state_machine, self._sock
))
return _response_from_h11(h11_response, self) This approach still requires to have a pre-compiler, but now a pre-compiler is only required to produce async code and is completely optional. It is similar to what @ncoghlan proposed here earlier, but requires no new sub-syntax in comments, instead relying on simple function calls and decorators.
If we create a library to handle this transition for urllib3, I'll use it in my asyncpg project to make a synchronous version of it. |
@njsmith My simple example pragmas would require that all await expression be either standalone expression statements or else the RHS of an assignment statement, so your print example would need to be written as:
That said, I do agree that having the async version be the "source of truth" is a better option, specifically due to the fact that it avoids the question of "Where should the It also occurs to me that @haikuginger's code generation concerns could potentially be mitigated by using an approach similar to the one CPython uses for many of its generated files: check them in to source control, and then have CI complain if running the code generator changes the contents of the generated files. We do that in CPython so that you don't typically need to bootstrap and run |
Hmm. The "where should the await go" question could be solved easily:
Thus python 2 compatible code
would easily and bidirectionally(!) transform into an async version without further mangling. Bidirectionality is important. When working on an async bug I don't want to have to replicate my fixes into the sync version. That's tedious and error prone. |
@smurfix makes an extremely interesting syntax suggestion which could form the basis for a code transformation process I'd be comfortable with. Doing it this way could create an extremely distinctive token for replacement while preserving the synchronous API as a first-class citizen. Combined with @ncoghlan's suggestion of having CI autofail if codegen modifies anything, and a bidirectional toolchain that detects which version has been modified and propagates the modification to the other version, I think we're getting somewhere. I do think we wouldn't want to necessarily call our placeholder token return 42 + (resolve_potential_coroutine <= bar())
return 42 + (finalize << bar())
return 42 + (complete | bar()) |
So, I'm not sure if Requests' plans influence any decisions here or not, but the tentative plans for Requests 3.0 are:
Just wanted to throw that out here. Potentially not supporting 2 would make this work a lot easier. |
@haikuginger Using I also agree that we should use something other than "await", for the simple reason that the sync version would also work with Python3. Personally I'd just go with |
Using The
(Whether the transformation marker looks better on the preceding line or as a trailing comment would presumably become clearer once some of the code had been suitably marked up - the trailing comment would require that the code transformer be a bit more clever, but it would still only involve looking ahead to the end of the current line) |
I'm glad to see out of the box thinking! I would never have thought about those options.
|
@pquentin Well …
|
For the last point, you're right, it looks like I don't fully understand what your |
@kennethreitz It turns out that not supporting py2 doesn't help as much as you might think. If we only care about py35+, then it's possible for the sync and async versions to share their internal implementation. But, the whole public API still needs to be duplicated, which is a job for a computer rather than a human, and once you've got tools to do that, supporting py2 isn't much harder... |
Lots of interesting ideas here! It sounds like there are two orthogonal decisions. One is whether to write async and generate sync or write sync and generate async -- I'm going to call these "a2s" and "s2a" respectively so my fingers don't get tired. The other decision is whether to check the generated code into git. use async to generate sync (a2s), or vice-versa (s2a)?I think the s2a proposals above could work, though there'd need to be some more stuff added, like AFAICT, the main advantage of making the sync code the source of truth (s2a) is that someone could do a For contributors modifying the code, I don't think this choice affects the workflow. Either way, the test suite will want to test both versions of the code, so the workflow will be (1) modify canonical version, (2) regenerate the other version, (3) run the tests. (And probably in practice there'll be a script that takes care of doing regeneration + testing with a single button press.) There are also some downsides to s2a though:
Should generated code be checked in to git?Again, both options are perfectly doable. Which is better? Again, the main thing this would get us is that it would allow users who want to do a For contributors modifying the code, and reviewers, this adds some extra hassle: first, you have to remember to regenerate before committing. I guess this isn't a huge deal because you have to regenerate before testing anyway, and hopefully you test before committing, right? But it does mess up some workflows, e.g. if you're doing line-by-line staging to split up a change into self-contained commits. It also adds clutter to reviews -- we can check that things were regenerated properly in CI, so the reviewers don't have to actually look at the generated code, but it'll still show up in diffs. EvaluationUsing both s2a and checking in generated artifacts definitely doesn't make sense, because their advantages are basically the same but their disadvantages are different, so you end up paying twice but only getting one benefit. If we have to pick between s2a and checking in generated artifacts, checking in generated artifacts seems better to me: it's straightforward and easy to understand. I am very dubious that the 'git clone and don't build' use case is important enough to justify any of these approaches. Both of these proposals add extra ongoing hassle for contributors, in order to save a small amount of hassle for a small minority of users who are doing something that arguably shouldn't be supported in the first place, and honestly we need contributors more than we need users. But the decision here is ultimately up to the urllib3 maintainers. One more thing...There's another proposal above that I didn't talk about yet: storing both versions, and having them both be the source of truth. So it's basically s2a and a2s and checking in the source, all at once. This sounds really unpleasant to me: basically it has all the disadvantages of all of these approaches, plus new ones like: if there's a discrepancy between the two versions in some tree, how does the tooling figure out which side should be regenerated? This feels way too confusing and over-engineered IMO. Also, if I'm reading right, the reason @smurfix proposed bidirectionality is because he doesn't want to deal with an ad hoc encoding of async annotations into synchronous python, not that he actually wants bidirectionality per se :-). |
Well … partly. I am proposing bidirectionality (i.e.. either version may serve as the "source") because that's the way I work. "Hack, save source, go to other window, run testcase", no matter whether I'm working on the sync or async version. No converter involved. And let's face it, Tooling can mechanically generate the "other" version based on which version has been modified. If the answer is "both", it should check that translating the older file matches the newer one, else complain. Obviously, this means I'd check in both versions. Usually I don't like that at all, but in this case it'd make enough sense IMHO. Otherwise I'm strongly in favor of keeping the async version as the True Source, if only because removing async annotations is way easier than adding them. |
It's not up to PyPI, PyPI just presents a list of files and related information about them and the installer picks which one it installs. For modern versions of pip, it would get the most recent version whose Although I feel like this is a bit of a digression, because @njsmith has already stated that maintaining Python 2.x support isn't particularly hard if you have sync support, the hard/sticky part is maintaining sync support. |
The way you wrote this makes me uncertain whether you understood the trick that we're using here :-). There are no sync wrappers; instead, we basically treat the async code as a DSL for generating the sync code. This is really easy, because for everything except the lowest-level I/O functions, the async code is already just the sync code + async/await annotations; stripping out those annotations gives us a sync version of the library. So there's no need to maintain any kind of wrapping layer, and there are no extra functions calls. The nice thing about this strategy is that it's totally generic; once the async→sync script is debugged, you can take any library that's built on top of urllib3 and turn it into a sync+async library. All you have to do is add the async/await annotations at the appropriate places in your code and then add the async→sync plugin to your setup.py. Since you're here and working on boto these days, let me ask: do you think this is something you'd be interested in using in boto? I know @brettcannon is eager to get async support into the Azure SDK... |
Nah I grok'd it, I just worded it poorly since I don't have a word to describe what it is.
Possibly! |
While I suspect you're not going to need it (given the way the rest of the discussion is going), I'll note that https://packaging.python.org/guides/dropping-older-python-versions/ exists now to cover the ins-and-outs of using Python-Requires to deliver older maintenance releases to users on older Python versions. As far as what to call the synthetic synchronous API in Nathaniel's proposal goes, perhaps "synchronous shadow API" would work? |
* Implement wait_for_[read,write} directly on top of 'select' module Our vendored backport of the 'selectors' module is huge, its tests are flaky, and the only we thing we use it for is to... turn around and implement some trivial select() operations. It lets urllib3 use epoll or kqueue... but the way urllib3 uses them, they're actually *less* efficient than just using poll or select. This commit removes the dependency on 'selectors', by implementing urllib3's wait_for_{read,write} helpers directly on top of poll/select. Because I'm sneaky, it does this in terms of a more generic wait_for_socket(...) operation, which is exactly the primitive that we need for the bleach-spike branch (see gh-1323), so this should also help keep the diff down. * Delete selectors backport Now that we're not using it anymore (see previous commit), it can be deleted. * Address comments from review - Check that poll works before using it - Add a test for detecting readability after the peer closes the socket - Remove an unnecessary coverage-control comment * Delay choice of wait_for_socket backend until first call To work around things like: import urllib3 from gevent import monkey monkey.patch_all() Since this means we no longer make an import-time decision about wait_for_socket support, I also got rid of the constant HAS_WAIT_FOR_SOCKET. * Add test for calling wait_for_socket on a closed socket * Fix lint errors
Given how hard it is to make substantial changes to Assuming this as a given, we do want to support both sync and async use cases. Having done a lot of thinking on async Python lately, to me, conceptually, |
@jmehnle The trivial event loop would look like run_secretly_sync_async_fn, right? As discussed a few comments later in that link, doing this for the whole urllib3 API is quite a lot of work, and we already get this for free if we want Python 2 support. But if you drop Python 2 support, then it's certainly an option. As noted by @shazow, it's not clear that we'll be able to finish this before Python 2 EOL anyway. |
Re trivial event loop: Yeah, pretty much. I think the synchronous shim layer could be generated at run-time using meta-programming techniques. |
why not make everything at backend async, and provide code as async and sync to final user, where sync is a wrapper to get loop and run until complete? |
@rspadim That has been tried in HTTPX and determined that it's not feasible. I can't recall why exactly but I'm sure you could dig up the reasoning in their issue tracker. |
* Implement wait_for_[read,write} directly on top of 'select' module Our vendored backport of the 'selectors' module is huge, its tests are flaky, and the only we thing we use it for is to... turn around and implement some trivial select() operations. It lets urllib3 use epoll or kqueue... but the way urllib3 uses them, they're actually *less* efficient than just using poll or select. This commit removes the dependency on 'selectors', by implementing urllib3's wait_for_{read,write} helpers directly on top of poll/select. Because I'm sneaky, it does this in terms of a more generic wait_for_socket(...) operation, which is exactly the primitive that we need for the bleach-spike branch (see urllib3gh-1323), so this should also help keep the diff down. * Delete selectors backport Now that we're not using it anymore (see previous commit), it can be deleted. * Address comments from review - Check that poll works before using it - Add a test for detecting readability after the peer closes the socket - Remove an unnecessary coverage-control comment * Delay choice of wait_for_socket backend until first call To work around things like: import urllib3 from gevent import monkey monkey.patch_all() Since this means we no longer make an import-time decision about wait_for_socket support, I also got rid of the constant HAS_WAIT_FOR_SOCKET. * Add test for calling wait_for_socket on a closed socket * Fix lint errors
Curious to know if this has been discussed in the last few years? Is this something that will get added to urllib3 or is it likely to wait for a urllib4 or something similar? |
I've been meaning to close this issue for a while now. To be clear, the https://github.com/python-trio/hip effort had made great progress:
But then making each test to pass in async was going to be yet another slog, and after that I would have to tackle requests. It was becoming lonely and urllib3 was continuing to move fast. So the hip has stalled, and I'm now focusing my limited time into urllib3 itself. The goal is to (very) slowly move towards h11 support in urllib3, and maybe eventually async support. We've learned various lessons with hip, and will apply them to urllib3 in time. The first step here is to decouple us from the Python standard library HTTP support in urllib3 2.0: #1985. Closing the issue as there's nothing actionable left here. Thanks everyone for the support! edit: if you want an async client today, try httpx (it supports trio and asyncio) or aiohttp (asyncio only) |
Today, Can urllib3 support async? |
We (mostly @pquentin and I) have been working on a proof of concept for adding pluggable async support to urllib3, with the hope of eventually getting this into the upstream urllib3. It's reached the point where there's still lots of missing bits, but there's an end-to-end demo working and we don't see any major obstacles to getting everything else working, so I wanted to start getting feedback from the urllib3 maintainers about whether this looks like something you'd be interested in eventually merging, and what it would take to get there.
Demo
So: hi! Check it out – a single py2.py3 wheel that keeps the classic synchronous API, and on python 3.6+ it can also run in async mode on both Trio and Twisted:
Things that (probably) don't work yet include: HTTPS, timeouts, proper connection reuse, using python 2 to run
setup.py bdist_wheel
, running the test suite. OTOH some non-trivial things do work, like chunked transfer encoding, and there's at least code in there for proxy support and handling early responses from the server.what sorcery is this
This is based on @Lukasa's "v2" branch, so it's using h11. (Goodbye httplib 👋.) The v2 branch is currently stalled in a pretty half-finished state (e.g. the I/O layer is a bunch of raw select loops open-coded everywhere that IIRC don't work; I'm sure @Lukasa would have cleaned it up if he had more time); we cleaned all that up and made it work. The major new code is here: https://github.com/python-trio/urllib3/blob/bleach-spike/urllib3/_async/connection.py
And then we made the low-level I/O pluggable – there's a small-ish ad hoc API providing the set of I/O operations that urllib3 actually needs, and several implementations using different backends. Note that the I/O backend interface is internal, so we can adjust it as needed; there's no attempt to provide a generic abstract I/O interface that would work for anyone else. The code for the backends is here: https://github.com/python-trio/urllib3/tree/bleach-spike/urllib3/_backends
Then we started adding async/await annotations to the rest of urllib3's code. That gave us a version that could work on Trio and Twisted. But wait, we also want to support python 2! And everyone using the existing synchronous API on python 3 too, for that matter. But we don't want to maintain two copies of the code. Unfortunately, Python absolutely insists that async APIs and synchronous APIs be loaded from different copies of the code; there's no way around this. (Mayyybe if we dropped python 2 support and were willing to maintain a giant shim layer then it would be possible, but I don't think either of those things is true.)
Solution: we maintain one copy of the code – the version with async/await annotations – and then a little script maintains the synchronous copy by automatically stripping them out again. It's not beautiful, but as far as I can tell all the alternatives are worse.
Currently the async version (source of truth) lives in
urllib3/_async/
, and thensetup.py
has the code to automatically generateurllib3/_sync/...
at build time. (This script should probably get factored out into a separate project.) The script is not at all clever; you can see the exact set of transformations in the file, but basically it just tokenizes the source, deletesasync
andawait
tokens, and renames a few other tokens like__aenter__
→__enter__
. There's no complicated AST manipulation and comments are preserved. Thenurllib3/__init__.py
imports things fromurllib3._sync
and (if it's running on a new enough Python)urllib3._async
.The resulting dev experience is sort of half-way between that of a pure-python project and one with a C extension: instead of an edit/run cycle, you need to do an edit/compile/run cycle. But you still end up with a nice py2.py3-none-any wheel at the end, so you don't need to stress out about providing binary builds for different platforms, and the builds are extremely fast because it's just some shallow text manipulation, not invoking a C compiler.
Oh, and for backcompat we also added some shim files like
urllib3/connectionpool.py
, that just re-export stuff from the corresponding files inurllib3/_sync/...
, since these submodules are documented as part of the API. This also has the benefit of making it easier to avoid accidentally exporting things in the future; urllib3's public API is perhaps larger than it should be.Importantly, this basic strategy would also work for libraries that use urllib3, so it provides a path forward for higher-level projects like requests, botocore, etc. to provide dual sync/async APIs while supporting python 2 + multiple async backends.
Backwards compatibility
So far this is looking surprisingly good. Switching to h11 means losing most of
urllib3.connection
, since that's directly exposing httplib APIs. In async mode we can't support lazily loadingresponse.data
, but that's fine, and we can still support it in sync mode if we want. Right now the branch is keeping the "close" operations synchronous, but we might want to switch to making them async, because it might make things easier for HTTP/2 later. If we do switch to async close, then that will force some broader changes – in particularRecentlyUsedContainer
assumes that it can close things from__setitem__
and__delitem__
, which can't be made async. It wouldn't be hard to rewriteRecentlyUsedContainer
, but right now technically it's a public API.I'm not sure about some of the more obscure stuff like
urllib3.contrib
. In particularurllib3.contrib.pyopenssl
,urllib3.contrib.socks
, andurllib3.contrib.securetransport
seem difficult to handle in a I/O-backend-agnostic way – maybe we could hack them to keep working on the sync backend only? And I don't know what to think about the appengine support. Maybe it's fine because it doesn't actually use urllib3 internals at all?But overall, my impression is that we could keep quite a high degree of backwards compatibility if we want. Perhaps we'd want to break things for other reasons if we're doing a v2, but that's a separate discussion.
Questions
Basically at this point it seems clear that this approach can work. And what's left is just straightforward work. So:
Does this seem like the direction we want urllib3 to go?
If so, then how would you like that process to go?
I know @Lukasa is wary of the code generation approach, but the alternatives seem to be (a) maintaining a ton of redundant code, (b) not supporting async in urllib3, in which case the community will... end up maintaining a ton of redundant code as each I/O library separately implements their own fake copy of requests. And I don't believe that we have enough HTTP experts in the community to do that well; I think there are huge benefits to concentrating all our maintainers on a single library if we can.
CC: @kennethreitz @markrwilliams
The text was updated successfully, but these errors were encountered: