Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix websocket connection leak #7978

Merged
merged 13 commits into from Dec 18, 2023
Merged

Fix websocket connection leak #7978

merged 13 commits into from Dec 18, 2023

Conversation

bdraco
Copy link
Member

@bdraco bdraco commented Dec 17, 2023

fixes #6325

The existing implementation never closed the transport unless heartbeat was enabled (not default), the pong was not received, and .close() had not been called before the pong was not received since it cancelled the timer.

We now always close the transport after we get the WSMsgType.CLOSE message or an abnormal closure occurs instead of hoping asyncio calls connection_lost which may not happen in a reasonable time frame (or ever?) for a variety of reasons deeper in the stack (network, bad client implementations, etc).

Any time we are closing/setting self._close_code we must either call self.close() or close the transport otherwise its possible to leak the connection.

To facilitate that, and reduce the chance of future refactoring re-introducing the problem, _set_code_close_transport is added which sets the code and closes the transport.

All other places self._close_code is set have been audited to ensure they also call self.close() There is one exception if self._autoclose is set to False and the client closes the connection, than its the responsibility of the caller to close the connection as otherwise it can leak. This is now documented.

Copy link

codecov bot commented Dec 17, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5e44ba4) 97.43% compared to head (d0b459d) 97.44%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7978      +/-   ##
==========================================
+ Coverage   97.43%   97.44%   +0.01%     
==========================================
  Files         107      107              
  Lines       32346    32370      +24     
  Branches     3748     3751       +3     
==========================================
+ Hits        31516    31544      +28     
+ Misses        629      624       -5     
- Partials      201      202       +1     
Flag Coverage Δ
CI-GHA 97.36% <100.00%> (+0.01%) ⬆️
OS-Linux 97.03% <100.00%> (+0.01%) ⬆️
OS-Windows 95.52% <100.00%> (+<0.01%) ⬆️
OS-macOS 96.85% <100.00%> (+0.01%) ⬆️
Py-3.10.11 95.44% <100.00%> (+<0.01%) ⬆️
Py-3.10.13 96.84% <100.00%> (+0.01%) ⬆️
Py-3.11.6 ?
Py-3.11.7 96.57% <100.00%> (+0.04%) ⬆️
Py-3.12.0 96.50% <100.00%> (+0.01%) ⬆️
Py-3.12.1 96.62% <100.00%> (+0.01%) ⬆️
Py-3.8.10 95.41% <100.00%> (+<0.01%) ⬆️
Py-3.8.18 96.77% <100.00%> (+0.01%) ⬆️
Py-3.9.13 95.41% <100.00%> (+<0.01%) ⬆️
Py-3.9.18 96.80% <100.00%> (+<0.01%) ⬆️
Py-pypy7.3.13 96.36% <100.00%> (+0.01%) ⬆️
VM-macos 96.85% <100.00%> (+0.01%) ⬆️
VM-ubuntu 97.03% <100.00%> (+0.01%) ⬆️
VM-windows 95.52% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bdraco bdraco added backport-3.9 Trigger automatic backporting to the 3.9 release branch by Patchback robot backport-3.10 Trigger automatic backporting to the 3.10 release branch by Patchback robot labels Dec 17, 2023
@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Dec 17, 2023
tests/test_web_websocket.py Dismissed Show dismissed Hide dismissed
@bdraco bdraco changed the title DO NOT MERGE: Fix websocket connection leak Fix websocket connection leak Dec 18, 2023
@bdraco
Copy link
Member Author

bdraco commented Dec 18, 2023

pushed to production a new hours ago. all working well

Copy link
Member

@Dreamsorcerer Dreamsorcerer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the leak in that demo is gone, then lets merge this.

@bdraco
Copy link
Member Author

bdraco commented Dec 18, 2023

I am no longer able reproduce it with the provided script after this change.

I will deploy the change more widely on my production later today and merge if no regressions are observed

@bdraco
Copy link
Member Author

bdraco commented Dec 18, 2023

deployed widely. watching logs

@bdraco bdraco marked this pull request as ready for review December 18, 2023 20:26
@bdraco
Copy link
Member Author

bdraco commented Dec 18, 2023

Logs look good. Home Assistant's object logger / leak checker doesn't see the leak anymore.

@bdraco bdraco merged commit 6f1c608 into master Dec 18, 2023
32 of 34 checks passed
@bdraco bdraco deleted the fix_ws_close branch December 18, 2023 21:27
Copy link
Contributor

patchback bot commented Dec 18, 2023

Backport to 3.9: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply 6f1c608 on top of patchback/backports/3.9/6f1c608fe756baa4b82ea8714373cebab0252265/pr-7978

Backporting merged PR #7978 into master

  1. Ensure you have a local repo clone of your fork. Unless you cloned it
    from the upstream, this would be your origin remote.
  2. Make sure you have an upstream repo added as a remote too. In these
    instructions you'll refer to it by the name upstream. If you don't
    have it, here's how you can add it:
    $ git remote add upstream https://github.com/aio-libs/aiohttp.git
  3. Ensure you have the latest copy of upstream and prepare a branch
    that will hold the backported code:
    $ git fetch upstream
    $ git checkout -b patchback/backports/3.9/6f1c608fe756baa4b82ea8714373cebab0252265/pr-7978 upstream/3.9
  4. Now, cherry-pick PR Fix websocket connection leak #7978 contents into that branch:
    $ git cherry-pick -x 6f1c608fe756baa4b82ea8714373cebab0252265
    If it'll yell at you with something like fatal: Commit 6f1c608fe756baa4b82ea8714373cebab0252265 is a merge but no -m option was given., add -m 1 as follows instead:
    $ git cherry-pick -m1 -x 6f1c608fe756baa4b82ea8714373cebab0252265
  5. At this point, you'll probably encounter some merge conflicts. You must
    resolve them in to preserve the patch from PR Fix websocket connection leak #7978 as close to the
    original as possible.
  6. Push this branch to your fork on GitHub:
    $ git push origin patchback/backports/3.9/6f1c608fe756baa4b82ea8714373cebab0252265/pr-7978
  7. Create a PR, ensure that the CI is green. If it's not — update it so that
    the tests and any other checks pass. This is it!
    Now relax and wait for the maintainers to process your pull request
    when they have some cycles to do reviews. Don't worry — they'll tell you if
    any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

Copy link
Contributor

patchback bot commented Dec 18, 2023

Backport to 3.10: 💔 cherry-picking failed — conflicts found

❌ Failed to cleanly apply 6f1c608 on top of patchback/backports/3.10/6f1c608fe756baa4b82ea8714373cebab0252265/pr-7978

Backporting merged PR #7978 into master

  1. Ensure you have a local repo clone of your fork. Unless you cloned it
    from the upstream, this would be your origin remote.
  2. Make sure you have an upstream repo added as a remote too. In these
    instructions you'll refer to it by the name upstream. If you don't
    have it, here's how you can add it:
    $ git remote add upstream https://github.com/aio-libs/aiohttp.git
  3. Ensure you have the latest copy of upstream and prepare a branch
    that will hold the backported code:
    $ git fetch upstream
    $ git checkout -b patchback/backports/3.10/6f1c608fe756baa4b82ea8714373cebab0252265/pr-7978 upstream/3.10
  4. Now, cherry-pick PR Fix websocket connection leak #7978 contents into that branch:
    $ git cherry-pick -x 6f1c608fe756baa4b82ea8714373cebab0252265
    If it'll yell at you with something like fatal: Commit 6f1c608fe756baa4b82ea8714373cebab0252265 is a merge but no -m option was given., add -m 1 as follows instead:
    $ git cherry-pick -m1 -x 6f1c608fe756baa4b82ea8714373cebab0252265
  5. At this point, you'll probably encounter some merge conflicts. You must
    resolve them in to preserve the patch from PR Fix websocket connection leak #7978 as close to the
    original as possible.
  6. Push this branch to your fork on GitHub:
    $ git push origin patchback/backports/3.10/6f1c608fe756baa4b82ea8714373cebab0252265/pr-7978
  7. Create a PR, ensure that the CI is green. If it's not — update it so that
    the tests and any other checks pass. This is it!
    Now relax and wait for the maintainers to process your pull request
    when they have some cycles to do reviews. Don't worry — they'll tell you if
    any improvements are necessary when the time comes!

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

bdraco added a commit that referenced this pull request Dec 18, 2023
(cherry picked from commit 6f1c608)
bdraco added a commit that referenced this pull request Dec 18, 2023
(cherry picked from commit 6f1c608)
bdraco added a commit that referenced this pull request Dec 19, 2023
bdraco added a commit that referenced this pull request Dec 19, 2023
@johntdyer
Copy link

@Dreamsorcerer any idea when 3.9.2 might be released.... presumably including this change?

@Dreamsorcerer
Copy link
Member

Will probably come round to it shortly, have been fully focused on a PR for aiohttp-admin for the last month.

@spikefishjohn
Copy link

Howdy all and @bdraco!

I'm working on a bug report for GNS3 that I think this PR lines up with. The GNS3 issue is that when a web socket client sends a close message the server never closes the tcp socket. I tried the patch above but it doesn't address the issue.

High level GNS3 is calling ws.receive() at which point aiohttp receive() the close message.

I believe the problem is because ws.receive is setting self._closing = True when it reiceves a close message. This then causes self.close() to return here which prevents if msg.type == WSMsgType.CLOSE: from being reached.

I'm almost thinking this should be removed but i'm not sure what the intent of that is so i'm unsure if that is the proper fix.

Should I open a new issue for this?

@bdraco
Copy link
Member Author

bdraco commented Feb 26, 2024

Howdy all and @bdraco!

I'm working on a bug report for GNS3 that I think this PR lines up with. The GNS3 issue is that when a web socket client sends a close message the server never closes the tcp socket. I tried the patch above but it doesn't address the issue.

High level GNS3 is calling ws.receive() at which point aiohttp receive() the close message.

I believe the problem is because ws.receive is setting self._closing = True when it reiceves a close message. This then causes self.close() to return here which prevents if msg.type == WSMsgType.CLOSE: from being reached.

I'm almost thinking this should be removed but i'm not sure what the intent of that is so i'm unsure if that is the proper fix.

Should I open a new issue for this?

Yes please. Thank you

@spikefishjohn
Copy link

Will do, thanks for the quick reply.

@pxscott
Copy link

pxscott commented Mar 22, 2024

Hi,

With the changes for this bug, the behavior of asyncio.wait_for(ws.receive(), timeout=1) has changed. Before, this could be used to do a poll for new messages, but now hitting the timeout causes the websocket to be closed. This causes problems in our code, so we are pinning to 3.9.1.

  • Scott McRae

@bdraco
Copy link
Member Author

bdraco commented Mar 22, 2024

Hi,

With the changes for this bug, the behavior of asyncio.wait_for(ws.receive(), timeout=1) has changed. Before, this could be used to do a poll for new messages, but now hitting the timeout causes the websocket to be closed. This causes problems in our code, so we are pinning to 3.9.1.

  • Scott McRae

Use asyncio.wait instead or wrap the receive call in asyncio.shield to prevent cancellation

@Dreamsorcerer
Copy link
Member

Use asyncio.wait instead or wrap the receive call in asyncio.shield to prevent cancellation

We should probably look at this more closely. The TimeoutError must come from our timeout() call, so that behaviour is probably correct. But, a CancelledError could come from anywhere and maybe shouldn't close the connection. It may be that we need to check if the cancellation came from outside the task (asyncio.current_task().cancelling()) and skip closing otherwise.

More information about this at:
https://superfastpython.com/asyncio-task-cancellation-best-practices/#Practice_1_Do_Not_Consume_CancelledError
And we recently implemented this in aiohttp-sse: https://github.com/aio-libs/aiohttp-sse/pull/459/files

@bdraco
Copy link
Member Author

bdraco commented Mar 26, 2024

I think the cancellation handling is fine here, but closing the connection when the receive times out/is cancelled is what was not desired.

#8251 will restore the behavior to keep the connection open if receive times out or is cancelled

@pxscott
Copy link

pxscott commented Mar 27, 2024

That's perfect. Thanks!

@Dreamsorcerer
Copy link
Member

asyncio.wait_for(ws.receive(), timeout=1)

Also, worth pointing out that receive has a timeout parameter, so you don't need wait_for() at all, in fact you're just doubling the number of timeouts being constructed.

@pxscott
Copy link

pxscott commented Mar 29, 2024

Yeah, I noticed that when debugging this. This code was written awhile back, I think before that parameter existed. But changing to use the parameter didn't help with the close logic, so I just pinned the version for now. I'll try using the parameter when i do upgrade once this patch is released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-3.9 Trigger automatic backporting to the 3.9 release branch by Patchback robot backport-3.10 Trigger automatic backporting to the 3.10 release branch by Patchback robot bot:chronographer:provided There is a change note present in this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Too many websocket client disconnect, memory not release
5 participants