-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fork][tracking issue] grpc thread pool hanging on fork #31885
Comments
Stack trace we captured: |
CC @veblush |
As a meta point, the quality control on gRPC is quite lacking from an OSS perspective unfortunately. Almost all recent major gRPC releases have caused issues. gRPC is such a foundational piece of code that the stability really needs to be there and we need to be able to trust the releases. The Python package is especially problematic here due to how the python package ecosystem is set up -- once a package is released, it can easily end up in everybody's deployment and there is limited chance for libraries built on top to control which version is shipped (other than pinning, which will cause conflicts with other libraries if everybody does it). For the C++ version, we can at least control which version we link in and can do the quality control from our side :) Let us know how or if we can help to improve this situation from the Ray side. We can for example offer to test your RCs more thoroughly in our CI and communicate failures before the release. |
We're sorry to hear that. This seems to be a fork-related issue, which we are aware of and are actively working toward a resolution for.
Yes please. We put out RCs before every release in the hope that we will catch such issues. As you point out, this is especially important for Python packages because of how quickly they end up throughout the ecosystem. Please let us know if you need any guidance/help on how to do this. |
Sounds great, thanks for your support! If we haven't already done so, I'll connect you with @scv119 who can make the necessary modification to our CI. Ideally we will just run all our premerge tests with your latest RC, that will discover a decent amount of breakage before it makes it out into the wild. |
Hey @gnossen - is there a tracking issue/pr/thread where we could follow the progress of this? |
I think this is the most relevant tracking issue at the moment. We just merged a fix for what seems to be the highest percentage fork offender. Can you please try out the artifacts here to see if it fixes the issue for you? Edit: I see this report was for 3.8. We don't build 3.8 artifacts for PRs. I'll comment again when the master job has run, which should include 3.8 artifacts. |
This master build has the full supported version range. Please try it out and let us know if it resolves the issue for you. |
@rickyyx You should be able to test with the nightly builds as well. |
Thanks @drfloob! Just tried with the nightly build The repro script above still creates the thread pool issue. And looks like the repro is even more reliable now. Wondering if you have looked into the deadlocks mentioned here: #31772 (comment) |
Thanks @rickyyx. My fix in #31969 removed the most commonly seen cause of that deadlock (an ExecCtx created during a fork event), but I would not be surprised if there were others. Though that change alone should not increase the flake rate. I don't think it's the right solution here, but we can try modifying the thread pool to skip running any callbacks during fork events. I'll start a discussion on a PR. |
Hello. |
We are seeing this issue with the same "Waiting for thread pool to idle before forking" in a loop in Linux as well on 1.51.1, so I don't think it's just MacOS. |
I'm seeing this issue still when building grpc from source in master and when using 1.51.1 and Python 3.8.9. Reproduction steps are as follows if that's helpful:
|
1.55.0 hangs on Mac in gunicorn/flask run with "Waiting for thread pool to idle before forking." grpc/grpc#31885
Just jumping in to say that this issue is now causing some of our users to be unable to use our package since our pin is incompatible with a pin in newer versions of tensorflow. Offer still stands to do another live debugging session with a debugger session that reproduces the hang if that would be at all helpful to get to the bottom of this (or to run any additional logging against our test suite that reproduces the problem). |
1.55.0 hangs on Mac in gunicorn/flask run with "Waiting for thread pool to idle before forking." grpc/grpc#31885
We were able to find and hotfix the problem. It order to fix it you have to reset I can upstream the fixes but I see no way for them to be merged. |
@georgthegreat The issue with the absl deadlock checker is a known one when @gibsondan Thank you for the offer. I'll follow up directly. Do you have a reference to the tensorflow pin? |
@gnossen thanks for making this clear. This indeed looks like our case: we build grpc from source and the problem appeared in debug build. |
I believe that we are running into the same bug as well, only we are using PHP. I get deadlocks when making curl requests, which appear to be caught in a deadlock related to gRPC. This is the GDB backtrace of a deadlock state:
grpc extension version:
|
1.55.0 hangs on Mac in gunicorn/flask run with "Waiting for thread pool to idle before forking." grpc/grpc#31885
Looks like 1.56.0 may have fixed this? I was seeing this with 1.55.x, but it's no longer happening on 1.56.0. Python 3.9.16, grpcio==1.56.0 grpcio-status==1.44.0, Mac OS 13.4.1, Apple Silicon. |
Seconding the above, the dagster repro of the hang seems to be gone in 1.56.0! Is that expected? |
It seems like the bug grpc/grpc#31885 that caused the problems with Ray Client tests has been fixed in grpcio 1.56, so we are removing the pin so people can upgrade to fix https://nvd.nist.gov/vuln/detail/CVE-2023-32731 Pinning to just the latest version would be too restrictive so we remove the pin (since the Ray client works with other versions as well except for some corner cases).
It seems like the bug grpc/grpc#31885 that caused the problems with Ray Client tests has been fixed in grpcio 1.56, so we are removing the pin so people can upgrade to fix https://nvd.nist.gov/vuln/detail/CVE-2023-32731 Pinning to just the latest version would be too restrictive so we remove the pin (since the Ray client works with other versions as well except for some corner cases).
It seems like the bug grpc/grpc#31885 that caused the problems with Ray Client tests has been fixed in grpcio 1.56, so we are removing the pin so people can upgrade to fix https://nvd.nist.gov/vuln/detail/CVE-2023-32731 Pinning to just the latest version would be too restrictive so we remove the pin (since the Ray client works with other versions as well except for some corner cases).
Seems to work on 1.56.0, thank you! |
It seems like the bug grpc/grpc#31885 that caused the problems with Ray Client tests has been fixed in grpcio 1.56, so we are removing the pin so people can upgrade to fix https://nvd.nist.gov/vuln/detail/CVE-2023-32731 Pinning to just the latest version would be too restrictive so we remove the pin (since the Ray client works with other versions as well except for some corner cases).
It seems like the bug grpc/grpc#31885 that caused the problems with Ray Client tests has been fixed in grpcio 1.56, so we are removing the pin so people can upgrade to fix https://nvd.nist.gov/vuln/detail/CVE-2023-32731 Pinning to just the latest version would be too restrictive so we remove the pin (since the Ray client works with other versions as well except for some corner cases). Signed-off-by: NripeshN <nn2012@hw.ac.uk>
It seems like the bug grpc/grpc#31885 that caused the problems with Ray Client tests has been fixed in grpcio 1.56, so we are removing the pin so people can upgrade to fix https://nvd.nist.gov/vuln/detail/CVE-2023-32731 Pinning to just the latest version would be too restrictive so we remove the pin (since the Ray client works with other versions as well except for some corner cases). Signed-off-by: harborn <gangsheng.wu@intel.com>
It seems like the bug grpc/grpc#31885 that caused the problems with Ray Client tests has been fixed in grpcio 1.56, so we are removing the pin so people can upgrade to fix https://nvd.nist.gov/vuln/detail/CVE-2023-32731 Pinning to just the latest version would be too restrictive so we remove the pin (since the Ray client works with other versions as well except for some corner cases).
It seems like the bug grpc/grpc#31885 that caused the problems with Ray Client tests has been fixed in grpcio 1.56, so we are removing the pin so people can upgrade to fix https://nvd.nist.gov/vuln/detail/CVE-2023-32731 Pinning to just the latest version would be too restrictive so we remove the pin (since the Ray client works with other versions as well except for some corner cases). Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
It seems like the bug grpc/grpc#31885 that caused the problems with Ray Client tests has been fixed in grpcio 1.56, so we are removing the pin so people can upgrade to fix https://nvd.nist.gov/vuln/detail/CVE-2023-32731 Pinning to just the latest version would be too restrictive so we remove the pin (since the Ray client works with other versions as well except for some corner cases). Signed-off-by: Victor <vctr.y.m@example.com>
Versions |
What version of gRPC and what language are you using?
grpc 1.51.1
What operating system (Linux, Windows,...) and version?
macOS Catalina 10.15
What runtime / compiler are you using (e.g. python version or version of gcc)
python 3.8.15
What did you do?
Please provide either 1) A unit test for reproducing the bug or 2) Specific steps for us to follow to reproduce the bug. If there’s not enough information to debug the problem, gRPC team may close the issue at their discretion. You’re welcome to re-open the issue once you have a reproduction.
What did you expect to see?
Script runs ok.
What did you see instead?
Script hanging with
Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs).
See TROUBLESHOOTING.md for how to diagnose problems better.
Anything else we should know about your project / environment?
The text was updated successfully, but these errors were encountered: