-
Notifications
You must be signed in to change notification settings - Fork 23.9k
[CD] Fix slim-wheel cuda_nvrtc import problem #145614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Similar fix as: #144816 Fixes: #145580 Found during testing of #138340 Please note both nvrtc and nvjitlink exist for cuda 11.8, 12.4 and 12.6 hence we can safely remove if statement. Preloading can apply to all supporting cuda versions. CUDA 11.8 path: ``` (.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib __init__.py __pycache__ libnvrtc-builtins.so.11.8 libnvrtc-builtins.so.12.4 libnvrtc.so.11.2 libnvrtc.so.12 (.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/nvjitlink/lib __init__.py __pycache__ libnvJitLink.so.12 ``` Test with rc 2.6 and CUDA 11.8: ``` python cudnn_test.py 2.6.0+cu118 ---------------------------------------------SDPA-Flash--------------------------------------------- ALL GOOD ---------------------------------------------SDPA-CuDNN--------------------------------------------- ALL GOOD ``` Thank you @nWEIdia for discovering this issue Pull Request resolved: #145582 Approved by: https://github.com/nWEIdia, https://github.com/eqy, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> (cherry picked from commit 9752c7c)
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145614
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 9 PendingAs of commit 34c3e25 with merge base f7e621c ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@atalman I am noticing that you might have tested cu124 first, and then cu118, please see that your test directory containing both libnvrtc.so.11.2 libnvrtc.so.12 if "nvidia/cuda_runtime/lib/libcudart.so" not in _maps: The above libcudart.so might be too strict, I guess libcudart.so.* existence should be fine? /usr/local/lib/python3.12/dist-packages/torch# ls ../nvidia/cuda_runtime/lib/ |
Only if /usr/local/lib/python3.12/dist-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so symlink is created by default installation, it is only libnvrtc.so.11.2 |
Looks like you are right standalone cu118 is not loading:
However the file is there:
FYI. the statement: As per @nWEIdia workaround is: |
Yeah, not sure why but two workarounds identified so far: (either of them works) export LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/nvidia/cuda_nvrtc/lib/:$LD_LIBRARY_PATH |
I am going to switch the preload order, but need the test case for the first issue. Would they be incompatible (both want to be preloaded first?) Update: it seems the libnvjitlink test would just be "python -c 'import torch'" , so if libnvrtc test case works, libnvjitlink test must also have worked fine. |
There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc Fixes issues reported in #145614 (comment) Pull Request resolved: #145638 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc Fixes issues reported in #145614 (comment) Pull Request resolved: #145638 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> (cherry picked from commit 2a70de7)
[CUDA] Change slim-wheel libraries load order (#145638) There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc Fixes issues reported in #145614 (comment) Pull Request resolved: #145638 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> (cherry picked from commit 2a70de7) Co-authored-by: Wei Wang <weiwan@nvidia.com>
There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc Fixes issues reported in pytorch#145614 (comment) Pull Request resolved: pytorch#145638 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Similar fix as: #144816
Fixes: #145580
Found during testing of #138340
Please note both nvrtc and nvjitlink exist for cuda 11.8, 12.4 and 12.6 hence we can safely remove if statement. Preloading can apply to all supporting cuda versions.
CUDA 11.8 path:
Test with rc 2.6 and CUDA 11.8:
Thank you @nWEIdia for discovering this issue
cc @seemethere @malfet @osalpekar