Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all GPUs in host are visible to job submit with '--gpus=0' #662

Open
nowenL opened this issue Sep 3, 2021 · 5 comments
Open

all GPUs in host are visible to job submit with '--gpus=0' #662

nowenL opened this issue Sep 3, 2021 · 5 comments

Comments

@nowenL
Copy link

nowenL commented Sep 3, 2021

Env

arena version: v0.8.6+a2bec8c

k8s server version: {Major:"1", Minor:"20+", GitVersion:"v1.20.4-aliyun.1", GitCommit:"7a23884", GitTreeState:"", BuildDate:"2021-05-31T13:47:24Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

Problem

  1. submit a tfjob using a docker image set NVIDIA_VISIBLE_DEVICES=all and flag '--gpus=0':
arena submit tf --gpus=0 --name=test --namespace="train-ai" --image="nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu20.04" sleep 99999
  1. attach to the job and run nvidia-smi, found all GPUs is visible to the job
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.102.04   Driver Version: 450.102.04   CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   31C    P0    39W / 300W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   50C    P0   261W / 300W |  17706MiB / 32510MiB |     95%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:0A.0 Off |                    0 |
| N/A   46C    P0   157W / 300W |  31426MiB / 32510MiB |     84%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:0B.0 Off |                    0 |
| N/A   37C    P0   251W / 300W |  30684MiB / 32510MiB |     45%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
  1. expect no gpu is visible to the job since '--gpus=0' is set in cli
@cheyang
Copy link
Collaborator

cheyang commented Sep 22, 2021

/assign @happy2048

@google-oss-robot
Copy link

@cheyang: GitHub didn't allow me to assign the following users: happy2048.

Note that only kubeflow members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @happy2048

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wsxiaozhang
Copy link
Collaborator

@nowenL
basically, we can think of '--gpus=0' as a flag to let arena not attach gpu device into job containers explicitly. however, that might be a different semantic from the job without any gpus parameters specified, although both mean run job with cpu only.

2 quick questions to confirm:

  1. what's the expected behavior, when you use --gpus=0? do you mean you just want to run this job without gpu?
  2. if yes, then why is the cuda image used for the job?

@nowenL
Copy link
Author

nowenL commented Sep 24, 2021

@wsxiaozhang
Thanks for the reply. To answer you questions:

what's the expected behavior, when you use --gpus=0? do you mean you just want to run this job without gpu?

As mentioned above, expect no gpu is visible to the job. And yes, I want to run job without gpu.

if yes, then why is the cuda image used for the job?

It's a minimal example to reproduce the issue. This may happen in practice as well, for example, users reuse their GPU base image for CPU training workload. Anyway, the root cause is 'NVIDIA_VISIBLE_DEVICES=all' and cuda image is just one way to trigger it.

Another major concern is, by simply setting an env, any cluster user can gain control of GPU in the host even it's assigned to other jobs. This looks vulnerable and can become critical in some cases.

@wsxiaozhang
Copy link
Collaborator

@wsxiaozhang
Thanks for the reply. To answer you questions:

what's the expected behavior, when you use --gpus=0? do you mean you just want to run this job without gpu?

As mentioned above, expect no gpu is visible to the job. And yes, I want to run job without gpu.

if yes, then why is the cuda image used for the job?

It's a minimal example to reproduce the issue. This may happen in practice as well, for example, users reuse their GPU base image for CPU training workload. Anyway, the root cause is 'NVIDIA_VISIBLE_DEVICES=all' and cuda image is just one way to trigger it.

Another major concern is, by simply setting an env, any cluster user can gain control of GPU in the host even it's assigned to other jobs. This looks vulnerable and can become critical in some cases.

@nowenL got your points now, that's fair.
the coming release will fix this by overwriting NVIDIA_VISIBLE_DEVICES with value of 'void', which also disable NVIDIA_DRIVER_CAPABILITIES. It that way, as long as you specify --gpus=0 or --worker_gpus=0, arena will disable gpu mounting accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants