Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run notebooks on GPU (on-premises), running kubeflow on ranchers k3s cluster #2683

Open
khalidkhushal opened this issue Apr 17, 2024 · 6 comments

Comments

@khalidkhushal
Copy link

Hello everyone,
I have installed kubeflow on k3s cluster.
I have configured the cluster to be able to access GPUs and it's working fine for test pods on gpu, and I can see nvidia-smi output on pods.

But, I want notebooks created from kubeflow dashboard to run on gpus, I am not able to find anything how I can do this. I have selected gpus from UI in notebook creation on dashboard but still while running jupyter notebook following command returns false:

torch.cuda.is_avaiable() return False
For k3s, we need to add "runtimeClassName: nvidia" in pod specs but there is not way I can do this in kubeflow manifests.
Please suggest me something or guide me if I am missing something.

Thanks in advance.

@kubeflow-bot kubeflow-bot added this to To Do in Needs Triage Apr 17, 2024
@yingding
Copy link

yingding commented Apr 23, 2024

You need to define the "nvidia" as you default runtime in .../containerd/config.toml.tmpl for k3s.

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"
...

It is not a Kubeflow issue, but rather a k8s containerd runtime config issue. Please search for the related references at k3s or rancher for nvidia runtime.

@khalidkhushal
Copy link
Author

You need to define the "nvidia" as you default runtime in .../containerd/config.toml.tmpl for k3s.

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"
...

It is not a Kubeflow issue, but rather a k8s containerd runtime config issue. Please search for the related references at k3s or rancher for nvidia runtime.

Thanks for your response!

I don't have /containerd/config.toml.tmpl but in /containerd/config.toml I have:

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"

Do I need to create /containerd/config.toml.tmpl file as well?
if yes what else do I need to configure in /containerd/config.toml.tmpl.

@zhsung
Copy link

zhsung commented Apr 29, 2024

hello.

First of all, I would like to tell you that since I am a non-English speaker and use a translator, communication is not smooth.

I'm asking because I'm experiencing a similar problem.
I configure k8s with 1 gpu server (worker) and 3 cpu servers (master 1, worker 2) and use Calico CNI and Containerd.
The laptop that enabled kubeflow to use the gpu is in a pending state with the message below.


0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

There was no problem when using k8s 1.20 + kubeflow 1.5.0 version with docker + nvidia-docker2.

Where should I look further?

@zhsung
Copy link

zhsung commented Apr 29, 2024

hello.

First of all, I would like to tell you that since I am a non-English speaker and use a translator, communication is not smooth.

I'm asking because I'm experiencing a similar problem. I configure k8s with 1 gpu server (worker) and 3 cpu servers (master 1, worker 2) and use Calico CNI and Containerd. The laptop that enabled kubeflow to use the gpu is in a pending state with the message below.

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

There was no problem when using k8s 1.20 + kubeflow 1.5.0 version with docker + nvidia-docker2.

Where should I look further?

My problem was the taint I had set on the GPU node.
When I removed the taint with kubectl taint nodes g01 gpu-, nvidia-related pods were created on node g01, and the laptop created with the GPU usage option was created normally on g01.

in the order they were created

  1. gpu-operator-xxx~xxx-node-feature-discovery-worker-xxxx
  2. nvidia-dcgm-exporter-xxxx
    3.gpu-feature-discovery-xxxx
  3. nvidia-device-plugin-daemonset-xxxx
  4. nvidia-container-toolkit-daemonset-xxxx
  5. nvidia-mig-manager-xxxx
  6. nvidia-operator-validator-xxxxx
    8.nvidia-cuda-validator-xxxxx

xxxx is a random number or character.

After the above pods are created and run, the laptop for GPU is created normally, and nvidia-smi also operates normally in the laptop's terminal.

Perhaps the helm command from this site ( https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o ) that I installed yesterday worked.

@khalidkhushal
Copy link
Author

hello.
First of all, I would like to tell you that since I am a non-English speaker and use a translator, communication is not smooth.
I'm asking because I'm experiencing a similar problem. I configure k8s with 1 gpu server (worker) and 3 cpu servers (master 1, worker 2) and use Calico CNI and Containerd. The laptop that enabled kubeflow to use the gpu is in a pending state with the message below.

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

There was no problem when using k8s 1.20 + kubeflow 1.5.0 version with docker + nvidia-docker2.
Where should I look further?

My problem was the taint I had set on the GPU node. When I removed the taint with kubectl taint nodes g01 gpu-, nvidia-related pods were created on node g01, and the laptop created with the GPU usage option was created normally on g01.

in the order they were created

  1. gpu-operator-xxx~xxx-node-feature-discovery-worker-xxxx
  2. nvidia-dcgm-exporter-xxxx
    3.gpu-feature-discovery-xxxx
  3. nvidia-device-plugin-daemonset-xxxx
  4. nvidia-container-toolkit-daemonset-xxxx
  5. nvidia-mig-manager-xxxx
  6. nvidia-operator-validator-xxxxx
    8.nvidia-cuda-validator-xxxxx

xxxx is a random number or character.

After the above pods are created and run, the laptop for GPU is created normally, and nvidia-smi also operates normally in the laptop's terminal.

Perhaps the helm command from this site ( https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o ) that I installed yesterday worked.

Are you using k3s?
If not how are you deploying kubernetes cluster?

@zhsung
Copy link

zhsung commented Apr 29, 2024

hello.
First of all, I would like to tell you that since I am a non-English speaker and use a translator, communication is not smooth.
I'm asking because I'm experiencing a similar problem. I configure k8s with 1 gpu server (worker) and 3 cpu servers (master 1, worker 2) and use Calico CNI and Containerd. The laptop that enabled kubeflow to use the gpu is in a pending state with the message below.

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

There was no problem when using k8s 1.20 + kubeflow 1.5.0 version with docker + nvidia-docker2.
Where should I look further?

My problem was the taint I had set on the GPU node. When I removed the taint with kubectl taint nodes g01 gpu-, nvidia-related pods were created on node g01, and the laptop created with the GPU usage option was created normally on g01.
in the order they were created

  1. gpu-operator-xxx~xxx-node-feature-discovery-worker-xxxx
  2. nvidia-dcgm-exporter-xxxx
    3.gpu-feature-discovery-xxxx
  3. nvidia-device-plugin-daemonset-xxxx
  4. nvidia-container-toolkit-daemonset-xxxx
  5. nvidia-mig-manager-xxxx
  6. nvidia-operator-validator-xxxxx
    8.nvidia-cuda-validator-xxxxx

xxxx is a random number or character.
After the above pods are created and run, the laptop for GPU is created normally, and nvidia-smi also operates normally in the laptop's terminal.
Perhaps the helm command from this site ( https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o ) that I installed yesterday worked.

Are you using k3s? If not how are you deploying kubernetes cluster?

I didn't use k3.

I installed kubeadm, kubelet, and kubectl in order, and then installed k8s 1.28 version one by one.

After installing the k8s dashboard, I installed version kubeflow 1.8 from github and confirmed that all pods were operating normally.

I don't remember exactly because I couldn't find the gpu server, but I ran the helm command from the nvidia site mentioned above on the master server yesterday and checked it today.

After checking the message that the taint was a problem and disabling the taint with the "kubectl taint gpu-server gpu-" command, nvidia-related pods were suddenly created in the gpu-server in the terminal where I was running "watch -n1 kubectl get po -Ao wide".

This was confirmed, and after waiting for the newly created pods to be in the "RUN" state, I tested and the notebooks were created normally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants