Unable to run notebooks on GPU (on-premises), running kubeflow on ranchers k3s cluster #2683

khalidkhushal · 2024-04-17T10:23:38Z

Hello everyone,
I have installed kubeflow on k3s cluster.
I have configured the cluster to be able to access GPUs and it's working fine for test pods on gpu, and I can see nvidia-smi output on pods.

But, I want notebooks created from kubeflow dashboard to run on gpus, I am not able to find anything how I can do this. I have selected gpus from UI in notebook creation on dashboard but still while running jupyter notebook following command returns false:

torch.cuda.is_avaiable() return False
For k3s, we need to add "runtimeClassName: nvidia" in pod specs but there is not way I can do this in kubeflow manifests.
Please suggest me something or guide me if I am missing something.

Thanks in advance.

The text was updated successfully, but these errors were encountered:

yingding · 2024-04-23T15:48:34Z

You need to define the "nvidia" as you default runtime in .../containerd/config.toml.tmpl for k3s.

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"
...

It is not a Kubeflow issue, but rather a k8s containerd runtime config issue. Please search for the related references at k3s or rancher for nvidia runtime.

khalidkhushal · 2024-04-24T12:32:13Z

You need to define the "nvidia" as you default runtime in .../containerd/config.toml.tmpl for k3s.
[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"
...
It is not a Kubeflow issue, but rather a k8s containerd runtime config issue. Please search for the related references at k3s or rancher for nvidia runtime.

Thanks for your response!

I don't have /containerd/config.toml.tmpl but in /containerd/config.toml I have:

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"

Do I need to create /containerd/config.toml.tmpl file as well?
if yes what else do I need to configure in /containerd/config.toml.tmpl.

zhsung · 2024-04-29T01:47:59Z

hello.

First of all, I would like to tell you that since I am a non-English speaker and use a translator, communication is not smooth.

I'm asking because I'm experiencing a similar problem.
I configure k8s with 1 gpu server (worker) and 3 cpu servers (master 1, worker 2) and use Calico CNI and Containerd.
The laptop that enabled kubeflow to use the gpu is in a pending state with the message below.

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

There was no problem when using k8s 1.20 + kubeflow 1.5.0 version with docker + nvidia-docker2.

Where should I look further?

zhsung · 2024-04-29T07:49:14Z

hello.

First of all, I would like to tell you that since I am a non-English speaker and use a translator, communication is not smooth.

I'm asking because I'm experiencing a similar problem. I configure k8s with 1 gpu server (worker) and 3 cpu servers (master 1, worker 2) and use Calico CNI and Containerd. The laptop that enabled kubeflow to use the gpu is in a pending state with the message below.

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

There was no problem when using k8s 1.20 + kubeflow 1.5.0 version with docker + nvidia-docker2.

Where should I look further?

My problem was the taint I had set on the GPU node.
When I removed the taint with kubectl taint nodes g01 gpu-, nvidia-related pods were created on node g01, and the laptop created with the GPU usage option was created normally on g01.

in the order they were created

gpu-operator-xxx~xxx-node-feature-discovery-worker-xxxx
nvidia-dcgm-exporter-xxxx
3.gpu-feature-discovery-xxxx
nvidia-device-plugin-daemonset-xxxx
nvidia-container-toolkit-daemonset-xxxx
nvidia-mig-manager-xxxx
nvidia-operator-validator-xxxxx
8.nvidia-cuda-validator-xxxxx

xxxx is a random number or character.

After the above pods are created and run, the laptop for GPU is created normally, and nvidia-smi also operates normally in the laptop's terminal.

Perhaps the helm command from this site ( https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o ) that I installed yesterday worked.

khalidkhushal · 2024-04-29T11:10:28Z

hello.
First of all, I would like to tell you that since I am a non-English speaker and use a translator, communication is not smooth.
I'm asking because I'm experiencing a similar problem. I configure k8s with 1 gpu server (worker) and 3 cpu servers (master 1, worker 2) and use Calico CNI and Containerd. The laptop that enabled kubeflow to use the gpu is in a pending state with the message below.

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

There was no problem when using k8s 1.20 + kubeflow 1.5.0 version with docker + nvidia-docker2.
Where should I look further?

My problem was the taint I had set on the GPU node. When I removed the taint with kubectl taint nodes g01 gpu-, nvidia-related pods were created on node g01, and the laptop created with the GPU usage option was created normally on g01.

in the order they were created

gpu-operator-xxx~xxx-node-feature-discovery-worker-xxxx

nvidia-dcgm-exporter-xxxx
3.gpu-feature-discovery-xxxx

nvidia-device-plugin-daemonset-xxxx

nvidia-container-toolkit-daemonset-xxxx

nvidia-mig-manager-xxxx

nvidia-operator-validator-xxxxx
8.nvidia-cuda-validator-xxxxx

xxxx is a random number or character.

After the above pods are created and run, the laptop for GPU is created normally, and nvidia-smi also operates normally in the laptop's terminal.

Perhaps the helm command from this site ( https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o ) that I installed yesterday worked.

Are you using k3s?
If not how are you deploying kubernetes cluster?

zhsung · 2024-04-29T12:12:43Z

hello.
First of all, I would like to tell you that since I am a non-English speaker and use a translator, communication is not smooth.
I'm asking because I'm experiencing a similar problem. I configure k8s with 1 gpu server (worker) and 3 cpu servers (master 1, worker 2) and use Calico CNI and Containerd. The laptop that enabled kubeflow to use the gpu is in a pending state with the message below.

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

There was no problem when using k8s 1.20 + kubeflow 1.5.0 version with docker + nvidia-docker2.
Where should I look further?

My problem was the taint I had set on the GPU node. When I removed the taint with kubectl taint nodes g01 gpu-, nvidia-related pods were created on node g01, and the laptop created with the GPU usage option was created normally on g01.
in the order they were created

gpu-operator-xxx~xxx-node-feature-discovery-worker-xxxx

nvidia-dcgm-exporter-xxxx
3.gpu-feature-discovery-xxxx

nvidia-device-plugin-daemonset-xxxx

nvidia-container-toolkit-daemonset-xxxx

nvidia-mig-manager-xxxx

nvidia-operator-validator-xxxxx
8.nvidia-cuda-validator-xxxxx

xxxx is a random number or character.
After the above pods are created and run, the laptop for GPU is created normally, and nvidia-smi also operates normally in the laptop's terminal.
Perhaps the helm command from this site ( https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o ) that I installed yesterday worked.

Are you using k3s? If not how are you deploying kubernetes cluster?

I didn't use k3.

I installed kubeadm, kubelet, and kubectl in order, and then installed k8s 1.28 version one by one.

After installing the k8s dashboard, I installed version kubeflow 1.8 from github and confirmed that all pods were operating normally.

I don't remember exactly because I couldn't find the gpu server, but I ran the helm command from the nvidia site mentioned above on the master server yesterday and checked it today.

After checking the message that the taint was a problem and disabling the taint with the "kubectl taint gpu-server gpu-" command, nvidia-related pods were suddenly created in the gpu-server in the terminal where I was running "watch -n1 kubectl get po -Ao wide".

This was confirmed, and after waiting for the newly created pods to be in the "RUN" state, I tested and the notebooks were created normally.

kubeflow-bot added this to To Do in Needs Triage Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to run notebooks on GPU (on-premises), running kubeflow on ranchers k3s cluster #2683

Unable to run notebooks on GPU (on-premises), running kubeflow on ranchers k3s cluster #2683

khalidkhushal commented Apr 17, 2024

yingding commented Apr 23, 2024 •

edited

khalidkhushal commented Apr 24, 2024

zhsung commented Apr 29, 2024

zhsung commented Apr 29, 2024 •

edited

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

khalidkhushal commented Apr 29, 2024

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

zhsung commented Apr 29, 2024

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

Unable to run notebooks on GPU (on-premises), running kubeflow on ranchers k3s cluster #2683

Unable to run notebooks on GPU (on-premises), running kubeflow on ranchers k3s cluster #2683

Comments

khalidkhushal commented Apr 17, 2024

yingding commented Apr 23, 2024 • edited

khalidkhushal commented Apr 24, 2024

zhsung commented Apr 29, 2024

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

zhsung commented Apr 29, 2024 • edited

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

khalidkhushal commented Apr 29, 2024

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

zhsung commented Apr 29, 2024

0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..

yingding commented Apr 23, 2024 •

edited

zhsung commented Apr 29, 2024 •

edited