-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to run notebooks on GPU (on-premises), running kubeflow on ranchers k3s cluster #2683
Comments
You need to define the "nvidia" as you default runtime in [plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
... It is not a Kubeflow issue, but rather a k8s containerd runtime config issue. Please search for the related references at k3s or rancher for nvidia runtime. |
Thanks for your response! I don't have
Do I need to create /containerd/config.toml.tmpl file as well? |
hello. First of all, I would like to tell you that since I am a non-English speaker and use a translator, communication is not smooth. I'm asking because I'm experiencing a similar problem. 0/4 nodes are available: 1 node(s) had untold taint {node-role.kubernetes.io/control-plane: }, 3 Insufficient nvidia.com/gpu. preemption: 0/4 nodes are available: 1 Preemption is not helpful for scheduling, 3 No preemption victims found for incoming pod..There was no problem when using k8s 1.20 + kubeflow 1.5.0 version with docker + nvidia-docker2. Where should I look further? |
My problem was the taint I had set on the GPU node. in the order they were created
xxxx is a random number or character. After the above pods are created and run, the laptop for GPU is created normally, and nvidia-smi also operates normally in the laptop's terminal. Perhaps the helm command from this site ( https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o ) that I installed yesterday worked. |
Are you using k3s? |
I didn't use k3. I installed kubeadm, kubelet, and kubectl in order, and then installed k8s 1.28 version one by one. After installing the k8s dashboard, I installed version kubeflow 1.8 from github and confirmed that all pods were operating normally. I don't remember exactly because I couldn't find the gpu server, but I ran the helm command from the nvidia site mentioned above on the master server yesterday and checked it today. After checking the message that the taint was a problem and disabling the taint with the "kubectl taint gpu-server gpu-" command, nvidia-related pods were suddenly created in the gpu-server in the terminal where I was running "watch -n1 kubectl get po -Ao wide". This was confirmed, and after waiting for the newly created pods to be in the "RUN" state, I tested and the notebooks were created normally. |
Hello everyone,
I have installed kubeflow on k3s cluster.
I have configured the cluster to be able to access GPUs and it's working fine for test pods on gpu, and I can see nvidia-smi output on pods.
But, I want notebooks created from kubeflow dashboard to run on gpus, I am not able to find anything how I can do this. I have selected gpus from UI in notebook creation on dashboard but still while running jupyter notebook following command returns false:
torch.cuda.is_avaiable()
returnFalse
For k3s, we need to add "runtimeClassName: nvidia" in pod specs but there is not way I can do this in kubeflow manifests.
Please suggest me something or guide me if I am missing something.
Thanks in advance.
The text was updated successfully, but these errors were encountered: