Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arena top job lost resource information #1082

Closed
kangzemin opened this issue May 8, 2024 · 15 comments · Fixed by #1087
Closed

arena top job lost resource information #1082

kangzemin opened this issue May 8, 2024 · 15 comments · Fixed by #1087

Comments

@kangzemin
Copy link

arena top job lost resourece information
image

arena: v0.9.14
BuildDate: 2024-04-10T12:54:22Z
GitCommit: adb43b8
GitTreeState: clean
GitTag: v0.9.14
GoVersion: go1.20.12
Compiler: gc
Platform: linux/amd64

@Syulin7
Copy link
Collaborator

Syulin7 commented May 8, 2024

@kangzemin The GPU resource information depends on metrics in Prometheus, requiring a Prometheus service in the cluster and providing metrics such as "nvidia_gpu_duty_cycle." For reference, see: https://github.com/kubeflow/arena/blob/master/pkg/apis/types/gpu_metric.go

@kangzemin
Copy link
Author

Thank you for your guidance!
When I deploy exporter with https://github.com/kubeflow/arena/blob/master/docs/top/prometheus.md, there has been an error:
image
so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate)
now pod is running.

But exporter pod log is:

time="2024-05-11T03:28:50Z" level=info msg="runtime is docker"
{"level":"error","msg":"GetDriverVersion(): 535.161.07","time":"2024-05-11T03:28:50Z"}

Is there something wrong?

kubernests version:

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-02T00:35:13Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-01T01:11:45Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

attention: my kubernetes runtime is containerd.
nvidia-smi:

 NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2

@kangzemin
Copy link
Author

Thank you for your guidance! When I deploy exporter with https://github.com/kubeflow/arena/blob/master/docs/top/prometheus.md, there has been an error: image so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate) now pod is running.

But exporter pod log is:

time="2024-05-11T03:28:50Z" level=info msg="runtime is docker"
{"level":"error","msg":"GetDriverVersion(): 535.161.07","time":"2024-05-11T03:28:50Z"}

Is there something wrong?

kubernests version:

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-02T00:35:13Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.25.4-1", GitCommit:"f23e643ebd790a62a54b376116d094a732f28263", GitTreeState:"archive", BuildDate:"2023-02-01T01:11:45Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

attention: my kubernetes runtime is containerd. nvidia-smi:

 NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2

@Syulin7

@Syulin7
Copy link
Collaborator

Syulin7 commented May 13, 2024

so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate)

@kangzemin You need to mount the node's containerd.sock to /run/containerd/containerd.sock inside the container.

@kangzemin
Copy link
Author

so I fix kubernetes-artifacts/prometheus/gpu-exporter.yaml,and delete line 26and30(type: FileOrCreate)

@kangzemin You need to mount the node's containerd.sock to /run/containerd/containerd.sock inside the container.

@Syulin7 ok,I mount nodes‘s /run/containerd/containerd.sock to /run/containerd/containerd.sock inside the container. and exporter pod is running.

but exporter pod log is error:

time="2024-05-14T06:36:07Z" level=info msg="runtime is containerd"
{"level":"error","msg":"GetDriverVersion(): 535.161.07","time":"2024-05-14T06:36:07Z"}

query from prometheus is empty:

kubectl get --raw '/api/v1/namespaces/arena-system/services/prometheus-svc:prometheus/proxy/api/v1/query?query=nvidia_gpu_num_devices' 
{"status":"success","data":{"resultType":"vector","result":[]}}

gpu-exporter

@Syulin7
Copy link
Collaborator

Syulin7 commented May 15, 2024

@kangzemin Execute the following command to check if node-gpu-exporter exposes metrics.

kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'

@kangzemin
Copy link
Author

@kangzemin Execute the following command to check if node-gpu-exporter exposes metrics.

kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'

This is result :
image

image

image

@kangzemin
Copy link
Author

@kangzemin Execute the following command to check if node-gpu-exporter exposes metrics.

kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'

This is result : image

image

image

@Syulin7

@Syulin7
Copy link
Collaborator

Syulin7 commented May 15, 2024

#1087

@kangzemin I submitted a PR to fix this issue. Please refer to this PR to redeploy the service.

The Prometheus deployed here is for testing only. In a production environment, you should deploy your own Prometheus service and ensure data persistence.

@kangzemin
Copy link
Author

The Prometheus deployed here is for testing only. In a production environment, you should deploy your own Prometheus service and ensure data persistence.
@Syulin7 Ok,Thank you !

@Syulin7
Copy link
Collaborator

Syulin7 commented May 17, 2024

@kangzemin Does it work after trying again? Are there any other issues?

@kangzemin
Copy link
Author

@kangzemin Does it work after trying again? Are there any other issues?

The problem still exists。
image
But prometheus looks normal

kubectl get --raw '/api/v1/namespaces/arena-system/services/node-gpu-exporter:http-metrics/proxy/'

prometheus

grafana only gpunode dashboard has data, other is empty.

Can you give me some advice?

@Syulin7
Copy link
Collaborator

Syulin7 commented May 27, 2024

@kangzemin It seems that the metrics collected by node-gpu-exporter do not include the pod_name. Have you updated the node-gpu-exporter image and modified the resource limit value according to #1087?

@kangzemin
Copy link
Author

It seems that the metrics collected by node-gpu-exporter do not include the pod_name. Have you updated the node-gpu-exporter image and modified the resource limit value according to #1087?

@Syulin7 Yes, I fix deployment, use image:gpu-prometheus-exporter:v1.0.1-b2c2f9b. and limit cpu 1, mem 2000Mi.
arena top job, about gpu info is N/A .

@Syulin7
Copy link
Collaborator

Syulin7 commented May 29, 2024

@kangzemin This should be related to your cluster configuration, please contact me via email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants