Unable to run InferenceService on a local cluster #2715

yurkoff-mv · 2024-05-14T09:28:03Z

Validation Checklist

Is this a Kubeflow issue?
Are you posting in the right repository ?
Did you follow the installation guide https://github.com/kubeflow/manifests?tab=readme-ov-file ?
Is the issue report properly structured and detailed with version numbers?
Is this for Kubeflow development ?
Would you like to work on this issue?
Join our slack channel using wg-manifests.

Version

1.8

Describe your issue

I have a local cluster without internet access. Manifests version 1.8 is deployed on it. I deployed this version using images imported as tar files. I also imported the image for InferenceService as a tar file. However, the service does not start. If you run the command microk8s kubectl describe inferenceservices -n kubeflow-namespace llm, you may see the following error message:
Revision "llm -predictor-00001" failed with message: Unable to fetch image "yurkoff/torchserve-kfs:0.9.0-gpu": failed to resolve image to digest: Get "https://index.docker.io/v2 /": read tcp 10.1.22.219:48238->54.198.86.24:443: read: connection reset by peer.
Moreover, the image is present in microk8s ctr...
microk8s ctr images list | grep yurkoff
docker.io/yurkoff/torchserve-kfs:0.9.0-gpu application/vnd.docker.distribution.manifest.v2+json sha256:1b771d7c0c2d26f78e892997cb00e6051c77cf3654827c4715aa5a502267ee76 5.7 GiB linux/amd64 io.cri-containerd.image=managed

My yaml-file for InferenceSevice (Please note that I specifically set imagePullPolicy: "Never"):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: "llm"
  namespace: "kubeflow-namespace"
spec:
  predictor:
    pytorch:
      protocolVersion: v1
      runtimeVersion: "0.9.0-gpu"
      image: "yurkoff/torchserve-kfs:0.9.0-gpu"
      imagePullPolicy: "Never"
      storageUri: pvc://torchserve-claim/llm
      resources:
        requests:
          cpu: "2"
          memory: 16Gi
          nvidia.com/gpu: "1"
        limits:
          cpu: "4"
          memory: 30Gi
          nvidia.com/gpu: "1"
    minReplicas: 1
    maxReplicas: 1
    timeout: 180

Steps to reproduce the issue

In other machine with internet:

microk8s ctr images pull docker.io/yurkoff/torchserve-kfs:0.9.0-gpu
microk8s ctr images export yurkoff_torchserve-kfs_0.9.0-gpu.tar docker.io/yurkoff/torchserve-kfs:0.9.0-gpu

In local machine without internet:

microk8s ctr images import yurkoff_torchserve-kfs_0.9.0-gpu.tar
microk8s kubectl apply -f llm_isvc.yaml

Put here any screenshots or videos (optional)

No response

The text was updated successfully, but these errors were encountered:

juliusvonkohout · 2024-05-16T10:28:05Z

Hello, l do not see how that is Kubeflow related, i only see microk8s issues.

kubeflow-bot added this to To Do in Needs Triage May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to run InferenceService on a local cluster #2715

Unable to run InferenceService on a local cluster #2715

yurkoff-mv commented May 14, 2024

juliusvonkohout commented May 16, 2024

Unable to run InferenceService on a local cluster #2715

Unable to run InferenceService on a local cluster #2715

Comments

yurkoff-mv commented May 14, 2024

Validation Checklist

Version

Describe your issue

Steps to reproduce the issue

Put here any screenshots or videos (optional)

juliusvonkohout commented May 16, 2024