Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run InferenceService on a local cluster #2715

Open
4 of 7 tasks
yurkoff-mv opened this issue May 14, 2024 · 1 comment
Open
4 of 7 tasks

Unable to run InferenceService on a local cluster #2715

yurkoff-mv opened this issue May 14, 2024 · 1 comment

Comments

@yurkoff-mv
Copy link

Validation Checklist

  • Is this a Kubeflow issue?
  • Are you posting in the right repository ?
  • Did you follow the installation guide https://github.com/kubeflow/manifests?tab=readme-ov-file ?
  • Is the issue report properly structured and detailed with version numbers?
  • Is this for Kubeflow development ?
  • Would you like to work on this issue?
  • Join our slack channel using wg-manifests.

Version

1.8

Describe your issue

I have a local cluster without internet access. Manifests version 1.8 is deployed on it. I deployed this version using images imported as tar files. I also imported the image for InferenceService as a tar file. However, the service does not start. If you run the command microk8s kubectl describe inferenceservices -n kubeflow-namespace llm, you may see the following error message:
Revision "llm -predictor-00001" failed with message: Unable to fetch image "yurkoff/torchserve-kfs:0.9.0-gpu": failed to resolve image to digest: Get "https://index.docker.io/v2 /": read tcp 10.1.22.219:48238->54.198.86.24:443: read: connection reset by peer.
Moreover, the image is present in microk8s ctr...
microk8s ctr images list | grep yurkoff
docker.io/yurkoff/torchserve-kfs:0.9.0-gpu application/vnd.docker.distribution.manifest.v2+json sha256:1b771d7c0c2d26f78e892997cb00e6051c77cf3654827c4715aa5a502267ee76 5.7 GiB linux/amd64 io.cri-containerd.image=managed

My yaml-file for InferenceSevice (Please note that I specifically set imagePullPolicy: "Never"):

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: "llm"
  namespace: "kubeflow-namespace"
spec:
  predictor:
    pytorch:
      protocolVersion: v1
      runtimeVersion: "0.9.0-gpu"
      image: "yurkoff/torchserve-kfs:0.9.0-gpu"
      imagePullPolicy: "Never"
      storageUri: pvc://torchserve-claim/llm
      resources:
        requests:
          cpu: "2"
          memory: 16Gi
          nvidia.com/gpu: "1"
        limits:
          cpu: "4"
          memory: 30Gi
          nvidia.com/gpu: "1"
    minReplicas: 1
    maxReplicas: 1
    timeout: 180

Steps to reproduce the issue

In other machine with internet:

  1. microk8s ctr images pull docker.io/yurkoff/torchserve-kfs:0.9.0-gpu
  2. microk8s ctr images export yurkoff_torchserve-kfs_0.9.0-gpu.tar docker.io/yurkoff/torchserve-kfs:0.9.0-gpu

In local machine without internet:

  1. microk8s ctr images import yurkoff_torchserve-kfs_0.9.0-gpu.tar
  2. microk8s kubectl apply -f llm_isvc.yaml

Put here any screenshots or videos (optional)

No response

@kubeflow-bot kubeflow-bot added this to To Do in Needs Triage May 14, 2024
@juliusvonkohout
Copy link
Member

Hello, l do not see how that is Kubeflow related, i only see microk8s issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants