-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker start from checkpoint fails occasionally as content sha256 already exists #42900
Comments
Error is emitted here; moby/libcontainerd/remote/client.go Lines 188 to 190 in 2773f81
And coming from this code moby/libcontainerd/remote/client.go Lines 906 to 908 in 2773f81
Perhaps it's safe to ignore that error; I see we have some handling for that in other parts of the code that stores content in containerd, e.g.; Lines 149 to 155 in 2773f81
|
Hmm.. looking a bit more at the code; IIUC, the content is stored in containerd's metadata using the checkpoint-dir as reference; moby/libcontainerd/remote/client.go Line 172 in 2773f81
If I'm correct, this error may occur if either multiple containers are restored from the same checkpoint-directory, or if there's a race condition where the content wasn't removed yet after the container exited. I see the same function mentioned above also cleans up the checkpoint from containerd's content store in a moby/libcontainerd/remote/client.go Lines 175 to 176 in 2773f81
Wondering if that's a problem if multiple containers try to start from the same checkpoint 🤔 (or if there's still some lease / reference counting that would prevent that from happening) @cpuguy83 any thoughts? |
@thaJeztah thanks for the quick response!
I use different checkpoint directories for different containers, but the checkpoint name is the same for all containers (do you think that could be an issue?) Edit: In the API call, I do not specify the checkpoint directory. I just specify the checkpoint name (which is the same for all containers). |
I only gave it a cursory look, so not sure (it could be that it only passes the basedir). That said, looking at the error message again, and the error is mentioning a digest that already exists;
So wondering now if the "(directory)name" is just a red-herring, and it's the content of the checkpoint that's the same (not sure what the chances are for that; perhaps "empty" state? Of course, the checksum could still be a checksum of the name 😅. I guess to summarise; this needs someone to take a dive into what's happening 😂. |
We also run into this problem in our CI tests: checkpoint-restore/criu#1567
It looks like a race condition in containerd. |
The problem seems to occur when creating a container from a checkpoint immediately after the checkpoint has been created. From the following comment moby/vendor/github.com/containerd/containerd/content/local/store.go Lines 143 to 145 in 1192b46
it looks like we might need to add a global lock in moby/libcontainerd/remote/client.go Lines 175 to 176 in 2773f81
|
There is a race condition in docker/containerd that occasionally causes the following error when a container has been restored immediately after checkpoint. This problem is unrelated to criu and has been reported in moby/moby#42900 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
There is a race condition in docker/containerd that causes docker to occasionally fail when starting a container from a checkpoint immediately after the checkpoint has been created. This problem is unrelated to criu and has been reported in moby/moby#42900 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
There is a race condition in docker/containerd that causes docker to occasionally fail when starting a container from a checkpoint immediately after the checkpoint has been created. This problem is unrelated to criu and has been reported in moby/moby#42900 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
There is a race condition in docker/containerd that causes docker to occasionally fail when starting a container from a checkpoint immediately after the checkpoint has been created. This problem is unrelated to criu and has been reported in moby/moby#42900 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
There is a race condition in docker/containerd that causes docker to occasionally fail when starting a container from a checkpoint immediately after the checkpoint has been created. This problem is unrelated to criu and has been reported in moby/moby#42900 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
I'm running into this exact same issue when trying to follow https://criu.org/Docker. $ docker run -d --name looper busybox /bin/sh -c 'i=0; while true; do echo $i; i=$(expr $i + 1); sleep 1; done'
fa322f8bad73d3fc4ad4558aa73c4d9e1e744daa1eec392eef2465e659996b83
$ docker checkpoint create looper checkpoint1
checkpoint1
$ docker start --checkpoint checkpoint1 looper
Error response from daemon: failed to create task for container: content digest 2921a6b88e538747da49680beffa44afc8a1e487fe14bdea776430d91af86725: not found: unknown
$ docker start --checkpoint checkpoint1 looper
Error response from daemon: failed to upload checkpoint to containerd: commit failed: content sha256:2921a6b88e538747da49680beffa44afc8a1e487fe14bdea776430d91af86725: already exists Output of docker version$ docker version
Client: Docker Engine - Community
Version: 25.0.2
API version: 1.44
Go version: go1.21.6
Git commit: 29cf629
Built: Thu Feb 1 00:22:57 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 25.0.2
API version: 1.44 (minimum version 1.24)
Go version: go1.21.6
Git commit: fce6e0c
Built: Thu Feb 1 00:22:57 2024
OS/Arch: linux/amd64
Experimental: true
containerd:
Version: 1.6.28
GitCommit: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
runc:
Version: 1.1.12
GitCommit: v1.1.12-0-g51d5e94
docker-init:
Version: 0.19.0
GitCommit: de40ad0 Output of docker info$ docker info
Client: Docker Engine - Community
Version: 25.0.2
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.12.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.24.5
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 2
Running: 0
Paused: 0
Stopped: 2
Images: 3
Server Version: 25.0.2
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
runc version: v1.1.12-0-g51d5e94
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.15.0-92-generic
Operating System: Ubuntu 22.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.61GiB
Name: c10-03.sysnet.ucsd.edu
ID: 30256552-1c1a-4307-9b2f-e4c8fff589cc
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
I tried waiting for a while, but that didn't seem to particularly help in my case. I'd be happy to share more information if it helps debug. I am looking to explore the support for live container migration and this has been a blocker for me. I would appreciate if someone can show a workaround or if someone is willing to guide me on the required fix which I can try taking up. |
@mayank-02: This is fixed with: #47456 |
The error still exists after the fix, unless I'm missing something: #47456 (comment) Edit: I was missing that the backport only applied to v25.0.4 onwards |
Thanks for checking! Let me close the issue then. |
Description
I'm running a large workload where I need to repeatedly take container checkpoints and then restart them. Occasionally the start from checkpoint fails with the following error:
I use the API call directly to start the container as follows:
When making the above API call, occasionally the start fails due to a message saying sha256 already exists. I'm wondering what could be the reason for this.
Output of
docker version
:Output of
docker info
:The text was updated successfully, but these errors were encountered: