Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve container and imagefs stats #233

Merged
merged 3 commits into from
Oct 9, 2023

Conversation

xinfengliu
Copy link
Contributor

@xinfengliu xinfengliu commented Sep 5, 2023

We have seen high CPU usage in dockerd and cri-dockerd in some cases. The profiling data shows high CPU was caused by calculating container RW layer size. Also there's an issue report in #135

This PR does:

  • Limit the concurrency of ListContainerStats.
    Currently the number of concurrency is number of k8s containers, this could cause thunderstorm loads on big nodes running a large number of pods. This PR uses pipeline pattern to use at most half of number of CPUs for ListContainerStats. Please see later comments in this PR for updates.

  • Cache container stats with dynamic TTL based on the time taken for getting stats.
    Use another cache implementation to overcome the below shortcomings of k8s.io/client-go/tools/cache:

    1. it could not set different cache TTL for different entries
    2. it could cause time-consuming operations to fire in parallel. E.g. the InspectContainerWithSize could take minutes to finish, if multiple requests come in, it will cause multiple InspectContainerWithSize to run, thus adding high load to the system.
    • This customized cache prevents time-consuming operations from running in parallel. In addition, it adds lastTimeTaken to the cache TTL to make cache TTL dynamic. Normally the operation takes some ms or a few seconds, adding lastTimeTaken does not affect much to TTL. Some resource consuming operations may take minutes to finish, it does not make sense to keep the low TTL since the stats are already not realtime, make TTL longer helps reduce the load.
  • cache imageFS stats

  • For linux, calculate imageFS using docker root directory (obsoletes Fix evaluation of usage for ImageFsInfo method #221). ImageFs means the filesystem that container runtimes use to store container images and container writable layers. Only counting image directory itself is wrong and does not help kubelet make correct decisions for garbage collections or evictions.

  • Rename image_linux.go and image_windows.go to imagefs_linux.go and imagefs_windows.go to better reflect the file purposes.

@nwneisen nwneisen self-requested a review September 5, 2023 19:05
core/stats.go Outdated Show resolved Hide resolved
core/stats.go Show resolved Hide resolved
Copy link
Collaborator

@nwneisen nwneisen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

core/stats.go Outdated Show resolved Hide resolved
core/docker_service.go Outdated Show resolved Hide resolved
@xinfengliu xinfengliu force-pushed the improve-container-stats branch 3 times, most recently from 97c3d86 to 5af63d2 Compare September 10, 2023 10:26
@xinfengliu
Copy link
Contributor Author

xinfengliu commented Sep 10, 2023

I pushed a few more commits over the weekend based on my testing and debugging.

  • Use statfs for imageFsInfo() on Linux

    I think the usedBytes and iNodesUsed are only meaningful at the filesystem level for kubelet's decision on garbage collection or eviction. Using statfs also avoids the expensive walking the docker directory.

  • Reduce unnecessary function calls
    pickup Reduce the number of expensive calls in the Windows stats queries for dockershim kubernetes/kubernetes#104287 which cri-dockerd had missed.

  • Refactor the cache for docker info and docker version

    When docker info takes long time (e.g. under high system load), it could cause multiple incoming requests (e.g. getContainerStats) for getting docker info from the k8s.io/client-go/tools/cache fire multiple docker info request to dockerd when they all see the value is not in the cache. I have seen from a user's docker daemon stack dump that there're over 800 goroutines processing docker info although this user's cri-dockerd has included Cache docker info and version #209 .

    I used a cri-dockerd version with debugging info, also occasionally captured this:

Sep 10 12:25:09 ub2004-3 cri-dockerd-mke[625387]: time="2023-09-10T12:25:09+08:00" level=info msg="Calling dockerd for docker info"
Sep 10 12:25:09 ub2004-3 cri-dockerd-mke[625387]: time="2023-09-10T12:25:09+08:00" level=info msg="Calling dockerd for docker info"
Sep 10 12:25:09 ub2004-3 cri-dockerd-mke[625387]: time="2023-09-10T12:25:09+08:00" level=info msg="Calling dockerd for docker info"
Sep 10 12:25:09 ub2004-3 cri-dockerd-mke[625387]: time="2023-09-10T12:25:09+08:00" level=info msg="Calling dockerd for docker info"
Sep 10 12:25:09 ub2004-3 cri-dockerd-mke[625387]: time="2023-09-10T12:25:09+08:00" level=info msg="Calling dockerd for docker info"
Sep 10 12:25:09 ub2004-3 cri-dockerd-mke[625387]: time="2023-09-10T12:25:09+08:00" level=info msg="Calling dockerd for docker info"
Sep 10 12:25:09 ub2004-3 cri-dockerd-mke[625387]: time="2023-09-10T12:25:09+08:00" level=info msg="Calling dockerd for docker info"
Sep 10 12:25:09 ub2004-3 cri-dockerd-mke[625387]: time="2023-09-10T12:25:09+08:00" level=info msg="Calling dockerd for docker info"
...

I changed to use the home-made cache.(Even if using this home-mad cache, there might be heavy lock contention for docker info by getContainerStats. )

  • Rework the concurrency control for ListContainerStats

The getContainerStats may take some time. When there are many containers, the whole ListContainerStats may have long delays if the number of workers is small. Therefore, we want to set a bigger value for the number of workers to avoid too long delays. Consider a common node with 8 CPU running dozens of pods, NumCPU() * 6 may be a moderate number.

From the analysis of dockerd stack dump, I think the bottleneck is in daemon/stats.collector.go, it runs in a serial way to handle all ContainerStats requests, this does not scale with large number of ContainerStats requests triggered by ListContainerStats. Another issue is each ContainerStats needs a condition variable lock to register with Collector, it may miss the current Collector.Run loop, and each loop will sleep for 1s at the end.

info, err := ds.getDockerInfo()
if err != nil {
logrus.Error(err, "Failed to get docker info")
return nil, err
}

bytes, inodes, err := dirSize(filepath.Join(info.DockerRootDir, "image"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This additional work looks promising to me (need to spend more time looking at specifics later in the week when I find time); it's worth noting that this will conflict with/obsolete #221.

This will also have similar problems when the containerd snapshotter backend is used with dockerd (not that anyone is likely yet using cri-dockerd against Moby 24.0 in production with this feature) as I mentioned in #221 (comment).

cc @surik

@xinfengliu
Copy link
Contributor Author

Sorry for the messy commits, I just squashed them.

Highlighted recent changes:

Copy link
Collaborator

@neersighted neersighted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like store.NewObjectCache is unused with these changes -- can we please add removing it to this PR?

@xinfengliu
Copy link
Contributor Author

Looks like store.NewObjectCache is unused with these changes -- can we please add removing it to this PR?

Done. Thanks for pointing it out.

@xinfengliu
Copy link
Contributor Author

@neersighted
Could you help get moby/moby#46448 reviewed? Thanks. (I don't have the permission to request reviewers.)

@xinfengliu
Copy link
Contributor Author

Some new updates:

  • Since the time for calculating container RW layer size is unpredictable (In some cases, it could be multiple minutes, which exceeds the default docker client timeout 1m59s), it's beneficial to decouple the container RW layer size calculation.

  • The ImageFS used_bytes is now computed via docker images API, which is consistent with the way used by cadvisor, although I still think it's not meaningful since docker volumes, builder caches could also consume a lot of spaces.

New commits have been rebased with master branch and squashed.

@xinfengliu xinfengliu force-pushed the improve-container-stats branch 4 times, most recently from a328876 to 42f5668 Compare September 20, 2023 07:43
- Use pipeline pattern for controlling the concurrecy of ListContainerStats
- decouple computing container RW layer size and cache it.
- for linux, calculate imageFS using statfs and docker images api
- cache imageFS stats
- pick up kubernetes#104287

Signed-off-by: Xinfeng Liu <XinfengLiu@icloud.com>

remove store/object_cache.go

Signed-off-by: Xinfeng Liu <XinfengLiu@icloud.com>

decouple container RW layer size calculation

Signed-off-by: Xinfeng Liu <XinfengLiu@icloud.com>

use docker images api for imagefs usedBytes

Signed-off-by: Xinfeng Liu <XinfengLiu@icloud.com>
core/docker_service.go Show resolved Hide resolved
core/docker_service.go Outdated Show resolved Hide resolved
core/imagefs_linux.go Show resolved Hide resolved
core/stats_windows.go Outdated Show resolved Hide resolved
Signed-off-by: Xinfeng Liu <XinfengLiu@icloud.com>
@xinfengliu
Copy link
Contributor Author

@nwneisen I pushed a new commit per your recent feedbacks. Please let me know if there's anything else I need to do.

@nwneisen
Copy link
Collaborator

nwneisen commented Oct 9, 2023

@xinfengliu The changes look good. I will go ahead an merge this in. Thank you for the PR

@nwneisen nwneisen merged commit 8c694f1 into Mirantis:master Oct 9, 2023
2 checks passed
@zhangguanzhang
Copy link
Contributor

after update to v0.3.5, fixed this:

ERRO[0037] Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? Failed to get stats from container ...
ERRO[0037] Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? Failed to get stats from container ...
ERRO[0037] Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? Failed to get stats from container ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants