-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(build): run zeebe user in docker image by default #13418
Conversation
I'd like to know how I can test things further to give more confidence that we can move forward with this change. I'm also not very familiar with the QA Integration test CI that is failing. Perhaps it's related to the helm chart change that must be made in parallel to this one. |
@jessesimpson36 thanks for creating this PR, I would pick this up after we came to a conclusion on #13302 as changes required to e.g. the failing |
I think this can be merged before anyway. The changes to the update tests that need to be done will still need to be done anyway, and while the asymmetric partition test might be fixed by switching to Alpine, honestly I'm not so confident about that move anymore, so we might as well deal with it now. The main issue is that we need to install some utilities in the image for testing, and if the image is not running as root this is not possible anymore. @jessesimpson36 - the failing test uses We can't just disconnect the containers from the Docker network because we want to simulate the following: given 3 nodes, A, B, and C, we want to simulate that A cannot talk to C, and B can talk to both nodes. I see two options: use reverse proxies to simulate the disconnection (e.g. something like toxiproxy), or dynamically build a Docker image on top of our image which has I don't expect @jessesimpson36 to do that however, so someone from our team should. I would normally say @megglos as he seems to be DRI for this topic in our team (maybe?), but he's currently out sick. If he's not back tomorrow, I can take it over. That said, IIRC, the main blocker for us was that this is a breaking change. As we see with the update tests, deploying the original image then updating to the new one, with |
@npepinpe could we extend the |
Good catch, we could! I don't know how easy it is, as it's not directly exposed by Testcontainers, so we have to essentially reimplement their Though, using Toxiproxy makes the solution much more portable in general 🤷 |
From my perspective, biting the bullet about users changing volume permissions isn't as big of an issue compared to the current bullets I'm already munching on. Right now, I've already had to instruct customers on how to set their own user within these images, how to specify their own image / registry in the helm chart, and then to set the fsGroup which will change the file permissions for them. The helm chart can be updated to set fsGroup by default, which will change the file permissions when volumes are mounted, on upgrade. Users specifying their own fsGroup won't be affected. If theres workflows you'd like me to test, such as a helm upgrade with this new image and existing data inside the helm chart, just to prove a smooth migration can happen, I'm open to doing that. I'm also good with attempting myself to modify the existing tests. |
I created a draft PR in the helm chart repo with changes that may be necessary to support the non-root zeebe. |
In some basic testing, I found that the storage class seems to impact the effectiveness of Hmm... |
Good point, I would also prefer to move away from installing custom tooling and rather use a dedicated component like toxiproxy to simulate network interruptions. Happy to try this out if it can wait 1/2 days. |
thanks for investigating! I think in such situations a workaround could be that users override the user the image runs with with root again? We could list that as a breaking change, but assuming the majority uses the helm chart, we should be good then already. We also need to check the k8s controller for SaaS then to make use of fsGroup, right? |
For existing deployments, customers can still run as root if this is an issue for them right? Though it would be nice to provide them with an update path. |
So unfortunately using Toxiproxy is not as accurate, since we have multiple ports, and our nodes don't advertise different ports for different nodes. This means if we block traffic on one of the proxy ports, we block it for all nodes. Unfortunately, Toxiproxy doesn't support applying toxics only for specific source/destination =/ |
@megglos are you taking care of fixing the tests or should I? |
if you are looking into it already anyway, go for it :) happy to review! |
@megglos - actually can you take over, I just realized I'm medic next week, and I still need to look into the randomized Raft bug today + wrap up the job worker integration for job push. |
2a16dfb
to
31c8829
Compare
@npepinpe I fixed the tests by running the apt commands as root using the docker client api + running the previous zeebe image with the zeebe user to avoid permission issues after update. I would raise a PR to testcontainers to offer a Can you review the current state please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Good idea to open a PR upstream to allow specifying the user for the command. Are you doing it? Should I do it? It wasn't 100% clear :)
I have it ready, PR to be opened shortly |
Opened a PR on testcontainers-java |
Yep. I can talk with Chaima about this. Also, yes, creating a new image running as root or setting runAsUser: 0 would be a good workaround. |
31c8829
to
1b8394b
Compare
1b8394b
to
5227011
Compare
Ready to merge from my perspective. I would be willing to test this on SaaS (updating from 8.2.x to a generation with zeebe snapshot) and follow-up with the controller team if changes are needed. I guess it will be though looking at https://github.com/camunda-cloud/camunda-operator/blob/main/templates/zeebe_statefulset.yaml having no fsGroup set, right? @jessesimpson36 Docs will be followed up with camunda/camunda-docs#2340 after we are certain to keep it like that. @npepinpe do you see any risk on breaking things like weekly benchmarks when merging this? 🤔 as they start fresh and don't involve updates it should be fine? |
Should have no impact on benchmarks 👍 |
bors merge |
13418: fix(build): run zeebe user in docker image by default r=megglos a=jessesimpson36 ## Description The purpose of this PR is to use the non-privileged user by default. I am from the team that maintains the helm chart and docker-compose deployment methods of the camunda platform. My interest in making this PR is to reduce the amount of calls and support tickets that my team goes on. Customers often use security tools like AquaSecurity which blocks "insecure" images from being ran (even if we define a non-privileged user in run-time via runAsUser. ## Related issues closes #12382 ## Related PR This PR is an alternative to a PR that currently exists which runs a non-privileged user under gosu (which would be a run-time non-root user). #12931 Co-authored-by: Jesse Simpson <jesse.simpson@camunda.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
Build failed: |
bors retry |
13418: fix(build): run zeebe user in docker image by default r=megglos a=jessesimpson36 ## Description The purpose of this PR is to use the non-privileged user by default. I am from the team that maintains the helm chart and docker-compose deployment methods of the camunda platform. My interest in making this PR is to reduce the amount of calls and support tickets that my team goes on. Customers often use security tools like AquaSecurity which blocks "insecure" images from being ran (even if we define a non-privileged user in run-time via runAsUser. ## Related issues closes #12382 ## Related PR This PR is an alternative to a PR that currently exists which runs a non-privileged user under gosu (which would be a run-time non-root user). #12931 Co-authored-by: Jesse Simpson <jesse.simpson@camunda.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
Build failed: |
hit new flake #13537 which is not related to the changes here |
bors retry |
Build succeeded: |
Since camunda/camunda#13418 got merged the zeebe container runs with an unprivileged `zeebe` user. As the chaos tooling makes use of apt to install some tools needed to e.g. stress the cpu or modify ip routes we need to also overwrite the `runAsUser` to root to make that possible still. Unfortnately the k8s exec API is not offering a way to override the user see https://github.com/kubernetes/kubectl/blob/master/pkg/cmd/exec/exec.go#L104 The first commit resolves an issue that lead to test failure, see testcontainers/testcontainers-go#1359 (comment)
Description
The purpose of this PR is to use the non-privileged user by default. I am from the team that maintains the helm chart and docker-compose deployment methods of the camunda platform. My interest in making this PR is to reduce the amount of calls and support tickets that my team goes on. Customers often use security tools like AquaSecurity which blocks "insecure" images from being ran (even if we define a non-privileged user in run-time via runAsUser.
Related issues
closes #12382
Related PR
This PR is an alternative to a PR that currently exists which runs a non-privileged user under gosu (which would be a run-time non-root user).
#12931
Definition of Done
Not all items need to be done depending on the issue and the pull request.
Code changes:
With regards to the helm charts, the user/customer needs to understand how to set the
fsGroup
option to ensure their volumes have the proper ownership.backport stable/1.3
) to the PR, in case that fails you need to create backports manually.This should not be necessary for this PR.
Testing:
I tested this change via loading the locally built image into my kind cluster and deploying the helm chart. I found that I needed to make a change to our helm chart by removing the already mounted /usr/local/bin/startup.sh file. I'm still not quite sure why the helm chart needed to overwrite that file.
After making that helm chart change, the pod started up successfully.
Documentation:
Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.
Helm chart needs to be changed with the startup.sh file currently overwriting the startup.sh file inside the image.
Please refer to our review guidelines.