fix(build): run zeebe user in docker image by default #13418

jessesimpson36 · 2023-07-10T19:07:18Z

Description

The purpose of this PR is to use the non-privileged user by default. I am from the team that maintains the helm chart and docker-compose deployment methods of the camunda platform. My interest in making this PR is to reduce the amount of calls and support tickets that my team goes on. Customers often use security tools like AquaSecurity which blocks "insecure" images from being ran (even if we define a non-privileged user in run-time via runAsUser.

Related issues

closes #12382

Related PR

This PR is an alternative to a PR that currently exists which runs a non-privileged user under gosu (which would be a run-time non-root user).
#12931

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

The changes are backwards compatibility with previous versions

With regards to the helm charts, the user/customer needs to understand how to set the fsGroup option to ensure their volumes have the proper ownership.

If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

This should not be necessary for this PR.

Testing:

There are unit/integration tests that verify all acceptance criterias of the issue
New tests are written to ensure backwards compatibility with further versions
The behavior is tested manually

I tested this change via loading the locally built image into my kind cluster and deploying the helm chart. I found that I needed to make a change to our helm chart by removing the already mounted /usr/local/bin/startup.sh file. I'm still not quite sure why the helm chart needed to overwrite that file.

After making that helm chart change, the pod started up successfully.

The change has been verified by a QA run
The impact of the changes is verified by a benchmark

Documentation:

The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.

Distribution

Helm chart needs to be changed with the startup.sh file currently overwriting the startup.sh file inside the image.

Please refer to our review guidelines.

jessesimpson36 · 2023-07-10T19:44:17Z

I'd like to know how I can test things further to give more confidence that we can move forward with this change. I'm also not very familiar with the QA Integration test CI that is failing. Perhaps it's related to the helm chart change that must be made in parallel to this one.

megglos · 2023-07-11T07:34:42Z

@jessesimpson36 thanks for creating this PR, I would pick this up after we came to a conclusion on #13302 as changes required to e.g. the failing AsymmetricNetworkPartitionIT test will depend on the base image used going forward.

megglos · 2023-07-11T07:35:13Z

Relates to https://github.com/camunda/product-hub/issues/717

npepinpe · 2023-07-11T11:17:06Z

I think this can be merged before anyway. The changes to the update tests that need to be done will still need to be done anyway, and while the asymmetric partition test might be fixed by switching to Alpine, honestly I'm not so confident about that move anymore, so we might as well deal with it now.

The main issue is that we need to install some utilities in the image for testing, and if the image is not running as root this is not possible anymore.

@jessesimpson36 - the failing test uses ip to add unreachable routes between nodes to simulate a partial network partition. I'd rather not add ip as part of the base image if it's only required for testing, but it looks like no common tools (e.g. route, ip, etc.) for this are in the base image.

We can't just disconnect the containers from the Docker network because we want to simulate the following: given 3 nodes, A, B, and C, we want to simulate that A cannot talk to C, and B can talk to both nodes.

I see two options: use reverse proxies to simulate the disconnection (e.g. something like toxiproxy), or dynamically build a Docker image on top of our image which has ip installed. I would prefer the Toxiproxy solution.

I don't expect @jessesimpson36 to do that however, so someone from our team should. I would normally say @megglos as he seems to be DRI for this topic in our team (maybe?), but he's currently out sick. If he's not back tomorrow, I can take it over.

That said, IIRC, the main blocker for us was that this is a breaking change. As we see with the update tests, deploying the original image then updating to the new one, with zeebe as the user, will cause deployments to fail due to invalid volume permissions (i.e. the data was previously written under root). Did we decide to just bite the bullet then?

megglos · 2023-07-11T12:10:29Z

I see two options: use reverse proxies to simulate the disconnection (e.g. something like toxiproxy), or dynamically build a Docker image on top of our image which has ip installed. I would prefer the Toxiproxy solution.

@npepinpe could we extend the execInContainer api to allow specifying a user for exec? then we would be able to install tooling at will again

npepinpe · 2023-07-11T12:14:05Z

Good catch, we could! I don't know how easy it is, as it's not directly exposed by Testcontainers, so we have to essentially reimplement their ExecInContainerPattern =/ But that might still be simpler than using a proxy or something like that.

Though, using Toxiproxy makes the solution much more portable in general 🤷

jessesimpson36 · 2023-07-11T20:18:56Z

From my perspective, biting the bullet about users changing volume permissions isn't as big of an issue compared to the current bullets I'm already munching on. Right now, I've already had to instruct customers on how to set their own user within these images, how to specify their own image / registry in the helm chart, and then to set the fsGroup which will change the file permissions for them.

The helm chart can be updated to set fsGroup by default, which will change the file permissions when volumes are mounted, on upgrade. Users specifying their own fsGroup won't be affected.

If theres workflows you'd like me to test, such as a helm upgrade with this new image and existing data inside the helm chart, just to prove a smooth migration can happen, I'm open to doing that. I'm also good with attempting myself to modify the existing tests.

jessesimpson36 · 2023-07-12T01:20:59Z

I created a draft PR in the helm chart repo with changes that may be necessary to support the non-root zeebe.

jessesimpson36 · 2023-07-12T01:29:10Z

In some basic testing, I found that the storage class seems to impact the effectiveness of fsGroup at changing the folder permissions. NFS and local-path PVC types don't change file permissions, but more cloud-native storageclasses do change permissions when fsGroup is set.

Hmm...

megglos · 2023-07-12T06:39:10Z

Though, using Toxiproxy makes the solution much more portable in general 🤷

Good point, I would also prefer to move away from installing custom tooling and rather use a dedicated component like toxiproxy to simulate network interruptions. Happy to try this out if it can wait 1/2 days.

megglos · 2023-07-12T06:41:43Z

In some basic testing, I found that the storage class seems to impact the effectiveness of fsGroup at changing the folder permissions. NFS and local-path PVC types don't change file permissions, but more cloud-native storageclasses do change permissions when fsGroup is set.

Hmm...

thanks for investigating! I think in such situations a workaround could be that users override the user the image runs with with root again? We could list that as a breaking change, but assuming the majority uses the helm chart, we should be good then already.

We also need to check the k8s controller for SaaS then to make use of fsGroup, right?

npepinpe · 2023-07-12T06:51:20Z

For existing deployments, customers can still run as root if this is an issue for them right? Though it would be nice to provide them with an update path.

npepinpe · 2023-07-13T08:00:50Z

So unfortunately using Toxiproxy is not as accurate, since we have multiple ports, and our nodes don't advertise different ports for different nodes. This means if we block traffic on one of the proxy ports, we block it for all nodes. Unfortunately, Toxiproxy doesn't support applying toxics only for specific source/destination =/

npepinpe · 2023-07-13T09:09:56Z

@megglos are you taking care of fixing the tests or should I?

megglos · 2023-07-13T09:11:39Z

@megglos are you taking care of fixing the tests or should I?

if you are looking into it already anyway, go for it :) happy to review!

npepinpe · 2023-07-14T07:51:01Z

@megglos - actually can you take over, I just realized I'm medic next week, and I still need to look into the randomized Raft bug today + wrap up the job worker integration for job push.

megglos · 2023-07-17T06:39:44Z

@npepinpe I fixed the tests by running the apt commands as root using the docker client api + running the previous zeebe image with the zeebe user to avoid permission issues after update.

I would raise a PR to testcontainers to offer a withUser api in ExecInContainerPattern, if we get this merged we could directly make use of it.

Can you review the current state please?

npepinpe

🚀

Good idea to open a PR upstream to allow specifying the user for the command. Are you doing it? Should I do it? It wasn't 100% clear :)

qa/update-tests/pom.xml

qa/update-tests/src/test/java/io/camunda/zeebe/test/ContainerState.java

megglos · 2023-07-17T08:02:56Z

🚀

Good idea to open a PR upstream to allow specifying the user for the command. Are you doing it? Should I do it? It wasn't 100% clear :)

I have it ready, PR to be opened shortly

megglos · 2023-07-17T10:33:57Z

Opened a PR on testcontainers-java
testcontainers/testcontainers-java#7311

jessesimpson36 · 2023-07-17T14:45:13Z

In some basic testing, I found that the storage class seems to impact the effectiveness of fsGroup at changing the folder permissions. NFS and local-path PVC types don't change file permissions, but more cloud-native storageclasses do change permissions when fsGroup is set.
Hmm...

thanks for investigating! I think in such situations a workaround could be that users override the user the image runs with with root again? We could list that as a breaking change, but assuming the majority uses the helm chart, we should be good then already.

We also need to check the k8s controller for SaaS then to make use of fsGroup, right?

Yep. I can talk with Chaima about this. Also, yes, creating a new image running as root or setting runAsUser: 0 would be a good workaround.

megglos · 2023-07-18T07:46:49Z

Ready to merge from my perspective.

I would be willing to test this on SaaS (updating from 8.2.x to a generation with zeebe snapshot) and follow-up with the controller team if changes are needed. I guess it will be though looking at https://github.com/camunda-cloud/camunda-operator/blob/main/templates/zeebe_statefulset.yaml having no fsGroup set, right? @jessesimpson36

Docs will be followed up with camunda/camunda-docs#2340 after we are certain to keep it like that.

@npepinpe do you see any risk on breaking things like weekly benchmarks when merging this? 🤔 as they start fresh and don't involve updates it should be fine?

npepinpe · 2023-07-18T08:41:22Z

Should have no impact on benchmarks 👍

megglos · 2023-07-18T13:29:33Z

bors merge

13418: fix(build): run zeebe user in docker image by default r=megglos a=jessesimpson36 ## Description The purpose of this PR is to use the non-privileged user by default. I am from the team that maintains the helm chart and docker-compose deployment methods of the camunda platform. My interest in making this PR is to reduce the amount of calls and support tickets that my team goes on. Customers often use security tools like AquaSecurity which blocks "insecure" images from being ran (even if we define a non-privileged user in run-time via runAsUser. ## Related issues closes #12382 ## Related PR This PR is an alternative to a PR that currently exists which runs a non-privileged user under gosu (which would be a run-time non-root user). #12931 Co-authored-by: Jesse Simpson <jesse.simpson@camunda.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>

zeebe-bors-camunda · 2023-07-18T13:40:41Z

Build failed:

Test summary

megglos · 2023-07-18T13:47:28Z

bors retry

13418: fix(build): run zeebe user in docker image by default r=megglos a=jessesimpson36 ## Description The purpose of this PR is to use the non-privileged user by default. I am from the team that maintains the helm chart and docker-compose deployment methods of the camunda platform. My interest in making this PR is to reduce the amount of calls and support tickets that my team goes on. Customers often use security tools like AquaSecurity which blocks "insecure" images from being ran (even if we define a non-privileged user in run-time via runAsUser. ## Related issues closes #12382 ## Related PR This PR is an alternative to a PR that currently exists which runs a non-privileged user under gosu (which would be a run-time non-root user). #12931 Co-authored-by: Jesse Simpson <jesse.simpson@camunda.com> Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>

zeebe-bors-camunda · 2023-07-18T13:58:34Z

Build failed:

Test summary

megglos · 2023-07-18T14:16:23Z

hit new flake #13537 which is not related to the changes here

megglos · 2023-07-18T14:16:42Z

bors retry

zeebe-bors-camunda · 2023-07-18T14:29:16Z

Build succeeded:

Test summary

Since camunda/camunda#13418 got merged the zeebe container runs with an unprivileged `zeebe` user. As the chaos tooling makes use of apt to install some tools needed to e.g. stress the cpu or modify ip routes we need to also overwrite the `runAsUser` to root to make that possible still. Unfortnately the k8s exec API is not offering a way to override the user see https://github.com/kubernetes/kubectl/blob/master/pkg/cmd/exec/exec.go#L104 The first commit resolves an issue that lead to test failure, see testcontainers/testcontainers-go#1359 (comment)

jessesimpson36 mentioned this pull request Jul 10, 2023

Docker: Run the zeebe process with an unprivileged user by default #12382

Closed

jessesimpson36 requested a review from megglos July 10, 2023 19:44

jessesimpson36 marked this pull request as ready for review July 10, 2023 19:44

jessesimpson36 changed the title ~~fix: run zeebe user in docker image by default~~ fix(docker): run zeebe user in docker image by default Jul 10, 2023

jessesimpson36 changed the title ~~fix(docker): run zeebe user in docker image by default~~ fix(build): run zeebe user in docker image by default Jul 10, 2023

jessesimpson36 mentioned this pull request Jul 12, 2023

refactor: support non-root user by default in zeebe camunda/camunda-platform-helm#778

Merged

7 tasks

megglos force-pushed the jessesimpson36/use-zeebe-user branch 5 times, most recently from 2a16dfb to 31c8829 Compare July 14, 2023 20:57

megglos requested review from npepinpe and removed request for megglos July 17, 2023 06:39

npepinpe approved these changes Jul 17, 2023

View reviewed changes

qa/update-tests/pom.xml Outdated Show resolved Hide resolved

qa/update-tests/src/test/java/io/camunda/zeebe/test/ContainerState.java Show resolved Hide resolved

megglos force-pushed the jessesimpson36/use-zeebe-user branch from 31c8829 to 1b8394b Compare July 18, 2023 06:55

megglos mentioned this pull request Jul 18, 2023

Remove explicit usage of zeebe user for previous version in update tests #13528

Open

jessesimpson36 and others added 3 commits July 18, 2023 09:22

fix: run zeebe user in docker image by default

3ceb62b

fix: use root user for installing network utils

3e6c199

fix: use zeebe user for old image on update test

5227011

megglos force-pushed the jessesimpson36/use-zeebe-user branch from 1b8394b to 5227011 Compare July 18, 2023 07:23

megglos mentioned this pull request Jul 18, 2023

Docker: Run the zeebe process with an unprivileged user by default #12931

Closed

14 tasks

zeebe-bors-camunda bot merged commit 1628bb8 into main Jul 18, 2023
33 checks passed

zeebe-bors-camunda bot deleted the jessesimpson36/use-zeebe-user branch July 18, 2023 14:29

jessesimpson36 mentioned this pull request Jul 28, 2023

fix: runs image as non-root user camunda/connectors#952

Merged

megglos mentioned this pull request Aug 1, 2023

fix(setup): runAs root to install tooling zeebe-io/zeebe-chaos#388

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(build): run zeebe user in docker image by default #13418

fix(build): run zeebe user in docker image by default #13418

jessesimpson36 commented Jul 10, 2023 •

edited

jessesimpson36 commented Jul 10, 2023

megglos commented Jul 11, 2023

megglos commented Jul 11, 2023

npepinpe commented Jul 11, 2023 •

edited

megglos commented Jul 11, 2023

npepinpe commented Jul 11, 2023 •

edited

jessesimpson36 commented Jul 11, 2023

jessesimpson36 commented Jul 12, 2023

jessesimpson36 commented Jul 12, 2023

megglos commented Jul 12, 2023

megglos commented Jul 12, 2023

npepinpe commented Jul 12, 2023

npepinpe commented Jul 13, 2023

npepinpe commented Jul 13, 2023

megglos commented Jul 13, 2023

npepinpe commented Jul 14, 2023

megglos commented Jul 17, 2023

npepinpe left a comment

megglos commented Jul 17, 2023

megglos commented Jul 17, 2023

jessesimpson36 commented Jul 17, 2023 •

edited

megglos commented Jul 18, 2023 •

edited

npepinpe commented Jul 18, 2023

megglos commented Jul 18, 2023

zeebe-bors-camunda bot commented Jul 18, 2023

megglos commented Jul 18, 2023

zeebe-bors-camunda bot commented Jul 18, 2023

megglos commented Jul 18, 2023 •

edited

megglos commented Jul 18, 2023

zeebe-bors-camunda bot commented Jul 18, 2023

fix(build): run zeebe user in docker image by default #13418

fix(build): run zeebe user in docker image by default #13418

Conversation

jessesimpson36 commented Jul 10, 2023 • edited

Description

Related issues

Related PR

Definition of Done

jessesimpson36 commented Jul 10, 2023

megglos commented Jul 11, 2023

megglos commented Jul 11, 2023

npepinpe commented Jul 11, 2023 • edited

megglos commented Jul 11, 2023

npepinpe commented Jul 11, 2023 • edited

jessesimpson36 commented Jul 11, 2023

jessesimpson36 commented Jul 12, 2023

jessesimpson36 commented Jul 12, 2023

megglos commented Jul 12, 2023

megglos commented Jul 12, 2023

npepinpe commented Jul 12, 2023

npepinpe commented Jul 13, 2023

npepinpe commented Jul 13, 2023

megglos commented Jul 13, 2023

npepinpe commented Jul 14, 2023

megglos commented Jul 17, 2023

npepinpe left a comment

Choose a reason for hiding this comment

megglos commented Jul 17, 2023

megglos commented Jul 17, 2023

jessesimpson36 commented Jul 17, 2023 • edited

megglos commented Jul 18, 2023 • edited

npepinpe commented Jul 18, 2023

megglos commented Jul 18, 2023

zeebe-bors-camunda bot commented Jul 18, 2023

megglos commented Jul 18, 2023

zeebe-bors-camunda bot commented Jul 18, 2023

megglos commented Jul 18, 2023 • edited

megglos commented Jul 18, 2023

zeebe-bors-camunda bot commented Jul 18, 2023

jessesimpson36 commented Jul 10, 2023 •

edited

npepinpe commented Jul 11, 2023 •

edited

npepinpe commented Jul 11, 2023 •

edited

jessesimpson36 commented Jul 17, 2023 •

edited

megglos commented Jul 18, 2023 •

edited

megglos commented Jul 18, 2023 •

edited