Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(build): run zeebe user in docker image by default #13418

Merged
merged 3 commits into from
Jul 18, 2023

Conversation

jessesimpson36
Copy link
Contributor

@jessesimpson36 jessesimpson36 commented Jul 10, 2023

Description

The purpose of this PR is to use the non-privileged user by default. I am from the team that maintains the helm chart and docker-compose deployment methods of the camunda platform. My interest in making this PR is to reduce the amount of calls and support tickets that my team goes on. Customers often use security tools like AquaSecurity which blocks "insecure" images from being ran (even if we define a non-privileged user in run-time via runAsUser.

Related issues

closes #12382

Related PR

This PR is an alternative to a PR that currently exists which runs a non-privileged user under gosu (which would be a run-time non-root user).
#12931

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions

With regards to the helm charts, the user/customer needs to understand how to set the fsGroup option to ensure their volumes have the proper ownership.

  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

This should not be necessary for this PR.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually

I tested this change via loading the locally built image into my kind cluster and deploying the helm chart. I found that I needed to make a change to our helm chart by removing the already mounted /usr/local/bin/startup.sh file. I'm still not quite sure why the helm chart needed to overwrite that file.

After making that helm chart change, the pod started up successfully.

  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.

Helm chart needs to be changed with the startup.sh file currently overwriting the startup.sh file inside the image.

Please refer to our review guidelines.

@jessesimpson36
Copy link
Contributor Author

I'd like to know how I can test things further to give more confidence that we can move forward with this change. I'm also not very familiar with the QA Integration test CI that is failing. Perhaps it's related to the helm chart change that must be made in parallel to this one.

@jessesimpson36 jessesimpson36 marked this pull request as ready for review July 10, 2023 19:44
@jessesimpson36 jessesimpson36 changed the title fix: run zeebe user in docker image by default fix(docker): run zeebe user in docker image by default Jul 10, 2023
@jessesimpson36 jessesimpson36 changed the title fix(docker): run zeebe user in docker image by default fix(build): run zeebe user in docker image by default Jul 10, 2023
@megglos
Copy link
Contributor

megglos commented Jul 11, 2023

@jessesimpson36 thanks for creating this PR, I would pick this up after we came to a conclusion on #13302 as changes required to e.g. the failing AsymmetricNetworkPartitionIT test will depend on the base image used going forward.

@megglos
Copy link
Contributor

megglos commented Jul 11, 2023

@npepinpe
Copy link
Member

npepinpe commented Jul 11, 2023

I think this can be merged before anyway. The changes to the update tests that need to be done will still need to be done anyway, and while the asymmetric partition test might be fixed by switching to Alpine, honestly I'm not so confident about that move anymore, so we might as well deal with it now.

The main issue is that we need to install some utilities in the image for testing, and if the image is not running as root this is not possible anymore.

@jessesimpson36 - the failing test uses ip to add unreachable routes between nodes to simulate a partial network partition. I'd rather not add ip as part of the base image if it's only required for testing, but it looks like no common tools (e.g. route, ip, etc.) for this are in the base image.

We can't just disconnect the containers from the Docker network because we want to simulate the following: given 3 nodes, A, B, and C, we want to simulate that A cannot talk to C, and B can talk to both nodes.

I see two options: use reverse proxies to simulate the disconnection (e.g. something like toxiproxy), or dynamically build a Docker image on top of our image which has ip installed. I would prefer the Toxiproxy solution.

I don't expect @jessesimpson36 to do that however, so someone from our team should. I would normally say @megglos as he seems to be DRI for this topic in our team (maybe?), but he's currently out sick. If he's not back tomorrow, I can take it over.

That said, IIRC, the main blocker for us was that this is a breaking change. As we see with the update tests, deploying the original image then updating to the new one, with zeebe as the user, will cause deployments to fail due to invalid volume permissions (i.e. the data was previously written under root). Did we decide to just bite the bullet then?

@megglos
Copy link
Contributor

megglos commented Jul 11, 2023

I see two options: use reverse proxies to simulate the disconnection (e.g. something like toxiproxy), or dynamically build a Docker image on top of our image which has ip installed. I would prefer the Toxiproxy solution.

@npepinpe could we extend the execInContainer api to allow specifying a user for exec? then we would be able to install tooling at will again

@npepinpe
Copy link
Member

npepinpe commented Jul 11, 2023

Good catch, we could! I don't know how easy it is, as it's not directly exposed by Testcontainers, so we have to essentially reimplement their ExecInContainerPattern =/ But that might still be simpler than using a proxy or something like that.

Though, using Toxiproxy makes the solution much more portable in general 🤷

@jessesimpson36
Copy link
Contributor Author

From my perspective, biting the bullet about users changing volume permissions isn't as big of an issue compared to the current bullets I'm already munching on. Right now, I've already had to instruct customers on how to set their own user within these images, how to specify their own image / registry in the helm chart, and then to set the fsGroup which will change the file permissions for them.

The helm chart can be updated to set fsGroup by default, which will change the file permissions when volumes are mounted, on upgrade. Users specifying their own fsGroup won't be affected.

If theres workflows you'd like me to test, such as a helm upgrade with this new image and existing data inside the helm chart, just to prove a smooth migration can happen, I'm open to doing that. I'm also good with attempting myself to modify the existing tests.

@jessesimpson36
Copy link
Contributor Author

I created a draft PR in the helm chart repo with changes that may be necessary to support the non-root zeebe.

@jessesimpson36
Copy link
Contributor Author

In some basic testing, I found that the storage class seems to impact the effectiveness of fsGroup at changing the folder permissions. NFS and local-path PVC types don't change file permissions, but more cloud-native storageclasses do change permissions when fsGroup is set.

Hmm...

@megglos
Copy link
Contributor

megglos commented Jul 12, 2023

Though, using Toxiproxy makes the solution much more portable in general 🤷

Good point, I would also prefer to move away from installing custom tooling and rather use a dedicated component like toxiproxy to simulate network interruptions. Happy to try this out if it can wait 1/2 days.

@megglos
Copy link
Contributor

megglos commented Jul 12, 2023

In some basic testing, I found that the storage class seems to impact the effectiveness of fsGroup at changing the folder permissions. NFS and local-path PVC types don't change file permissions, but more cloud-native storageclasses do change permissions when fsGroup is set.

Hmm...

thanks for investigating! I think in such situations a workaround could be that users override the user the image runs with with root again? We could list that as a breaking change, but assuming the majority uses the helm chart, we should be good then already.

We also need to check the k8s controller for SaaS then to make use of fsGroup, right?

@npepinpe
Copy link
Member

For existing deployments, customers can still run as root if this is an issue for them right? Though it would be nice to provide them with an update path.

@npepinpe
Copy link
Member

So unfortunately using Toxiproxy is not as accurate, since we have multiple ports, and our nodes don't advertise different ports for different nodes. This means if we block traffic on one of the proxy ports, we block it for all nodes. Unfortunately, Toxiproxy doesn't support applying toxics only for specific source/destination =/

@npepinpe
Copy link
Member

@megglos are you taking care of fixing the tests or should I?

@megglos
Copy link
Contributor

megglos commented Jul 13, 2023

@megglos are you taking care of fixing the tests or should I?

if you are looking into it already anyway, go for it :) happy to review!

@npepinpe
Copy link
Member

@megglos - actually can you take over, I just realized I'm medic next week, and I still need to look into the randomized Raft bug today + wrap up the job worker integration for job push.

@megglos megglos force-pushed the jessesimpson36/use-zeebe-user branch 5 times, most recently from 2a16dfb to 31c8829 Compare July 14, 2023 20:57
@megglos
Copy link
Contributor

megglos commented Jul 17, 2023

@npepinpe I fixed the tests by running the apt commands as root using the docker client api + running the previous zeebe image with the zeebe user to avoid permission issues after update.

I would raise a PR to testcontainers to offer a withUser api in ExecInContainerPattern, if we get this merged we could directly make use of it.

Can you review the current state please?

@megglos megglos requested review from npepinpe and removed request for megglos July 17, 2023 06:39
Copy link
Member

@npepinpe npepinpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Good idea to open a PR upstream to allow specifying the user for the command. Are you doing it? Should I do it? It wasn't 100% clear :)

qa/update-tests/pom.xml Outdated Show resolved Hide resolved
@megglos
Copy link
Contributor

megglos commented Jul 17, 2023

🚀

Good idea to open a PR upstream to allow specifying the user for the command. Are you doing it? Should I do it? It wasn't 100% clear :)

I have it ready, PR to be opened shortly

@megglos
Copy link
Contributor

megglos commented Jul 17, 2023

Opened a PR on testcontainers-java
testcontainers/testcontainers-java#7311

@jessesimpson36
Copy link
Contributor Author

jessesimpson36 commented Jul 17, 2023

In some basic testing, I found that the storage class seems to impact the effectiveness of fsGroup at changing the folder permissions. NFS and local-path PVC types don't change file permissions, but more cloud-native storageclasses do change permissions when fsGroup is set.
Hmm...

thanks for investigating! I think in such situations a workaround could be that users override the user the image runs with with root again? We could list that as a breaking change, but assuming the majority uses the helm chart, we should be good then already.

We also need to check the k8s controller for SaaS then to make use of fsGroup, right?

Yep. I can talk with Chaima about this. Also, yes, creating a new image running as root or setting runAsUser: 0 would be a good workaround.

@megglos megglos force-pushed the jessesimpson36/use-zeebe-user branch from 1b8394b to 5227011 Compare July 18, 2023 07:23
@megglos
Copy link
Contributor

megglos commented Jul 18, 2023

Ready to merge from my perspective.

I would be willing to test this on SaaS (updating from 8.2.x to a generation with zeebe snapshot) and follow-up with the controller team if changes are needed. I guess it will be though looking at https://github.com/camunda-cloud/camunda-operator/blob/main/templates/zeebe_statefulset.yaml having no fsGroup set, right? @jessesimpson36

Docs will be followed up with camunda/camunda-docs#2340 after we are certain to keep it like that.

@npepinpe do you see any risk on breaking things like weekly benchmarks when merging this? 🤔 as they start fresh and don't involve updates it should be fine?

@npepinpe
Copy link
Member

Should have no impact on benchmarks 👍

@megglos
Copy link
Contributor

megglos commented Jul 18, 2023

bors merge

zeebe-bors-camunda bot added a commit that referenced this pull request Jul 18, 2023
13418: fix(build): run zeebe user in docker image by default r=megglos a=jessesimpson36

## Description

The purpose of this PR is to use the non-privileged user by default.  I am from the team that maintains the helm chart and docker-compose deployment methods of the camunda platform.  My interest in making this PR is to reduce the amount of calls and support tickets that my team goes on.  Customers often use security tools like AquaSecurity which blocks "insecure" images from being ran (even if we define a non-privileged user in run-time via  runAsUser.

## Related issues

closes #12382 

## Related PR

This PR is an alternative to a PR that currently exists which runs a non-privileged user under gosu (which would be a run-time non-root user).
#12931




Co-authored-by: Jesse Simpson <jesse.simpson@camunda.com>
Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
@zeebe-bors-camunda
Copy link
Contributor

Build failed:

@megglos
Copy link
Contributor

megglos commented Jul 18, 2023

bors retry

zeebe-bors-camunda bot added a commit that referenced this pull request Jul 18, 2023
13418: fix(build): run zeebe user in docker image by default r=megglos a=jessesimpson36

## Description

The purpose of this PR is to use the non-privileged user by default.  I am from the team that maintains the helm chart and docker-compose deployment methods of the camunda platform.  My interest in making this PR is to reduce the amount of calls and support tickets that my team goes on.  Customers often use security tools like AquaSecurity which blocks "insecure" images from being ran (even if we define a non-privileged user in run-time via  runAsUser.

## Related issues

closes #12382 

## Related PR

This PR is an alternative to a PR that currently exists which runs a non-privileged user under gosu (which would be a run-time non-root user).
#12931




Co-authored-by: Jesse Simpson <jesse.simpson@camunda.com>
Co-authored-by: Meggle (Sebastian Bathke) <sebastian.bathke@camunda.com>
@zeebe-bors-camunda
Copy link
Contributor

Build failed:

@megglos
Copy link
Contributor

megglos commented Jul 18, 2023

hit new flake #13537 which is not related to the changes here

@megglos
Copy link
Contributor

megglos commented Jul 18, 2023

bors retry

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@zeebe-bors-camunda zeebe-bors-camunda bot merged commit 1628bb8 into main Jul 18, 2023
33 checks passed
@zeebe-bors-camunda zeebe-bors-camunda bot deleted the jessesimpson36/use-zeebe-user branch July 18, 2023 14:29
lenaschoenburg added a commit to zeebe-io/zeebe-chaos that referenced this pull request Aug 1, 2023
Since camunda/camunda#13418 got merged the zeebe
container runs with an unprivileged `zeebe` user.

As the chaos tooling makes use of apt to install some tools needed to
e.g. stress the cpu or modify ip routes we need to also overwrite the
`runAsUser` to root to make that possible still.

Unfortnately the k8s exec API is not offering a way to override the user
see
https://github.com/kubernetes/kubectl/blob/master/pkg/cmd/exec/exec.go#L104

The first commit resolves an issue that lead to test failure, see
testcontainers/testcontainers-go#1359 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docker: Run the zeebe process with an unprivileged user by default
3 participants