Make puma cluster process suitable as PID 1 #3255

casperisfine · 2023-10-09T09:12:46Z

We recently ran into an issue in production where our puma containers had thousands of zombie (defunct) processes.

This was caused by the web application spawning some multi-process command (Google Chrome) and sometimes interrupting it hard.

- pid=1 puma cluster
 \- pid=2 puma worker
  \- pid=3 google chrome
   \- pid=4 google chrome subprocess

In such scenario if pid=3 dies without first reaping pid=4, then pid=4 get reparented as a child of pid=1, and pid=1 is responsible to reap it.

On classic style hosting, PID 1 is the init system (e.g. systemd), but with containers PID 1 is the container initial command, in our case the Puma cluster process.

Some people uses some minimal init implementations like tini to handle this, but I believe this enter the category of "unknown unknowns", as it you need to know about this concern to avoid the problem.

This commit simply change the wait_workers method to reap all childrens not just the workers it knows about.

If it end up reaping an unknown process, it logs it.

This is pretty much exactly what Unicorn and Pitchfork do, and remove the need for an init system in containers.

PS: I really don't know how to cover this with tests, it would require the test to be ran as PID 1.

casperisfine · 2023-10-09T09:32:55Z

The CI failures seem legit, I'll start digging into them.

nateberkopec · 2023-10-11T03:55:27Z

On classic style hosting, PID 1 is the init system (e.g. systemd), but with containers PID 1 is the container initial command, in our case the Puma cluster process.

TIL re: the container part

We recently ran into an issue in production where our puma containers had thousands of zombie (defunct) processes. This was caused by the web application spawning some multi-process command (Google Chrome) and sometimes interrupting it hard. - pid=1 puma cluster \- pid=2 puma worker \- pid=3 google chrome \- pid=4 google chrome subprocess In such scenario if pid=3 dies without first reaping pid=4, then pid=4 get reparented as a child of pid=1, and pid=1 is responsible to reap it. On classic style hosting, PID 1 is the init system (e.g. systemd), but with containers PID 1 is the container initial command, in our case the Puma cluster process. Some people uses some minimal init implementations like `tini` to handle this, but I believe this enter the category of "unknown unknowns", as it you need to know about this concern to avoid the problem. This commit simply change the `wait_workers` method to reap all childrens not just the workers it knows about. If it end up reaping an unknown process, it logs it. This is pretty much exactly what Unicorn and Pitchfork do, and remove the need for an init system in containers.

casperisfine · 2023-10-17T14:31:25Z

Apologies for the delay, I was a bit swamped last week.

I believe I fixed CI, there are two jobs failing on the last push, but both are ruby-head I suspect it's not related to my PR.

It should be ready for review now.

nateberkopec

Good for me but I want @MSP-Greg to approve as well

MSP-Greg · 2023-10-18T14:29:59Z

@byroot @casperisfine

Thanks for the PR. I'm working on a revised test suite, and this passed when cherry picked on top of it.

both are ruby-head I suspect it's not related to my PR.

Ruby head has been unstable recently, especially on limited resource CI VMs. Or, fails in the cloud, but never on my local desktop...

casperisfine · 2023-10-18T15:28:19Z

Ah thanks, I'll rebase then.

casperisfine · 2023-10-18T15:29:49Z

Nvm, I just realized you haven't merged these fixes yet.

nateberkopec · 2023-10-19T01:50:54Z

Thanks! Should get 6.4.1 out tomorrow or so.

We recently ran into an issue in production where our puma containers had thousands of zombie (defunct) processes. This was caused by the web application spawning some multi-process command (Google Chrome) and sometimes interrupting it hard. - pid=1 puma cluster \- pid=2 puma worker \- pid=3 google chrome \- pid=4 google chrome subprocess In such scenario if pid=3 dies without first reaping pid=4, then pid=4 get reparented as a child of pid=1, and pid=1 is responsible to reap it. On classic style hosting, PID 1 is the init system (e.g. systemd), but with containers PID 1 is the container initial command, in our case the Puma cluster process. Some people uses some minimal init implementations like `tini` to handle this, but I believe this enter the category of "unknown unknowns", as it you need to know about this concern to avoid the problem. This commit simply change the `wait_workers` method to reap all childrens not just the workers it knows about. If it end up reaping an unknown process, it logs it. This is pretty much exactly what Unicorn and Pitchfork do, and remove the need for an init system in containers. Co-authored-by: Jean Boussier <jean.boussier@gmail.com>

Starting with Puma v6.4.1, we observed that killed Puma cluster workers were never being restarted when run as PID 1. Note that PID 44 remained in the `defunct` state after a `kill 44` was issued: ``` git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef UID PID PPID C STIME TTY TIME CMD git 1 0 0 Jan09 ? 00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker] git 23 1 0 Jan09 ? 00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab git 41 1 0 Jan09 ? 00:01:55 ruby /srv/gitlab/bin/metrics-server git 44 1 0 Jan09 ? 00:02:41 [ruby] <defunct> git 46 1 0 Jan09 ? 00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker] git 48 1 0 Jan09 ? 00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker] git 49 1 0 Jan09 ? 00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker] git 5205 0 0 21:57 pts/0 00:00:00 bash git 5331 5205 0 22:00 pts/0 00:00:00 ps -ef ``` Further investigation showed that the introduction of `Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return anything inside Google Kubernetes Engine running as PID 1. Previously `Process.wait(w.pid, process::WNOHANG)` was called on each known worker PID. puma#3255 changed this behavior to do this only if the `fork_worker` config parameter were enabled, but it seems that we should always do this.

Starting with Puma v6.4.1, we observed that killed Puma cluster workers were never being restarted when run as PID 1. Note that PID 44 remained in the `defunct` state after a `kill 44` was issued: ``` git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef UID PID PPID C STIME TTY TIME CMD git 1 0 0 Jan09 ? 00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker] git 23 1 0 Jan09 ? 00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab git 41 1 0 Jan09 ? 00:01:55 ruby /srv/gitlab/bin/metrics-server git 44 1 0 Jan09 ? 00:02:41 [ruby] <defunct> git 46 1 0 Jan09 ? 00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker] git 48 1 0 Jan09 ? 00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker] git 49 1 0 Jan09 ? 00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker] git 5205 0 0 21:57 pts/0 00:00:00 bash git 5331 5205 0 22:00 pts/0 00:00:00 ps -ef ``` Further investigation showed that the introduction of `Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return anything inside Google Kubernetes Engine running as PID 1. Previously `Process.wait(w.pid, process::WNOHANG)` was called on each known worker PID. puma#3255 changed this behavior to do this only if the `fork_worker` config parameter were enabled, but it seems that we should always do this. Closes puma#3313

Starting with Puma v6.4.1, we observed that killed Puma cluster workers were never being restarted when the parent was run as PID 1. For example, I issued a `kill 44` and PID 44 remained in the `defunct` state: ``` git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef UID PID PPID C STIME TTY TIME CMD git 1 0 0 Jan09 ? 00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker] git 23 1 0 Jan09 ? 00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab git 41 1 0 Jan09 ? 00:01:55 ruby /srv/gitlab/bin/metrics-server git 44 1 0 Jan09 ? 00:02:41 [ruby] <defunct> git 46 1 0 Jan09 ? 00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker] git 48 1 0 Jan09 ? 00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker] git 49 1 0 Jan09 ? 00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker] git 5205 0 0 21:57 pts/0 00:00:00 bash git 5331 5205 0 22:00 pts/0 00:00:00 ps -ef ``` Further investigation showed that the introduction of `Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return anything inside Google Kubernetes Engine running as PID 1. Previously `Process.wait(w.pid, process::WNOHANG)` was called on each known worker PID. puma#3255 changed this behavior to do this only if the `fork_worker` config parameter were enabled, but it seems that we should always do this. Closes puma#3313

Starting with Puma v6.4.1, we observed that killed Puma cluster workers were never being restarted when the parent was run as PID 1. For example, I issued a `kill 44` and PID 44 remained in the `defunct` state: ``` git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef UID PID PPID C STIME TTY TIME CMD git 1 0 0 Jan09 ? 00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker] git 23 1 0 Jan09 ? 00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab git 41 1 0 Jan09 ? 00:01:55 ruby /srv/gitlab/bin/metrics-server git 44 1 0 Jan09 ? 00:02:41 [ruby] <defunct> git 46 1 0 Jan09 ? 00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker] git 48 1 0 Jan09 ? 00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker] git 49 1 0 Jan09 ? 00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker] git 5205 0 0 21:57 pts/0 00:00:00 bash git 5331 5205 0 22:00 pts/0 00:00:00 ps -ef ``` Further investigation showed that the introduction of `Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return anything inside Google Kubernetes Engine running as PID 1. Previously `Process.wait(w.pid, Process::WNOHANG)` was called on each known worker PID. puma#3255 changed this behavior to do this only if the `fork_worker` config parameter were enabled, but it seems that we should always do this. Closes puma#3313

Starting with Puma v6.4.1, we observed that killed Puma cluster workers were never being restarted when the parent was run as PID 1. For example, I issued a `kill 44` and PID 44 remained in the `defunct` state: ``` git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef UID PID PPID C STIME TTY TIME CMD git 1 0 0 Jan09 ? 00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker] git 23 1 0 Jan09 ? 00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab git 41 1 0 Jan09 ? 00:01:55 ruby /srv/gitlab/bin/metrics-server git 44 1 0 Jan09 ? 00:02:41 [ruby] <defunct> git 46 1 0 Jan09 ? 00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker] git 48 1 0 Jan09 ? 00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker] git 49 1 0 Jan09 ? 00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker] git 5205 0 0 21:57 pts/0 00:00:00 bash git 5331 5205 0 22:00 pts/0 00:00:00 ps -ef ``` Further investigation showed that the introduction of `Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return anything when: 1. The parent PID is 1. 2. `Process.detach(some PID != 1)` is run after a `Process.spawn`. This bug appears to be present in Ruby 3.1 and 3.2, but it seems to have been fixed in Ruby 3.3. Previously `Process.wait(w.pid, Process::WNOHANG)` was called on each known worker PID. puma#3255 changed this behavior to do this only if the `fork_worker` config parameter were enabled, but it seems that we should always do this. Closes puma#3313

Starting with Puma v6.4.1, we observed that killed Puma cluster workers were never being restarted when the parent was run as PID 1. For example, I issued a `kill 44` and PID 44 remained in the `defunct` state: ``` git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef UID PID PPID C STIME TTY TIME CMD git 1 0 0 Jan09 ? 00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker] git 23 1 0 Jan09 ? 00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab git 41 1 0 Jan09 ? 00:01:55 ruby /srv/gitlab/bin/metrics-server git 44 1 0 Jan09 ? 00:02:41 [ruby] <defunct> git 46 1 0 Jan09 ? 00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker] git 48 1 0 Jan09 ? 00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker] git 49 1 0 Jan09 ? 00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker] git 5205 0 0 21:57 pts/0 00:00:00 bash git 5331 5205 0 22:00 pts/0 00:00:00 ps -ef ``` Further investigation showed that the introduction of `Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return anything when `Process.detach` is run on some process that has not exited. This bug appears to be present from Ruby 2.6 to 3.2, but has been been fixed in Ruby 3.3: https://bugs.ruby-lang.org/issues/20181 Previously `Process.wait(w.pid, Process::WNOHANG)` was called on each known worker PID. puma#3255 changed this behavior to do this only if the `fork_worker` config parameter were enabled, but it seems that we should always do this to ensure that terminated workers are reaped in a timely manner. Closes puma#3313

Starting with Puma v6.4.1, we observed that killed Puma cluster workers were never being restarted when the parent was run as PID 1. For example, I issued a `kill 44` and PID 44 remained in the `defunct` state: ``` git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef UID PID PPID C STIME TTY TIME CMD git 1 0 0 Jan09 ? 00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker] git 23 1 0 Jan09 ? 00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab git 41 1 0 Jan09 ? 00:01:55 ruby /srv/gitlab/bin/metrics-server git 44 1 0 Jan09 ? 00:02:41 [ruby] <defunct> git 46 1 0 Jan09 ? 00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker] git 48 1 0 Jan09 ? 00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker] git 49 1 0 Jan09 ? 00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker] git 5205 0 0 21:57 pts/0 00:00:00 bash git 5331 5205 0 22:00 pts/0 00:00:00 ps -ef ``` Further investigation showed that the introduction of `Process.wait2(-1, Process::WNOHANG)` in puma#3255 never appears to return anything when `Process.detach` is run on some process that has not exited. This bug appears to be present from Ruby 2.6 to 3.2, but has been been fixed in Ruby 3.3: https://bugs.ruby-lang.org/issues/19837 Previously `Process.wait(w.pid, Process::WNOHANG)` was called on each known worker PID. puma#3255 changed this behavior to do this only if the `fork_worker` config parameter were enabled, but it seems that we should always do this to ensure that terminated workers are reaped in a timely manner. Closes puma#3313

* Fix child processes not being reaped when `Process.detach` used Starting with Puma v6.4.1, we observed that killed Puma cluster workers were never being restarted when the parent was run as PID 1. For example, I issued a `kill 44` and PID 44 remained in the `defunct` state: ``` git@gitlab-webservice-default-78664bb757-2nxvh:/var/log/gitlab$ ps -ef UID PID PPID C STIME TTY TIME CMD git 1 0 0 Jan09 ? 00:01:39 puma 6.4.1 (tcp://0.0.0.0:8080) [gitlab-puma-worker] git 23 1 0 Jan09 ? 00:05:46 /usr/local/bin/gitlab-logger /var/log/gitlab git 41 1 0 Jan09 ? 00:01:55 ruby /srv/gitlab/bin/metrics-server git 44 1 0 Jan09 ? 00:02:41 [ruby] <defunct> git 46 1 0 Jan09 ? 00:02:38 puma: cluster worker 1: 1 [gitlab-puma-worker] git 48 1 0 Jan09 ? 00:02:42 puma: cluster worker 2: 1 [gitlab-puma-worker] git 49 1 0 Jan09 ? 00:02:41 puma: cluster worker 3: 1 [gitlab-puma-worker] git 5205 0 0 21:57 pts/0 00:00:00 bash git 5331 5205 0 22:00 pts/0 00:00:00 ps -ef ``` Further investigation showed that the introduction of `Process.wait2(-1, Process::WNOHANG)` in #3255 never appears to return anything when `Process.detach` is run on some process that has not exited. This bug appears to be present from Ruby 2.6 to 3.2, but has been been fixed in Ruby 3.3: https://bugs.ruby-lang.org/issues/19837 Previously `Process.wait(w.pid, Process::WNOHANG)` was called on each known worker PID. #3255 changed this behavior to do this only if the `fork_worker` config parameter were enabled, but it seems that we should always do this to ensure that terminated workers are reaped in a timely manner. Closes #3313 * Add integration test for Puma worker reaping This test ensures that Puma handles the `Process.detach` bug described in https://bugs.ruby-lang.org/issues/19837.

casperisfine force-pushed the reap-all-children branch from e4ebbab to ad09f92 Compare October 17, 2023 13:34

casperisfine force-pushed the reap-all-children branch from ad09f92 to 54f01eb Compare October 17, 2023 13:38

nateberkopec approved these changes Oct 18, 2023

View reviewed changes

MSP-Greg approved these changes Oct 18, 2023

View reviewed changes

MSP-Greg mentioned this pull request Oct 18, 2023

6.4.1 #3261

Closed

nateberkopec merged commit a4826bb into puma:master Oct 19, 2023

stanhu mentioned this pull request Jan 10, 2024

Puma cluster not reaping child processes with Puma 6.4.1 #3313

Closed

stanhu mentioned this pull request Jan 10, 2024

Fix child processes not being reaped when Process.detach used #3314

Merged

7 tasks

dentarg added the feature label Mar 21, 2024

dentarg added the bug label Mar 21, 2024

dentarg mentioned this pull request Jul 2, 2024

Seeing multiple "reaped unknown child process" messages #3419

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make puma cluster process suitable as PID 1 #3255

Make puma cluster process suitable as PID 1 #3255

casperisfine commented Oct 9, 2023 •

edited

Loading

casperisfine commented Oct 9, 2023

nateberkopec commented Oct 11, 2023

casperisfine commented Oct 17, 2023

nateberkopec left a comment

MSP-Greg commented Oct 18, 2023

casperisfine commented Oct 18, 2023

casperisfine commented Oct 18, 2023

nateberkopec commented Oct 19, 2023

Make puma cluster process suitable as PID 1 #3255

Make puma cluster process suitable as PID 1 #3255

Conversation

casperisfine commented Oct 9, 2023 • edited Loading

casperisfine commented Oct 9, 2023

nateberkopec commented Oct 11, 2023

casperisfine commented Oct 17, 2023

nateberkopec left a comment

Choose a reason for hiding this comment

MSP-Greg commented Oct 18, 2023

casperisfine commented Oct 18, 2023

casperisfine commented Oct 18, 2023

nateberkopec commented Oct 19, 2023

casperisfine commented Oct 9, 2023 •

edited

Loading