[Fix #2786] Worker 0 timing out during phased restart #3225

joshuay03 · 2023-09-15T00:23:46Z

Description

Closes #2786.

Ensures that worker 0 is pinged on every forked worker's post-boot ping to prevent it from timing out.

An integration test for this would be something like #2786 (comment) but might not be practical considering the time it takes. I'll add some unit tests when I get a chance unless anyone has a better idea.

I've added a regression test with a lowered worker timeout to simulate a high worker count + high worker timeout real world scenario.

Your checklist for this pull request

I have reviewed the guidelines for contributing to this repository.
I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
My pull request is 100 lines added/removed or less so that it can be easily reviewed.
If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
I have updated the documentation accordingly.
All new and existing tests passed, including Rubocop.

lib/puma/cluster.rb

joshuay03 · 2023-09-15T03:18:53Z

Thanks! @MSP-Greg

MSP-Greg · 2023-09-15T03:36:56Z

@joshuay03

Thanks for the PR. It's certainly an issue with lots of workers and threads. Locally, I don't think I ever tried a large number of workers.

I'm wondering about a test, but after making several typos while watching a US football game...

test/test_integration_cluster.rb

MSP-Greg · 2023-09-16T03:31:27Z

@joshuay03

I've also worked on a test, I've got one working with the PR and failing with master. But, it involves some changes to one of the helper files, and it's tanked all the JRuby CI. Of course, it works with JRuby locally...

I decided to look at it tomorrow with a 'fresher' set of eyes.

MSP-Greg · 2023-09-26T19:47:19Z

@joshuay03

Thanks again for this. Using @server.gets may involve a bit of blocking, which can be problematic with parallel testing and CI servers running with multiple VM's.

Can you rebase and use the following? I've tried to rebase PR's before, and things seem to go south when I've then done a force push. Note that I dropped the worker count to 10.

JFYI, this only possible with a rebase, as I just added @server_log, which is the server log that's accumulated as any method that reads the server log output is called. get_worker_pids waits for all the workers to be booted, 1st parameter is the phase...

Thanks.

  def test_fork_worker_phased_restart_with_high_worker_count
    worker_count = 10

    cli_server "test/rackup/hello.ru", config: <<~RUBY
      fork_worker 0
      worker_check_interval 1
      # lower worker timeout from default (60) to avoid test timeout
      worker_timeout 2
      # to simulate worker 0 timeout, total boot time for all workers
      # needs to exceed single worker timeout
      workers #{worker_count}
    RUBY

    # workers is the default
    get_worker_pids 0, worker_count

    Process.kill :USR1, @pid

    get_worker_pids 1, worker_count

    # below is so all of @server_log isn't output for failure
    refute @server_log[/.*Terminating timed out worker.*/]
  end

Co-authored-by: MSP-Greg <Greg.mpls@gmail.com>

joshuay03 · 2023-09-26T23:34:51Z

@MSP-Greg

Done 👍🏽 This setup is much better, thanks for that!

joshuay03 force-pushed the phased-restart-worker-0-timing-out branch from 21cf991 to cb5df5e Compare September 15, 2023 00:25

joshuay03 commented Sep 15, 2023

View reviewed changes

lib/puma/cluster.rb Show resolved Hide resolved

nateberkopec added bug waiting-for-changes Waiting on changes from the requestor labels Sep 15, 2023

joshuay03 commented Sep 16, 2023

View reviewed changes

test/test_integration_cluster.rb Outdated Show resolved Hide resolved

joshuay03 force-pushed the phased-restart-worker-0-timing-out branch from 34e0018 to 3a3fd49 Compare September 16, 2023 01:53

joshuay03 force-pushed the phased-restart-worker-0-timing-out branch 2 times, most recently from fccca5c to f682f0a Compare September 17, 2023 12:17

joshuay03 mentioned this pull request Sep 17, 2023

[ci skip] | Docs | Clarify worker_timeout minimum value #3226

Merged

7 tasks

joshuay03 and others added 5 commits September 27, 2023 09:02

[Fix puma#2786] Worker 0 timing out during phased restart

8409db3

test/test_integration_cluster.rb - fixup for logging changes

ad1c2e5

Add test which simulates worker 0 timeout

ef467fd

Bump workers, reduce timeout for more realistic but also faster test

783dcde

Use server log consumption with less blocking in test

f88961b

Co-authored-by: MSP-Greg <Greg.mpls@gmail.com>

joshuay03 force-pushed the phased-restart-worker-0-timing-out branch from 4cdb945 to f88961b Compare September 26, 2023 23:03

MSP-Greg approved these changes Sep 26, 2023

View reviewed changes

nateberkopec merged commit 252890c into puma:master Sep 27, 2023
59 checks passed

joshuay03 deleted the phased-restart-worker-0-timing-out branch September 27, 2023 04:33

joshuay03 mentioned this pull request Oct 20, 2023

fork_worker keeps serving stale code indefinitely and at random after phased restart (specially after second phased restart) #2470

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix #2786] Worker 0 timing out during phased restart #3225

[Fix #2786] Worker 0 timing out during phased restart #3225

joshuay03 commented Sep 15, 2023 •

edited

joshuay03 commented Sep 15, 2023

MSP-Greg commented Sep 15, 2023

MSP-Greg commented Sep 16, 2023

MSP-Greg commented Sep 26, 2023

joshuay03 commented Sep 26, 2023

[Fix #2786] Worker 0 timing out during phased restart #3225

[Fix #2786] Worker 0 timing out during phased restart #3225

Conversation

joshuay03 commented Sep 15, 2023 • edited

Description

Your checklist for this pull request

joshuay03 commented Sep 15, 2023

MSP-Greg commented Sep 15, 2023

MSP-Greg commented Sep 16, 2023

MSP-Greg commented Sep 26, 2023

joshuay03 commented Sep 26, 2023

joshuay03 commented Sep 15, 2023 •

edited