Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix #2786] Worker 0 timing out during phased restart #3225

Merged

Conversation

joshuay03
Copy link
Contributor

@joshuay03 joshuay03 commented Sep 15, 2023

Description

Closes #2786.

Ensures that worker 0 is pinged on every forked worker's post-boot ping to prevent it from timing out.

An integration test for this would be something like #2786 (comment) but might not be practical considering the time it takes. I'll add some unit tests when I get a chance unless anyone has a better idea.

I've added a regression test with a lowered worker timeout to simulate a high worker count + high worker timeout real world scenario.

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

@joshuay03 joshuay03 force-pushed the phased-restart-worker-0-timing-out branch from 21cf991 to cb5df5e Compare September 15, 2023 00:25
@joshuay03
Copy link
Contributor Author

Thanks! @MSP-Greg

@MSP-Greg
Copy link
Member

@joshuay03

Thanks for the PR. It's certainly an issue with lots of workers and threads. Locally, I don't think I ever tried a large number of workers.

I'm wondering about a test, but after making several typos while watching a US football game...

@nateberkopec nateberkopec added bug waiting-for-changes Waiting on changes from the requestor labels Sep 15, 2023
@joshuay03 joshuay03 force-pushed the phased-restart-worker-0-timing-out branch from 34e0018 to 3a3fd49 Compare September 16, 2023 01:53
@MSP-Greg
Copy link
Member

@joshuay03

I've also worked on a test, I've got one working with the PR and failing with master. But, it involves some changes to one of the helper files, and it's tanked all the JRuby CI. Of course, it works with JRuby locally...

I decided to look at it tomorrow with a 'fresher' set of eyes.

@joshuay03 joshuay03 force-pushed the phased-restart-worker-0-timing-out branch 2 times, most recently from fccca5c to f682f0a Compare September 17, 2023 12:17
@MSP-Greg
Copy link
Member

@joshuay03

Thanks again for this. Using @server.gets may involve a bit of blocking, which can be problematic with parallel testing and CI servers running with multiple VM's.

Can you rebase and use the following? I've tried to rebase PR's before, and things seem to go south when I've then done a force push. Note that I dropped the worker count to 10.

JFYI, this only possible with a rebase, as I just added @server_log, which is the server log that's accumulated as any method that reads the server log output is called. get_worker_pids waits for all the workers to be booted, 1st parameter is the phase...

Thanks.

  def test_fork_worker_phased_restart_with_high_worker_count
    worker_count = 10

    cli_server "test/rackup/hello.ru", config: <<~RUBY
      fork_worker 0
      worker_check_interval 1
      # lower worker timeout from default (60) to avoid test timeout
      worker_timeout 2
      # to simulate worker 0 timeout, total boot time for all workers
      # needs to exceed single worker timeout
      workers #{worker_count}
    RUBY

    # workers is the default
    get_worker_pids 0, worker_count

    Process.kill :USR1, @pid

    get_worker_pids 1, worker_count

    # below is so all of @server_log isn't output for failure
    refute @server_log[/.*Terminating timed out worker.*/]
  end

@joshuay03 joshuay03 force-pushed the phased-restart-worker-0-timing-out branch from 4cdb945 to f88961b Compare September 26, 2023 23:03
@joshuay03
Copy link
Contributor Author

@MSP-Greg

Done 👍🏽 This setup is much better, thanks for that!

@nateberkopec nateberkopec merged commit 252890c into puma:master Sep 27, 2023
59 checks passed
@joshuay03 joshuay03 deleted the phased-restart-worker-0-timing-out branch September 27, 2023 04:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug waiting-for-changes Waiting on changes from the requestor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fork_worker worker termination during phased-restart
3 participants