Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow sending SIGTERM to workers on timeout #157

Merged
merged 4 commits into from Dec 11, 2019

Conversation

schneems
Copy link
Member

When rack-timeout fires then it can put an application in a "bad" state. This is due to https://www.schneems.com/2017/02/21/the-oldest-bug-in-ruby-why-racktimeout-might-hose-your-server/.

We can give applications the option of restarting the entire process when a timeout is hit by sending it a SIGTERM. Why would we want that?

While the Thread.raise api is unsafe, restarting a process is safe. If the application is running a multiple-worker webserver such as unicorn or puma, when the worker process gets a SIGTERM then it will be shut down and restarted by the "master" process.

In puma when SIGTERM is sent to a worker, that worker will stop accepting new requests, but continue to process all existing requests currently running in threads. This means that sending a SIGTERM to the current process will not affect requests that are currently running (this is a good thing).

What is the downside of sending a TERM to the current worker when there is a timeout? When you TERM a worker, you will lose some capacity to process requests until puma brings up another process. Puma may take a second or two to boot a new process. If all processes on a dyno are TERM'd then your app would stop serving requests until puma boots new processes which, again, would take time. If the application is hitting timeouts because it is under-provisioned then sending TERM to the process, reduces processing power and adds a delay in processing new requests. In the worst case, this could mean that the application is stuck in a process restart loop.

Anecdotally I've seen that usually, applications can tolerate a number of timeout events. I say, usually, fewer than a dozen a day tend to not experience problems. I don't have an empirical number, but if I had to guess, I would say that it would make sense if it was related to MAX_THREADS. Why?

Imagine that every rack-timeout request that fails corrupts something. It's likely to be a connection of some kind, and if you've got MAX_THREADS=5 then you've got 5 database connections.

Since this is a theory I'm making the setting configurable so we can experiment with it in the wild.

cc/ @ericc572

Assigning multiple values creates an intermediate array that isn't needed.
@schneems schneems changed the title Allow sending SIGTERM to workers on timeout [WIP] Allow sending SIGTERM to workers on timeout Nov 14, 2019
When rack-timeout fires then it can put an application in a "bad" state. This is due to https://www.schneems.com/2017/02/21/the-oldest-bug-in-ruby-why-racktimeout-might-hose-your-server/.

We can give applications the option of restarting the entire process when a timeout is hit by sending it a SIGTERM. Why would we want that?

While the `Thread.raise` api is unsafe, restarting a process is safe. If the application is running a multiple-worker webserver such as unicorn or puma, when the worker process gets a SIGTERM then it will be shut down and restarted by the "master" process.

In puma when `SIGTERM` is sent to a worker, that worker will stop accepting new requests, but continue to process all existing requests currently running in threads. This means that sending a SIGTERM to the current process will not affect requests that are currently running (this is a good thing).

What is the downside of sending a TERM to the current worker when there is a timeout? When you TERM a worker, you will lose some capacity to process requests until puma brings up another process. Puma may take a second or two to boot a new process. If all processes on a dyno are TERM'd then your app would stop serving requests until puma boots new processes which, again, would take time. If the application is hitting timeouts because it is under provisioned then sending TERM to the process, reduces processing power and adds a delay in processing new requests. In the worst case this could mean that the application is stuck in a process restart loop. 

Anecdotally I've seen that usually applications can tolerate a number of timeout events. I say, usually fewer than a dozen a day tend to not experience problems. I don't have an empirical number, but if I had to guess, I would say that it would make sense if it was related to MAX_THREADS. Why?

Imagine that every rack-timeout request that fails corrupts something. It's likely to be a connection of some kind, and if you've got MAX_THREADS=5 then you've got 5 database connections.

Since this is a theory I'm making the setting configurable so we can experiment with it in the wild.
@schneems
Copy link
Member Author

Internal support ticket number 784120

@schneems schneems merged commit c328e41 into zombocom:master Dec 11, 2019
@schneems schneems changed the title [WIP] Allow sending SIGTERM to workers on timeout Allow sending SIGTERM to workers on timeout Dec 11, 2019
@wuputah
Copy link
Collaborator

wuputah commented Dec 11, 2019

Thank you for the PR. I have been thinking about this. While it's certainly safe because it's behind a setting, there is a number of counter-arguments to this approach.

  • TERMing children in response to a timeout is server-specific behavior. Rack::Timeout is not specific to a particular Rack server. Presumably, many people have Rack::Timeout installed when running locally, so they may need to disable this behavior when not running puma, or running puma not in -w X mode.
  • Ultimately, the real purpose of Rack::Timeout is to generate a stack trace so you are aware of, and can debug, slow running requests. TERMing a child could cause errors to never reach error trackers (Rollbar, etc), though this particular behavior is yet not clear to me and would require some testing in an app where an error tracker is present. This depends how rapidly Puma responds to the TERM signal in destroying this particular thread.
    • Additionally, any other exception handling the app itself has in place may be interrupted.
  • While the concern around interrupting database connection state is certainly valid, it's common for timeouts to interrupt other less critical network calls, like HTTP traffic, which are less dangerous to interrupt.
  • There's code in Rack::Timeout to allow you to respond to particular events (observers) that could allow you to do specific actions, so possibly the implementation itself could possibly be done via an observer, but I'd have to look into this possibility more closely.

@schneems
Copy link
Member Author

so they may need to disable this behavior when not running puma, or running puma not in -w X mode.

I would actually love to add some integration to Puma. I've not looked into it too deeply, but at least it would be good to know if you're running with workers or not. Worse case scenario I might have to add a Puma.active_config integration point or something. Would be better if we can find that info without. Would be even better if we could figure out if a process was a child of another process in a server agnostic way. I googled around a bit and didn't see an obvious easy answer, but that doesn't mean one doesn't exist.

Since Rails 5.0 Puma has shipped as the default webserver and in practice i've rarely (if ever?) seen a case where someone was using processes in prod, but not locally.

Ultimately, the real purpose of Rack::Timeout is to generate a stack trace so you are aware of, and can debug, slow running requests.

A backtrace still gets raised here.

This depends how rapidly Puma responds to the TERM signal in destroying this particular thread. Additionally, any other exception handling the app itself has in place may be interrupted.

Puma doesn't actually kill a request when it gets a TERM, it waits for all the requests to be finished. That includes the request that had the timeout. It will be "finished" when the timeout error is raised and then whatever error handling is applied and a 500 response is generated.

Unicorn (assuming that you've done the Heroku TERM to KILL switcheroo) also waits on the current request before exiting, though it has an exit timeout I believe.

While the concern around interrupting database connection state is certainly valid, it's common for timeouts to interrupt other less critical network calls, like HTTP traffic, which are less dangerous to interrupt.

True. My ideal end goal is to introduce some kind of interface like Thread.raise_when_safe(timeout: 1.seconds) when an exception won't be raised until after the thread is no longer executing an ensure block. I talked to Matz about this and he said you can't do that because there might be an inifinte loop that you might need to kill, so my proposal is to ad a timeout that could fire in the event of "stuck" ensure code.

I will mention that the customer on ticket 784120 reported that this branch solved their immediate problem.

Even so, it's still a bandaid - ultimately we want to get people who are hitting H12s to not get them at all. But we also need these failsafes (like rack-timeout) to not turn around and cause additional issues.

I'm working on a H12 devcenter article and would like to mention this setting.

There's code in Rack::Timeout to allow you to respond to particular events (observers) that could allow you to do specific actions, so possibly the implementation itself could possibly be done via an observer, but I'd have to look into this possibility more closely.

Would be interesting to see if this can be cleaner.

@wuputah
Copy link
Collaborator

wuputah commented Dec 11, 2019

I don't know what rails start does by default (use workers or not), but in older versions of Rails as I recall it would not use Puma at all, while a Procfile may specify a different web server with specific options. The config option here could be specified only in c/e/production.rb, of course, but if we want to head in the direction of turning this on by default, smarter/clever default behavior might be warranted.

defined?(Puma) and then Puma.cli_config looks like it should work for current versions of Puma. This is a Puma::Configuration object and .options[:workers] should have what you need.

It'd be real nice if we didn't have to destroy the entire worker to solve this problem. If the thread was destroyed instead, would that help? (Probably not.) And is there even any way to tell Puma to exit a thread gracefully when the request is finished? That'd have to be via its API, rather than a signal.

But if the problem is, most typically, corrupted database connections in a pool, or database connection pool exhaustion, I'm not sure what else can be done, aside from fixing the problem closer to the source (in Ruby) as you suggested.

@schneems
Copy link
Member Author

Looks like that config object is not always available:

module Puma
  class << self
    # The CLI exports its Puma::Configuration object here to allow
    # apps to pick it up. An app needs to use it conditionally though
    # since it is not set if the app is launched via another
    # mechanism than the CLI class.
    attr_accessor :cli_config
  end

I could scan through object space to grab an instance to the launcher, but that's pretty invasive (it's essentially what PumaWorkerKiller does FWIW). If this is functionality we want I would rather work towards getting it in puma propper, exposing config values directly even when not booted via the CLI. Not sure how large the scope of that work would be, but I'm guessing it's less trivial than it sounds.

@wuputah
Copy link
Collaborator

wuputah commented Dec 12, 2019

I'm curious what that means in practice. Does that mean it would work if web: puma -w 3 -t 16:16 -p $PORT but not web: rails start -p $PORT because Rails starts Puma directly via API commands?

Throne3d added a commit to glowfic-constellation/glowfic that referenced this pull request Dec 27, 2019
Per the following articles, Rails (and specifically
rack-timeout) can handle timeouts inappropriately, often
failing to return DB connections to the pool appropriately.

To resolve this, rack-timeout has a new configuration
option to SIGTERM Puma worker processes upon a timeout,
making sure they clean up resources appropriately (by
deferring to Puma's process restart process).

zombocom/rack-timeout#157

https://www.schneems.com/2017/02/21/the-oldest-bug-in-ruby-why-racktimeout-might-hose-your-server/

Future work on this issue should add timeouts to our Postgres
connection, making sure that Rack::Timeout is more of a
"last resort".
Throne3d added a commit to glowfic-constellation/glowfic that referenced this pull request Dec 27, 2019
Per the following articles, Rails (and specifically
rack-timeout) can handle timeouts inappropriately, often
failing to return DB connections to the pool appropriately.

To resolve this, rack-timeout has a new configuration
option to SIGTERM Puma worker processes upon a timeout,
making sure they clean up resources appropriately (by
deferring to Puma's process restart process).

zombocom/rack-timeout#157

https://www.schneems.com/2017/02/21/the-oldest-bug-in-ruby-why-racktimeout-might-hose-your-server/

https://github.com/ankane/the-ultimate-guide-to-ruby-timeouts/blob/a72ea3234e732942bd855735c1f8efa40e23de57/README.md#rack-timeout

Future work on this issue should add timeouts to our Postgres
connection, making sure that Rack::Timeout is more of a
"last resort".
dentarg added a commit to dentarg/rack-timeout that referenced this pull request Mar 31, 2020
This was originally added as part of zombocom#157, and released in 0.6.0.

Then it was reverted in zombocom#161, but I think it should be kept in, as
removing it is a breaking change. (Yeah, I know this gem is not 1.0 yet,
but will it ever be? :-) It has soon existed for 10 years.)

It is useful, as it allows you to use Rack::Timeout like this:

    use Rack::Timeout, service_timeout: ENV.fetch("RACK_TIMEOUT_SERVICE_TIMEOUT")
dentarg added a commit to dentarg/rack-timeout that referenced this pull request Mar 31, 2020
This was originally added as part of zombocom#157, and released in 0.6.0.

Then it was reverted in zombocom#161, but I think it should be kept in, as
removing it is a breaking change. (Yeah, I know this gem is not 1.0 yet,
but will it ever be? :-) It has soon existed for 10 years.)

It is useful, as it allows you to use Rack::Timeout like this:

    use Rack::Timeout, service_timeout: ENV.fetch("RACK_TIMEOUT_SERVICE_TIMEOUT")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants