Allow sending SIGTERM to workers on timeout #157

schneems · 2019-11-14T19:30:28Z

When rack-timeout fires then it can put an application in a "bad" state. This is due to https://www.schneems.com/2017/02/21/the-oldest-bug-in-ruby-why-racktimeout-might-hose-your-server/.

We can give applications the option of restarting the entire process when a timeout is hit by sending it a SIGTERM. Why would we want that?

While the Thread.raise api is unsafe, restarting a process is safe. If the application is running a multiple-worker webserver such as unicorn or puma, when the worker process gets a SIGTERM then it will be shut down and restarted by the "master" process.

In puma when SIGTERM is sent to a worker, that worker will stop accepting new requests, but continue to process all existing requests currently running in threads. This means that sending a SIGTERM to the current process will not affect requests that are currently running (this is a good thing).

What is the downside of sending a TERM to the current worker when there is a timeout? When you TERM a worker, you will lose some capacity to process requests until puma brings up another process. Puma may take a second or two to boot a new process. If all processes on a dyno are TERM'd then your app would stop serving requests until puma boots new processes which, again, would take time. If the application is hitting timeouts because it is under-provisioned then sending TERM to the process, reduces processing power and adds a delay in processing new requests. In the worst case, this could mean that the application is stuck in a process restart loop.

Anecdotally I've seen that usually, applications can tolerate a number of timeout events. I say, usually, fewer than a dozen a day tend to not experience problems. I don't have an empirical number, but if I had to guess, I would say that it would make sense if it was related to MAX_THREADS. Why?

Imagine that every rack-timeout request that fails corrupts something. It's likely to be a connection of some kind, and if you've got MAX_THREADS=5 then you've got 5 database connections.

Since this is a theory I'm making the setting configurable so we can experiment with it in the wild.

cc/ @ericc572

Assigning multiple values creates an intermediate array that isn't needed.

When rack-timeout fires then it can put an application in a "bad" state. This is due to https://www.schneems.com/2017/02/21/the-oldest-bug-in-ruby-why-racktimeout-might-hose-your-server/. We can give applications the option of restarting the entire process when a timeout is hit by sending it a SIGTERM. Why would we want that? While the `Thread.raise` api is unsafe, restarting a process is safe. If the application is running a multiple-worker webserver such as unicorn or puma, when the worker process gets a SIGTERM then it will be shut down and restarted by the "master" process. In puma when `SIGTERM` is sent to a worker, that worker will stop accepting new requests, but continue to process all existing requests currently running in threads. This means that sending a SIGTERM to the current process will not affect requests that are currently running (this is a good thing). What is the downside of sending a TERM to the current worker when there is a timeout? When you TERM a worker, you will lose some capacity to process requests until puma brings up another process. Puma may take a second or two to boot a new process. If all processes on a dyno are TERM'd then your app would stop serving requests until puma boots new processes which, again, would take time. If the application is hitting timeouts because it is under provisioned then sending TERM to the process, reduces processing power and adds a delay in processing new requests. In the worst case this could mean that the application is stuck in a process restart loop. Anecdotally I've seen that usually applications can tolerate a number of timeout events. I say, usually fewer than a dozen a day tend to not experience problems. I don't have an empirical number, but if I had to guess, I would say that it would make sense if it was related to MAX_THREADS. Why? Imagine that every rack-timeout request that fails corrupts something. It's likely to be a connection of some kind, and if you've got MAX_THREADS=5 then you've got 5 database connections. Since this is a theory I'm making the setting configurable so we can experiment with it in the wild.

schneems · 2019-11-14T19:51:44Z

Internal support ticket number 784120

wuputah · 2019-12-11T04:42:53Z

Thank you for the PR. I have been thinking about this. While it's certainly safe because it's behind a setting, there is a number of counter-arguments to this approach.

TERMing children in response to a timeout is server-specific behavior. Rack::Timeout is not specific to a particular Rack server. Presumably, many people have Rack::Timeout installed when running locally, so they may need to disable this behavior when not running puma, or running puma not in -w X mode.
Ultimately, the real purpose of Rack::Timeout is to generate a stack trace so you are aware of, and can debug, slow running requests. TERMing a child could cause errors to never reach error trackers (Rollbar, etc), though this particular behavior is yet not clear to me and would require some testing in an app where an error tracker is present. This depends how rapidly Puma responds to the TERM signal in destroying this particular thread.
- Additionally, any other exception handling the app itself has in place may be interrupted.
While the concern around interrupting database connection state is certainly valid, it's common for timeouts to interrupt other less critical network calls, like HTTP traffic, which are less dangerous to interrupt.
There's code in Rack::Timeout to allow you to respond to particular events (observers) that could allow you to do specific actions, so possibly the implementation itself could possibly be done via an observer, but I'd have to look into this possibility more closely.

schneems · 2019-12-11T05:23:11Z

so they may need to disable this behavior when not running puma, or running puma not in -w X mode.

I would actually love to add some integration to Puma. I've not looked into it too deeply, but at least it would be good to know if you're running with workers or not. Worse case scenario I might have to add a Puma.active_config integration point or something. Would be better if we can find that info without. Would be even better if we could figure out if a process was a child of another process in a server agnostic way. I googled around a bit and didn't see an obvious easy answer, but that doesn't mean one doesn't exist.

Since Rails 5.0 Puma has shipped as the default webserver and in practice i've rarely (if ever?) seen a case where someone was using processes in prod, but not locally.

Ultimately, the real purpose of Rack::Timeout is to generate a stack trace so you are aware of, and can debug, slow running requests.

A backtrace still gets raised here.

This depends how rapidly Puma responds to the TERM signal in destroying this particular thread. Additionally, any other exception handling the app itself has in place may be interrupted.

Puma doesn't actually kill a request when it gets a TERM, it waits for all the requests to be finished. That includes the request that had the timeout. It will be "finished" when the timeout error is raised and then whatever error handling is applied and a 500 response is generated.

Unicorn (assuming that you've done the Heroku TERM to KILL switcheroo) also waits on the current request before exiting, though it has an exit timeout I believe.

While the concern around interrupting database connection state is certainly valid, it's common for timeouts to interrupt other less critical network calls, like HTTP traffic, which are less dangerous to interrupt.

True. My ideal end goal is to introduce some kind of interface like Thread.raise_when_safe(timeout: 1.seconds) when an exception won't be raised until after the thread is no longer executing an ensure block. I talked to Matz about this and he said you can't do that because there might be an inifinte loop that you might need to kill, so my proposal is to ad a timeout that could fire in the event of "stuck" ensure code.

I will mention that the customer on ticket 784120 reported that this branch solved their immediate problem.

Even so, it's still a bandaid - ultimately we want to get people who are hitting H12s to not get them at all. But we also need these failsafes (like rack-timeout) to not turn around and cause additional issues.

I'm working on a H12 devcenter article and would like to mention this setting.

There's code in Rack::Timeout to allow you to respond to particular events (observers) that could allow you to do specific actions, so possibly the implementation itself could possibly be done via an observer, but I'd have to look into this possibility more closely.

Would be interesting to see if this can be cleaner.

wuputah · 2019-12-11T07:40:55Z

I don't know what rails start does by default (use workers or not), but in older versions of Rails as I recall it would not use Puma at all, while a Procfile may specify a different web server with specific options. The config option here could be specified only in c/e/production.rb, of course, but if we want to head in the direction of turning this on by default, smarter/clever default behavior might be warranted.

defined?(Puma) and then Puma.cli_config looks like it should work for current versions of Puma. This is a Puma::Configuration object and .options[:workers] should have what you need.

It'd be real nice if we didn't have to destroy the entire worker to solve this problem. If the thread was destroyed instead, would that help? (Probably not.) And is there even any way to tell Puma to exit a thread gracefully when the request is finished? That'd have to be via its API, rather than a signal.

But if the problem is, most typically, corrupted database connections in a pool, or database connection pool exhaustion, I'm not sure what else can be done, aside from fixing the problem closer to the source (in Ruby) as you suggested.

schneems · 2019-12-11T22:29:34Z

Looks like that config object is not always available:

module Puma
  class << self
    # The CLI exports its Puma::Configuration object here to allow
    # apps to pick it up. An app needs to use it conditionally though
    # since it is not set if the app is launched via another
    # mechanism than the CLI class.
    attr_accessor :cli_config
  end

I could scan through object space to grab an instance to the launcher, but that's pretty invasive (it's essentially what PumaWorkerKiller does FWIW). If this is functionality we want I would rather work towards getting it in puma propper, exposing config values directly even when not booted via the CLI. Not sure how large the scope of that work would be, but I'm guessing it's less trivial than it sounds.

wuputah · 2019-12-12T06:38:55Z

I'm curious what that means in practice. Does that mean it would work if web: puma -w 3 -t 16:16 -p $PORT but not web: rails start -p $PORT because Rails starts Puma directly via API commands?

Per the following articles, Rails (and specifically rack-timeout) can handle timeouts inappropriately, often failing to return DB connections to the pool appropriately. To resolve this, rack-timeout has a new configuration option to SIGTERM Puma worker processes upon a timeout, making sure they clean up resources appropriately (by deferring to Puma's process restart process). zombocom/rack-timeout#157 https://www.schneems.com/2017/02/21/the-oldest-bug-in-ruby-why-racktimeout-might-hose-your-server/ Future work on this issue should add timeouts to our Postgres connection, making sure that Rack::Timeout is more of a "last resort".

Per the following articles, Rails (and specifically rack-timeout) can handle timeouts inappropriately, often failing to return DB connections to the pool appropriately. To resolve this, rack-timeout has a new configuration option to SIGTERM Puma worker processes upon a timeout, making sure they clean up resources appropriately (by deferring to Puma's process restart process). zombocom/rack-timeout#157 https://www.schneems.com/2017/02/21/the-oldest-bug-in-ruby-why-racktimeout-might-hose-your-server/ https://github.com/ankane/the-ultimate-guide-to-ruby-timeouts/blob/a72ea3234e732942bd855735c1f8efa40e23de57/README.md#rack-timeout Future work on this issue should add timeouts to our Postgres connection, making sure that Rack::Timeout is more of a "last resort".

This was originally added as part of zombocom#157, and released in 0.6.0. Then it was reverted in zombocom#161, but I think it should be kept in, as removing it is a breaking change. (Yeah, I know this gem is not 1.0 yet, but will it ever be? :-) It has soon existed for 10 years.) It is useful, as it allows you to use Rack::Timeout like this: use Rack::Timeout, service_timeout: ENV.fetch("RACK_TIMEOUT_SERVICE_TIMEOUT")

schneems added 2 commits November 14, 2019 12:38

Whitespace

9aa53ba

Remove multiple assignment

6d39bbf

Assigning multiple values creates an intermediate array that isn't needed.

schneems changed the title ~~Allow sending SIGTERM to workers on timeout~~ [WIP] Allow sending SIGTERM to workers on timeout Nov 14, 2019

schneems force-pushed the schneems/stop-process branch from 36c66b7 to 9119b3b Compare November 14, 2019 19:41

schneems force-pushed the schneems/stop-process branch from 9119b3b to 6f32baa Compare November 14, 2019 19:43

Update docs

4aa4af5

schneems merged commit c328e41 into zombocom:master Dec 11, 2019

schneems changed the title ~~[WIP] Allow sending SIGTERM to workers on timeout~~ Allow sending SIGTERM to workers on timeout Dec 11, 2019

This was referenced Dec 12, 2019

Fix timeouts delayed by IO blocking #30

Closed

Documentation: timing out requests without exiting is inherently unsafe #39

Closed

Throne3d mentioned this pull request Dec 27, 2019

Improve timeout behavior (Rack::Timeout and PG statement_timeout) glowfic-constellation/glowfic#1279

Merged

dentarg mentioned this pull request Mar 31, 2020

Be able to configure Rack::Timeout with strings #165

Closed

schneems mentioned this pull request Aug 25, 2020

Provide a way to cleanup after aborted requests #49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow sending SIGTERM to workers on timeout #157

Allow sending SIGTERM to workers on timeout #157

schneems commented Nov 14, 2019

schneems commented Nov 14, 2019

wuputah commented Dec 11, 2019

schneems commented Dec 11, 2019

wuputah commented Dec 11, 2019

schneems commented Dec 11, 2019

wuputah commented Dec 12, 2019

Allow sending SIGTERM to workers on timeout #157

Allow sending SIGTERM to workers on timeout #157

Conversation

schneems commented Nov 14, 2019

schneems commented Nov 14, 2019

wuputah commented Dec 11, 2019

schneems commented Dec 11, 2019

wuputah commented Dec 11, 2019

schneems commented Dec 11, 2019

wuputah commented Dec 12, 2019