New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow sending SIGTERM to workers on timeout #157
Conversation
Assigning multiple values creates an intermediate array that isn't needed.
36c66b7
to
9119b3b
Compare
When rack-timeout fires then it can put an application in a "bad" state. This is due to https://www.schneems.com/2017/02/21/the-oldest-bug-in-ruby-why-racktimeout-might-hose-your-server/. We can give applications the option of restarting the entire process when a timeout is hit by sending it a SIGTERM. Why would we want that? While the `Thread.raise` api is unsafe, restarting a process is safe. If the application is running a multiple-worker webserver such as unicorn or puma, when the worker process gets a SIGTERM then it will be shut down and restarted by the "master" process. In puma when `SIGTERM` is sent to a worker, that worker will stop accepting new requests, but continue to process all existing requests currently running in threads. This means that sending a SIGTERM to the current process will not affect requests that are currently running (this is a good thing). What is the downside of sending a TERM to the current worker when there is a timeout? When you TERM a worker, you will lose some capacity to process requests until puma brings up another process. Puma may take a second or two to boot a new process. If all processes on a dyno are TERM'd then your app would stop serving requests until puma boots new processes which, again, would take time. If the application is hitting timeouts because it is under provisioned then sending TERM to the process, reduces processing power and adds a delay in processing new requests. In the worst case this could mean that the application is stuck in a process restart loop. Anecdotally I've seen that usually applications can tolerate a number of timeout events. I say, usually fewer than a dozen a day tend to not experience problems. I don't have an empirical number, but if I had to guess, I would say that it would make sense if it was related to MAX_THREADS. Why? Imagine that every rack-timeout request that fails corrupts something. It's likely to be a connection of some kind, and if you've got MAX_THREADS=5 then you've got 5 database connections. Since this is a theory I'm making the setting configurable so we can experiment with it in the wild.
9119b3b
to
6f32baa
Compare
Internal support ticket number 784120 |
Thank you for the PR. I have been thinking about this. While it's certainly safe because it's behind a setting, there is a number of counter-arguments to this approach.
|
I would actually love to add some integration to Puma. I've not looked into it too deeply, but at least it would be good to know if you're running with workers or not. Worse case scenario I might have to add a Since Rails 5.0 Puma has shipped as the default webserver and in practice i've rarely (if ever?) seen a case where someone was using processes in prod, but not locally.
A backtrace still gets raised here.
Puma doesn't actually kill a request when it gets a TERM, it waits for all the requests to be finished. That includes the request that had the timeout. It will be "finished" when the timeout error is raised and then whatever error handling is applied and a 500 response is generated. Unicorn (assuming that you've done the Heroku TERM to KILL switcheroo) also waits on the current request before exiting, though it has an exit timeout I believe.
True. My ideal end goal is to introduce some kind of interface like I will mention that the customer on ticket 784120 reported that this branch solved their immediate problem. Even so, it's still a bandaid - ultimately we want to get people who are hitting H12s to not get them at all. But we also need these failsafes (like rack-timeout) to not turn around and cause additional issues. I'm working on a H12 devcenter article and would like to mention this setting.
Would be interesting to see if this can be cleaner. |
I don't know what
It'd be real nice if we didn't have to destroy the entire worker to solve this problem. If the thread was destroyed instead, would that help? (Probably not.) And is there even any way to tell Puma to exit a thread gracefully when the request is finished? That'd have to be via its API, rather than a signal. But if the problem is, most typically, corrupted database connections in a pool, or database connection pool exhaustion, I'm not sure what else can be done, aside from fixing the problem closer to the source (in Ruby) as you suggested. |
Looks like that config object is not always available: module Puma
class << self
# The CLI exports its Puma::Configuration object here to allow
# apps to pick it up. An app needs to use it conditionally though
# since it is not set if the app is launched via another
# mechanism than the CLI class.
attr_accessor :cli_config
end I could scan through object space to grab an instance to the launcher, but that's pretty invasive (it's essentially what PumaWorkerKiller does FWIW). If this is functionality we want I would rather work towards getting it in puma propper, exposing config values directly even when not booted via the CLI. Not sure how large the scope of that work would be, but I'm guessing it's less trivial than it sounds. |
I'm curious what that means in practice. Does that mean it would work if |
Per the following articles, Rails (and specifically rack-timeout) can handle timeouts inappropriately, often failing to return DB connections to the pool appropriately. To resolve this, rack-timeout has a new configuration option to SIGTERM Puma worker processes upon a timeout, making sure they clean up resources appropriately (by deferring to Puma's process restart process). zombocom/rack-timeout#157 https://www.schneems.com/2017/02/21/the-oldest-bug-in-ruby-why-racktimeout-might-hose-your-server/ Future work on this issue should add timeouts to our Postgres connection, making sure that Rack::Timeout is more of a "last resort".
Per the following articles, Rails (and specifically rack-timeout) can handle timeouts inappropriately, often failing to return DB connections to the pool appropriately. To resolve this, rack-timeout has a new configuration option to SIGTERM Puma worker processes upon a timeout, making sure they clean up resources appropriately (by deferring to Puma's process restart process). zombocom/rack-timeout#157 https://www.schneems.com/2017/02/21/the-oldest-bug-in-ruby-why-racktimeout-might-hose-your-server/ https://github.com/ankane/the-ultimate-guide-to-ruby-timeouts/blob/a72ea3234e732942bd855735c1f8efa40e23de57/README.md#rack-timeout Future work on this issue should add timeouts to our Postgres connection, making sure that Rack::Timeout is more of a "last resort".
This was originally added as part of zombocom#157, and released in 0.6.0. Then it was reverted in zombocom#161, but I think it should be kept in, as removing it is a breaking change. (Yeah, I know this gem is not 1.0 yet, but will it ever be? :-) It has soon existed for 10 years.) It is useful, as it allows you to use Rack::Timeout like this: use Rack::Timeout, service_timeout: ENV.fetch("RACK_TIMEOUT_SERVICE_TIMEOUT")
This was originally added as part of zombocom#157, and released in 0.6.0. Then it was reverted in zombocom#161, but I think it should be kept in, as removing it is a breaking change. (Yeah, I know this gem is not 1.0 yet, but will it ever be? :-) It has soon existed for 10 years.) It is useful, as it allows you to use Rack::Timeout like this: use Rack::Timeout, service_timeout: ENV.fetch("RACK_TIMEOUT_SERVICE_TIMEOUT")
When rack-timeout fires then it can put an application in a "bad" state. This is due to https://www.schneems.com/2017/02/21/the-oldest-bug-in-ruby-why-racktimeout-might-hose-your-server/.
We can give applications the option of restarting the entire process when a timeout is hit by sending it a SIGTERM. Why would we want that?
While the
Thread.raise
api is unsafe, restarting a process is safe. If the application is running a multiple-worker webserver such as unicorn or puma, when the worker process gets a SIGTERM then it will be shut down and restarted by the "master" process.In puma when
SIGTERM
is sent to a worker, that worker will stop accepting new requests, but continue to process all existing requests currently running in threads. This means that sending a SIGTERM to the current process will not affect requests that are currently running (this is a good thing).What is the downside of sending a TERM to the current worker when there is a timeout? When you TERM a worker, you will lose some capacity to process requests until puma brings up another process. Puma may take a second or two to boot a new process. If all processes on a dyno are TERM'd then your app would stop serving requests until puma boots new processes which, again, would take time. If the application is hitting timeouts because it is under-provisioned then sending TERM to the process, reduces processing power and adds a delay in processing new requests. In the worst case, this could mean that the application is stuck in a process restart loop.
Anecdotally I've seen that usually, applications can tolerate a number of timeout events. I say, usually, fewer than a dozen a day tend to not experience problems. I don't have an empirical number, but if I had to guess, I would say that it would make sense if it was related to MAX_THREADS. Why?
Imagine that every rack-timeout request that fails corrupts something. It's likely to be a connection of some kind, and if you've got MAX_THREADS=5 then you've got 5 database connections.
Since this is a theory I'm making the setting configurable so we can experiment with it in the wild.
cc/ @ericc572