Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage for large file uploads #3062

Merged
merged 2 commits into from Jan 29, 2023

Conversation

willkoehler
Copy link
Contributor

@willkoehler willkoehler commented Jan 22, 2023

The motivation for this change is to reduce memory used by a Rails application running on Heroku that handles large file uploads (~100Mb). In it's current state the application (running Puma 6.0.2 + Rails 5.2.8 + Ruby 2.7.6) starts using swap and triggering "memory quota exceeded errors" on the Heroku dyno after only a few file uploads.

This PR is a proof of concept that appears to significantly reduce the memory used by Puma during the data phase of the HTTP request. The change I made was to re-use a single read buffer in Puma::Client#read_body and Puma::Client#read_chunked_body vs allocating a new buffer on each call to read_nonblock().

I proved this change worked locally by throttling the network speed in Chrome dev tools and watching application memory usage as Puma received the file upload request. I then reproduced this experiment on a Heroku dyno, uploading a 100 Mb file to my application while watching memory on the dyno with top. I grabbed screenshots just before Puma handed the request off to Rack to isolate any memory usage to Puma alone.

Puma v6.0.2 - Initial memory

memory13_at_idle_puma_6_0_2

Puma v6.0.2 - Memory at 95% uploaded

memory14_95percent_uploaded_puma_6_0_2

This PR - Initial memory

memory11_at_idle_with_fix

This PR - Memory at 95% uploaded

memory12_95percent_uploaded_with_fix

To demonstrate this change had a positive impact on the application memory footprint overall (not just the Puma data phase), I watched Heroku's memory usage graph while uploading 5 large (100Mb) files. This PR reduced memory usage by ~100Mb (25% reduction) compared to Puma 6.0.2. See screenshot below.

heroku_memory

My biggest concern is that this solution may not be thread-safe. To test this, I ran Puma locally with WEB_CONCURRENCY=0 bin/rails s and performed 3 simultaneous 100Mb file uploads. Each of the files was received successfully.

One final note: I did not make a change to the read_nonblock call in Puma::Client#try_to_finish Doing so creates a shared memory conflict between data and @buffer and @read_buffer (they are all just references to the same thing). This would need to be fixed by duping data before assigning to @buffer.

if @buffer
  @buffer << data
else
  @buffer = data.dup
end

I felt that this additional complexity may be confusing and easily broken in the future.

Your checklist for this pull request

  • I have reviewed the guidelines for contributing to this repository.
  • I have added (or updated) appropriate tests if this PR fixes a bug or adds a feature.
  • My pull request is 100 lines added/removed or less so that it can be easily reviewed.
  • If this PR doesn't need tests (docs change), I added [ci skip] to the title of the PR.
  • If this closes any issues, I have added "Closes #issue" to the PR description or my commit messages.
  • I have updated the documentation accordingly.
  • All new and existing tests passed, including Rubocop.

@nateberkopec
Copy link
Member

My biggest concern is that this solution may not be thread-safe.

Someone check me here but I believe Client objects are only accessed by the Reactor thread, so I don't think there can be a thread safety issue here because this state is not shared.

@@ -98,6 +98,8 @@ def initialize(io, env=nil)
@body_remain = 0

@in_last_chunk = false

@read_buffer = +""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? I haven't seen this syntax before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the short syntax for a non frozen String (if strings was to be frozen by default). Same as String.new which would read better IMHO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+"" also has better performance.

Benchmark.bm do |bm|
  bm.report { 100000000.times { String.new } }
  bm.report { 100000000.times { +'' } }
end
       user     system      total        real
   4.359000   0.000000   4.359000 ( 10.255661)
   2.954000   0.000000   2.954000 (  6.286254)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally used String.new but this triggered a Rubocop Performance/UnfreezeString error.

I like String.new because it's intention is clear and we only allocate once. But I also see the benefit of using +"" as a rule app-wide if this is what the team prefers. Performance is important in Puma. There are some places where this gain may be meaningful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's obviously not very easily googleable 😆

@dentarg
Copy link
Member

dentarg commented Jan 23, 2023

Can you contribute a repro app where you see the same gain? I think it would be interesting for others to reproduce/play with this, to verify gains.

@willkoehler
Copy link
Contributor Author

willkoehler commented Jan 23, 2023

Someone check me here but I believe Client objects are only accessed by the Reactor thread, so I don't think there can be a thread safety issue here because this state is not shared.

This was my take as well. It appeared that @read_buffer would belong to (and only be used by) a single request at any given time.

@willkoehler
Copy link
Contributor Author

willkoehler commented Jan 23, 2023

Can you contribute a repro app where you see the same gain? I think it would be interesting for others to reproduce/play with this, to verify gains.

I'm happy to do that. My app uses Rails + Carrierwave + S3. I can bootstrap a simple, one page Rails app with the same upload behavior. But I wonder if it would be worth even simplifying further to a pure Rails app with local ActiveStorage?

@willkoehler
Copy link
Contributor Author

To keep things as simple as possible, I'm building a small Sinatra app to demonstrate the same behavior. I have a proof of concept running. I'll circle back here when I have the final version pushed up with reproducible steps to demonstrate the memory consumption.

@nateberkopec
Copy link
Member

If it's just built in Sinatra, you could drop into the benchmarks folder of the repository as well.

@nateberkopec nateberkopec added the waiting-for-changes Waiting on changes from the requestor label Jan 25, 2023
@willkoehler
Copy link
Contributor Author

The demo app is up https://github.com/willkoehler/puma_demo

I see the same essential memory behavior in this demo app as the Rails app. However Ruby garbage collection appears to be a bit more efficient in the demo app. Memory doesn't grow in the same unbounded way over multiple uploads as I'm seeing in the Rails app. However the demo app still shows a clear benefit with the PR vs Puma 6.0.2 during the data phase of the HTTP request.

@dentarg
Copy link
Member

dentarg commented Jan 26, 2023

Thanks for the demo app, I simplified it a tiny bit (just the config.ru, upload using curl) and added puma_worker_killer (https://github.com/zombocom/puma_worker_killer) just to log the memory usage (every 0.3 second): https://github.com/dentarg/gists/tree/master/gists/puma_pr-3062-upload-file-mem

Puma 6.0.2

[36093] Puma starting in cluster mode...
[36093] * Puma version: 6.0.2 (ruby 3.1.3-p185) ("Sunflower")
[36093] *  Min threads: 0
[36093] *  Max threads: 5
[36093] *  Environment: development
[36093] *   Master PID: 36093
[36093] *      Workers: 1
[36093] *     Restarts: (✔) hot (✔) phased
[36093] * Listening on http://0.0.0.0:9090
[36093] Use Ctrl-C to stop
[36093] - Worker 0 (PID: 36094) booted in 0.07s, phase: 0
[36093] PumaWorkerKiller: Consuming 70.703125 mb with master and 1 workers.
[36093] PumaWorkerKiller: Consuming 70.703125 mb with master and 1 workers.
...
[36093] PumaWorkerKiller: Consuming 110.40625 mb with master and 1 workers.
[36093] PumaWorkerKiller: Consuming 161.328125 mb with master and 1 workers.
[36093] PumaWorkerKiller: Consuming 276.515625 mb with master and 1 workers.
127.0.0.1 - - [26/Jan/2023:22:51:24 +0100] "POST /upload HTTP/1.1" 200 7 0.1335
[36093] PumaWorkerKiller: Consuming 250.859375 mb with master and 1 workers.
[36093] PumaWorkerKiller: Consuming 250.875 mb with master and 1 workers.
[36093] PumaWorkerKiller: Consuming 250.9375 mb with master and 1 workers.

This PR

[36240] Puma starting in cluster mode...
[36240] * Puma version: 6.0.2 (ruby 3.1.3-p185) ("Sunflower")
[36240] *  Min threads: 0
[36240] *  Max threads: 5
[36240] *  Environment: development
[36240] *   Master PID: 36240
[36240] *      Workers: 1
[36240] *     Restarts: (✔) hot (✔) phased
[36240] * Listening on http://0.0.0.0:9090
[36240] Use Ctrl-C to stop
[36240] - Worker 0 (PID: 36242) booted in 0.07s, phase: 0
[36240] PumaWorkerKiller: Consuming 71.125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 71.15625 mb with master and 1 workers.
...
[36240] PumaWorkerKiller: Consuming 71.578125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 140.25 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 213.28125 mb with master and 1 workers.
127.0.0.1 - - [26/Jan/2023:22:51:51 +0100] "POST /upload HTTP/1.1" 200 7 0.1279
[36240] PumaWorkerKiller: Consuming 213.28125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 213.3125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 213.3125 mb with master and 1 workers.

@dentarg
Copy link
Member

dentarg commented Jan 26, 2023

Forgot to mention, the file I uploaded above was 322 MB (337322761 bytes)

@willkoehler
Copy link
Contributor Author

willkoehler commented Jan 26, 2023

Nice 🙌 I like the trick of usingpuma_worker_killer to monitor memory. Using curl to make the request will allow us to use the --limit-rate param to throttle the bandwidth, which will more closely simulate behavior in production so we can see how memory grows and is garbaged collected during the data phase.

There's a lot going on here to unpack memory-wise and I want to make sure we're separating Puma's memory use from the rest of the app. In particular, once Puma hands the app off to Rack and Sinatra, I see additional memory allocation that we won't be able to control.

EDIT: Just skip to the next comment #3062 (comment). It's a lot clearer than what I tried to say below

...but if you want to keep reading, here was my original thought:

If we throttle the request bandwidth with --limit-rate I think what we will see is that, with Puma 6.0.2, memory grows during the data phase of the request (inside the Puma code) and it will look like this

[36093] PumaWorkerKiller: Consuming 70.703125 mb with master and 1 workers.
[36093] PumaWorkerKiller: Consuming 70.703125 mb with master and 1 workers.

**** Memory keeps growing - mostly but maybe sometimes dropping when garbage collection runs

[36093] PumaWorkerKiller: Consuming 110.40625 mb with master and 1 workers.

With my PR I expect it will look like this

[36240] PumaWorkerKiller: Consuming 71.125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 71.15625 mb with master and 1 workers.

**** Memory stays about the same

[36240] PumaWorkerKiller: Consuming 71.578125 mb with master and 1 workers.

I believe the last few memory logs in each of the samples above (where the memory usage jumps) are after Puma has handed the request off to Rack

[36240] PumaWorkerKiller: Consuming 71.578125 mb with master and 1 workers.

**** Request handed from Puma to Rack/Sinatra. Memory use jumps here but is out of our control.

[36240] PumaWorkerKiller: Consuming 140.25 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 213.28125 mb with master and 1 workers.
127.0.0.1 - - [26/Jan/2023:22:51:51 +0100] "POST /upload HTTP/1.1" 200 7 0.1279
[36240] PumaWorkerKiller: Consuming 213.28125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 213.3125 mb with master and 1 workers.
[36240] PumaWorkerKiller: Consuming 213.3125 mb with master and 1 workers.

I'm going to update my app locally with your changes, reproduce your test and see if I can add some clarity about where memory is being allocated and how this PR may help.

@willkoehler
Copy link
Contributor Author

willkoehler commented Jan 27, 2023

I updated my app and tested with puma_worker_killer and curl. Thanks again for the suggestion 🙌 This is much nicer than the way I was testing.

Here are the results uploading a 115MB video.

Run the app: bundle exec puma -p 9090 --config puma.rb -w 1
Make a rate-limited request: curl --form "video=@IMG_0009.MOV" --limit-rate 10M http://localhost:9090/upload

Puma 6.0.2

[11820] Puma starting in cluster mode...
[11820] * Puma version: 6.0.2 (ruby 3.2.0-p0) ("Sunflower")
[11820] *  Min threads: 0
[11820] *  Max threads: 5
[11820] *  Environment: development
[11820] *   Master PID: 11820
[11820] *      Workers: 1
[11820] *     Restarts: (✔) hot (✔) phased
[11820] * Listening on http://0.0.0.0:3000
[11820] Use Ctrl-C to stop
[11820] - Worker 0 (PID: 11949) booted in 0.06s, phase: 0
[11820] PumaWorkerKiller: Consuming 70.984375 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 70.984375 mb with master and 1 workers.

...curl request made - memory increases as file is received

[11820] PumaWorkerKiller: Consuming 72.796875 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 75.921875 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 78.953125 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 82.15625 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 85.265625 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 88.046875 mb with master and 1 workers.

...(clipped out lines) memory keeps increasing while request is received

[11820] PumaWorkerKiller: Consuming 121.53125 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 122.75 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 125.40625 mb with master and 1 workers.

...request handed off from Puma to Rack/Sinatra

[11820] PumaWorkerKiller: Consuming 220.6875 mb with master and 1 workers.
127.0.0.1 - - [26/Jan/2023:20:09:56 -0500] "POST /upload HTTP/1.1" 200 162 0.0553
[11820] PumaWorkerKiller: Consuming 228.96875 mb with master and 1 workers.
[11820] PumaWorkerKiller: Consuming 228.96875 mb with master and 1 workers.

This PR

[20815] Puma starting in cluster mode...
[20815] * Puma version: 6.0.2 (ruby 3.2.0-p0) ("Sunflower")
[20815] *  Min threads: 0
[20815] *  Max threads: 5
[20815] *  Environment: development
[20815] *   Master PID: 20815
[20815] *      Workers: 1
[20815] *     Restarts: (✔) hot (✔) phased
[20815] * Listening on http://0.0.0.0:3000
[20815] Use Ctrl-C to stop
[20815] - Worker 0 (PID: 20944) booted in 0.1s, phase: 0
[20815] PumaWorkerKiller: Consuming 73.25 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.25 mb with master and 1 workers.

...curl request made - memory stays level as file is received

[20815] PumaWorkerKiller: Consuming 73.28125 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.296875 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.34375 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.359375 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.359375 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.359375 mb with master and 1 workers.

...(clipped out lines) memory continues to stay level

[20815] PumaWorkerKiller: Consuming 73.703125 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.703125 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 73.703125 mb with master and 1 workers.

...request handed off from Puma to Rack/Sinatra

[20815] PumaWorkerKiller: Consuming 181.96875 mb with master and 1 workers.
127.0.0.1 - - [26/Jan/2023:20:27:16 -0500] "POST /upload HTTP/1.1" 200 162 0.0585
[20815] PumaWorkerKiller: Consuming 183.78125 mb with master and 1 workers.
[20815] PumaWorkerKiller: Consuming 183.78125 mb with master and 1 workers.

With larger files, garbage collection sometimes kicks in and improve the memory story. But I still think this change has benefits:

  • I've found that garbage collection isn't perfect. Memory still tends to grow in real-world apps.
  • If there's a straightforward way to avoid allocating the memory in the first place (and making the garbage collector work), then we reduce the overall load on the system.
  • There's more improvement that can be made after Puma hands over the request to Rack. I have ideas there as well. But that's a topic for a separate PR

@nateberkopec
Copy link
Member

Any chance we could commit this demo app to /benchmarks? Say in a new folder?

willkoehler added a commit to willkoehler/puma that referenced this pull request Jan 27, 2023
By re-using a single read buffer (vs allocating a new buffer on each call to read_nonblock), we significantly reduce the memory used when receiving large request bodies, such as those generated by Multipart/form-data requests with file uploads.
willkoehler added a commit to willkoehler/puma that referenced this pull request Jan 27, 2023
@willkoehler
Copy link
Contributor Author

Any chance we could commit this demo app to /benchmarks? Say in a new folder?

For sure 👍 I simplified the demo app and added it to /benchmarks/local/sinatra Feel free to suggest a different name or location.

@nateberkopec
Copy link
Member

Excellent. Thank you very much for your contribution.

@nateberkopec nateberkopec merged commit 5973781 into puma:master Jan 29, 2023
@willkoehler willkoehler deleted the reduce_read_body_memory branch January 29, 2023 14:20
@willkoehler
Copy link
Contributor Author

Thanks everyone for the review and vetting 🙌 I'm running 5973781 in production. I'll report back if I see anything strange.

@sanzstez
Copy link

@willkoehler Can you please check this strange behaviour #3145 produced by your PR?

@willkoehler
Copy link
Contributor Author

Hi @sanzstez I'd be happy to help look into that. Let's continue the conversation in #3145

@boazsegev
Copy link

@willkoehler , thank you for the demo-app I used it to test iodine 👍🏻

I think this is a wonderful approach, except that now that the @read_buffer isn't re-allocated for every read, I would consider increasing the buffer size. The system's buffer wouldn't be less than 64kb and using smaller chunks increases system calls (which hurts performance)... I would push for a 32Kb or 64Kb buffer instead (IMHO).

Good luck!

@willkoehler
Copy link
Contributor Author

Hi @boazsegev, I'm glad the demo-app was helpful 🙌

This was my first Puma contribution, so the decision about the buffer size is above my "pay grade". However, it sounds plausible that a larger buffer size would be an improvement and it's probably worth opening a separate issue to start the conversation.

Feel free to tag me in the issue. I would be happy to explore the benefits with you and make a case for increasing the buffer size if it ends up improving performance or memory efficiency.

@boazsegev
Copy link

boazsegev commented May 18, 2023

Hi @willkoehler and @nateberkopec ,

As you know, I'm busy maintaining my own server (iodine)... however, please allow me to make a quick case for increasing Puma's buffer size, by comparing the 4Kb buffer to iodine's 8kb buffer (for HTTP read).

Comparison performed on my machine, while using curl with throttling set to 100M:

  • Puma uses about 21-33% CPU, curl uses ~7% CPU (takes 4.22 seconds to send 0.5Gb)
  • Iodine uses about 16-43% CPU, curl uses ~6% CPU (takes 4.18 seconds to send 0.5Gb)

I assume non-CPU time is lost to system calls and waiting on IO, as the request is throttled.

When the IO is slow enough, both show ~3-4% CPU usage. But when IO is faster (more saturated), the increased buffer means that the program spends more time running its own tasks (rather than system tasks).

Note that there are other differences (i.e., the iodine server is written in C), but at a quick glance it might indicate it's worthwhile to experiment with this.

I understand this might be better in a new issue, but I do not think I'll open an issue as that would also imply the responsibility to contribute (which might take me a while to get to).

Cheers!

@MSP-Greg
Copy link
Member

Wondering whether this could be a config option? Maybe req_body_buffer_size or something similar?

Allows for a typical config 'out-of-the-box`, but users dealing with large files can increase it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf waiting-for-changes Waiting on changes from the requestor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants