Performance: Encoding of keys/values in CommandArgs when using a codec that implements ToByteBufEncoder #2610

shikharid · 2024-01-24T16:32:00Z

Feature Request

Remove unwanted allocation when encoding CommandArgs when using a codec of type ToByteBufEncoder

Is your feature request related to a problem? Please describe

I was analysing cpu/memory for a jvm server that uses Redis heavily (order of 10s of thousands hmget/sec on ~10 connections) and noticed a good amount of CPU usage of the VM (~30% of total and ~70% of eventloops cpu time) comes from CommandArgs.KeyArgument.encode

All the connections use a ByteArrayCodec. Majority of the time is basically Netty's bytebuf gc/pooling logic.
The reason seems to be our specific usecase where every hmget call has 100s of keys and this gets called for each key

I looked through the implementation and it felt like these allocs can be entirely avoided for the special case where user does all req/resp in byte arrays (like folks who use the ByteArrayCodec) or for users who can exactly estimate the size of key/val

Two disclaimers:

The jfr sample used to analyse this is small (5 mins during peak load)
I have not invested extensive time with lettuce code, so might have been foolish to assume my suggestion below works

Describe the solution you'd like

Relevant code is in CommandArgs.KeyArgument.encode()and CommandArgs.ValueArgument.encode()

            if (codec instanceof ToByteBufEncoder) {

                ToByteBufEncoder<K, V> toByteBufEncoder = (ToByteBufEncoder<K, V>) codec;
                ByteBuf temporaryBuffer = target.alloc().buffer(toByteBufEncoder.estimateSize(val) + 6);

                try {
                    toByteBufEncoder.encodeValue(val, temporaryBuffer);
                    ByteBufferArgument.writeByteBuf(target, temporaryBuffer);
                } finally {
                    temporaryBuffer.release();
                }

                return;
            }

The solution will be replacing above with something like:

          if (codec instanceof ToByteBufEncoder) {
             ToByteBufEncoder<K, V> toByteBufEncoder = (ToByteBufEncoder<K, V>) codec;
              // Below lines are basically what ByteBufferArgument.writeByteBuf does
              target.writeByte('$');
              IntegerArgument.writeInteger(target, toByteBufEncoder.estimateSize(val));
              target.writeBytes(CRLF);
              toByteBufEncoder.encodeValue(val, target);
              target.writeBytes(CRLF); 
              return;
            }

Now this has a caveat because of which I'm creating this issue, It assumes estimatedSize is not "estimated" but "exact".

Maybe we can give user the control by adding another method in ToByteBufEncoder which can tell us if the codec can predict exact sizes.
In ByteArrayCodec we always can.

This ensures no additional allocs and essentially makes it garbage free.
Happy to contribute and impl the solution.

Describe alternatives you've considered

Use a threadlocal on heap bytebuf as the temporary buffer, since we know that its lifecycle is method scoped (resizing as needed and discarding bytes occasionaly if it encodes a very large key)
Somehow encode keys/vals in "batches" so as to allocate and copy to target once. Seems tricky to do

Teachability, Documentation, Adoption, Migration Strategy

NA

The text was updated successfully, but these errors were encountered:

shikharid · 2024-01-24T17:48:50Z

Also noticed this:

i.e.
A good chunk of memory allocs done in the lettuce threads were from creating RedisStateMachine.State objects.
Since its a very simple class annd its entire usage is inside this RedisStateMachine class, why not go along the lines of what Netty does with such hot objects, cache them in a pool.
Can use netty's ObjectPool (used extensively across Netty).

Seems pretty simple:

Take from the pool whenever doing new State()
Reset its state and release the objects when its no longer needed ( reset() and remove())

If somehow we miss releasing to the netty ObjectPool (reset not called before RedisStateMachine object was garbage collected), it won't be a memory leak as the pool itself keeps no reference (see io.netty.util.Recycler)

Again happy to contribute and will create a separate issue if the optimisation makes sense

Setup Details

node: m6gd.xlarge ec2 nodes
pod: jvm running with 4c/8g and 2g max direct memory (jdk17)
lettuce:

8 conns used by 8 different threads (1 conn/thread)
auto flush disabled, cmds pipelined and flushed at fixed intervals/queue-size/max-cmd-age
dedicated epoll event loop, no other compute in its threads

shikharid · 2024-01-25T04:41:57Z

Have done a crudish untested version of the changes needed, if project owners agree then will add a more formal version in separate PRs

mp911de · 2024-01-25T08:05:51Z

Thanks for looking into this. We follow this approach to ensure protocol compliance by creating the encoded representation first and then write the correct byte size to the argument buff. estimateSize is implemented on a best-effort basis. For the byte-array coded, estimateSize is precise. For all other implementations this is a guesstimate, especially if you take a look at the StringCodec implementation with averageBytesPerChar * value.length().

It could make sense to introduce the optimization in the form that ToByteBufEncoder reports whether the estimate is the exact size (boolean isEstimateExact() or the like) and then follow your proposal to write onto the output buffer directly.

Regarding the State stack in RedisStateMachine, it could be worth considering to make the array allocation-free by preinitializing the array with 32 instances of Stack. Instead of pooling instances per Thread, any release would just clear type and count.

In any case, before we proceed with code changes, it would be good to have some measurements before and after such changes to learn how much of an impact such a change provides.

shikharid · 2024-01-25T08:34:00Z

Thanks, the preallocated Stack makes good sense.
Will write some jmh benchmarks and changes and share a PR soon.
Should I combine the changes in 1 PR or create two separate ones?

mp911de · 2024-01-25T09:23:03Z

Let's go with a single pull request that has two commits for easier reviewing.

shikharid · 2024-01-25T11:12:26Z

Just curious though, where does the number 32 comes from, like why won't we ever need more than that?
Couldn't figure out based on commit msgs and skimming through RedisStateMachine

mp911de · 2024-01-25T14:05:23Z

32 is an estimate of Redis' response depth (array-in-array-in-object-…). Most commands key/value commands have single-depth while some extended Stream commands (such as XPENDING) have a nesting level of 6. I think that even 16 should be fine, but 32 brings us more on the safe side.

shikharid · 2024-01-25T16:53:31Z

Got it, Thanks.

shikharid · 2024-02-05T05:52:32Z

Hey, I have been caught up in some work last 2 weeks. Will do it this week and send a PR around the end.

…ne benchmark redis#2610

…te object allocs redis#2610

…te object allocs redis#2610 * adds gc and thrpt profiling in RedisStateMachine benchmark * fixes a stale benchmark which caused compilation errors ClusterDistributionChannelWriterBenchmark

…redis#2610 * adds benchmarks to show perf gains * about 10x improvement in perf, with no added gc overhead

shikharid · 2024-02-23T06:12:09Z

@mp911de added changes and benchmarks #2768

Good gains as per benchmarks, have a look

…tate object allocs #2610 * adds gc and thrpt profiling in RedisStateMachine benchmark * fixes a stale benchmark which caused compilation errors ClusterDistributionChannelWriterBenchmark Original pull request: #2768

…zes #2610 * adds benchmarks to show perf gains * about 10x improvement in perf, with no added gc overhead Original pull request: #2768

Reduce code duplications. Add exact optimization to ASCII StringCodec. Tweak Javadoc. Original pull request: #2768

…tate object allocs #2610 * adds gc and thrpt profiling in RedisStateMachine benchmark * fixes a stale benchmark which caused compilation errors ClusterDistributionChannelWriterBenchmark Original pull request: #2768

…zes #2610 * adds benchmarks to show perf gains * about 10x improvement in perf, with no added gc overhead Original pull request: #2768

Reduce code duplications. Add exact optimization to ASCII StringCodec. Tweak Javadoc. Original pull request: #2768

mp911de · 2024-02-26T10:15:42Z

Thanks a lot, this was a decent improvement.

shikharid · 2024-03-01T08:38:09Z

Thanks, any chance of a release soon which includes these?

mp911de · 2024-03-01T09:46:06Z

6.3.2.RELEASE is scheduled for March 15

mp911de added type: enhancement A general enhancement status: waiting-for-feedback We need additional information before we can continue labels Jan 25, 2024

shikharid added a commit to shikharid/lettuce-core that referenced this issue Feb 23, 2024

Fixes stale benchmark, adds gc and thrpt profiling to RedisStateMachi…

36a8b94

…ne benchmark redis#2610

shikharid added a commit to shikharid/lettuce-core that referenced this issue Feb 23, 2024

Use pre-allocated State's in RedisStateMachine, avoiding need for Sta…

878c4e1

…te object allocs redis#2610

shikharid added a commit to shikharid/lettuce-core that referenced this issue Feb 23, 2024

Directly encode key/value to ByteBuf when codec knows exact byte sizes …

0525aec

…redis#2610 * adds benchmarks to show perf gains * about 10x improvement in perf, with no added gc overhead

shikharid mentioned this issue Feb 23, 2024

Remove State allocs in RedisStateMachine and reduce allocs in ByteArrayCodec #2768

Closed

4 tasks

mp911de linked a pull request Feb 26, 2024 that will close this issue

Remove State allocs in RedisStateMachine and reduce allocs in ByteArrayCodec #2768

Closed

4 tasks

mp911de added this to the 6.3.2.RELEASE milestone Feb 26, 2024

mp911de pushed a commit that referenced this issue Feb 26, 2024

Directly encode key/value to ByteBuf when codec knows exact byte si…

68c89ae

…zes #2610 * adds benchmarks to show perf gains * about 10x improvement in perf, with no added gc overhead Original pull request: #2768

mp911de added a commit that referenced this issue Feb 26, 2024

Polishing #2610

6185ebd

Reduce code duplications. Add exact optimization to ASCII StringCodec. Tweak Javadoc. Original pull request: #2768

mp911de pushed a commit that referenced this issue Feb 26, 2024

Directly encode key/value to ByteBuf when codec knows exact byte si…

45b6ad3

…zes #2610 * adds benchmarks to show perf gains * about 10x improvement in perf, with no added gc overhead Original pull request: #2768

mp911de added a commit that referenced this issue Feb 26, 2024

Polishing #2610

0458b21

Reduce code duplications. Add exact optimization to ASCII StringCodec. Tweak Javadoc. Original pull request: #2768

mp911de removed the status: waiting-for-feedback We need additional information before we can continue label Feb 26, 2024

mp911de closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Encoding of keys/values in CommandArgs when using a codec that implements ToByteBufEncoder #2610

Performance: Encoding of keys/values in CommandArgs when using a codec that implements ToByteBufEncoder #2610

shikharid commented Jan 24, 2024 •

edited

shikharid commented Jan 24, 2024 •

edited

shikharid commented Jan 25, 2024

mp911de commented Jan 25, 2024

shikharid commented Jan 25, 2024

mp911de commented Jan 25, 2024

shikharid commented Jan 25, 2024

mp911de commented Jan 25, 2024

shikharid commented Jan 25, 2024

shikharid commented Feb 5, 2024

shikharid commented Feb 23, 2024

mp911de commented Feb 26, 2024

shikharid commented Mar 1, 2024

mp911de commented Mar 1, 2024

Performance: Encoding of keys/values in CommandArgs when using a codec that implements ToByteBufEncoder #2610

Performance: Encoding of keys/values in CommandArgs when using a codec that implements ToByteBufEncoder #2610

Comments

shikharid commented Jan 24, 2024 • edited

Feature Request

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Teachability, Documentation, Adoption, Migration Strategy

shikharid commented Jan 24, 2024 • edited

shikharid commented Jan 25, 2024

mp911de commented Jan 25, 2024

shikharid commented Jan 25, 2024

mp911de commented Jan 25, 2024

shikharid commented Jan 25, 2024

mp911de commented Jan 25, 2024

shikharid commented Jan 25, 2024

shikharid commented Feb 5, 2024

shikharid commented Feb 23, 2024

mp911de commented Feb 26, 2024

shikharid commented Mar 1, 2024

mp911de commented Mar 1, 2024

shikharid commented Jan 24, 2024 •

edited

shikharid commented Jan 24, 2024 •

edited