[fix] [issue 1051]: Fix inaccurate producer mem limit in chunking and schema #1055

Gleiphir2769 · 2023-07-11T15:35:30Z

Motivation

The producer memory limit have some problem when EnableChunking=true or Schema is set.

When Schema is set, the actual message payload is msg.Value. The len(msg.Payload) may be 0 and memory can not be reserved acurate.

pulsar-client-go/pulsar/producer_partition.go

Lines 479 to 494 in be35740

    
           uncompressedPayload := msg.Payload 
        
           uncompressedPayloadSize := int64(len(uncompressedPayload)) 
        
           var schemaPayload []byte 
        
           var err error 
        
           if msg.Value != nil && msg.Payload != nil { 
        
           	p.log.Error("Can not set Value and Payload both") 
        
           	runCallback(request.callback, nil, request.msg, errors.New("can not set Value and Payload both")) 
        
           	return 
        
           } 
        
           // The block chan must be closed when returned with exception 
        
           defer request.stopBlock() 
        
           if !p.canAddToQueue(request, uncompressedPayloadSize) { 
        
           	return 
        
           }

In chunking, if producer meets the memory limit, it should release the memory for chunks which has send out. But the calculate for this release is not accurate, it should be uncompressedPayloadSize - int64(lhs) instead of uncompressedPayloadSize - int64(rhs)

pulsar-client-go/pulsar/producer_partition.go

Lines 662 to 664 in be35740

    
           if chunkID != 0 && !p.canAddToQueue(nsr, 0) { 
        
           	p.releaseSemaphoreAndMem(uncompressedPayloadSize - int64(rhs)) 
        
           	return

In chunking, if internalSingleSend is failed, it should release the memory for single chunk. But we release all the chunks memory repeatly now.

pulsar-client-go/pulsar/producer_partition.go

Lines 838 to 843 in be35740

    
           if err != nil { 
        
           	runCallback(request.callback, nil, request.msg, err) 
        
           	p.releaseSemaphoreAndMem(int64(len(msg.Payload))) 
        
           	p.log.WithError(err).Errorf("Single message serialize failed %s", msg.Value) 
        
           	return 
        
           }

When producer received the receipt from broker, it should release the memory it reserved before sending. But it releases wrong size in chunking and schema.

pulsar-client-go/pulsar/producer_partition.go

Lines 1221 to 1230 in be35740

    
           if sr.msg != nil { 
        
           	atomic.StoreInt64(&p.lastSequenceID, int64(pi.sequenceID)) 
        
           	p.releaseSemaphoreAndMem(int64(len(sr.msg.Payload))) 
        
           	p.metrics.PublishLatency.Observe(float64(now-sr.publishTime.UnixNano()) / 1.0e9) 
        
           	p.metrics.MessagesPublished.Inc() 
        
           	p.metrics.MessagesPending.Dec() 
        
           	payloadSize := float64(len(sr.msg.Payload)) 
        
           	p.metrics.BytesPublished.Add(payloadSize) 
        
           	p.metrics.BytesPending.Sub(payloadSize) 
        
           }

Modifications

Fix all the memory limit problems relative to chunking and schema
Add unit tests to cover these scenarios

Verifying this change

Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API: (yes / no)
The schema: (yes / no / don't know)
The default values of configurations: (yes / no)
The wire protocol: (yes / no)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / GoDocs / not documented)
If a feature is not applicable for documentation, explain why?
If a feature is not documented yet in this PR, please create a followup issue for adding the documentation

gunli · 2023-07-12T03:10:10Z

@Gleiphir2769 Great job, actaully I am preparing to fix this issue and #1043 and refactor the callback/releaseSemaphore/releaseMemory/metrics together, it will be a BIG PR, I am busy these days, if you have time, you can do it together.

My idea is:

Calculte the required resouce (semaphore/memory, when chunking, more than 1 semaphores, we cache the compressedPayload/meta in the sendRequest when Calculting) before we put a request into the dataChan, if there is no enough resource, fail fast, in this way, we can delete the sendRequest.blockCh field and no need to block;
Add a sendRequest.done() method, when a request is done (succeed or failed), call it, release the resources a request holds, run the callback, report metrics, write debug logs in this method, in this way, we manage the resource/logic together and don't have to do these things across the whole file.

Gleiphir2769 · 2023-07-12T06:01:42Z

Hi @gunli.

Calculte the required resouce before we put a request into the dataChan

Before chunking introduction, the semaphore is required before dataChan. I moved it from internalSendAsync to internalSend because chunking need to get maxMessageSize by asking broker.

pulsar-client-go/pulsar/producer_partition.go

Lines 569 to 572 in be35740

    
           maxMessageSize := int(p._getConn().GetMaxMessageSize()) 
        
           // compress payload if not batching

If we can get it brefore internalSend, we can make a easier way to reserve resouce.

Add a sendRequest.done() method, when a request is done (succeed or failed), call it

Sounds great. It's a bit difficult to understand sendRequest.callback() now.

And I think we can fix these bugs firstly. Refactoring work can be done in parallel. What do you think?

gunli · 2023-07-12T06:29:14Z

Hi @Gleiphir2769

If we can get it brefore internalSend, we can make a easier way to reserve resouce.

I think it is not a problem, you can check the code of connectionPool.GetConnection()/conn.waitUntilReady()/conn.doHandshake, when we get a conn from connpool, the conn is ready, and MaxMessageSize is cached in the connection. And in newPartitionProducer, when a producer is newed, the conn is ready, it is safe to get the conn, so GetMaxMessageSize() is safe too.

It's a bit difficult to understand sendRequest.callback() now

sendRequest.callback is just a field to store the callback function, in sendRequest.done(), we call this callback, something like this

func (sr *sendRequest) done(id *MessageID, err error) {
	if (sr.semaphore){
            sr.semaphore.Release()   
        }
        sr.memLimit.Release()
        runcallback(sr.callback, id, err)
        metrics.IncXXX()
        log.Debug/Error
       ...
}

In any other logic where the request is done, we just call request.done(), no need to care about resources/callback/metrics/debug logs, the code will be more clear.

And I think we can fix these bugs firstly. Refactoring work can be done in parallel. What do you think?

I agree with that.

Gleiphir2769 · 2023-07-12T17:12:56Z

Could you give a review? @gunli @RobertIndie @shibd

Please feel free to leave your comment, thanks!

gunli · 2023-07-13T01:44:56Z

pulsar/producer_partition.go

@@ -489,14 +488,13 @@ func (p *partitionProducer) internalSend(request *sendRequest) {

 	// The block chan must be closed when returned with exception
 	defer request.stopBlock()
-	if !p.canAddToQueue(request, uncompressedPayloadSize) {
+	if !p.canAddToQueue(request) {


forget to runCallback, it is better to add a debug log

runCallback is invoked by canReserveMem if it's failed.

gunli · 2023-07-13T01:45:53Z

pulsar/producer_partition.go

@@ -542,6 +538,11 @@ func (p *partitionProducer) internalSend(request *sendRequest) {

 	uncompressedSize := len(uncompressedPayload)

+	// try to reserve memory for uncompressedPayload
+	if !p.canReserveMem(request, int64(uncompressedSize)) {
+		return


forget to release semaphore and runCallback, it is better to add a debug log

Semaphore is released by canReserveMem if it failed, runCallback as well.

And I think Semaphore released by canReserveMem may not be a good idea. It makes canReserveMem must be invoked after canAddToQueue . What do you think?

I think so, runCallback/releaseSemaphore/releaseMemory in canAddToQueue and canReserveMem violates the Single Responsibility Principle

The final solution is encapsulating the resource releasing logic in request.done(), anywhere there is an error, just call request.done()

I'm +1 for moving the semaphore release out of canReserveMem. It's better that we release it here than in the canReserveMem before we find a good solution for it.

gunli · 2023-07-13T03:02:41Z

I think is better to merge #1052 first, and then this PR is better to rebase after that, or it will confilct. @Gleiphir2769 @RobertIndie @shibd

RobertIndie

Good work! Overall looks good to me. Left some comments.

pulsar/producer_partition.go

RobertIndie · 2023-07-13T10:44:10Z

pulsar/producer_test.go

+	assert.Error(t, err)
+	assert.ErrorContains(t, err, getResultStr(ClientMemoryBufferIsFull))
+
+	// wait all the chunks have been released


Is it better that we add producer.flush before retryAssert?

Because of DisableBatching=true, producer.flush here is useless and cause panic.

The producer.flush is also useful when disabling the batching and using sendAsync. But I just find that we already send a message synchronously at line 2047. So we don't need flush the producer here.

and cause panic.

Why does it panic? Seems like an unexpected behavior?

RobertIndie · 2023-07-13T10:52:56Z

Calculte the required resouce (semaphore/memory, when chunking, more than 1 semaphores, we cache the compressedPayload/meta in the sendRequest when Calculting) before we put a request into the dataChan, if there is no enough resource, fail fast, in this way, we can delete the sendRequest.blockCh field and no need to block;

When enabling the chunking, we cannot get the number of total chunks before pushing the request to the dataChan. And there may be a deadlock issue similar to apache/pulsar#17446

Add a sendRequest.done() method, when a request is done (succeed or failed), call it, release the resources a request holds, run the callback, report metrics, write debug logs in this method, in this way, we manage the resource/logic together and don't have to do these things across the whole file.

+1 for this. It's a good practice to manage the resource.

Gleiphir2769 · 2023-07-14T02:26:51Z

Please rerun workflow for this pr, thx! cc @RobertIndie

gunli · 2023-07-14T08:21:11Z

When enabling the chunking, we cannot get the number of total chunks before pushing the request to the dataChan. And there may be a deadlock issue similar to apache/pulsar#17446

@RobertIndie I think it is possible to do that, the trade off is we have to call sheame.Encode() and p.compressionProvider.Compress() before entering the dataChan, that will affect performance of the go routine of the application. Or, add another channel and go routine to the producer (not partitionProducer) to do these preparing work.

gunli · 2023-07-14T08:47:20Z

@RobertIndie Would you please review #1049 , if it is OK, please merge it, this PR should rebase and update after that PR.

RobertIndie · 2023-07-14T09:53:32Z

I think it is possible to do that, the trade off is we have to call sheame.Encode() and p.compressionProvider.Compress() before entering the dataChan, that will affect performance of the go routine of the application. Or, add another channel and go routine to the producer (not partitionProducer) to do these preparing work.

@gunli Yes. That would be a performance issue. But if we introduce another channel, we still need to wait for the channel and block the user goroutine. And it also introduces more complexity.

gunli · 2023-07-14T11:33:04Z

Yes. That would be a performance issue.

@RobertIndie I checked the code of compress Provider, there is a CompressMaxSize method in it, so I think compressing can be avoided, I know little about schema, if all the schemas can provide a method like that, all will be perfectly done.

And I checked the code of java client ProducerImpl.sendAsync(), it seems schema encoding and compressing are done in the user's/application's thread.

Gleiphir2769 · 2023-07-18T08:59:49Z

Ping @RobertIndie

RobertIndie · 2023-07-18T10:55:56Z

And I checked the code of java client ProducerImpl.sendAsync(), it seems schema encoding and compressing are done in the user's/application's thread.

Thanks. @gunli I also found a bug related to this: #1057 The initial idea I came up with is to have the operation of pushing a message to the producer queue happen in the user thread. Just like the Java client did. Let's move this discussion into that issue(or a new issue if it's not related).

Gleiphir2769 · 2023-07-18T11:21:59Z

I have rebased this branch to master. Please rerun the workflow. Thx! @RobertIndie

gunli · 2023-07-19T09:05:46Z

@RobertIndie would you please merge the latest PRs #1051 #1057 #1059 , we are eager to do the left refactoring work after these PRs :)

… schema (#1055) ### Motivation The producer memory limit have some problem when `EnableChunking=true` or `Schema` is set. - When `Schema` is set, the actual message payload is `msg.Value`. The `len(msg.Payload)` may be 0 and memory can not be reserved acurate. https://github.com/apache/pulsar-client-go/blob/be3574019383ac0cdc65fec63e422fcfd6c82e4b/pulsar/producer_partition.go#L479-L494 - In chunking, if producer meets the memory limit, it should release the memory for **chunks which has send out**. But the calculate for this release is not accurate, it should be `uncompressedPayloadSize - int64(lhs)` instead of `uncompressedPayloadSize - int64(rhs)` https://github.com/apache/pulsar-client-go/blob/be3574019383ac0cdc65fec63e422fcfd6c82e4b/pulsar/producer_partition.go#L662-L664 - In chunking, if `internalSingleSend` is failed, it should release the memory for **single chunk**. But we release all the chunks memory repeatly now. https://github.com/apache/pulsar-client-go/blob/be3574019383ac0cdc65fec63e422fcfd6c82e4b/pulsar/producer_partition.go#L838-L843 - When producer received the receipt from broker, it should release the memory **it reserved before sending**. But it releases wrong size in `chunking` and `schema`. https://github.com/apache/pulsar-client-go/blob/be3574019383ac0cdc65fec63e422fcfd6c82e4b/pulsar/producer_partition.go#L1221-L1230 ### Modifications - Fix all the memory limit problems relative to `chunking` and `schema` - Add unit tests to cover these scenarios --------- Co-authored-by: shenjiaqi.2769 <shenjiaqi.2769@bytedance.com> (cherry picked from commit 28f61d2)

Gleiphir2769 marked this pull request as draft July 11, 2023 15:35

Gleiphir2769 marked this pull request as ready for review July 12, 2023 17:11

gunli reviewed Jul 13, 2023

View reviewed changes

RobertIndie reviewed Jul 13, 2023

View reviewed changes

RobertIndie assigned Gleiphir2769 Jul 13, 2023

RobertIndie added the type/bug label Jul 13, 2023

RobertIndie added this to the v0.12.0 milestone Jul 13, 2023

RobertIndie added release/0.10.1 release/0.11.1 labels Jul 13, 2023

RobertIndie approved these changes Jul 18, 2023

View reviewed changes

Gleiphir2769 and others added 5 commits July 18, 2023 19:13

fix: fix inaccurate producer mem limit in chunking and schema

7af72eb

fix: fix and add unit tests

44ea79d

fix: fix metrics

71a712d

fix: fix flaky test

7bb3c77

fix: fix test

cf128d4

Gleiphir2769 force-pushed the fix_producer_mem_limit branch from e9eb93b to cf128d4 Compare July 18, 2023 11:21

RobertIndie merged commit 28f61d2 into apache:master Jul 20, 2023
6 checks passed

gunli mentioned this pull request Jul 22, 2023

[fix] [issue 1067] Fix the excessive dataChan capacity #1068

Closed

1 task

gunli mentioned this pull request Aug 31, 2023

[Improve][Producer] Refactor internalSend() and resouce managment #1071

Closed

1 task

RobertIndie added the cherry-picked/branch-0.11.0 label Sep 7, 2023

tisonkun mentioned this pull request Oct 24, 2023

[Improve][Producer] Refactor internalSend() and resouce managment #1113

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] [issue 1051]: Fix inaccurate producer mem limit in chunking and schema #1055

[fix] [issue 1051]: Fix inaccurate producer mem limit in chunking and schema #1055

Gleiphir2769 commented Jul 11, 2023 •

edited

gunli commented Jul 12, 2023

Gleiphir2769 commented Jul 12, 2023 •

edited

gunli commented Jul 12, 2023 •

edited

Gleiphir2769 commented Jul 12, 2023 •

edited

gunli Jul 13, 2023 •

edited

Gleiphir2769 Jul 13, 2023 •

edited

gunli Jul 13, 2023 •

edited

Gleiphir2769 Jul 13, 2023

Gleiphir2769 Jul 13, 2023

gunli Jul 13, 2023

gunli Jul 13, 2023

RobertIndie Jul 13, 2023

gunli commented Jul 13, 2023

RobertIndie left a comment

RobertIndie Jul 13, 2023

Gleiphir2769 Jul 14, 2023

RobertIndie Jul 14, 2023

RobertIndie commented Jul 13, 2023

Gleiphir2769 commented Jul 14, 2023 •

edited

gunli commented Jul 14, 2023 •

edited

gunli commented Jul 14, 2023

RobertIndie commented Jul 14, 2023

gunli commented Jul 14, 2023 •

edited

Gleiphir2769 commented Jul 18, 2023

RobertIndie commented Jul 18, 2023

Gleiphir2769 commented Jul 18, 2023

gunli commented Jul 19, 2023

	uncompressedPayload := msg.Payload
	uncompressedPayloadSize := int64(len(uncompressedPayload))

	var schemaPayload []byte
	var err error
	if msg.Value != nil && msg.Payload != nil {
	p.log.Error("Can not set Value and Payload both")
	runCallback(request.callback, nil, request.msg, errors.New("can not set Value and Payload both"))
	return
	}

	// The block chan must be closed when returned with exception
	defer request.stopBlock()
	if !p.canAddToQueue(request, uncompressedPayloadSize) {
	return
	}

	if chunkID != 0 && !p.canAddToQueue(nsr, 0) {
	p.releaseSemaphoreAndMem(uncompressedPayloadSize - int64(rhs))
	return

	if err != nil {
	runCallback(request.callback, nil, request.msg, err)
	p.releaseSemaphoreAndMem(int64(len(msg.Payload)))
	p.log.WithError(err).Errorf("Single message serialize failed %s", msg.Value)
	return
	}

	if sr.msg != nil {
	atomic.StoreInt64(&p.lastSequenceID, int64(pi.sequenceID))
	p.releaseSemaphoreAndMem(int64(len(sr.msg.Payload)))
	p.metrics.PublishLatency.Observe(float64(now-sr.publishTime.UnixNano()) / 1.0e9)
	p.metrics.MessagesPublished.Inc()
	p.metrics.MessagesPending.Dec()
	payloadSize := float64(len(sr.msg.Payload))
	p.metrics.BytesPublished.Add(payloadSize)
	p.metrics.BytesPending.Sub(payloadSize)
	}

[fix] [issue 1051]: Fix inaccurate producer mem limit in chunking and schema #1055

[fix] [issue 1051]: Fix inaccurate producer mem limit in chunking and schema #1055

Conversation

Gleiphir2769 commented Jul 11, 2023 • edited

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

gunli commented Jul 12, 2023

Gleiphir2769 commented Jul 12, 2023 • edited

gunli commented Jul 12, 2023 • edited

Gleiphir2769 commented Jul 12, 2023 • edited

gunli Jul 13, 2023 • edited

Choose a reason for hiding this comment

Gleiphir2769 Jul 13, 2023 • edited

Choose a reason for hiding this comment

gunli Jul 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gunli commented Jul 13, 2023

RobertIndie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobertIndie commented Jul 13, 2023

Gleiphir2769 commented Jul 14, 2023 • edited

gunli commented Jul 14, 2023 • edited

gunli commented Jul 14, 2023

RobertIndie commented Jul 14, 2023

gunli commented Jul 14, 2023 • edited

Gleiphir2769 commented Jul 18, 2023

RobertIndie commented Jul 18, 2023

Gleiphir2769 commented Jul 18, 2023

gunli commented Jul 19, 2023

Gleiphir2769 commented Jul 11, 2023 •

edited

Gleiphir2769 commented Jul 12, 2023 •

edited

gunli commented Jul 12, 2023 •

edited

Gleiphir2769 commented Jul 12, 2023 •

edited

gunli Jul 13, 2023 •

edited

Gleiphir2769 Jul 13, 2023 •

edited

gunli Jul 13, 2023 •

edited

Gleiphir2769 commented Jul 14, 2023 •

edited

gunli commented Jul 14, 2023 •

edited

gunli commented Jul 14, 2023 •

edited