Fix `Drain()` infinite loop and add test for concurrent `Next()` calls #1525

mdawar · 2024-01-14T17:26:38Z

This is a possible fix for #1524.

Previously when calling Stop() we were checking for the done channel to be closed to exit the loop in Next(), but as stated in #1524 this didn't work for Drain() as it also closes the done channel, so now we check the msgs channel if closed to exit the loop, but this required some changes to how locking was done to prevent a deadlock, as Next() was holding the lock until it returns (which might take a long time), that prevented cleanup() from executing (waiting for the lock) to unsubscribe (where the msgs channel is closed).

Changes in this pull request:

Added test for graceful shutdown for Drain()
Added test for concurrent Next() calls with StopAfter option (auto unsubscribe)
Removed the lock from cleanup() function
Removed an unused drained channel from pullSubscription

Maybe a better solution would be to use multiple locks, for example one lock dedicated only for the subscription.

Fixes possible deadlock when Next() is waiting and holding the lock and cleanup() waiting for the lock to unsubscribe.

piotrpio

Really good work here and again, thank you for the detailed issue and for the PR! There are some problems with your approach though in my opinion, as explained in the comment below - please let me know what you think.

piotrpio · 2024-01-14T23:31:27Z

jetstream/pull.go

@@ -537,69 +535,79 @@ var (
 )

 func (s *pullSubscription) Next() (Msg, error) {
-	s.Lock()


I don't think removing the lock from the entire Next() execution is a good idea. For example, it can cause problems with StopAfter option where executing Next() concurrently may lead to delivering more messages to the caller than specified in StopAfter option.

Holding the lock for the duration of Next() is challenging, and you're absolutely right, we need to have a way to unlock it for cleanup - therefore, I believe catching tle closure of s.done is necessary. Based on your branch I came up with a different solution (it's a bit crude right now as I just wanted to give an example, this would have to be cleaned up a bit): https://github.com/nats-io/nats.go/blob/fix-drain-in-messages/jetstream/pull.go#L537

Here's the gist of it:

When s.done is closed, we unlock the mutex, so that the subscription can be cleaned up properly and s.msgs can be closed. We need a way to conditionally unlock mutex in defer, thus the done bool (I really don't like that...)

If we detect we are draining, we set done to true and continue. Next iterations of the loop will check for the state of done and if it's set, will go to a select statement which does not listen on s.done. Those 2 select statements are identical except whether or not we have case <-s.done.

I extracted handleIncomingMessage() and handleError() methods to make it a bit more readable, but now it's just copy-paste from the select.

Alternatively, as you mentioned, we could use a separate lock just to make sure Next() cannot be executed concurrently

Sorry about that, I overlooked this StopAfter option, I think we should I add a test to verify this behavior first before we move on, the current test calls Next() sequentially.
After we add this test we should be able to refactor the code without breaking things.

There are also more elegant solutions:

Using a dedicated lock for the subscription (Used in cleanup()) and another lock for the counter fields or maybe the rest of the fields

Use atomic values for the counter fields

Which solution do you prefer?

I think using a separate lock for accessing subscription would be a preferable solution since I would be hesitant about allowing concurrent Next() calls - for concurrency, the suggested solution would be to create a whole new MessagesContext() for the same consumer. Separate lock sounds like it could actually simplify some things though, so that's nice.

Do you have time and would like to tackle this? Or should I take over?

@piotrpio yes you're right this is a better solution.
I started working on adding a test to verify Next() concurrent calls so any changes afterwards won't break this behavior.
I'll see what I can do, and you too feel free to do whatever is best.
I'll keep you updated with what I come up with.

I think that in cleanup() there's no risk of race conditions, so it might be ok without holding the lock.
It only reads the subscription field which is set by methods that create the pullSubscription struct like pullConsumer.Consume, pullConsumer.Messages and pullConsumer.fetch and the actual Subscription has it's own mutex.

But I don't know if this acceptable, I mean for future changes to the code that might introduce race conditions.

If it's ok right now (and looks like it is) and it does not produce a race, we can try to go without the lock I think - if in the future we will need locking mechanisms we can always add it. Just please (if you're working on it), add an appropriate comment on why the lock is not needed.

Thank you again for your contribution, it's extremely valuable.

OK, I will add the comment right now.

Added comment for the cleanup function on why it doesn't need to hold the lock.

piotrpio

That looks great, LGTM!

mdawar added 3 commits January 14, 2024 18:55

Added failing graceful shutdown test for MessagesContext.Drain method

284a639

Minimize lock scope in pullSubscription.Next to allow for cleanup

0976e9b

Fixes possible deadlock when Next() is waiting and holding the lock and cleanup() waiting for the lock to unsubscribe.

Remove unused drained channel from pullSubscription

8ec5423

piotrpio self-requested a review January 14, 2024 21:02

piotrpio reviewed Jan 14, 2024

View reviewed changes

mdawar added 3 commits January 15, 2024 12:33

Added test for auto unsubscribe with concurrent calls

10bee37

Revert locking in Next and remove cleanup lock

0460e1a

Prevent hanging in the auto-unsubscribe test

a482cc6

Added comment for the cleanup function on why it doesn't need to hold the lock.

mdawar changed the title ~~Fix Drain() infinite loop and minimize lock scope in Next()~~ Fix Drain() infinite loop and add test for concurrent Next() calls Jan 15, 2024

piotrpio approved these changes Jan 15, 2024

View reviewed changes

piotrpio merged commit a8a8d18 into nats-io:main Jan 15, 2024
1 check passed

mdawar deleted the drain-fix branch January 16, 2024 07:33

piotrpio mentioned this pull request Feb 14, 2024

Release v1.33.0 #1556

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `Drain()` infinite loop and add test for concurrent `Next()` calls #1525

Fix `Drain()` infinite loop and add test for concurrent `Next()` calls #1525

mdawar commented Jan 14, 2024 •

edited

piotrpio left a comment

piotrpio Jan 14, 2024

piotrpio Jan 15, 2024

mdawar Jan 15, 2024

piotrpio Jan 15, 2024

mdawar Jan 15, 2024

mdawar Jan 15, 2024

piotrpio Jan 15, 2024

mdawar Jan 15, 2024

piotrpio left a comment

Fix Drain() infinite loop and add test for concurrent Next() calls #1525

Fix Drain() infinite loop and add test for concurrent Next() calls #1525

Conversation

mdawar commented Jan 14, 2024 • edited

piotrpio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piotrpio left a comment

Choose a reason for hiding this comment

Fix `Drain()` infinite loop and add test for concurrent `Next()` calls #1525

Fix `Drain()` infinite loop and add test for concurrent `Next()` calls #1525

mdawar commented Jan 14, 2024 •

edited