Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: enforce strong consistency in channel_store #4740

Merged
merged 2 commits into from
Mar 10, 2025
Merged

fix: enforce strong consistency in channel_store #4740

merged 2 commits into from
Mar 10, 2025

Conversation

kostasrim
Copy link
Contributor

@kostasrim kostasrim commented Mar 10, 2025

Channel store uses a read-copy-update to distribute the changes of the channel store to all proactors. The problem is that we use memory_order_relaxed to load the new pointer to the channel store for each proactor which *does not guarantee* that we fetch the latest value of the channel store. Hence, the fix is to use sequencial consistency such to force fetch the latest value of the channel store. Should fix #4724

Fixes #4659 and #4724

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@kostasrim kostasrim requested a review from adiholden March 10, 2025 11:34
@kostasrim kostasrim self-assigned this Mar 10, 2025
@kostasrim kostasrim requested a review from mkaruza March 10, 2025 11:34
@@ -152,7 +153,7 @@ unsigned ChannelStore::SendMessages(std::string_view channel, facade::ArgRange m
it++;
}
};
shard_set->pool()->DispatchBrief(std::move(cb));
shard_set->pool()->AwaitBrief(std::move(cb));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This funciton is called inside DbSlice::DeleteExpiredStep which we asume do not preempt but now this breaks this assumption

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Then we can:

  1. Use DispatchBrief only when called by DeleteExpiredStep
  2. Ditch strong consistency when sending messages via publish and add a sleep to the test

I rather opt in for (1).

Other than that good catch!

Wdyt ? @adiholden

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add sleep to the test case as from what I understand the publish does not guaranties to send the message to all subscribers but that the message was queued to be sent.
Also when sending the message we use the conn->SendPubMessageAsync which I think also does not guarantee that after you call AwaitBrief here the massege was already sent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

@@ -124,8 +124,9 @@ unsigned ChannelStore::SendMessages(std::string_view channel, facade::ArgRange m

// Make sure none of the threads publish buffer limits is reached. We don't reserve memory ahead
// and don't prevent the buffer from possibly filling, but the approach is good enough for
// limiting fast producers. Most importantly, we can use DispatchBrief below as we block here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually dont understand the comment that was removed.. what is "we block here"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it because I replaced DispatchBrief with AwaitBrief

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but ignore, what is important is to the semantic right. I will adjust the comment

ChannelStore::control_block.most_recent.load(memory_order_relaxed));
// Do not use memory_order_relaxed, we need to fetch the latest value of
// the control block
ChannelStore::control_block.most_recent.load());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use explicitly the memory order you want use

@kostasrim kostasrim requested a review from adiholden March 10, 2025 13:37
@romange
Copy link
Collaborator

romange commented Mar 10, 2025

I am pretty sure we used DispatchBrief by design. What is a strongly consistent publish? Is Redis publish strongly consistent?
A server can not not when a client will read its notifications so how do you define it?

@@ -2970,6 +2969,9 @@ async def test_cluster_sharded_pub_sub(df_factory: DflyInstanceFactory):
consumer.ssubscribe("kostas")

await c_nodes[0].execute_command("SPUBLISH kostas hello")
# We need to sleep cause we use DispatchBrief internally. Otherwise we can't really gurantee
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can not guarantee in any case. how do you know that the message arrive to the client? maybe he is in australia?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't. My point was, that after the client gets a reply from a published message, we know dragonfly processed and sent all the messages (even if there were no subscribers). I agree it's an overkill, sleep is more than enough here

@kostasrim
Copy link
Contributor Author

I am pretty sure we used DispatchBrief by design. What is a strongly consistent publish? Is Redis publish strongly consistent? A server can not not when a client will read its notifications so how do you define it?

What I mean by "strongly consistent" is that after I get an ok from publish I know dragonfly has processed and sent this messages to all the subscribers something that we can't guarantee now. So it was not about "guarantee a delivery" but more of "gurantee that dragonfly sent the messages to all the subscribers if any"

@kostasrim kostasrim merged commit 2ff8603 into main Mar 10, 2025
25 checks passed
@kostasrim kostasrim deleted the kpr2 branch March 10, 2025 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_cluster_sharded_pubsub_shard_commands failed test_cluster_sharded_pub_sub failed
3 participants