Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: race condition in ReplicaOfInternal #4653

Merged
merged 6 commits into from
Feb 26, 2025
Merged

fix: race condition in ReplicaOfInternal #4653

merged 6 commits into from
Feb 26, 2025

Conversation

kostasrim
Copy link
Contributor

@kostasrim kostasrim commented Feb 24, 2025

#4636 introduced a small race . The problem is that Start() might have failed (because a subsequent ReplicaOf command cancelled it). ReplicaOfInternal would ignore the ec and start the replication fiber via replica_->StartMainReplicationFiber but the second ReplicaOf command that followed updated replica_ so we tried to start replication twice on the same replica instance triggering a check failed that the fiber is running.

We did not have that problem before because we called new_replica_->Start() which would start replication. Even if the replication got cancelled because of a second ReplicaOf command, it would be fine because the MainReplicationFiber would start twice on two separate Replica instances and one of them would cancel right away becauseR_CANCELLED was not set (since Stop() explicitly set it by cancelling the context)

@kostasrim kostasrim self-assigned this Feb 24, 2025
Signed-off-by: kostas <kostas@dragonflydb.io>
@@ -2845,7 +2851,7 @@ void ServerFamily::ReplicaOfInternal(CmdArgList args, Transaction* tx, SinkReply
}

if (on_err == ActionOnConnectionFail::kReturnOnError) {
replica_->StartMainReplicationFiber(builder);
new_replica->StartMainReplicationFiber(builder);
Copy link
Contributor Author

@kostasrim kostasrim Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the FIX. Same replica_ instance was used by two fibers and we ended up starting the main replication fiber twice. The rest of the changes above are an early exit to avoid creating the fiber all together in case of an error or if the context got cancelled. This small optimization is not really needed even in the presence of parallel ReplicaOf commands because MainReplicationFiber exits early if the context is cancelled (that's why we did not need to check for ec before my changes) -- so as long as this replication fiber is part of the correct instance -- cancellation would take care of the rest.

// However, it could be the case that Start() above connected succefully and by the time
// we acquire the lock, the context got cancelled because another ReplicaOf command
// executed and acquired the replicaof_mu_ before us.
if (ec || new_replica->IsContextCancelled()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small optimization. We do not need to start the replication fiber if we got an error or if the context is cancelled.

This would be handled anyway by new_replica->StartMainReplicationFiber which would spawn the MainReplicationFiber and then exit because it got cancelled. But yeah, no reason to do that

@kostasrim
Copy link
Contributor Author

kostasrim commented Feb 24, 2025

@@ -72,6 +72,10 @@ class Replica : ProtocolClient {

std::error_code TakeOver(std::string_view timeout, bool save_flag);

bool IsContextCancelled() const {
return !cntx_.IsRunning();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need such a function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just start the replication fiber and let it cancel. I wanted to avoid that alltogether so I added this. As I explained, this is not part of the fix but an "early optimization" (which is kinda funny because fiber cost is insignificant and this path is non critical)

// we acquire the lock, the context got cancelled because another ReplicaOf command
// executed and acquired the replicaof_mu_ before us.
if (ec || new_replica->IsContextCancelled()) {
if (replica_ == new_replica) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if replica != new_replica for the first command and the second command was cancelled. In this case we will not have replica at all and botjh commands are failed. Maybe lets discuss it with @adiholden

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should not have a replica! The second command cancelled the previously then it cancelled itself ? Isn't that what you are describing or am I missing something ?

// we acquire the lock, the context got cancelled because another ReplicaOf command
// executed and acquired the replicaof_mu_ before us.
if (ec || new_replica->IsContextCancelled()) {
if (replica_ == new_replica) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can replica_ == new_replica if new_replica->IsContextCancelled()?
when we cancel replication (from replica of command)
we take replicaof mutex
call replica_->Stop
and set replica_ var
So from what I understadn we can not get to the condition replica_ == new_replica and new_replica->IsContextCancelled()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment here.
I think that if replica was canceld and ec is not error you do not return response to client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from what I understadn we can not get to the condition replica_ == new_replica and new_replica->IsContextCancelled()

Yes that's true and it is not an issue. The IsContextCancelled() is ored with ec because ec might be 0 (meaning success) but the context got cancelled. For that case we just return. However, if ec has an error and replica == new_replica then it means we need to cancel everything alltogether (there is no other replica of command and we just cancelled ourselves)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the missing response

@@ -2845,7 +2851,7 @@ void ServerFamily::ReplicaOfInternal(CmdArgList args, Transaction* tx, SinkReply
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be more clear I think if you enter the Drakarys call under if (on_err == ActionOnConnectionFail::kReturnOnError)instead of if (tx)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not want to change the semantics. I will update on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep +1 for the suggestion. This also removes tx alltogether as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh Drakarys needs, nvm

comments
Signed-off-by: kostas <kostas@dragonflydb.io>
@kostasrim kostasrim requested a review from adiholden February 25, 2025 13:01
SetMasterFlagOnAllThreads(true);
replica_.reset();
}
// SendError if we got cancelled and Start() above succeeded. Otherwise, if ec,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. Please send the error from one place here. lets not send it in start but only here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to write like this
if (ec && replica_ == new_replica) {
service_.SwitchState(GlobalState::LOADING, GlobalState::ACTIVE);
SetMasterFlagOnAllThreads(true);
replica_.reset();
}
if (ec || cancelled) {
builder->SendError
return;
} else {
builder->SendOk();
}
if (on_err == ActionOnConnectionFail::kReturnOnError) {
Drakarys(tx, DbSlice::kDbAll);
new_replica->StartMainReplicationFiber(builder);
}

if (on_err == ActionOnConnectionFail::kReturnOnError) {
replica_->StartMainReplicationFiber(builder);
Drakarys(tx, DbSlice::kDbAll);
new_replica->StartMainReplicationFiber(builder);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to send the ok also here and not inside StartMainReplicationFiber

@kostasrim kostasrim requested a review from adiholden February 26, 2025 08:24
VLOG(1) << "Starting replication";
ProactorBase* mythread = ProactorBase::me();
CHECK(mythread);

auto check_connection_error = [this, builder](error_code ec, const char* msg) -> error_code {
auto check_connection_error = [this](error_code ec, const char* msg) -> GenericError {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can RETURN_ON_ERR work on GenericError?

Copy link
Contributor Author

@kostasrim kostasrim Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic error is implicitly convertible to error code. However RETURN_ON_ERR does this:

GenericError -> ErrorCode then RETURN_ON_ERR returns that ErrorCode, which is then converted back to GenericError.

This looses the details part of the GenericError. I will fix this.

I am SAVING my rant around implicit conversions.

cntx_.ReportCancelError();
return {{}, absl::StrCat(msg, ec.message())};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused how this is working.. why shouldnt this be return {ec, absl::StrCat(msg, ec.message())}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look how GenericError::Format is implemented. The above expression is turned to absl::StrCat(msg, ec.message() when formatted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we can construct GenericError directly from string. We can omit the first argument {} because we have that constructor. I will take care of this as well

cntx_.ReportCancelError();
return {absl::StrCat(msg, ec.message())};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the GenericError does this concatenation for us you can do
return {ec, msg}

Copy link
Contributor Author

@kostasrim kostasrim Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not do the same concatenation.

330 std::string GenericError::Format() const {
331   if (!ec_ && details_.empty())
332     return "";
333 
334   if (details_.empty())
335     return ec_.message();
336   else if (!ec_)
337     return details_;
338   else
339     return absl::StrCat(ec_.message(), ": ", details_);
340 }

ec is true so absl::StrCat(ec_.message(), ": ", details_); is not identical to {absl::StrCat(msg, ec.message())};

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we care about the order here? does this realy needs a different implementaion?

Copy link
Contributor Author

@kostasrim kostasrim Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need to check but it's irrelevant. At the end of the day we are discussing here return {ec, msg} vs return {msg}. Either way is fine and it's not important

@@ -31,6 +31,14 @@ using facade::kWrongTypeErr;

#define RETURN_ON_ERR(x) RETURN_ON_ERR_T(std::error_code, x)

#define RETURN_ON_GENERIC_ERR(x) \
Copy link
Contributor Author

@kostasrim kostasrim Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adiholden

I had to, there is no really good solution otherwise. Doing something RETURN_ON_ERR_T(GenericError, x) won't work. Yes I tried modifying the micro above by converting the statement:

std::error_code __ec = (x); ----> T __ex = (x) but a) this is not properly being parsed and we get compiler errors, b) we also do an extra copy in that expression which is fine for error code because it's trivially copyable but not for Generic error.

@kostasrim kostasrim requested a review from adiholden February 26, 2025 11:06
@kostasrim kostasrim merged commit 8512f72 into main Feb 26, 2025
10 checks passed
@kostasrim kostasrim deleted the kpr18 branch February 26, 2025 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants