-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: race condition in ReplicaOfInternal #4653
Conversation
Signed-off-by: kostas <kostas@dragonflydb.io>
src/server/server_family.cc
Outdated
@@ -2845,7 +2851,7 @@ void ServerFamily::ReplicaOfInternal(CmdArgList args, Transaction* tx, SinkReply | |||
} | |||
|
|||
if (on_err == ActionOnConnectionFail::kReturnOnError) { | |||
replica_->StartMainReplicationFiber(builder); | |||
new_replica->StartMainReplicationFiber(builder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the FIX. Same replica_
instance was used by two fibers and we ended up starting the main replication fiber twice. The rest of the changes above are an early exit to avoid creating the fiber all together in case of an error or if the context got cancelled. This small optimization is not really needed even in the presence of parallel ReplicaOf
commands because MainReplicationFiber exits early if the context is cancelled (that's why we did not need to check for ec before my changes) -- so as long as this replication fiber is part of the correct instance -- cancellation would take care of the rest.
src/server/server_family.cc
Outdated
// However, it could be the case that Start() above connected succefully and by the time | ||
// we acquire the lock, the context got cancelled because another ReplicaOf command | ||
// executed and acquired the replicaof_mu_ before us. | ||
if (ec || new_replica->IsContextCancelled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small optimization. We do not need to start the replication fiber if we got an error or if the context is cancelled.
This would be handled anyway by new_replica->StartMainReplicationFiber
which would spawn the MainReplicationFiber and then exit because it got cancelled. But yeah, no reason to do that
@@ -72,6 +72,10 @@ class Replica : ProtocolClient { | |||
|
|||
std::error_code TakeOver(std::string_view timeout, bool save_flag); | |||
|
|||
bool IsContextCancelled() const { | |||
return !cntx_.IsRunning(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need such a function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just start the replication fiber and let it cancel. I wanted to avoid that alltogether so I added this. As I explained, this is not part of the fix but an "early optimization" (which is kinda funny because fiber cost is insignificant and this path is non critical)
// we acquire the lock, the context got cancelled because another ReplicaOf command | ||
// executed and acquired the replicaof_mu_ before us. | ||
if (ec || new_replica->IsContextCancelled()) { | ||
if (replica_ == new_replica) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if replica != new_replica for the first command and the second command was cancelled. In this case we will not have replica at all and botjh commands are failed. Maybe lets discuss it with @adiholden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we should not have a replica! The second command cancelled the previously then it cancelled itself ? Isn't that what you are describing or am I missing something ?
// we acquire the lock, the context got cancelled because another ReplicaOf command | ||
// executed and acquired the replicaof_mu_ before us. | ||
if (ec || new_replica->IsContextCancelled()) { | ||
if (replica_ == new_replica) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can replica_ == new_replica if new_replica->IsContextCancelled()?
when we cancel replication (from replica of command)
we take replicaof mutex
call replica_->Stop
and set replica_ var
So from what I understadn we can not get to the condition replica_ == new_replica and new_replica->IsContextCancelled()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another comment here.
I think that if replica was canceld and ec is not error you do not return response to client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So from what I understadn we can not get to the condition replica_ == new_replica and new_replica->IsContextCancelled()
Yes that's true and it is not an issue. The IsContextCancelled()
is ored
with ec
because ec
might be 0
(meaning success) but the context got cancelled
. For that case we just return. However, if ec
has an error and replica == new_replica
then it means we need to cancel everything alltogether (there is no other replica of command and we just cancelled ourselves)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the missing response
src/server/server_family.cc
Outdated
@@ -2845,7 +2851,7 @@ void ServerFamily::ReplicaOfInternal(CmdArgList args, Transaction* tx, SinkReply | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be more clear I think if you enter the Drakarys call under if (on_err == ActionOnConnectionFail::kReturnOnError)instead of if (tx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not want to change the semantics. I will update on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep +1 for the suggestion. This also removes tx
alltogether as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh Drakarys needs, nvm
src/server/server_family.cc
Outdated
SetMasterFlagOnAllThreads(true); | ||
replica_.reset(); | ||
} | ||
// SendError if we got cancelled and Start() above succeeded. Otherwise, if ec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing. Please send the error from one place here. lets not send it in start but only here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to write like this
if (ec && replica_ == new_replica) {
service_.SwitchState(GlobalState::LOADING, GlobalState::ACTIVE);
SetMasterFlagOnAllThreads(true);
replica_.reset();
}
if (ec || cancelled) {
builder->SendError
return;
} else {
builder->SendOk();
}
if (on_err == ActionOnConnectionFail::kReturnOnError) {
Drakarys(tx, DbSlice::kDbAll);
new_replica->StartMainReplicationFiber(builder);
}
src/server/server_family.cc
Outdated
if (on_err == ActionOnConnectionFail::kReturnOnError) { | ||
replica_->StartMainReplicationFiber(builder); | ||
Drakarys(tx, DbSlice::kDbAll); | ||
new_replica->StartMainReplicationFiber(builder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to send the ok also here and not inside StartMainReplicationFiber
VLOG(1) << "Starting replication"; | ||
ProactorBase* mythread = ProactorBase::me(); | ||
CHECK(mythread); | ||
|
||
auto check_connection_error = [this, builder](error_code ec, const char* msg) -> error_code { | ||
auto check_connection_error = [this](error_code ec, const char* msg) -> GenericError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can RETURN_ON_ERR work on GenericError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generic error is implicitly convertible to error code. However RETURN_ON_ERR does this:
GenericError -> ErrorCode
then RETURN_ON_ERR returns that ErrorCode, which is then converted back to GenericError.
This looses the details
part of the GenericError. I will fix this.
I am SAVING my rant around implicit conversions
.
src/server/replica.cc
Outdated
cntx_.ReportCancelError(); | ||
return {{}, absl::StrCat(msg, ec.message())}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused how this is working.. why shouldnt this be return {ec, absl::StrCat(msg, ec.message())}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look how GenericError::Format
is implemented. The above expression is turned to absl::StrCat(msg, ec.message()
when formatted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also we can construct GenericError
directly from string. We can omit
the first argument {}
because we have that constructor. I will take care of this as well
cntx_.ReportCancelError(); | ||
return {absl::StrCat(msg, ec.message())}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the GenericError does this concatenation for us you can do
return {ec, msg}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not do the same concatenation.
330 std::string GenericError::Format() const {
331 if (!ec_ && details_.empty())
332 return "";
333
334 if (details_.empty())
335 return ec_.message();
336 else if (!ec_)
337 return details_;
338 else
339 return absl::StrCat(ec_.message(), ": ", details_);
340 }
ec is true
so absl::StrCat(ec_.message(), ": ", details_);
is not identical to {absl::StrCat(msg, ec.message())};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we care about the order here? does this realy needs a different implementaion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would need to check but it's irrelevant. At the end of the day we are discussing here return {ec, msg}
vs return {msg}
. Either way is fine and it's not important
@@ -31,6 +31,14 @@ using facade::kWrongTypeErr; | |||
|
|||
#define RETURN_ON_ERR(x) RETURN_ON_ERR_T(std::error_code, x) | |||
|
|||
#define RETURN_ON_GENERIC_ERR(x) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to, there is no really good solution otherwise. Doing something RETURN_ON_ERR_T(GenericError, x)
won't work. Yes I tried modifying the micro above by converting the statement:
std::error_code __ec = (x);
----> T __ex = (x)
but a) this is not properly being parsed and we get compiler errors, b) we also do an extra copy in that expression which is fine for error
code because it's trivially copyable
but not for Generic error.
#4636 introduced a small race . The problem is that
Start()
might have failed (because a subsequent ReplicaOf command cancelled it).ReplicaOfInternal
would ignore theec
and start the replication fiber viareplica_->StartMainReplicationFiber
but the second ReplicaOf command that followed updatedreplica_
so we tried to start replication twice on the same replica instance triggering a check failed that the fiber is running.We did not have that problem before because we called
new_replica_->Start()
which would start replication. Even if the replication got cancelled because of a secondReplicaOf
command, it would be fine because theMainReplicationFiber
would start twice on two separate Replica instances and one of them would cancel right away becauseR_CANCELLED
was not set (sinceStop()
explicitly set it by cancelling the context)