-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NRG: Remove stepdown channel, handle inline #4990
Conversation
Fault injection test show promising results so far (more in the pipe) |
6fb2ad7
to
8339873
Compare
Rebased on top of latest |
8339873
to
a8cb377
Compare
a8cb377
to
3315325
Compare
@mprimi how did the rest of the test perform on this one? |
Tests showed no difference from baseline. |
server/raft.go
Outdated
@@ -4063,13 +4051,15 @@ const ( | |||
noVote = _EMPTY_ | |||
) | |||
|
|||
func (n *raft) switchToFollower(leader string) { | |||
func (n *raft) switchToFollower(leader string, locked bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this one as unlocked and add in a switchToFollowerLocked() that is called from switchToFollower().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also mean adding stepdown()
and stepdownLocked()
. If you're OK with that then I can make the change in the morning. Otherwise this just cascades down from the stepdown()
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make reading easier, I was struggling til I saw what the bool meant way down in the PR. Meaning when we go back to this code after some time away will have similar issue IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -2922,7 +2915,7 @@ func (n *raft) runAsCandidate() { | |||
// We vote for ourselves. | |||
votes := 1 | |||
|
|||
for { | |||
for n.State() == Candidate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tried this before with the for loops and I thought it presented issues no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did have these in before, yes, and we reverted the non-leader loops in #4725 because we thought it had broken observer mode, but it was later proven to be an unrelated change at fault there. We just never put them back in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you recall what the unrelated change was?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#4727 — we initially thought that the Observer
state was broken by the condition on the runAsFollower
loop, but it turned out that the Observer
state const was never used. The unit test I added in that PR proved it still worked after we cleaned up the unused Observer
state const.
The stepdown channel interleaves with other channels such as the apply queue, leader change notifications etc in the `runAs` goroutines in an unpredictable order, so processing a stepdown request might be delayed behind other work. Doing this inline should be safer with stronger guarantees. Signed-off-by: Neil Twigg <neil@nats.io>
3315325
to
0ae4fe2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This reverts #4990 for now. From a Raft correctness perspective I think removing the step-down channel is the right thing to do, but it has a negative impact on the amount of time taken to complete stream moves currently (seemingly they wait for an election timeout) so I want to better understand what's going on there and to make sure that the cooperative leader handover is doing the right thing before we reintroduce those changes. `TestJetStreamSuperClusterMovingStreamAndMoveBack/R3` shows the issue for future reference. Signed-off-by: Neil Twigg <neil@nats.io>
The stepdown channel interleaves with other channels such as the apply queue, leader change notifications etc in the
runAs
goroutines in an unpredictable order, so processing a stepdown request might be delayed behind other work. Doing this inline should be safer with stronger guarantees.Signed-off-by: Neil Twigg neil@nats.io