Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NRG: Remove stepdown channel, handle inline #4990

Merged
merged 1 commit into from Feb 14, 2024
Merged

Conversation

neilalexander
Copy link
Member

The stepdown channel interleaves with other channels such as the apply queue, leader change notifications etc in the runAs goroutines in an unpredictable order, so processing a stepdown request might be delayed behind other work. Doing this inline should be safer with stronger guarantees.

Signed-off-by: Neil Twigg neil@nats.io

@mprimi
Copy link
Contributor

mprimi commented Jan 25, 2024

Fault injection test show promising results so far (more in the pipe)

@neilalexander
Copy link
Member Author

Rebased on top of latest main.

@Jarema
Copy link
Member

Jarema commented Feb 13, 2024

@mprimi how did the rest of the test perform on this one?
Might be worth rerunning them considering it was rebased few times.

@mprimi
Copy link
Contributor

mprimi commented Feb 13, 2024

Tests showed no difference from baseline.
Personally i have confidence this is a change for the better from protocol implementation perspective.
Since this targets main, I would suggest merging (after review) and seeing how it does in nightly test going forward.

server/raft.go Outdated
@@ -4063,13 +4051,15 @@ const (
noVote = _EMPTY_
)

func (n *raft) switchToFollower(leader string) {
func (n *raft) switchToFollower(leader string, locked bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this one as unlocked and add in a switchToFollowerLocked() that is called from switchToFollower().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also mean adding stepdown() and stepdownLocked(). If you're OK with that then I can make the change in the morning. Otherwise this just cascades down from the stepdown() calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make reading easier, I was struggling til I saw what the bool meant way down in the PR. Meaning when we go back to this code after some time away will have similar issue IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -2922,7 +2915,7 @@ func (n *raft) runAsCandidate() {
// We vote for ourselves.
votes := 1

for {
for n.State() == Candidate {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried this before with the for loops and I thought it presented issues no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did have these in before, yes, and we reverted the non-leader loops in #4725 because we thought it had broken observer mode, but it was later proven to be an unrelated change at fault there. We just never put them back in the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you recall what the unrelated change was?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#4727 — we initially thought that the Observer state was broken by the condition on the runAsFollower loop, but it turned out that the Observer state const was never used. The unit test I added in that PR proved it still worked after we cleaned up the unused Observer state const.

The stepdown channel interleaves with other channels such as the apply queue,
leader change notifications etc in the `runAs` goroutines in an unpredictable
order, so processing a stepdown request might be delayed behind other work.
Doing this inline should be safer with stronger guarantees.

Signed-off-by: Neil Twigg <neil@nats.io>
Copy link
Member

@wallyqs wallyqs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit e4ee17e into main Feb 14, 2024
4 checks passed
@derekcollison derekcollison deleted the neil/nrgstepdown branch February 14, 2024 17:58
wallyqs added a commit that referenced this pull request Feb 14, 2024
@neilalexander neilalexander mentioned this pull request Mar 11, 2024
derekcollison added a commit that referenced this pull request Mar 11, 2024
This reverts #4990 for now.

From a Raft correctness perspective I think removing the step-down
channel is the right thing to do, but it has a negative impact on the
amount of time taken to complete stream moves currently (seemingly they
wait for an election timeout) so I want to better understand what's
going on there and to make sure that the cooperative leader handover is
doing the right thing before we reintroduce those changes.

`TestJetStreamSuperClusterMovingStreamAndMoveBack/R3` shows the issue
for future reference.

Signed-off-by: Neil Twigg <neil@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants