NRG: Remove stepdown channel, handle inline #4990

neilalexander · 2024-01-24T13:29:07Z

The stepdown channel interleaves with other channels such as the apply queue, leader change notifications etc in the runAs goroutines in an unpredictable order, so processing a stepdown request might be delayed behind other work. Doing this inline should be safer with stronger guarantees.

Signed-off-by: Neil Twigg neil@nats.io

mprimi · 2024-01-25T01:25:00Z

Fault injection test show promising results so far (more in the pipe)

neilalexander · 2024-01-25T09:55:31Z

Rebased on top of latest main.

Jarema · 2024-02-13T09:09:26Z

@mprimi how did the rest of the test perform on this one?
Might be worth rerunning them considering it was rebased few times.

mprimi · 2024-02-13T20:51:49Z

Tests showed no difference from baseline.
Personally i have confidence this is a change for the better from protocol implementation perspective.
Since this targets main, I would suggest merging (after review) and seeing how it does in nightly test going forward.

derekcollison · 2024-02-13T21:10:06Z

server/raft.go

@@ -4063,13 +4051,15 @@ const (
 	noVote   = _EMPTY_
 )

-func (n *raft) switchToFollower(leader string) {
+func (n *raft) switchToFollower(leader string, locked bool) {


Let's keep this one as unlocked and add in a switchToFollowerLocked() that is called from switchToFollower().

This will also mean adding stepdown() and stepdownLocked(). If you're OK with that then I can make the change in the morning. Otherwise this just cascades down from the stepdown() calls.

Make reading easier, I was struggling til I saw what the bool meant way down in the PR. Meaning when we go back to this code after some time away will have similar issue IMO.

derekcollison · 2024-02-13T21:10:36Z

server/raft.go

@@ -2922,7 +2915,7 @@ func (n *raft) runAsCandidate() {
 	// We vote for ourselves.
 	votes := 1

-	for {
+	for n.State() == Candidate {


We tried this before with the for loops and I thought it presented issues no?

We did have these in before, yes, and we reverted the non-leader loops in #4725 because we thought it had broken observer mode, but it was later proven to be an unrelated change at fault there. We just never put them back in the end.

Do you recall what the unrelated change was?

#4727 — we initially thought that the Observer state was broken by the condition on the runAsFollower loop, but it turned out that the Observer state const was never used. The unit test I added in that PR proved it still worked after we cleaned up the unused Observer state const.

The stepdown channel interleaves with other channels such as the apply queue, leader change notifications etc in the `runAs` goroutines in an unpredictable order, so processing a stepdown request might be delayed behind other work. Doing this inline should be safer with stronger guarantees. Signed-off-by: Neil Twigg <neil@nats.io>

wallyqs

LGTM

- #5083 - #4990 - #5085 - #5086

This reverts #4990 for now. From a Raft correctness perspective I think removing the step-down channel is the right thing to do, but it has a negative impact on the amount of time taken to complete stream moves currently (seemingly they wait for an election timeout) so I want to better understand what's going on there and to make sure that the cooperative leader handover is doing the right thing before we reintroduce those changes. `TestJetStreamSuperClusterMovingStreamAndMoveBack/R3` shows the issue for future reference. Signed-off-by: Neil Twigg <neil@nats.io>

neilalexander force-pushed the neil/nrgstepdown branch from 6fb2ad7 to 8339873 Compare January 25, 2024 09:55

neilalexander marked this pull request as ready for review January 25, 2024 09:55

neilalexander requested a review from a team as a code owner January 25, 2024 09:55

neilalexander force-pushed the neil/nrgstepdown branch from 8339873 to a8cb377 Compare February 2, 2024 10:33

neilalexander force-pushed the neil/nrgstepdown branch from a8cb377 to 3315325 Compare February 12, 2024 15:33

derekcollison reviewed Feb 13, 2024

View reviewed changes

neilalexander force-pushed the neil/nrgstepdown branch from 3315325 to 0ae4fe2 Compare February 14, 2024 09:58

wallyqs approved these changes Feb 14, 2024

View reviewed changes

derekcollison merged commit e4ee17e into main Feb 14, 2024
4 checks passed

derekcollison deleted the neil/nrgstepdown branch February 14, 2024 17:58

wallyqs mentioned this pull request Feb 14, 2024

Cherry picks for Release v2.10.11 #5084

Merged

wallyqs added a commit that referenced this pull request Feb 14, 2024

Cherry picks for Release v2.10.11 (#5084)

11f5808

- #5083 - #4990 - #5085 - #5086

neilalexander mentioned this pull request Mar 11, 2024

Revert #4990 #5200

Merged

neilalexander mentioned this pull request Apr 24, 2024

NRG (2.11): Remove stepdown channel, handle inline #5344

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NRG: Remove stepdown channel, handle inline #4990

NRG: Remove stepdown channel, handle inline #4990

neilalexander commented Jan 24, 2024

mprimi commented Jan 25, 2024

neilalexander commented Jan 25, 2024

Jarema commented Feb 13, 2024

mprimi commented Feb 13, 2024

derekcollison Feb 13, 2024

neilalexander Feb 13, 2024 •

edited

derekcollison Feb 13, 2024

neilalexander Feb 14, 2024

derekcollison Feb 13, 2024

neilalexander Feb 13, 2024

derekcollison Feb 13, 2024

neilalexander Feb 13, 2024

wallyqs left a comment

NRG: Remove stepdown channel, handle inline #4990

NRG: Remove stepdown channel, handle inline #4990

Conversation

neilalexander commented Jan 24, 2024

mprimi commented Jan 25, 2024

neilalexander commented Jan 25, 2024

Jarema commented Feb 13, 2024

mprimi commented Feb 13, 2024

derekcollison Feb 13, 2024

Choose a reason for hiding this comment

neilalexander Feb 13, 2024 • edited

Choose a reason for hiding this comment

derekcollison Feb 13, 2024

Choose a reason for hiding this comment

neilalexander Feb 14, 2024

Choose a reason for hiding this comment

derekcollison Feb 13, 2024

Choose a reason for hiding this comment

neilalexander Feb 13, 2024

Choose a reason for hiding this comment

derekcollison Feb 13, 2024

Choose a reason for hiding this comment

neilalexander Feb 13, 2024

Choose a reason for hiding this comment

wallyqs left a comment

Choose a reason for hiding this comment

neilalexander Feb 13, 2024 •

edited