Fix test approach for detecting issues #93

storozhukBM · 2022-06-27T20:34:55Z

The problem in the previous approach was that mock time literally didn't run between limite.Take() calls, so now.Sub(oldState.last) would always be 0 that can be problematic for some issues detection.

This PR adds time increments between limite.Take() calls in (r *runnerImpl) startTaking. Unfortunately our library for clock mocking github.com/benbjohnson/clock has some races reported here benbjohnson/clock#44 and here benbjohnson/clock#42.

So I decided to use the time mocking approach that Ian Lance Taylor used 19 days ago in https://github.com/golang/time repository here golang/time@579cf78

Note: this tests approach detects an error in int64 based limiter implementation that the previous test didn't manage to catch.

codecov · 2022-06-27T20:39:13Z

Codecov Report

Merging #93 (31dcb6c) into main (b62b799) will not change coverage.
Report is 8 commits behind head on main.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main       #93   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            4         4           
  Lines           95        97    +2     
=========================================
+ Hits            95        97    +2

Files	Coverage Δ
ratelimit.go	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

rabbbit · 2022-06-27T21:59:11Z

ratelimit_test.go

 	"github.com/stretchr/testify/assert"
 )

+// testTime is a fake time used for testing.
+type testTime struct {


If this is copied over from golang/time@579cf78, could we:
(1) put this is a separate file
(2) attribute the source clearly
(3) call out differences in the implementation? For example, I see that advanceUnlocked is different than the reference implementation, but I cannot easily tell why.

rabbbit · 2022-06-27T22:11:34Z

Thanks!

This looks complicated, it'll take me some time to grok this (sorry) - I would expect maybe some time over the weekend.

It's interesting that benbjohnson/clock#42 was fixed by @abhinav, looks like the clock bit us a few times.

Note: this tests approach detects an error in int64 based limiter implementation that the previous test didn't manage to catch.

@storozhukBM - would you like to partially revert #91? We could add your new implementation to the repo, but not make it the default (yet). This way we could track the (1) clock change (2) bug fix in the same PR eventually. I can also do this, but I thought you might want the addition to be attributed to you in git log :)

storozhukBM · 2022-06-28T00:13:29Z

would you like to partially revert #91? We could add your new implementation to the repo, but not make it the default (yet). This way we could track the (1) clock change (2) bug fix in the same PR eventually. I can also do this, but I thought you might want the addition to be attributed to you in git log :)

2 Questions:
Should I do the revert in a separate PR or this one?
Should I fix the bug in the int64 based implementation right away, since I already have a fix?

rabbbit · 2022-06-29T02:43:53Z

(1) I'd do a separate diff. This one could be about fixing the fake time. (2) Your call really. I would slightly to push the broken version for now, and then fix it, but happy either way. tdlr; is that I'd like to play more with the clock implementations/races (or at least understand your implementation - which I currently don't) before landing this PR.

…

On Mon, Jun 27, 2022 at 5:13 PM Bohdan Storozhuk ***@***.***> wrote: @storozhukBM <https://github.com/storozhukBM> - would you like to partially revert #91 <#91>? We could add your new implementation to the repo, but not make it the default (yet). This way we could track the (1) clock change (2) bug fix in the same PR eventually. I can also do this, but I thought you might want the addition to be attributed to you in git log :) 2 Questions: Should I do the revert in a separate PR or this one? Should I fix the but in the int64 based implementation right away, since I already have a fix? — Reply to this email directly, view it on GitHub <#93 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACG3UWENEQ5O6BZERYC7YDVRI7TJANCNFSM5Z7XTTVA> . You are receiving this because you commented.Message ID: ***@***.***>

storozhukBM · 2022-06-29T20:09:30Z

@rabbbit

Let's do the following thing:

I'll create a PR with an almost complete revert of Revert "New atomic-based implementation squeezed into int64" #91, except making int64-based rate limiter a default implementation.
After its merge, I'll rebase this branch with a new master, so this branch will start to fail.
Then, I'll make a second PR that fixes a known bug.
After the fix PR gets merged, I'll rebase this branch again, and tests will show that the bug is fixed.

storozhukBM · 2022-06-29T20:28:56Z

@rabbbit
First step PR #94

@twelsh-aw

This limiter was introduced and merged in the following PR #85 Later @twelsh-aw found an issue with this implementation #90 So @rabbbit reverted this change in #91 Our tests did not detect this issue, so we have a separate PR #93 that enhances our tests approach to detect potential errors better. With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that #93 will detect the bug. Right after it, we'll open a subsequent PR to fix this bug.

rabbbit · 2022-07-02T06:02:31Z

Done, #94 merged. sorry for the delay. Thanks!

…

On Wed, Jun 29, 2022 at 1:29 PM Bohdan Storozhuk ***@***.***> wrote: @rabbbit <https://github.com/rabbbit> First step PR #94 <#94> — Reply to this email directly, view it on GitHub <#93 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACG3URGCAFOGW4QFOJ636TVRSWZLANCNFSM5Z7XTTVA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

@twelsh-aw

This limiter was introduced and merged in the following PR uber-go#85 Later @twelsh-aw found an issue with this implementation uber-go#90 So @rabbbit reverted this change in uber-go#91 Our tests did not detect this issue, so we have a separate PR uber-go#93 that enhances our tests approach to detect potential errors better. With this PR, we want to restore the int64-based atomic rate limiter implementation as a non-default rate limiter and then check that uber-go#93 will detect the bug. Right after it, we'll open a subsequent PR to fix this bug.

# Conflicts: # ratelimit_test.go

storozhukBM · 2022-07-02T19:47:19Z

@rabbbit

Now we can see that new test approach detects issues with atomic_int64 implementation.

@storozhukBM

In #93 @storozhukBM adds test parallization, which is unrelated to the actual change. I'd rather have this as a separate PR. The test currently take ~12 seconds on my laptop, so I see why we might want to run them in parallel. There are other parallelization opportunities (we could paralelize each of the subtests for each of the rate-limiters), but this is a good start. Before: ``` > go test ./... -race -count 1 ok go.uber.org/ratelimit 12.739s ``` After: ``` > go test ./... -race -count 1 ok go.uber.org/ratelimit 6.178s ```

ratelimit_test.go

@storozhukBM

In #93 @storozhukBM adds test parallization, which is unrelated to the actual change. I'd rather have this as a separate PR. The test currently take ~12 seconds on my laptop, so I see why we might want to run them in parallel. There are other parallelization opportunities (we could paralelize each of the subtests for each of the rate-limiters), but this is a good start. Before: ``` > go test ./... -race -count 1 ok go.uber.org/ratelimit 12.739s ``` After: ``` > go test ./... -race -count 1 ok go.uber.org/ratelimit 6.178s ```

rabbbit · 2022-07-11T05:49:21Z

ratelimit_mock_time_test.go

+https://github.com/golang/time/commit/579cf78fd858857c0d766e0d63eb2b0ccf29f436
+
+Modified parts:
+ - advanceUnlocked method uses in-place filtering of timers


is the goal of in-place filtering just performance of the fake-clock?

I don't think this feels very important (?) - I'd rather have simpler code that's slightly slower?

For me, this approach looks simpler as well as more performant, but check the original and give me your opinion, if you think we should keep the original, I don't mind.

It's just that the whole i, j looks like we would need tests for tests (off by one errors).

I think I would rather pull the other version as is, and then push a separate PR to fix the copying with clear changes. If we worried about the performance though, wouldn't we want to sort once on insert, rather than advance?

But:

see my other comment below

I understand if fixing the time is not your primary focus for now :)

OK, I can go back to the prev version.

ratelimit_mock_time_test.go

ratelimit_test.go

rabbbit · 2022-07-11T05:56:51Z

ratelimit_test.go

+				for {
+					move()
+				}


Sorry, but I think this needs a much larger explanation - perhaps as comments in code?

Am I reading this right that this is an infinite loop, trying to move to infinitely move to the next timer ... by recovering from a panic?

Irrespective of goroutine leaks, using panics for signaling - do you think the maxDuration is not useful here? It seems we should be able to just pass the duration into advance, theoretically?

Irrespective of goroutine leaks

I think 1 goroutine leak per test case is not a big deal, but if you want, I can add some exit logic here.

using panics for signaling

Code from Ian just panics when you try to run the next timer, and there is nothing to run. We can potentially change that behavior to not handle panics, but I wanted to limit changes in that code.

Sorry, but I think this needs a much larger explanation - perhaps as comments in code?

I'll write some description here first and when we clarify all questions about it, I'll move it into code comments:

testTime internally has a queue of all timers currently registered within the clock. In our case, there will be timers for goroutines that try to consume permissions and timers for goroutines that should run test asserts.
When we call testTime.advanceToTimer() it basically shifts time to the next registered timer, runs it, and also runs runtime.Gosched() multiple times to give a chance for other goroutines to run and react to what happened in the current timer. Usually, when we run runtime.Gosched() other goroutines will observe the new state, make their moves, and potentially create new timers inside this clock to wait for some
other state change.

If testTime.advanceToTimer() panics, it means that there are no timers to run, and it can happen if the test is already finished or it didn't manage to set up the full test case, so we just sleep a bit to allow and try again to run timers.

Do you think the maxDuration is not useful here?

No, we can use that if we want to get out of the currently infinite loop, we can check on every iteration if testTime is already passed maxDuration.

It seems we should be able to just pass the duration into advance, theoretically?

Yes, but since we advance time in different goroutines in this approach, it is necessary for us to do it one step at a time and give other goroutines the opportunity to make their steps as well. There also can be situations where we start to advance time in this goroutine before test goroutines spawned all permission consumers and asserting goroutines, so we need to just wait for everything to setup, that is what we do when panic happens.

So perhaps naively, but do we need both

advanceToTimer

advance(Unlocked?)

Our test runner knows exactly how much time needs to move (maxDuration). We can just pass it into advance. advanceUnlocked already does all the runtime.Gosched needed.

And soit seem to me that advance should theoretically be enough?
If the problem is the initial goroutine setup, should we just add the 16*runtime.Gosched at the beginning of advance to let the scheduler kick in?
This would also help with r.clock.advance(time.Nanosecond) you have below.

I understand this kinda modifies the code we copy-paste, but I'd feel much more comfortable with we decreased the API by 50% here.

@rabbbit

Right now, it is hard for me to make changes because tests do not pass until we merge #95

I agree with the comments and changes that you propose right now, but they don't change the approach in general.
I propose to merge #95, and then I can apply your suggested changes, or we can merge this PR as it is right now, and I'll make these improvements in subsequent PRs.

Merged #95. I might still want to re-review it later on, it feels like some simplification is possible.

I'll update this branch with new version and this should potentially fix all tests, event though I'm not sure how your new test for initial calls will work 🤞

@storozhukBM

In uber-go#93 @storozhukBM adds test parallization, which is unrelated to the actual change. I'd rather have this as a separate PR. The test currently take ~12 seconds on my laptop, so I see why we might want to run them in parallel. There are other parallelization opportunities (we could paralelize each of the subtests for each of the rate-limiters), but this is a good start. Before: ``` > go test ./... -race -count 1 ok go.uber.org/ratelimit 12.739s ``` After: ``` > go test ./... -race -count 1 ok go.uber.org/ratelimit 6.178s ```

@twelsh-aw

This PR fixes the issue found by @twelsh-aw with int64 based implementation #90 Our tests did not detect this issue, so we have a separate PR #93 that enhances our tests approach to detect potential errors better.

storozhukBM · 2022-07-13T23:14:51Z

@rabbbit
After merging with the recent main branch, all tests pass, except TestInitial for mutex rate limiter for some reason. I'll investigate what happens with it when I have time in the next few days.

rabbbit · 2022-07-13T23:19:09Z

It's possible that mutex behaviour was broken, but we didn't detect it because of the broken clock :|

…

On Wed, Jul 13, 2022 at 4:15 PM Bohdan Storozhuk ***@***.***> wrote: @rabbbit <https://github.com/rabbbit> After merging with the recent main branch, all tests pass, except TestInitial for mutex rate limiter for some reason. I'll investigate what happens with it when I have time in the next few days. — Reply to this email directly, view it on GitHub <#93 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACG3UQFR7DHSNS6T3LYLADVT5EXLANCNFSM5Z7XTTVA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

storozhukBM · 2022-07-13T23:22:21Z

@rabbbit

It's possible that mutex behaviour was broken, but we didn't detect it
because of the broken clock :|

No, the use of advanceToTimer approach removes the issue. I think just advance(time.Duration) is not sufficient with this testTime implementation. I'll figure it out.

Fetch exact numbers for goroutines in runtime to make that amount of Gosched calls.

storozhukBM · 2022-07-14T01:47:59Z

@rabbbit
I think I addressed all your comments, so please take another look

rabbbit

Almost SGTM:

I don't think we need the GetName (?)
I don't like the amount of time.Sleep but perhaps it's unavoidable

I've been wondering about some simplified fake clock library - given the amount of issues we've seen it seems it must be possible to make it simpler -__-

rabbbit · 2022-07-14T03:30:21Z

ratelimit_test.go

 type testRunner interface {
+	getName() string


why do we need this?

The name is included in the test-name automatically via tt.run (?)

rabbbit · 2022-07-14T03:30:24Z

ratelimit_test.go

 	"github.com/stretchr/testify/assert"
 )

+const advanceBackoffDuration = 5 * time.Millisecond


Do you know why we need this? Is the gosched not enough?

You can notice that gosched calls happen under lock. It means that all goroutines that would like to get the current time from this clock or register some additional timer won't be able to do so, even if we run these gosched calls. In our case, it means that most "startTaking" goroutines will never work while we step through timers, but they need the most time resource in our tests run.

So we actually need both types of pauses between runs. One is using gosched under lock for all goroutines that were waiting for the timer to click and are not blocked on mutex of this clock.
The other one is for all goroutines that are blocked on mutex of this clock and need to make some progress between timer advances.

rabbbit · 2022-07-14T03:30:58Z

ratelimit_test.go

@@ -102,6 +115,7 @@ func (r *runnerImpl) startTaking(rls ...Limiter) {
 			for _, rl := range rls {
 				rl.Take()
 			}
+			r.clock.advance(time.Nanosecond, 0)


and we need this one as well?

Yes, this one imitates time progressions between rl.Take() calls. With a real clock, when you call rl.Take() do something and call rl.Take() again time.Now() will move forward in time, so we want to imitate that.

storozhukBM · 2022-07-14T07:53:37Z

I've been wondering about some simplified fake clock library - given the amount of issues we've seen it seems it must be possible to make it simpler -__-

If you take a look at the libraries we have used in the past, like https://github.com/benbjohnson/clock/ it is MUCH more complex, has concurrency bugs, and `time. Sleep's as well here

I researched every library that I was able to find on Github, and this approach from golang/time is the most straightforward I've seen.

One more consideration that we should take into account is that we can find a lib or implementation that will pass all our tests but is going to detect the real bug. After we change the time library, we should use mutation testing at least manually introduce some bugs into the existing rate limiter implementations that we have and make sure that our tests with the new clock library detect the issue.

storozhukBM · 2022-08-23T17:53:42Z

Hey @rabbbit, what do you think? Should we move forward with this PR?

rabbbit · 2022-10-31T03:16:27Z

We spoke about this briefly in #101 and #90 - I'm not sure about merging this:

without in an depth review
without a plan for this library

For now I'll put this as "on-hold". This means that the library is effectively in code-freeze, as our test clock implementation is buggy.

storozhukBM · 2022-10-31T03:18:24Z

@rabbbit can we find someone form Uber to make a depth review?

There were quite a few fixes in that repo, let's pull them in. We've had a few mock-clock bugs before (#90), I'm hoping that with the new versions #93 won't be necessary. I'll try reverting #95 on a branch temporarily to see if it would have been caught.

storozhukBM · 2024-05-28T17:20:57Z

We maybe can get something in standard library for this golang/go#67434

storozhukBM marked this pull request as ready for review June 27, 2022 20:38

storozhukBM mentioned this pull request Jun 27, 2022

atomicInt64Limiter WithoutSlack doesn't block #90

Closed

rabbbit reviewed Jun 27, 2022

View reviewed changes

storozhukBM mentioned this pull request Jun 29, 2022

Restore int64 based atomic rate limiter #94

Merged

Fix test approach for detecting issues

1053ca8

storozhukBM force-pushed the fix_tests_approach_for_issue_detection branch from 9d5dd18 to 1053ca8 Compare July 2, 2022 19:40

Merge branch 'main' into fix_tests_approach_for_issue_detection

abdb9ce

# Conflicts: # ratelimit_test.go

storozhukBM mentioned this pull request Jul 2, 2022

Fix no slack option for int64 based option #95

Merged

rabbbit mentioned this pull request Jul 10, 2022

Run tests in parallel. #98

Merged

rabbbit reviewed Jul 10, 2022

View reviewed changes

ratelimit_test.go Outdated Show resolved Hide resolved

rabbbit reviewed Jul 11, 2022

View reviewed changes

storozhukBM and others added 2 commits July 13, 2022 23:51

Merge branch 'main' into fix_tests_approach_for_issue_detection

c6476cc

Adapt TestInitial to new testTime approach

47a6978

Use advanceToTimer approach in TestInitial

99e7b2c

storozhukBM added 6 commits July 14, 2022 00:29

Go back to the original approach of removing timers from the list.

86916e3

Fetch exact numbers for goroutines in runtime to make that amount of Gosched calls.

Simplify advanceToTimer routine

260ee11

Go back to 10 millis between advances

4e9eb9d

Use advanceFor instead of for loop

6ca7158

Sort internal timers and use only one function for advance

41e59b7

Add more comments

e2a95a9

rabbbit approved these changes Jul 14, 2022

View reviewed changes

Remove test runner name

31dcb6c

rabbbit mentioned this pull request Oct 31, 2022

Make int64 based atomic ratelimiter default #101

Merged

rabbbit mentioned this pull request Jul 9, 2023

Update clock dependency #116

Open

Fix test approach for detecting issues #93

Are you sure you want to change the base?

Fix test approach for detecting issues #93

Conversation

storozhukBM commented Jun 27, 2022

codecov bot commented Jun 27, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabbbit commented Jun 27, 2022

storozhukBM commented Jun 28, 2022 • edited

rabbbit commented Jun 29, 2022 via email

storozhukBM commented Jun 29, 2022

storozhukBM commented Jun 29, 2022

rabbbit commented Jul 2, 2022 via email

storozhukBM commented Jul 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

storozhukBM Jul 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

storozhukBM commented Jul 13, 2022 • edited

rabbbit commented Jul 13, 2022 via email

storozhukBM commented Jul 13, 2022

storozhukBM commented Jul 14, 2022

rabbbit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

storozhukBM commented Jul 14, 2022

storozhukBM commented Aug 23, 2022

rabbbit commented Oct 31, 2022

storozhukBM commented Oct 31, 2022

storozhukBM commented May 28, 2024

codecov bot commented Jun 27, 2022 •

edited

storozhukBM commented Jun 28, 2022 •

edited

storozhukBM Jul 11, 2022 •

edited

storozhukBM commented Jul 13, 2022 •

edited