Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rls: fix a data race involving the LRU cache #5925

Merged
merged 2 commits into from Jan 13, 2023

Conversation

easwars
Copy link
Contributor

@easwars easwars commented Jan 11, 2023

Summary of changes:

  • The data cache and the pending request map were guarded by a sync.RWMutex. In the RPC path, the cache lookup was done under a read-lock, while most of the other operations were done under a write-lock. This was under the assumption that a cache lookup was a read operation. But as it turns out, the cache lookup modifies the LRU cache because it makes the matched entry the most-recently-used.
    • With this change, we are getting rid of the sync.RWMutex and moving to a regular sync.Mutex.
  • When handling a config update, the RLS policy could end up resizing the data cache. This was being done without holding the lock. This change brings the resize operation under the lock.
  • Get rid of the iterateAndRun method on the dataCache type.
    • I'm still not sure whether deleting an item from the list while traversing it is safe. So, I tried other options and it turns out that it is easier to traverse the map and perform the corresponding operations.
  • Get rid of the onEvict field in the cacheEntry type. This was not being used.
  • Simplify code in the picker because we don't have a read-write lock anymore.

This takes care of a P1 internal issue.

RELEASE NOTES:

  • rls: fix a data race involving the LRU cache

@easwars easwars requested a review from dfawley January 11, 2023 23:24
@easwars easwars added this to the 1.53 Release milestone Jan 11, 2023
// cacheMu guards access to the data cache and pending requests map.
cacheMu sync.RWMutex
// cacheMu guards access to the data cache and pending requests map. We
// cannot use an RWMutex here since even a operation like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*an operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 274 to 277
resizeCache := false
if newCfg.cacheSizeBytes != b.lbCfg.cacheSizeBytes {
resizeCache = true
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Resize the cache if the size in the config has changed.
resizeCache := newCfg.cacheSizeBytes != b.lbCfg.cacheSizeBytes

(Comment optional, but explains what could be seen as "more confusing logic", and still uses fewer lines.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

@@ -284,6 +286,14 @@ func (b *rlsBalancer) UpdateClientConnState(ccs balancer.ClientConnState) error
b.updateCh.Put(resumePickerUpdates{done: done})
b.stateMu.Unlock()
<-done

if resizeCache {
// If the new config changes the size of the data cache, we might have to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If the new config reduces the size"...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -284,6 +286,14 @@ func (b *rlsBalancer) UpdateClientConnState(ccs balancer.ClientConnState) error
b.updateCh.Put(resumePickerUpdates{done: done})
b.stateMu.Unlock()
<-done

if resizeCache {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do this operation above, where we compute it, and remove the bool entirely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cacheMu needs to be grabbed before stateMu, which is why we can't do this operation up there. Added a comment for the same.

// to write to the cache (if we have to send out an RLS request), we will
// release the read-lock and acquire a write-lock.
p.lb.cacheMu.RLock()
p.lb.cacheMu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reduce the amount of time we're holding this at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following actions are happening under the lock in this method:

  • read the cache entry:
    • this is non-blocking and should be fast.
  • send an RLS request if required:
    • this is the only thing which could block. But the control channel implementation spawns a goroutine to do the actual work of sending the RLS rpc. So, this should also be fast.
  • queue the pick or delegate to child policies (or default policy):
    • this should also be non-blocking and should be fast.

So, I think we are OK to hold the lock for the whole duration of this method. If we are blocking or taking too long in this method, we have other problems as well I guess (since lb policies are expected to be quick in Pick() and non block in there).

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine, I was just wondering because the defer makes it impossible to give it up early on one path vs. another, and ideally we hold this for as short a time as possible.

@dfawley dfawley assigned easwars and unassigned dfawley Jan 12, 2023
@easwars
Copy link
Contributor Author

easwars commented Jan 12, 2023

Thanks for the quick review !!

@easwars easwars assigned dfawley and unassigned easwars Jan 12, 2023
@dfawley dfawley assigned easwars and unassigned dfawley Jan 12, 2023
@easwars easwars merged commit 9228cff into grpc:master Jan 13, 2023
1 check passed
@easwars easwars deleted the rls_crash_purge_data_cache branch January 13, 2023 00:02
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants