Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent persistent cache data races #402

Merged
merged 10 commits into from
Apr 13, 2023
Merged

Prevent persistent cache data races #402

merged 10 commits into from
Apr 13, 2023

Conversation

chlowell
Copy link
Collaborator

@chlowell chlowell commented Apr 7, 2023

Concurrent authentication causes data loss when using a persistent cache because base.Client doesn't synchronize its access. The data flow of authentication looks like this:

  1. load persistent cache into memory (clobber data in memory)
  2. authenticate
  3. store authentication data in memory
  4. write in-memory data to persistent cache (clobber persisted data)

When multiple goroutines authenticate concurrently, one may execute step 1 while another is between steps 3 and 4, deleting new data in memory before it's persisted. My solution in this PR is simply to lock around accessing the persistent cache.

To prevent repetition, I simplified several methods by consolidating the (internal) interfaces for partitioned and flat caches. These interfaces had almost the same Read/Write API; the difference was an unnecessary parameter. While I was at it, I also eliminated unnecessary writes to persistence--methods that don't write the in-memory cache no longer export it to the persistent cache.

@chlowell chlowell added the bug Something isn't working label Apr 7, 2023
@chlowell chlowell marked this pull request as ready for review April 7, 2023 00:08
@bgavrilMS bgavrilMS requested a review from pmaytak April 11, 2023 09:29
@bgavrilMS
Copy link
Member

Can you describe the actual user scenario that got broken? i.e. how does this race manifest?

A few thoughts:

MSAL.NET chose a slightly more granular approach to locking

  1. LOCK. load persistent cache into memory (clobber data in memory). RELEASE LOCK
  2. authenticate
  3. LOCK. re-load persistent cache into memory
  4. store authentication data in memory
  5. write in-memory data to persistent cache (clobber persisted data). RELEASSE LOCK

This ensures that we don't lock for the entire duration of the HTTP request.

Eventual consistency vs race issues

In MSAL.NET we experimented with disabling that lock. Unfortunately this seems to lead to a situation where a cache node (e.g. for Redis) now contains too many tokens. @pmaytak is looking at this.

I think the fix looks fine, but would be good to have a rough idea of the perf impact.

@chlowell
Copy link
Collaborator Author

The race I described above causes unnecessary reauthentication because some tokens don't make it to the persistent cache and are lost the next time the client loads persisted data. Reauthentication could be acceptable if it were rare, but empirically I see data loss ~25% of the time with only 2 goroutines authenticating concurrently.

benchmarks

The scenario is 10 goroutines concurrently authenticating and retrieving a cached token, with the cache persisted to a file. "Baseline" is the latest published version i.e. none of the locking added in this PR. "Read lock" is this PR as it is now, with a read lock in AcquireTokenSilent. "Exclusive lock" changes that read lock to an exclusive one.

image

It's interesting that this PR actually improves the benchmark despite adding locks. I believe this is because it also removes unnecessary writes; apparently those are more expensive in this scenario than the new locks.

@bgavrilMS
Copy link
Member

@pmaytak - can we do something similar for MSAL.NET?

@bgavrilMS
Copy link
Member

bgavrilMS commented Apr 12, 2023

@chlowell - 30% perf improvement for cached token acquisition is massive! Ship it :)

Are there any changes to the scenario where we rely just on MSAL's memory storage? (i.e. no token cache serialization)

@sonarcloud
Copy link

sonarcloud bot commented Apr 12, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@rayluo
Copy link
Contributor

rayluo commented Apr 13, 2023

It's interesting that this PR actually improves the benchmark despite adding locks. I believe this is because it also removes unnecessary writes; apparently those are more expensive in this scenario than the new locks.

Given that this PR improves the baseline, that hypothesis is reasonable, and I am also giving a Ship-It! :-)

The following comments are mostly FYIs for open discussion.

solution in this PR is simply to lock around accessing the persistent cache

MSAL.NET chose a slightly more granular approach to locking
......
This ensures that we don't lock for the entire duration of the HTTP request.

The MSAL .Net's approach were also adopted in MSAL Python and MSAL Java. The MSAL Go implementation in this PR looks also in the same granular manner, as opposed to locking up the entire token acquisition http request. So, we are all good here.

"Read lock" is this PR as it is now, with a read lock in AcquireTokenSilent. "Exclusive lock" changes that read lock to an exclusive one.

Were the "read lock" and "exclusive lock" referred to the actual file lock characteristic? FWIW, MSAL Python (EX) implementation used to acquire a lock exclusively, during both token read and token write. Since then, an improvement kept only lock behavior during token write, and removed the lock attempt on token read. It just assumes a successful read (meaning the token cache file is in valid json format) would mean the data is clean. It also seems to work well.

The race I described above causes unnecessary reauthentication because some tokens don't make it to the persistent cache and are lost the next time the client loads persisted data. Reauthentication could be acceptable if it were rare, but empirically I see data loss ~25% of the time with only 2 goroutines authenticating concurrently.

Out of curiosity, was that "empirical ~25% data loss with only 2 goroutines authenticating concurrently" a real-world "end user occasionally runs 2 instances of same app manually" scenario, or an in-lab "spinning up 2 instances and having them loop authentication repeatedly" scenario? The "~25% data loss" sounds too-high-than-we-thought for the former.

@chlowell
Copy link
Collaborator Author

chlowell commented Apr 13, 2023

Were the "read lock" and "exclusive lock" referred to the actual file lock characteristic?

They refer to whether AcquireTokenSilent held a read lock or an exclusive lock while loading the external cache i.e., whether we let multiple goroutines (lightweight threads in the same process) load the cache concurrently. The file lock coordinating external cache access across multiple processes will be implemented in the extensions module.

FWIW, MSAL Python (EX) implementation used to acquire a lock exclusively, during both token read and token write. Since then, an improvement kept only lock behavior during token write, and removed the lock attempt on token read. It just assumes a successful read (meaning the token cache file is in valid json format) would mean the data is clean. It also seems to work well.

👍 I'll try this in the extensions module.

Out of curiosity, was that "empirical ~25% data loss with only 2 goroutines authenticating concurrently" a real-world "end user occasionally runs 2 instances of same app manually" scenario, or an in-lab "spinning up 2 instances and having them loop authentication repeatedly" scenario?

The test was a single process with two goroutines concurrently acquiring a token and then silently authenticating, once. This is a realistic simulation, though conservative, because Go programs commonly launch many goroutines. On my machine, after ~25% of test runs the external cache had only one token. The upshot is that at the lowest level of concurrency there was a significant chance the client would delete a token from its in-memory cache before writing it to the external cache. To be clear, the problem was in the client, not the external cache; an external cache can prevent concurrent reads and writes of its data but not the client destructively overwriting its own in-memory data.

@chlowell chlowell merged commit 4c397f8 into dev Apr 13, 2023
4 checks passed
@chlowell chlowell deleted the chlowell/cache-races branch April 13, 2023 22:11
@rayluo rayluo mentioned this pull request Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants