Prevent persistent cache data races #402

chlowell · 2023-04-07T00:04:43Z

Concurrent authentication causes data loss when using a persistent cache because base.Client doesn't synchronize its access. The data flow of authentication looks like this:

load persistent cache into memory (clobber data in memory)
authenticate
store authentication data in memory
write in-memory data to persistent cache (clobber persisted data)

When multiple goroutines authenticate concurrently, one may execute step 1 while another is between steps 3 and 4, deleting new data in memory before it's persisted. My solution in this PR is simply to lock around accessing the persistent cache.

To prevent repetition, I simplified several methods by consolidating the (internal) interfaces for partitioned and flat caches. These interfaces had almost the same Read/Write API; the difference was an unnecessary parameter. While I was at it, I also eliminated unnecessary writes to persistence--methods that don't write the in-memory cache no longer export it to the persistent cache.

apps/internal/base/base.go

bgavrilMS · 2023-04-11T09:47:26Z

Can you describe the actual user scenario that got broken? i.e. how does this race manifest?

A few thoughts:

MSAL.NET chose a slightly more granular approach to locking

LOCK. load persistent cache into memory (clobber data in memory). RELEASE LOCK
authenticate
LOCK. re-load persistent cache into memory
store authentication data in memory
write in-memory data to persistent cache (clobber persisted data). RELEASSE LOCK

This ensures that we don't lock for the entire duration of the HTTP request.

Eventual consistency vs race issues

In MSAL.NET we experimented with disabling that lock. Unfortunately this seems to lead to a situation where a cache node (e.g. for Redis) now contains too many tokens. @pmaytak is looking at this.

I think the fix looks fine, but would be good to have a rough idea of the perf impact.

chlowell · 2023-04-12T00:22:50Z

The race I described above causes unnecessary reauthentication because some tokens don't make it to the persistent cache and are lost the next time the client loads persisted data. Reauthentication could be acceptable if it were rare, but empirically I see data loss ~25% of the time with only 2 goroutines authenticating concurrently.

benchmarks

The scenario is 10 goroutines concurrently authenticating and retrieving a cached token, with the cache persisted to a file. "Baseline" is the latest published version i.e. none of the locking added in this PR. "Read lock" is this PR as it is now, with a read lock in AcquireTokenSilent. "Exclusive lock" changes that read lock to an exclusive one.

It's interesting that this PR actually improves the benchmark despite adding locks. I believe this is because it also removes unnecessary writes; apparently those are more expensive in this scenario than the new locks.

bgavrilMS · 2023-04-12T13:51:56Z

@pmaytak - can we do something similar for MSAL.NET?

bgavrilMS · 2023-04-12T13:55:29Z

@chlowell - 30% perf improvement for cached token acquisition is massive! Ship it :)

Are there any changes to the scenario where we rely just on MSAL's memory storage? (i.e. no token cache serialization)

sonarcloud · 2023-04-12T21:05:13Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

rayluo · 2023-04-13T01:53:51Z

It's interesting that this PR actually improves the benchmark despite adding locks. I believe this is because it also removes unnecessary writes; apparently those are more expensive in this scenario than the new locks.

Given that this PR improves the baseline, that hypothesis is reasonable, and I am also giving a Ship-It! :-)

The following comments are mostly FYIs for open discussion.

solution in this PR is simply to lock around accessing the persistent cache

MSAL.NET chose a slightly more granular approach to locking
......
This ensures that we don't lock for the entire duration of the HTTP request.

The MSAL .Net's approach were also adopted in MSAL Python and MSAL Java. The MSAL Go implementation in this PR looks also in the same granular manner, as opposed to locking up the entire token acquisition http request. So, we are all good here.

"Read lock" is this PR as it is now, with a read lock in AcquireTokenSilent. "Exclusive lock" changes that read lock to an exclusive one.

Were the "read lock" and "exclusive lock" referred to the actual file lock characteristic? FWIW, MSAL Python (EX) implementation used to acquire a lock exclusively, during both token read and token write. Since then, an improvement kept only lock behavior during token write, and removed the lock attempt on token read. It just assumes a successful read (meaning the token cache file is in valid json format) would mean the data is clean. It also seems to work well.

The race I described above causes unnecessary reauthentication because some tokens don't make it to the persistent cache and are lost the next time the client loads persisted data. Reauthentication could be acceptable if it were rare, but empirically I see data loss ~25% of the time with only 2 goroutines authenticating concurrently.

Out of curiosity, was that "empirical ~25% data loss with only 2 goroutines authenticating concurrently" a real-world "end user occasionally runs 2 instances of same app manually" scenario, or an in-lab "spinning up 2 instances and having them loop authentication repeatedly" scenario? The "~25% data loss" sounds too-high-than-we-thought for the former.

chlowell · 2023-04-13T16:20:42Z

Were the "read lock" and "exclusive lock" referred to the actual file lock characteristic?

They refer to whether AcquireTokenSilent held a read lock or an exclusive lock while loading the external cache i.e., whether we let multiple goroutines (lightweight threads in the same process) load the cache concurrently. The file lock coordinating external cache access across multiple processes will be implemented in the extensions module.

FWIW, MSAL Python (EX) implementation used to acquire a lock exclusively, during both token read and token write. Since then, an improvement kept only lock behavior during token write, and removed the lock attempt on token read. It just assumes a successful read (meaning the token cache file is in valid json format) would mean the data is clean. It also seems to work well.

👍 I'll try this in the extensions module.

Out of curiosity, was that "empirical ~25% data loss with only 2 goroutines authenticating concurrently" a real-world "end user occasionally runs 2 instances of same app manually" scenario, or an in-lab "spinning up 2 instances and having them loop authentication repeatedly" scenario?

The test was a single process with two goroutines concurrently acquiring a token and then silently authenticating, once. This is a realistic simulation, though conservative, because Go programs commonly launch many goroutines. On my machine, after ~25% of test runs the external cache had only one token. The upshot is that at the lowest level of concurrency there was a significant chance the client would delete a token from its in-memory cache before writing it to the external cache. To be clear, the problem was in the client, not the external cache; an external cache can prevent concurrent reads and writes of its data but not the client destructively overwriting its own in-memory data.

chlowell added 6 commits April 6, 2023 16:13

remove unnecessary Manager.Read account parameter

4448403

account getters don't write the cache, so needn't export it

519672d

consolidate storage manager interfaces

9d0acdf

use consolidated interface to simplify methods

44b4343

lock around Export and Replace

2261e2e

update tests

e06fb5a

chlowell added the bug Something isn't working label Apr 7, 2023

fmt

d3ea05a

chlowell marked this pull request as ready for review April 7, 2023 00:08

chlowell requested review from rayluo, jhendrixMSFT and bgavrilMS April 7, 2023 00:12

jhendrixMSFT reviewed Apr 10, 2023

View reviewed changes

apps/internal/base/base.go Outdated Show resolved Hide resolved

jhendrixMSFT reviewed Apr 10, 2023

View reviewed changes

apps/internal/base/base.go Outdated Show resolved Hide resolved

jhendrixMSFT reviewed Apr 10, 2023

View reviewed changes

apps/internal/base/base.go Show resolved Hide resolved

jhendrixMSFT reviewed Apr 10, 2023

View reviewed changes

apps/internal/base/base.go Show resolved Hide resolved

mu -> cacheAccessorMu

e75038d

bgavrilMS requested a review from pmaytak April 11, 2023 09:29

prevent concurrent map iteration and writes

e453cfb

bgavrilMS approved these changes Apr 12, 2023

View reviewed changes

Export/Replace only when given an external cache

e08c04a

jhendrixMSFT approved these changes Apr 12, 2023

View reviewed changes

rayluo approved these changes Apr 13, 2023

View reviewed changes

chlowell merged commit 4c397f8 into dev Apr 13, 2023
4 checks passed

chlowell deleted the chlowell/cache-races branch April 13, 2023 22:11

rayluo mentioned this pull request Apr 19, 2023

MSAL Go 1.0.0 #412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent persistent cache data races #402

Prevent persistent cache data races #402

chlowell commented Apr 7, 2023

bgavrilMS commented Apr 11, 2023

chlowell commented Apr 12, 2023

bgavrilMS commented Apr 12, 2023

bgavrilMS commented Apr 12, 2023 •

edited

sonarcloud bot commented Apr 12, 2023

rayluo commented Apr 13, 2023

chlowell commented Apr 13, 2023 •

edited

Prevent persistent cache data races #402

Prevent persistent cache data races #402

Conversation

chlowell commented Apr 7, 2023

bgavrilMS commented Apr 11, 2023

MSAL.NET chose a slightly more granular approach to locking

Eventual consistency vs race issues

chlowell commented Apr 12, 2023

benchmarks

bgavrilMS commented Apr 12, 2023

bgavrilMS commented Apr 12, 2023 • edited

sonarcloud bot commented Apr 12, 2023

rayluo commented Apr 13, 2023

chlowell commented Apr 13, 2023 • edited

bgavrilMS commented Apr 12, 2023 •

edited

chlowell commented Apr 13, 2023 •

edited