Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ETS atomically on tracker update #172

Merged

Conversation

jonatanklosko
Copy link
Contributor

@jonatanklosko jonatanklosko commented May 24, 2023

Tracker update is handled as leave + join and ETS updates are applied in this notion, that is, a delete followed by an insert. Since tracker lookup reads ETS on the caller side, it can happen concurrently to tracker update, so there is a race condition when the ETS read happens between said delete and insert.

The "values" ETS table (which keeps the metadata) is an :ordered_set, which means that we can do an atomic update just by inserting the new value, without prior delete. So I made two changes:

  • for the local case I added State.leave_join, which works just like State.leave + State.join, but without the unnecessary delete
  • for the remote case I updated State.merge to gather desired inserts/deletes to ETS, do a simple diff and only then apply the changes (which makes an entry update atomic)

The changes only affect applying changes to the internal ETS storage, the update is reported as leave + join in exactly the same way.

Context: in Livebook we use tracker for registering notebook session processes and we also store basic session information necessary for listing, which gets updated. We have a lot of tests that create and lookup sessions and we started to observe more and more intermittent test failures that basically go: create session -> lookup session -> session not found. Running tests in a loop I could reproduce this behaviour and tracked it down to this race condition, with this PR it no longer happens.

Comment on lines +97 to +100
state = bump_clock(%State{state | clouds: pruned_clouds, delta: new_delta})

# Update ETS entry and produce add-like delta
state = bump_clock(state)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need to bump the clock twice, I just mirrored the exact behaviour of leave + join.

@chrismccord chrismccord merged commit fc5686f into phoenixframework:main May 24, 2023
@chrismccord
Copy link
Member

❤️❤️❤️🐥🔥

@jonatanklosko jonatanklosko deleted the jk-atomic-ets-update branch May 24, 2023 20:26
@chrismccord
Copy link
Member

Released in v2.1.2. Thank you so much!

@jonatanklosko
Copy link
Contributor Author

Beautiful, thanks! <3

@arjan
Copy link

arjan commented Jun 13, 2023

With this change I am seeing issues where the presence ETS values tables on different nodes are drifting out of sync. It seems that presence entries from processeses on other nodes are not always cleaned up. This forced me to roll back to 2.1.1 for the time being. I will see if I can come up with a reproducible case.

@jonatanklosko
Copy link
Contributor Author

jonatanklosko commented Jun 13, 2023

@arjan are you comparing the ETS rows, or the result of Presence.list? Also, do they differ just in excess processes or the metadata itself?

I went through the changes a couple times and so far haven't pinpointed anything that would introduce any meaningful difference.

@arjan
Copy link

arjan commented Jun 14, 2023

I noticed it because our stats use the output of Presence.list and the counts were way off yesterday morning. Hopefully I can reproduce it today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants