Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement nightly improvements (~30% read improvement) #43

Closed
wants to merge 4 commits into from

Conversation

terrarier2111
Copy link
Contributor

@terrarier2111 terrarier2111 commented Nov 18, 2022

These are some nice performance improvements but i got (at least) one more in the line - it'll enable us to remove 1 of the 3 branches in the get call

if let Some(inner) = self.get_inner(&thread.0) {
return Ok(inner);
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use thread_id::get() here? A thread ID is going to need to be allocated anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no thread_id::get() anymore in the nightly version because that used the thread_local! macro which introduced unnecessary overhead in the form of at least one branch and other stuff that LLVM wasn't able to optimize out very well. But it is able to do a decent job optimizing with this version where getting/setting the thread local got inlined

src/lib.rs Show resolved Hide resolved
@Amanieu
Copy link
Owner

Amanieu commented Nov 25, 2022

I spent some time reviewing this and found the new logic to be very hard to follow.

If I understand correctly, the main perf improvement comes from avoiding the need to initialize the thread ID on the fast path. Could this be neatly wrapped in a thread_id::try_get() function that returns an Option<Thread>? Then the call to thread_id::get() which initializes the TLS could be moved into the cold insert() function.

@terrarier2111
Copy link
Contributor Author

I spent some time reviewing this and found the new logic to be very hard to follow.

If I understand correctly, the main perf improvement comes from avoiding the need to initialize the thread ID on the fast path. Could this be neatly wrapped in a thread_id::try_get() function that returns an Option<Thread>? Then the call to thread_id::get() which initializes the TLS could be moved into the cold insert() function.

Sorry for the long delay, i did what you asked for and extracted the unsafe code portions into safe wrapper functions with extensive safety comments.

@Amanieu
Copy link
Owner

Amanieu commented Dec 12, 2022

Hmm, that's still quite complex. I had a go at implementing this myself in #44. Could you have a look and see if this addresses the performance issues you are seeing?

@terrarier2111
Copy link
Contributor Author

terrarier2111 commented Dec 13, 2022

Hmm, that's still quite complex. I had a go at implementing this myself in #44. Could you have a look and see if this addresses the performance issues you are seeing?

Your pr improves perf a bit (10% compared to master) but it is still a 20% regression compared to this pr. But if you want i can integrate your changes into this pr so we have both better stable and nightly performance.

@Amanieu
Copy link
Owner

Amanieu commented Dec 13, 2022

Sure! I think the only change needed here is to use #[thread_local] instead of thread_local! for the first TLS variable?

@Amanieu
Copy link
Owner

Amanieu commented Dec 13, 2022

With that said, I much prefer my version where the only interface to thread_id is get and try_get. Also I found that not passing the Thread to insert improves performance because it is otherwise passed by value.

@terrarier2111
Copy link
Contributor Author

terrarier2111 commented Dec 13, 2022

With that said, I much prefer my version where the only interface to thread_id is get and try_get. Also I found that not passing the Thread to insert improves performance because it is otherwise passed by value.

Tbh i don't see a way to reduce the interface further without moving the unsafe parts back into the main lib portion where they used to be, if you have some better idea i'd be glad to revise the interface. Right, nice catch, i'll add that optimization in as well, i'll probably be done by tmrw and push the changes up.

@Amanieu
Copy link
Owner

Amanieu commented Dec 13, 2022

If I read your code correctly, it does essentially the same thing as #44:

  • get and the fast path in get_or_try both only look at THREAD, which has a const initializer and no destructor. => thread_id::try_get().
  • insert will allocate a new thread ID if needed, or use the existing one. => thread_id::get()

src/lib.rs Outdated Show resolved Hide resolved
@terrarier2111
Copy link
Contributor Author

If I read your code correctly, it does essentially the same thing as #44:

* `get` and the fast path in `get_or_try` both only look at `THREAD`, which has a const initializer and no destructor. => `thread_id::try_get()`.

* `insert` will allocate a new thread ID if needed, or use the existing one. => `thread_id::get()`

I can't tell you exactly why my version is that much faster than yours, but i still suspect that it has to do with LLVM being able to tell that what i am doing with the macro-less thread local can be inlined in some places while for yours it can not.

@terrarier2111
Copy link
Contributor Author

So i checked on my workstation (which has an amd processor) (this is a different machine than the one i ran the benchmarks with before) and after integrating your changes into my branch, i am still getting an 5% edge over your approach in terms of read and an 20% edge in terms of insert which isn't as huge as on my laptop (which has an intel processor) but it is still a clear improvement

@terrarier2111
Copy link
Contributor Author

I integrated your changes for stable while still retaining the nightly feature for additional perf, are you okay with this?

@terrarier2111
Copy link
Contributor Author

@Amanieu any update on this? i'd love to have a new release with these improvements so i can release my crate on crates io that depends on them

@Amanieu
Copy link
Owner

Amanieu commented Feb 3, 2023

I've implemented the same optimization in #44, which I feel has much nicer code. If that looks OK to you, I think I will merge #44 and close this PR. I will of course credit you in the PR.

@terrarier2111
Copy link
Contributor Author

Generally speaking i'm okay with that but I'd prefer to leave this pr open (or at least open an issue referencing this pr) to track the potential for further optimization once #[thread_local] gets stabilized. Should I open an issue and close this pr or should I just leave it open?

@Amanieu
Copy link
Owner

Amanieu commented Feb 3, 2023

I did add the #[thread_local] optimization to #44. Any further optimizations should be done in a new PR.

@terrarier2111
Copy link
Contributor Author

I did add the #[thread_local] optimization to #44. Any further optimizations should be done in a new PR.

Yea, that's alr

@terrarier2111
Copy link
Contributor Author

Could you create a new release after #44 gets merged?

@Amanieu
Copy link
Owner

Amanieu commented Feb 8, 2023

Closing in favor of #44.

@Amanieu Amanieu closed this Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants