Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel: Wake cores from idle directly rather than through a host thread #6837

Merged
merged 5 commits into from
May 22, 2024

Conversation

riperiperi
Copy link
Member

@riperiperi riperiperi commented May 19, 2024

Right now when a core enters an idle state, leaving that idle state requires us to first signal the core's idle thread, which then signals the correct thread that we want to run on the core. This means that in a lot of cases, we're paying double for a thread to be awoken from an idle state.

This PR moves this process to happen on the thread that is waking others out of idle, instead of an idle thread that needs to be awoken first.

For compatibility the process has been kept as similar as possible - the process for IdleThreadLoop has been migrated to TryLeaveIdle, and is gated by a condition variable that lets it run only once at a time for each core. A core is only considered for wake from idle if idle is both active and has been signalled - the signal is consumed and the active state is cleared when the core leaves idle. Maybe we could go further with this to avoid waiting on other thread signals to complete, but a port of the current behaviour is the safest improvement for now.

Dummy threads (just the idle thread at the moment) have been changed to have no host thread, as the work is now done by threads entering idle and signalling out of it. The idle thread has been removed entirely, and idle core state is now directly on the scheduler.

This could put a bit of extra work on threads that would have triggered _idleInterruptEvent before, but I'd expect less time wasted than signalling all those reset events and the OS overhead that follows. Worst case is that other threads performing these signals at the same time will have to wait for each other, but it's still going to be a very short amount of time.

Improvements are very slight, but are best seen in games with heavy (or very misguided) multithreading, such as Pokemon: Legends Arceus. Improvements are expected in Scarlet/Violet and TOTK, but are harder to measure due to GPU trouble.

Testing on Linux/MacOS still to be done, definitely need to test more games as this affects all of them (obviously) and any issues might be rare to encounter.

Right now when a core enters an idle state, leaving that idle state requires us to first signal the core's idle thread, which then signals the correct thread that we want to run on the core. This means that in a lot of cases, we're paying double for a thread to be woken from an idle state.

This PR moves this process to happen on the thread that is waking others out of idle, instead of an idle thread that needs to be woken first.

For compatibility the process has been kept as similar as possible - the process for IdleThreadLoop has been migrated to TryLeaveIdle, and is gated by a condition variable that lets it run only once at a time for each core. A core is only considered for wake from idle if idle is both active and has been signalled - the signal is consumed and the active state is cleared when the core leaves idle.

Dummy threads (just the idle thread at the moment) have been changed to have no host thread, as the work is now done by threads entering idle and signalling out of it.

This could put a bit of extra work on threads that would have triggered `_idleInterruptEvent` before, but I'd expect less work than signalling all those reset events and the OS overhead that follows. Worst case is that other threads performing these signals at the same time will have to wait for each other, but it's still going to be a very short amount of time.

Improvements are best seen in games with heavy (or very misguided) multithreading, such as Pokemon: Legends Arceus. Improvements are expected in Scarlet/Violet and TOTK, but are harder to measure.

Testing on Linux/MacOS still to be done, definitely need to test more games as this affects all of them (obviously) and any issues might be rare to encounter.
@github-actions github-actions bot added horizon Related to Ryujinx.HLE kernel Related to the kernel labels May 19, 2024
@riperiperi
Copy link
Member Author

riperiperi commented May 19, 2024

Legends: Arceus provides the best view at what the difference is for core scheduling. There's a part of the core game loop where it wastefully swaps between three threads with what is basically a sequential workload for a few milliseconds. If we zoom in here with a profiler, we can see the behaviour before and after:

Before

image

You can see that guest threads 50, 54 and 53 are constantly blocking each other in a clear pattern. However, when each thread suspends, it also signals the idle threads for each core, OS thread 0,1,2,3. These threads then wake the next guest thread, so two OS context switches (shown by the arrows) need to be performed for the game to switch to the next thread.

After

image-1

The threads are in a similar pattern where they signal each other sequentially, but they are waking each other directly rather than waking idle threads first. You can see this via the arrows, where it's clearer what threads are unblocking each other. This won't be perfect - an unrelated thread could still wake a thread that was unblocked by some other thread that hasn't gotten to the idle awakening step, but it's nicer for debug and saves one OS context switch per idle.

It's worth noting that the profiler I'm using will exaggerate the runtime of threads, as it captures all context switches, but the time precision is a lot lower and it seems to round start down and end up. It also seems to slow down context switches a lot more, so with the new approach the game runs notably faster under a profiler.

On my Windows desktop with a Ryzen 3900X, there is a small boost to performance (peak performance shown, average performance difference is about the same #, same location):

Before

image

After

image

I still need to see if overall CPU usage drops, and how this might impact systems with less cores or power saving.

@riperiperi riperiperi added the performance Performance issue or improvement label May 19, 2024
@gdkchan
Copy link
Member

gdkchan commented May 19, 2024

I wonder how hard it would be to remove _idleThread entirely. It seems one of the remaining uses is for currentThread.AddCpuTime(ticksDelta);, which is used to measure the amount of time a core is idle. It shouldn't be hard to special case this with a field on the KScheduler to accumulate idle time instead. As for the other uses, they are just for checking if the current/next thread is the "idle thread". null could also be used to indicate "idle thread".

@riperiperi
Copy link
Member Author

riperiperi commented May 20, 2024

Did some testing on steam deck, and its performance appears to be affected a lot more. The system has 4 cores instead of the 12 on my desktop, runs linux instead of windows, and has aggressive power saving measures. All tests are running on battery, and screenshots are a few minutes after the test begins so the power usage numbers settle.

Uncapped framerate

Average performance greatly improves. Fluctuations from 35-36 to 42-43. (around 4-5ms saved) Overall power usage seems similar, but more seems to go into the GPU to reach the new higher framerate (not shown on screenshot, but general pattern is there when watching it). Frame times are a lot more stable.

Before

20240520074431_1

After

20240520084456_1

Capped Framerate

When framerate is capped, power usage greatly decreases. Focus on the wattage numbers next to "battery" and "cpu", and the clock speeds it decided upon. Frametime is a lot more consistent. Fan speed/temps are much lower, it quickly becomes inaudible when the cap is turned on.

Before

20240520080410_1

After

20240520075823_1

I've always wondered why this game was underperforming on deck, I guess now we have the answer.

@riperiperi riperiperi marked this pull request as ready for review May 20, 2024 19:20
@ryujinx-mako ryujinx-mako bot requested review from AcK77, gdkchan, TSRBerry and a team May 20, 2024 19:20
Copy link
Member

@gdkchan gdkchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks. I tested a few games here on Windows and macOS, and had no issues, I didn't play for long though, so might be worth to get more extended testing from someone else. Very nice to see the idle threads gone, it should make debugging a bit simpler. I didn't know it could have such a significant impact on Steam Deck too, so that was a nice surprise.

@ryujinx-mako ryujinx-mako bot requested a review from a team May 21, 2024 00:28
@LukeWarnut
Copy link
Contributor

I tested Smash Ultimate for an extended amount of time and didn't find anything unordinary. I also briefly tried a few others and got the same result.

Copy link
Member

@TSRBerry TSRBerry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works great! Can't really comment on the code changes tbh. They make sense to me and I don't see any issues, but I'm very inexperienced in that area, so that's not really worth much.

@gdkchan gdkchan merged commit c1ed150 into Ryujinx:master May 22, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
horizon Related to Ryujinx.HLE kernel Related to the kernel performance Performance issue or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants