Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high resource usage in the housekeeper #154

Closed
Geal opened this issue Jun 24, 2022 · 10 comments
Closed

high resource usage in the housekeeper #154

Geal opened this issue Jun 24, 2022 · 10 comments
Assignees

Comments

@Geal
Copy link

Geal commented Jun 24, 2022

hi, we're using moka in a query cache in the Apollo router, and we are seeing high CPU usage in the housekeeper code, as shown in that flamegraph of a benchmark that uses one core and does not insert or get anything from the moka cache(the large tower on the left is the housekeeper):
flamegraph

With 4 cores, It is much lower, but still 2.45% of sampled time, for a cache that is doing nothing:
flamegraph

do you have an idea why it would cost that much to run the housekeeper?

@tatsuya6502
Copy link
Member

tatsuya6502 commented Jun 26, 2022

Hi. Thank you for reporting the issue.

Which OS and CPU architecture did you use? Perhaps Linux x86_64? And you used a multi-core machine, right?

When moka cache is not busy (including idle), its housekeeper task will run every 0.3 seconds. If cache is idle, it will spend very little time to run because there is no work to do. I see it takes only 0.05% of the space in the flamegraph.

housekeeper-task-run

I do not have answer yet, but I found the followings:

  • As you said, the large tower on the left is the housekeeper, taking up 29.34% of the space of the sigle-core flamegraph.
  • In the tower:
    • Running actual housekeeper task takes up only 0.05%.
    • futex_wait in a system call takes up ~22.20%. (12.51% + 7.61% + 1.44% + 0.64%)
    • futex_wait was called by Condvar::wait_until of parking_lot_core crate, which will put the calling thread to sleep for a given duration.

If I understand correctly, the system spent lots of time (?) in futex_wait calls. And futex_wait is used for sleeping a thread for given duration.

Maybe native_write_msr on the top of the stack is the key. I am not sure yet, but it might be used to record Linux perf data?

If so, I could try different platforms without perf; e.g. macOS or FreeBSD with dtrace.

@Geal
Copy link
Author

Geal commented Jun 26, 2022

It's on linux x86_64 yes

@tatsuya6502
Copy link
Member

tatsuya6502 commented Jun 28, 2022

A quick update:

I am currently investigating why the system spent lots of time in Linux futex_wait while all worker threads in the thread pool were mostly sleeping.

I created a testing environment at Amazon EC2, but have not had time to write a test program to reproduce the issue:

  • Ubuntu Server 22.04, x86_64, 4 vCPUs
  • In the env, I confirmed that I can generate flamegraphs including kernel stacks by using [cargo-]framegraph with Linux perf.

I read relevant codes in scheduled-thread-pool, parking_lot and parking_lot_core crates. I also read some blog articles to learn very basic of Linux futex.

My current theory (not verified) is that spurious wakeups are occurring in futex_wait.

  • I see in the original single-core flamegraph, futex_wait was calling schedule to return the CPU core to the kernel.
    • This would mean that the worker threads were woken up by the kernel, but not ready to proceed.
  • Maybe the kernel was waking up the threads because there were not enough tasks to be scheduled to make the CPU cores busy?

@Geal
Copy link
Author

Geal commented Jun 29, 2022

maybe it's a function of how busy the CPUs are yes. In my initial tests I limit the number of cores used by the router like this: https://github.com/apollographql/router/blob/main/apollo-router/src/executable.rs#L139-L146
(it's necessary because it's hard to saturate the router with traffic if it has access to all cores)

@tatsuya6502
Copy link
Member

tatsuya6502 commented Jul 7, 2022

Hi. Sorry for the slow progress. I was busy on finalizing Moka v0.9.0. I released it so I hope I have more time now.

I wrote a simple program using scheduled-thread-pool crate with a modified parking_lot crate to check if there were spurious wakeups. However, I could not observe/reproduce the issue with that simple program.

I think I will change my direction now. Instead of keep investigating this, I will add an option to remove the thread pools from moka::sync::Cache. If there is no thread pool, there will be no CondVar with Linux futex and the problem will not exist in the first place.

Some people have already suggested me not to have thread pools by default, and make the cache to use client threads to drive the housekeeping jobs. I checked Moka cache usage in Apollo router, and found that it does not use any advanced features that Moka provides. So removing the thread pools for that specific use case will be very easy.

I also found that you are moving away from Moka for the generic response cache layer (the cache-tools branch). Is this right? How about the APQ? Will you continue using Moka or not?

@tatsuya6502 tatsuya6502 self-assigned this Jul 7, 2022
@Geal
Copy link
Author

Geal commented Jul 8, 2022

the cache-tools branch is about factoring the various cache implementations in the router (there were 5 different ones when I started that work 😅 ) around one cache interface that can have multiple levels, in memory and external (redis, memcached).
We have not decided to move away from moka yet, we're evaluating cache libraries that we can fit in that model. If we find a good library that handles the multi level for us we might go for it.

If there's one point that would motivate me to move away from moka, it's that it feels too complicated for me to go and fix quickly during an incident. But that's not an inconvenient of the library itself, you chose to cover a lot of ground in one coherent implementation. An example here: with the get_or_insert_with, moka can return a value or perform the work to retrieve a new one, while coalescing concurrent queries, and this affects multiple parts of the cache. In the router we handle query coalescing ourselves (because we can fit it better with the graphql query model), so we mainly need a synchronized KV store.

@tatsuya6502
Copy link
Member

tatsuya6502 commented Jul 11, 2022

Hi. Thank you very much for sharing your thoughts. I am not aware of any library that handles the multi level (tiers) of caches in Rust, but I hope you will find one. I have not got idea how the tiered cache in the cache-tool branch would work. I will read the code closely and try to figure out.

In the latest version of Moka (v0.9.0), I added support for eviction listener callback function. That feature was requested by multiple users. I wondered if it will be useful when implementing a tiered cache. The listener can be used to move an entry evicted from Moka cache to an external cache.

So I wrote an example tiered cache implementation here, which uses Moka as the primary cache and Redis as the secondary cache:

I am not sure if it can fit your needs; I guess not. I also found some issues (limitations) in the current API and internal design. I listed them up in the "Limitations" section of the README. I hope we can solve them in the future.

I agree that Moka cache's implementation is complicated. I chose to go in the direction of providing convenient features requested by users. I also put my knowledge and ten-year experience in distributed key-value and object stores to Moka cache to make it more performant. But this also makes it more complicated. My hope is that it becomes mature enough so that nobody have to go into and quickly fix it.

On the other hand, a few users have asked for a much simpler version of cache; like a HashMap with size constraints and entry expiration. We have experimental concurrent cache implementation called moka::dash::Cache for that. It has no advanced features, so its implementation is a bit simpler.

I am thinking to move it to a separate crate to make it clear that it is a simpler version. Maybe I will make it even simpler than now. I hope you might be interested in it.

@tatsuya6502
Copy link
Member

I will add an option to remove the thread pools from moka::sync::Cache. If there is no thread pool, there will be no CondVar with Linux futex and the problem will not exist in the first place.

I am working on it here: #165

You can try it by doing the followings:

Modify Cargo.toml:

[dependencies]
moka = { git = "https://github.com/moka-rs/moka", branch = "disable-housekeeper-threads" }

When creating a cache instance, set thread_pool_enabled to false using the cache builder.

use moka::sync::Cache;

let cache = Cache::builder()
    .max_capacity(MAX_CAPACITY)
    .thread_pool_enabled(false)
    .build();

For backward compatibility, thread_pool_enabled is set to true by default. We will change the default to false in the future (v0.10.0?).

@tatsuya6502
Copy link
Member

tatsuya6502 commented Aug 4, 2022

I will add an option to remove the thread pools from moka::sync::Cache. If there is no thread pool, there will be no CondVar with Linux futex and the problem will not exist in the first place.

I have published Moka v0.9.3 with this feature. (doc)

I will keep this issue open to investigate the original issue. It I will do it when I have more time.

@tatsuya6502
Copy link
Member

Closing as complete.

v0.12.0 has major breaking changes. Please read the MIGRATION-GUIDE.md for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants