Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of wasmtime file I/O maybe because tokio #7973

Open
liutao-liu opened this issue Feb 21, 2024 · 22 comments
Open

Poor performance of wasmtime file I/O maybe because tokio #7973

liutao-liu opened this issue Feb 21, 2024 · 22 comments
Labels
bug Incorrect behavior in the current implementation that needs fixing

Comments

@liutao-liu
Copy link
Contributor

liutao-liu commented Feb 21, 2024

Test Case

test.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>

int main()
{
    FILE *fp; 
    char str[14];
    for (int i = 1; i < 1000; i++)
    {

        int fd = open("test.txt", O_CREAT | O_WRONLY, S_IRUSR | S_IWUSR);
        for (int j = 0; j < 1000; j++)
        {

            lseek(fd, i * 10 + j * 13, SEEK_SET);
            write(fd, "hello world! ", 13);
        }
        fsync(fd);
        close(fd); 
    }

    return 0;
}

Steps to Reproduce

  • first,compile test.c in the preceding test case into WASM bytecode using the WASI SDK
wasi-sdk/bin/clang -O3 test.c -o test.wasm
  • second, WASMTIME AOT compile ,got test.aot.
wasmtime compile  -W simd,relaxed-simd test.wasm -o test.aot
  • third, Test Case Running Duration. It takes about 40 seconds.
time wasmtime run --allow-precompiled --dir ./ test.aot

Expected Results

wasmtime takes about the same time as native and wamr.

Actual Results

Wasmtime takes about 23 seconds.
The same test.c, native or wamr only takes about 2 seconds.

Versions and Environment

Wasmtime version :16.0.0

Operating system: ubuntu 20.04

Architecture: aarch64 (same as x86 for this case)

Extra Info

Profile

#  profile for wasmtime
perf record -g -k mono wasmtime run --profile=jitdump --allow-precompiled --dir ./ test.aot
sudo perf inject --jit --input perf.data --output perf.jit.data
perf report -i perf.jit.data

As shown in the following figure, most performance hotspots are on Tokio. This is because wasmtime uses Tokio to implement the file I/O interface, involving:

__imported_wasi_snapshot_preview1_fd_read  
__imported_wasi_snapshot_preview1_fd_seek  
__imported_wasi_snapshot_preview1_fd_sync 
__imported_wasi_snapshot_preview1_fd_write

image

#  profile for native
perf record -g -k mono ./test
perf report 

image

System Call Times Statistics

As shown in the following figure, the number of wasmtime system call times is three times that of native.
Is it because wasmtime uses tokio to implement file IO operations, and the number of file I/O operations is three times that of native, resulting in poor performance?

# strace for wasmtime
strace -c wasmtime run --allow-precompiled --dir ./ test.aot

image

# strace for native( ths same as wamr )
strace -c ./test 

image

Why do we use Tokio to implement file I/O? Have we considered performance?

@liutao-liu liutao-liu added the bug Incorrect behavior in the current implementation that needs fixing label Feb 21, 2024
@liutao-liu liutao-liu changed the title performance of wasmtime file I/O is poor, maybe because tokio Poor performance of wasmtime file I/O maybe because tokio Feb 21, 2024
@pchickey
Copy link
Collaborator

pchickey commented Feb 21, 2024

Thanks for this detailed report!

The short answer is, we haven't benchmarked performance of file IO yet, and right off the top of my head we have a couple optimization ideas we haven't explored because we were just trying to get things out the door.

I will dig in deeper and see if we can come up with some improvements here.

@liutao-liu
Copy link
Contributor Author

liutao-liu commented Feb 22, 2024

Thank you for your prompt response.
Besides tokio, does wasmtime have any other I/O solutions?

Or is there any configuration on wasmtime that can improve tokio's io performance a bit?

@pchickey
Copy link
Collaborator

pchickey commented Feb 22, 2024

Those are great questions. This is a long answer but this is a pretty involved topic, so please excuse this wall of text, and let me know if there is anything about this I should explain better.

We chose to use tokio for wasmtime-wasi's implementation partly because for many production embedders (i.e. not users of the wasmtime-cli binary, but Fastly and Fermyon and Cosmonic and etc who use the wasmtime crate directly in their server software), wasmtime and WASI need to be implemented on top of an async runtime, and tokio is the async ecosystem of choice for all of our known embedders.

Another significant factor is that wasmtime-wasi-http is built on an HTTP implementation and needs to integrate with the pollables and streams provided by wasmtime-wasi. For production HTTP stacks in Rust, hyper, which is built on tokio, is essentially the only option. A significant fraction of all internet traffic passes through hyper today - I'm not aware of any credible alternatives.

In order to fit in with the same pollable and stream resources as wasi-sockets, wasi-http, and etc, wasi-filesystem is also implemented on top of tokio. Because Linux doesn't provide a non-blocking way to do File IO (except for io_uring, which afaik no production-ready Rust systems are using yet, though hopefully they will soon), the correct way to do File IO in tokio is to move that work onto a special blocking IO thread-pool managed by tokio, which costs two synchronizations between threads per blocking operation for the executor to hand the work off from an async task, and to get notified of completion to resume the async task - that our hypothesis for where the futex and epoll_pwait syscalls are coming from.

I don't actually know where the extra write syscalls are coming from, but I haven't had time to dig in deeper yet. There may be an obvious low-hanging fruit there? If we could it cut down to just one write per WASI file blocking-write-and-flush that would be a nice win.

Finally, why does wasmtime-cli, which is a totally synchronous Rust program, use tokio under the hood to implement the WASI interface? Basically, because we only had time to implement wasmtime-wasi once. When you use wasmtime-wasi with a synchronous wasmtime (i.e. wasmtime::Config::async_support(false), which is the default), wasmtime-wasi provides the Linker with a shim that lazily creates a private tokio runtime, and calls the async Rust implementations of all the pollable and stream bits (and various other wasi-filesystem operations that turn into blocking-syscalls) underneath that private tokio.

There is no configuration available to change whether blocking file IO is moved to separate thread or not - its very fundamental to how tokio and the wasmtime-wasi implementation work. The only way to change that behavior would be to rewrite wasmtime-wasi with a completely different internal architecture, to solely use synchronous Rust.

A rewrite of wasmtime-wasi for the synchronous register of Rust would take a significant amount of time to write and maintain, and it would be challenging to integrate with the rest of the ecosystem - it would essentially cleave the wasmtime ecosystem in two. For example, wasmtime-wasi-http, or whatever other crates folks write beyond this tree that integrate with wasmtime-wasi - and thats virtually every interface that does any sort of IO, since streams and pollables are so fundamental - currently use wasmtime-wasi's pub trait Subscribe, pub trait HostInputStream, and pub trait HostOutputStream, which all use async Rust functions and assume a tokio executor. We don't expect that problem to change anytime soon, and in fact we may expect it to become even more deeply integrated with wasmtime as the Component Model works on having native async as part of the next big effort.

To give historical context: I designed wasi-common so that tokio was an optional dependency, and we were able to get away with it during WASI Preview 1 because WASI didnt yet have streams or an extensible poll interface. Our experience embedding wasi-common in various contexts where we needed to wait on e.g. HTTP request readiness, or treat HTTP bodies as a stream, highlighted that the design of WASI Preview 1's poll_oneoff was not the right design for composable systems. So, we changed the design of WASI itself in Preview 2 to have a pollable resource that can be created by any interface that needs to express waiting on readiness, and with that we more or less got forced into using async Rust for async trait Subscribe as the host implementation of each pollable.

We may be able to find a way, for synchronous embeddings, to break some of tokio's rules about performing blocking syscalls on the "main thread", because that operation should only affect wasmtime-wasi and other crates that build on top of it. If we could break those rules, we could provide a faster path to perform some blocking file IO operations (likely input-stream.blocking-read and output-stream.blocking-write-and-flush only) without the cost of context switches. However, thats just a guess we have, and I am apprehensive to break tokio's rules because the consequences of that for other crates (like wasmtime-wasi-http, as well as whatever other code that others implement outside this tree) could both be pretty severe, and difficult to understand or document. We will see if we can investigate that more, but to be totally honest its not at the top of my priorities right now. If you can help us understand the business case for improving performance, that may help us move it up in priority.

@liutao-liu
Copy link
Contributor Author

liutao-liu commented Feb 23, 2024

Thank you very much for explaining the reasons for choosing tokio in such detail.

Currently, I am using FlashDB as my database. FlashDB has two types of file APIs: LIBC file API (like fopen/fread/fwrte/fclose) and Posix file API (like open/read/write/close). After being compiled by the wasi-sdk, the two types of file APIs invoke the same type of interface:
__imported_wasi_snapshot_preview1_fd_read
__imported_wasi_snapshot_preview1_fd_seek
__imported_wasi_snapshot_preview1_fd_sync
__imported_wasi_snapshot_preview1_fd_write...and so on,.

For the FlashDB scenario, I now expect the file IO performance of wasmtime to be close to native, so I plan to try to modify the implementation of wasmtime's file IO, and directly use rust's file IO instead of using tokio. This is just a try, of course, very much look forward to your suggestions.

@squillace
Copy link

@pchickey: Just to add to the users of async wasmtime at scale, Microsoft also uses this, and has built it into the containerd/runwasi project for generalized Kubernetes usage as well. Great description of how this came about, thank you for the time and effort.

@pchickey
Copy link
Collaborator

@liutao-liu Thanks for your response, that use case makes sense and its one that many users might encounter whether using FlashDB or sqlite or etc.

One detail I glossed over: we are using cap_std::fs::File (a thin wrapper on std File) to actually perform the io operations, not tokio::fs::File, but we are doing so inside a tokio::task::spawn_blocking, which makes it morally equivalent to the way tokio File wraps std File. The trick to avoiding the thread synchronizations is getting rid of spawn_blocking, which is architecturally kinda tricky to avoid, but it may be possible to hack through it, especially if you only need wasmtime-cli to work and not the more difficult composition scenarios I outlined.

@liutao-liu
Copy link
Contributor Author

liutao-liu commented Mar 8, 2024

Hi @pchickey,thank you for your modification tips. After I made the following changes according to your tips, the performance of wasmtime has been greatly improved.
image
image

In the above test case, the wasmtime time has been optimized from 23 seconds to 6 seconds. As you might expect, the futex system call is gone. Also, the write and read system calls become pwrite64 and pread64,i haven't figured out why yet.

Is it necessary to submit a PR for my changes? I added a run option to control whether or not to block in tokio. I think it is still necessary to provide an option for users to choose whether to use tokio

image

@alexcrichton
Copy link
Member

Thanks for testing that out @liutao-liu! I think it'd be reasonable to land something along these lines into the CLI itself, although I'd personally prefer to avoid a flag here since it'd be best to have the behavior turned on by default. What I might propose is something like:

  • Add a new flag to WasiCtx that indicates whether blocking the current thread is ok. This defaults to faults but the CLI would set it to true. That would then be consulted during spawn_blocking to do what you're doing, but from configuration via WasiCtx instead.
  • For the second slowdown, avoiding in_tokio, I think the best solution would be to do that as part of the CLI. For example the CLI could call in_tokio originally and that way all of wasm is executed within a tokio context. I believe that would hit the fast path in the in_tokio function where nothing is done.

Would that work for you?

@liutao-liu
Copy link
Contributor Author

liutao-liu commented Mar 11, 2024

Thanks for testing that out @liutao-liu! I think it'd be reasonable to land something along these lines into the CLI itself, although I'd personally prefer to avoid a flag here since it'd be best to have the behavior turned on by default. What I might propose is something like:

  • Add a new flag to WasiCtx that indicates whether blocking the current thread is ok. This defaults to faults but the CLI would set it to true. That would then be consulted during spawn_blocking to do what you're doing, but from configuration via WasiCtx instead.
  • For the second slowdown, avoiding in_tokio, I think the best solution would be to do that as part of the CLI. For example the CLI could call in_tokio originally and that way all of wasm is executed within a tokio context. I believe that would hit the fast path in the in_tokio function where nothing is done.

Would that work for you?

I understand your first propose, I can move the flag to wasictx. That's a good idea, it would be simpler.

But I don't understand your second propose. Do you mean to run the whole wasmtime in tokio? Can you explain that in more detail?

@alexcrichton
Copy link
Member

For the second point, sorry I think this is actually the right function, namely with_ambient_tokio_runtime.

Can you try wrapping this invocation of Func::call in that function and see if it improves the performance you're seeing?

@liutao-liu
Copy link
Contributor Author

Thanks for testing that out @liutao-liu! I think it'd be reasonable to land something along these lines into the CLI itself, although I'd personally prefer to avoid a flag here since it'd be best to have the behavior turned on by default. What I might propose is something like:

  • Add a new flag to WasiCtx that indicates whether blocking the current thread is ok. This defaults to faults but the CLI would set it to true. That would then be consulted during spawn_blocking to do what you're doing, but from configuration via WasiCtx instead.
  • For the second slowdown, avoiding in_tokio, I think the best solution would be to do that as part of the CLI. For example the CLI could call in_tokio originally and that way all of wasm is executed within a tokio context. I believe that would hit the fast path in the in_tokio function where nothing is done.

Would that work for you?

Hello @alexcrichton , I tested the solution you proposed and it actually took an average of 14 seconds. Compared with the original solution (23 seconds), this improvement is not ideal. This is because each I/O operation is performed by invoking spawn_blocking, which still causes a large number of asynchronous waits.
image

@alexcrichton
Copy link
Member

Sorry, but to confirm, did you keep the changes mentioned above, e.g. the --block-file-io-in-tokio flag?

I sketched out what those changes might look like in this commit, but I think we'll both want to skip the spawn_blocking (conditionally) and additionally have the with_ambient_tokio_runtime bits.

Can you confirm whether that commit has the performance that you're looking for?

@liutao-liu
Copy link
Contributor Author

liutao-liu commented Mar 20, 2024

Sorry, but to confirm, did you keep the changes mentioned above, e.g. the --block-file-io-in-tokio flag?

I sketched out what those changes might look like in this commit, but I think we'll both want to skip the spawn_blocking (conditionally) and additionally have the with_ambient_tokio_runtime bits.

Can you confirm whether that commit has the performance that you're looking for?

With these two changes, it takes 6 seconds, hardly any further improvement.

Also, you mentioned the new flags added in wasictx, which can't be read here.in_tokio

@alexcrichton
Copy link
Member

Oops sorry I forgot to actually turn the option on. I do realize that the goal is to avoid spawn_blocking, and most of the patch I linked was doing that. If you try this branch's latest commit that should remove the spawn_blocking. Can you test and see if that performance is what you're looking for?

@liutao-liu
Copy link
Contributor Author

liutao-liu commented Mar 20, 2024

Oops sorry I forgot to actually turn the option on. I do realize that the goal is to avoid spawn_blocking, and most of the patch I linked was doing that. If you try this branch's latest commit that should remove the spawn_blocking. Can you test and see if that performance is what you're looking for?

Using your latest commit, it took 7 seconds. You haven't changed the code here. After I modified it based on yours, it took 6 seconds.

fn in_tokio<F: Future>(future: F) -> Result<F::Output> {
        Ok(futures::executor::block_on(future))
    }

@alexcrichton
Copy link
Member

Ok, thanks for confirming!

You're right that I didn't change in_tokio and that was intentional. Changing in_tokio would break other functionality, even on an opt-in basis, so we can't quite so easily switch to a different executor and assume that it works. Some users of in_tokio require the tokio event loop to get turned, and other users don't necessarily need it. To use a totally different executor we'd have to classify which is which.

@alexcrichton
Copy link
Member

I've opened #8190 with the changes I made above cleaned up a bit. That's probably at least a good new baseline to start from in terms of optimizing.

@liutao-liu
Copy link
Contributor Author

liutao-liu commented Mar 30, 2024

//hi @pchickey,
I'm still working on optimizing wasmtime's fd_read\fd_write performance, and I've found BorrowChecker to be a performance hotspot, at around 7% cpu. I looked at the history of commits for crates/wiggle and couldn't see why BorrowChecker was introduced. Do you know the historical background to the introduction of BorrowChecker?
Which practical application scenarios require the use of BorrowChecker, and does the wasmtime-cli also require the use of BorrowChecker?

image

@liutao-liu
Copy link
Contributor Author

liutao-liu commented Mar 31, 2024

Wasmtime's borrow checker should be modeled after Rust's borrow checker, right? The borrow checker of Rust will only be executed by the compiler in the static code analysis phase, so it will not have a negative impact on runtime performance. The borrow checker of wasmtime is executed during runtime, which can have a negative impact on performance. I don't know why is wasmtime not allowing hosts to hold simultaneously share borrow and mut borrow pointer of guest memory. I feel like this can be relaxed like C/C++, with .wasm ensuring memory security.

@liutao-liu
Copy link
Contributor Author

Issue#734 have some explaination of the backgroud of borrow checker, the proposal mentioned in the issue is which proposal?

@iximeow
Copy link
Collaborator

iximeow commented Apr 1, 2024

I don't know why is wasmtime not allowing hosts to hold simultaneously share borrow and mut borrow pointer of guest memory. I feel like this can be relaxed like C/C++, with .wasm ensuring memory security.

wasm memory sandboxing is not really related to BorrowChecker - BorrowChecker ensures that regardless of what a wasm program may provide to Wasmtime as hostcall arguments, Wasmtime cannot be caused to violate Rust safety rules. imagine if an embedder exposed a host-optimized memcpy(dest, src, len); if a wasm program called that with overlapping dest and src, implementing such a function by just forwarding the arguments to libc's memcpy() would yield UB in Wasmtime, even though the wasm guest was safely sandboxed from directly accessing memory outside its heap.

(BorrowChecker also ensures references are properly aligned and lie entirely inside the wasm memory they ought to be in, which are checks even a C/C++ embedder of wasmtime would need and would cause some overhead. this is all pretty similar to the kinds of checks pread(2) and similar do before using user inputs)

while it might be possible to defer these checks in some cases (passthrough to a host pread(2) is probably a good example..) it's still tricky in Wasmtime because we don't know that pread(2) isn't happening at the end of a call chain like:
wasm -> embedder_hostcall(buf, len) -> BorrowChecker in wasmtime tracks &[buf] -> call back into wasm -> wasm calls wasi-common's fd_read(buf, len). so from fd_read's perspective it's only safe to pass through arguments to the host OS if Wasmtime has no other borrow overlapping with the buffer to be read into.

iirc the proposal in #734 is basically what became BorrowChecker and BorrowChecker is still what what we plan to use even as wig/wit/witx has evolved, though Pat or Alex would definitely be more familiar with that.

re. simultaneously holding share borrow and mut borrow of guest memory, BorrowChecker exists partially to support that, erroring if those borrows would overlap. disallowing concurrent borrows of guest memory would make BorrowChecker much simpler and faster, but a much more limited host interface 😅

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Apr 1, 2024
This commit is a refactoring and modernization of wiggle's
`BorrowChecker` implementation. This type is quite old and predates
everything related to the component model for example. This type
additionally predates the implementation of WASI threads for Wasmtime as
well. In general, this type is old and has not been updated in a long
time.

Originally a `BorrowChecker` was intended to be a somewhat cheap method
of enabling the host to have active safe shared and mutable borrows to
guest memory. Over time though this hasn't really panned out. The WASI
threads proposal, for example, doesn't allow safe shared or mutable
borrows at all. Instead everything must be modeled as a copy in or copy
out of data. This means that all of `wasmtime-wasi` and `wasi-common`
have largely already been rewritten in such a way to minimize borrows
into linear memory.

Nowadays the only types that represent safe borrows are the `GuestSlice`
type and its equivalents (e.g. `GuestSliceMut`, `GuestStr`, etc). These
are minimally used throughout `wasi-common` and `wasmtime-wasi` and when
they are used they're typically isolated to a small region of memory.

This is all coupled with the facst that `BorrowChecker` never ended up
being optimized. It's a `Mutex<HashMap<..>>` effectively and a pretty
expensive one at that. The `Mutex` is required because `&BorrowChecker`
must both allow mutations and be `Sync`. The `HashMap` is used to
implement precise byte-level region checking to fulfill the original
design requirements of what `wiggle` was envisioned to be.

Given all that, this commit guts `BorrowChecker`'s implementation and
functionality. The type is now effectively a glorified `RefCell` for the
entire span of linear memory. Regions are no longer considered when
borrows are made and instead a shared borrow is considered as borrowing
the entirety of shared memory. This means that it's not possible to
simultaneously have a safe shared and mutable borrow, even if they're
disjoint, at the same time.

The goal of this commit is to address performance issues seen in bytecodealliance#7973
which I've seen locally as well. The heavyweight implementation of
`BorrowChecker` isn't really buying us much nowadays, especially with
much development having since moved on to the component model. The hope
is that this much coarser way of implementing borrow checking, which
should be much more easily optimizable, is sufficient for the needs of
WASI and not a whole lot else.
@alexcrichton
Copy link
Member

The BorrowChecker type is quite old in Wasmtime and I don't have anything to add about its rationale over what @iximeow already said. That being said I believe it's over-powered relative to what we need it to do, hence the cost you're seeing, so I posted #8277 which should remove the performance issue you're seeing related to BorrowChecker. That comes at a cost of redefining what it does, but I believe that should be ok given how it's used today.

github-merge-queue bot pushed a commit that referenced this issue Apr 1, 2024
This commit is a refactoring and modernization of wiggle's
`BorrowChecker` implementation. This type is quite old and predates
everything related to the component model for example. This type
additionally predates the implementation of WASI threads for Wasmtime as
well. In general, this type is old and has not been updated in a long
time.

Originally a `BorrowChecker` was intended to be a somewhat cheap method
of enabling the host to have active safe shared and mutable borrows to
guest memory. Over time though this hasn't really panned out. The WASI
threads proposal, for example, doesn't allow safe shared or mutable
borrows at all. Instead everything must be modeled as a copy in or copy
out of data. This means that all of `wasmtime-wasi` and `wasi-common`
have largely already been rewritten in such a way to minimize borrows
into linear memory.

Nowadays the only types that represent safe borrows are the `GuestSlice`
type and its equivalents (e.g. `GuestSliceMut`, `GuestStr`, etc). These
are minimally used throughout `wasi-common` and `wasmtime-wasi` and when
they are used they're typically isolated to a small region of memory.

This is all coupled with the facst that `BorrowChecker` never ended up
being optimized. It's a `Mutex<HashMap<..>>` effectively and a pretty
expensive one at that. The `Mutex` is required because `&BorrowChecker`
must both allow mutations and be `Sync`. The `HashMap` is used to
implement precise byte-level region checking to fulfill the original
design requirements of what `wiggle` was envisioned to be.

Given all that, this commit guts `BorrowChecker`'s implementation and
functionality. The type is now effectively a glorified `RefCell` for the
entire span of linear memory. Regions are no longer considered when
borrows are made and instead a shared borrow is considered as borrowing
the entirety of shared memory. This means that it's not possible to
simultaneously have a safe shared and mutable borrow, even if they're
disjoint, at the same time.

The goal of this commit is to address performance issues seen in #7973
which I've seen locally as well. The heavyweight implementation of
`BorrowChecker` isn't really buying us much nowadays, especially with
much development having since moved on to the component model. The hope
is that this much coarser way of implementing borrow checking, which
should be much more easily optimizable, is sufficient for the needs of
WASI and not a whole lot else.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior in the current implementation that needs fixing
Projects
None yet
Development

No branches or pull requests

5 participants