Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wast: Parsing giant wat takes too much memory resources #1095

Closed
Xyah3PBeHB opened this issue Jul 2, 2023 · 5 comments
Closed

wast: Parsing giant wat takes too much memory resources #1095

Xyah3PBeHB opened this issue Jul 2, 2023 · 5 comments

Comments

@Xyah3PBeHB
Copy link

I have a giant wat (385,794,195 bytes) with 127384 functions (568 of which are imported functions) and I used heaptrack to analyze the memory usage when converting that wat to wasm using wasm-tool and wat2wasm (from wabt).

wasm-tool

heaptrack target/release/wasm-tools parse test.wat -o test1.wasm

total runtime: 37.60s.
calls to allocation functions: 20932994 (556743/s)
temporary memory allocations: 8617065 (229183/s)
peak heap memory consumption: 6.20G
peak RSS (including heaptrack overhead): 6.06G
total memory leaked: 352B

wat2wasm

heaptrack wabt-1.0.33/bin/wat2wasm test.wat -o test2.wasm --enable-annotations

total runtime: 31.10s.
calls to allocation functions: 23266595 (748122/s)
temporary memory allocations: 1561856 (50220/s)
peak heap memory consumption: 2.83G
peak RSS (including heaptrack overhead): 3.01G
total memory leaked: 0B

To my surprise the memory peak of wasm-tool was more than double that of wat2wasm. This is wat for testing.

alexcrichton added a commit to alexcrichton/wasm-tools that referenced this issue Jul 5, 2023
This commit shrinks the size of the `Instruction` enum on x86_64 from
152 bytes to 80 bytes. This is in an effort to reduce the resource
consumption reported in bytecodealliance#1095 since it's definitely excessively large at
this point. This reduces the peak memory usage from 6G to 5G when
parsing that input `*.wat` file.

Shrinking this enum has been done by running a build with
`RUSTFLAGS=-Zprint-type-sizes` and then adding a `Box` to variants with
excessively large payloads. The current size of 80 bytes is still a bit
large, but I didn't want to go too crazy with inserting boxes just yet.
alexcrichton added a commit to alexcrichton/wasm-tools that referenced this issue Jul 5, 2023
Otherwise parsing an `Index` requires heap allocations due to the error
message tracking in `lookahead1` which typically gets thrown away as
other branches succeed in parsing.

I found this via DHAT while looking into bytecodealliance#1095 and while it wasn't much
of a perf improvement it drastically reduced the number of temporary
allocations performed at this location which seems like a nice little
micro-optimization to have nonetheless.
@alexcrichton
Copy link
Member

Thanks for the report! I've been digging in a bit and there's quite a bit of fruit to pick in this regard, so I'll try to make some PRs in the future to reduce the overhead here.

If I may be so bold as to ask, I'm curious if this caused issues for you somewhere? Or basically was curious what led to the motivation to make a comparison?

fitzgen pushed a commit that referenced this issue Jul 6, 2023
Otherwise parsing an `Index` requires heap allocations due to the error
message tracking in `lookahead1` which typically gets thrown away as
other branches succeed in parsing.

I found this via DHAT while looking into #1095 and while it wasn't much
of a perf improvement it drastically reduced the number of temporary
allocations performed at this location which seems like a nice little
micro-optimization to have nonetheless.
fitzgen pushed a commit that referenced this issue Jul 6, 2023
This commit shrinks the size of the `Instruction` enum on x86_64 from
152 bytes to 80 bytes. This is in an effort to reduce the resource
consumption reported in #1095 since it's definitely excessively large at
this point. This reduces the peak memory usage from 6G to 5G when
parsing that input `*.wat` file.

Shrinking this enum has been done by running a build with
`RUSTFLAGS=-Zprint-type-sizes` and then adding a `Box` to variants with
excessively large payloads. The current size of 80 bytes is still a bit
large, but I didn't want to go too crazy with inserting boxes just yet.
alexcrichton added a commit to alexcrichton/wasm-tools that referenced this issue Jul 6, 2023
This commit is another improvement towards addressing bytecodealliance#1095 where the
goal here is to shrink the size of `Token` and reduce the allocated
memory that it retains. Currently the entire input string is tokenized
and stored as a list of tokens for `Parser` to process. This means that
the size of a token has a large affect on the size of this vector for
large inputs.

Even before this commit tokens had been slightly optimized for size
where some variants were heap-allocated with a `Box`. In profiling with
DHAT, however, it appears that a large portion of peak memory was these
boxes, namely for integer/float tokens which appear quite a lot in many
inputs.

The changes in this commit were to:

* Shrink the size of `Token` to two words. This is done by removing all
  pointers from `Token` and instead only storing a `TokenKind` which is
  packed to 32-bits or less. Span information is still stored in a
  `Token`, however.

* With no more payload tokens which previously had a payload such as
  integers, strings, and floats are now re-parsed. They're sort of
  parsed once while lexing, then again when the token is interpreted
  later on. Some of this is fundamental where the parsing currently
  happens in a type-specific context but the context isn't known during
  lexing (e.g. if something is parsed as `u8` then that shouldn't accept
  `256` as input).

The hypothesis behind this is that tokens are far more often keywords,
whitespace, and comments rather than integers, strings, and floats. This
means that if these tokens require some extra work then that should
hopefully "come out in the wash" after and this representation would
otherwise allow for other speedups.

Locally the example in bytecodealliance#1095 has a peak memory usage reduced from 5G to
4G from this commit and additionally the parsing time drops from 8.9s to
7.6s.
@Xyah3PBeHB
Copy link
Author

First of all, thank you for optimizing the memory usage. I'd be happy to share the origin of this issue.
In the past I have used wat2wasm to analyze wasm. Recently, I plan to use wasmparser to automate some operations. First of all, I want to try the wasm-tools tool set. But in the behavior that I have been able to complete normally in the past (converting giant wat back to wasm), it ran out of memory (my device has not been upgraded for many years and only has 16G memory). So I want to compare it with the tools I used in the past.

@alexcrichton
Copy link
Member

Makes sense and thanks for the context!

fitzgen pushed a commit that referenced this issue Jul 7, 2023
This commit is another improvement towards addressing #1095 where the
goal here is to shrink the size of `Token` and reduce the allocated
memory that it retains. Currently the entire input string is tokenized
and stored as a list of tokens for `Parser` to process. This means that
the size of a token has a large affect on the size of this vector for
large inputs.

Even before this commit tokens had been slightly optimized for size
where some variants were heap-allocated with a `Box`. In profiling with
DHAT, however, it appears that a large portion of peak memory was these
boxes, namely for integer/float tokens which appear quite a lot in many
inputs.

The changes in this commit were to:

* Shrink the size of `Token` to two words. This is done by removing all
  pointers from `Token` and instead only storing a `TokenKind` which is
  packed to 32-bits or less. Span information is still stored in a
  `Token`, however.

* With no more payload tokens which previously had a payload such as
  integers, strings, and floats are now re-parsed. They're sort of
  parsed once while lexing, then again when the token is interpreted
  later on. Some of this is fundamental where the parsing currently
  happens in a type-specific context but the context isn't known during
  lexing (e.g. if something is parsed as `u8` then that shouldn't accept
  `256` as input).

The hypothesis behind this is that tokens are far more often keywords,
whitespace, and comments rather than integers, strings, and floats. This
means that if these tokens require some extra work then that should
hopefully "come out in the wash" after and this representation would
otherwise allow for other speedups.

Locally the example in #1095 has a peak memory usage reduced from 5G to
4G from this commit and additionally the parsing time drops from 8.9s to
7.6s.
alexcrichton added a commit to alexcrichton/wasm-tools that referenced this issue Jul 8, 2023
This commit is a change to the internals of parsing for wasm text files.
Previously the entire input would be lex'd and stored into an array for
parsing to access. This is, however, the largest contributing factor to
the peak memory usage reported in bytecodealliance#1095. Tokens are only used a handful
of times so buffering the entire file in-memory is quite wasteful.

This change was tricky to apply, however, because the original rationale
for lexing in this manner was performance related. The recursive-descent
style `Parser` trait encourages `peek`-ing tokens and will often involve
attempting a parse only to later unwind and try something else instead.
This means that each individual token, especially whitespace and
comments, would otherwise naively be lexed many times. For example if
the buffer of all tokens is "simply" removed some instrumented analysis
showed that over half of all tokens in the input file were lexed more
than once. This means that simply removing the buffer resulted in a
performance regression.

This performance regression is in some sense inherently not addressable
with a lazy-lexing strategy. I implemented a fixed-width cache of the
latest tokens lex'd but it still didn't perform as well as caching all
tokens. I think this is because lexing is quite fast and adding the
layer of the cache spent a lot of time in checking and managing the
cache. While this performance regression may not be 100% fixable though
I've settled on a strategy that's a bit more of a half-measure.

The general idea for the `Parser` is now that it stores the current
position in the file plus the next "significant" token at the same time.
Here "significant" means not-whitespace and not-comments for example.
This enables the parser to always know the next token having pre-skipped
whitespace and comments. Thus any single-token "peek" operations don't
actually need to do any lexing, they can instead look at the current
token and decide what to do based on that. This is enabled with a few
new `Cursor::peek_*` methods to avoid generating the next `Cursor` which
would otherwise require a lexing operation.

Overall this means that the majority of tokens in the case of bytecodealliance#1095 are
lexed only once. There's still ~10% of tokens that are lexed 2+ times,
but the performance numbers are before this commit parsing that file
took 7.6s with 4G of memory, and after this commit it takes 7.9s with 2G
of memory. This means that for a 4% regression in time we're reducing
memory usage by half.
fitzgen pushed a commit that referenced this issue Jul 11, 2023
* Update `peek`-related methods with a `Result`

This is in preparation for an upcoming change where lexing will be
performed lazily rather than eagerly. In this situation lexing is
possibly done during a `peek` operation which means that errors can
happen at that time.

This commit purely updates the related methods to return
`Result<Option<T>>` instead of `Option<T>`, but it doesn't actually
introduce any functional change other than that. The return value from
these methods is always `Ok(...)` at this time, and a future commit will
change that to possibly being an error.

This is split out from the future change since it's a fair amount of
churn with lots of new `?` operators popping up everywhere.

* Fix some issues with support for annotations

This commit fixes a few issues with annotation-related support in the
wasm text format:

* The `@producers` section is not parsed and recognized for components
  where previously it was ignored.
* Similarly the `@name` annotation is not properly recognized for
  components instead of being ignored in a few places.
* Registration of annotations has been moved to top-level parsers to
  avoid adding and re-adding entries to the known annotations map.

* Don't lex the entire input immediately

This commit is a change to the internals of parsing for wasm text files.
Previously the entire input would be lex'd and stored into an array for
parsing to access. This is, however, the largest contributing factor to
the peak memory usage reported in #1095. Tokens are only used a handful
of times so buffering the entire file in-memory is quite wasteful.

This change was tricky to apply, however, because the original rationale
for lexing in this manner was performance related. The recursive-descent
style `Parser` trait encourages `peek`-ing tokens and will often involve
attempting a parse only to later unwind and try something else instead.
This means that each individual token, especially whitespace and
comments, would otherwise naively be lexed many times. For example if
the buffer of all tokens is "simply" removed some instrumented analysis
showed that over half of all tokens in the input file were lexed more
than once. This means that simply removing the buffer resulted in a
performance regression.

This performance regression is in some sense inherently not addressable
with a lazy-lexing strategy. I implemented a fixed-width cache of the
latest tokens lex'd but it still didn't perform as well as caching all
tokens. I think this is because lexing is quite fast and adding the
layer of the cache spent a lot of time in checking and managing the
cache. While this performance regression may not be 100% fixable though
I've settled on a strategy that's a bit more of a half-measure.

The general idea for the `Parser` is now that it stores the current
position in the file plus the next "significant" token at the same time.
Here "significant" means not-whitespace and not-comments for example.
This enables the parser to always know the next token having pre-skipped
whitespace and comments. Thus any single-token "peek" operations don't
actually need to do any lexing, they can instead look at the current
token and decide what to do based on that. This is enabled with a few
new `Cursor::peek_*` methods to avoid generating the next `Cursor` which
would otherwise require a lexing operation.

Overall this means that the majority of tokens in the case of #1095 are
lexed only once. There's still ~10% of tokens that are lexed 2+ times,
but the performance numbers are before this commit parsing that file
took 7.6s with 4G of memory, and after this commit it takes 7.9s with 2G
of memory. This means that for a 4% regression in time we're reducing
memory usage by half.

* Fix tests
@alexcrichton
Copy link
Member

Ok after #1110 I think we're in a much better state now. Runtime still isn't fantastic but 2G is much better than the previous 6G. More improvements are still possible, but is that suitable for your use case? If so I'll probably go ahead and close this, but if not there's a thing or two that may be possible to optimize this further.

@Xyah3PBeHB
Copy link
Author

Thank you very much for your efforts. I made a comparison based on the latest commit. Now not only is the memory peak lower than wat2wasm, but the performance has also been greatly improved and is far ahead. This has greatly exceeded the optimization potential I expected, and it is enough to use it on my old device.

wasm-tool

heaptrack target/release/wasm-tools parse sample.wat -o test1.wasm

total runtime: 11.11s.
calls to allocation functions: 4072605 (366538/s)
temporary memory allocations: 1748744 (157388/s)
peak heap memory consumption: 2.04G
peak RSS (including heaptrack overhead): 2.01G
total memory leaked: 368B

wat2wasm

heaptrack wabt-1.0.33/bin/wat2wasm sample.wat -o test2.wasm

total runtime: 25.57s.
calls to allocation functions: 23266595 (910024/s)
temporary memory allocations: 1561856 (61088/s)
peak heap memory consumption: 2.83G
peak RSS (including heaptrack overhead): 3.02G
total memory leaked: 0B

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants