Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't lex the entire input immediately #1110

Merged

Commits on Jul 8, 2023

  1. Update peek-related methods with a Result

    This is in preparation for an upcoming change where lexing will be
    performed lazily rather than eagerly. In this situation lexing is
    possibly done during a `peek` operation which means that errors can
    happen at that time.
    
    This commit purely updates the related methods to return
    `Result<Option<T>>` instead of `Option<T>`, but it doesn't actually
    introduce any functional change other than that. The return value from
    these methods is always `Ok(...)` at this time, and a future commit will
    change that to possibly being an error.
    
    This is split out from the future change since it's a fair amount of
    churn with lots of new `?` operators popping up everywhere.
    alexcrichton committed Jul 8, 2023
    Configuration menu
    Copy the full SHA
    c339eb5 View commit details
    Browse the repository at this point in the history
  2. Fix some issues with support for annotations

    This commit fixes a few issues with annotation-related support in the
    wasm text format:
    
    * The `@producers` section is not parsed and recognized for components
      where previously it was ignored.
    * Similarly the `@name` annotation is not properly recognized for
      components instead of being ignored in a few places.
    * Registration of annotations has been moved to top-level parsers to
      avoid adding and re-adding entries to the known annotations map.
    alexcrichton committed Jul 8, 2023
    Configuration menu
    Copy the full SHA
    7f3d248 View commit details
    Browse the repository at this point in the history
  3. Don't lex the entire input immediately

    This commit is a change to the internals of parsing for wasm text files.
    Previously the entire input would be lex'd and stored into an array for
    parsing to access. This is, however, the largest contributing factor to
    the peak memory usage reported in bytecodealliance#1095. Tokens are only used a handful
    of times so buffering the entire file in-memory is quite wasteful.
    
    This change was tricky to apply, however, because the original rationale
    for lexing in this manner was performance related. The recursive-descent
    style `Parser` trait encourages `peek`-ing tokens and will often involve
    attempting a parse only to later unwind and try something else instead.
    This means that each individual token, especially whitespace and
    comments, would otherwise naively be lexed many times. For example if
    the buffer of all tokens is "simply" removed some instrumented analysis
    showed that over half of all tokens in the input file were lexed more
    than once. This means that simply removing the buffer resulted in a
    performance regression.
    
    This performance regression is in some sense inherently not addressable
    with a lazy-lexing strategy. I implemented a fixed-width cache of the
    latest tokens lex'd but it still didn't perform as well as caching all
    tokens. I think this is because lexing is quite fast and adding the
    layer of the cache spent a lot of time in checking and managing the
    cache. While this performance regression may not be 100% fixable though
    I've settled on a strategy that's a bit more of a half-measure.
    
    The general idea for the `Parser` is now that it stores the current
    position in the file plus the next "significant" token at the same time.
    Here "significant" means not-whitespace and not-comments for example.
    This enables the parser to always know the next token having pre-skipped
    whitespace and comments. Thus any single-token "peek" operations don't
    actually need to do any lexing, they can instead look at the current
    token and decide what to do based on that. This is enabled with a few
    new `Cursor::peek_*` methods to avoid generating the next `Cursor` which
    would otherwise require a lexing operation.
    
    Overall this means that the majority of tokens in the case of bytecodealliance#1095 are
    lexed only once. There's still ~10% of tokens that are lexed 2+ times,
    but the performance numbers are before this commit parsing that file
    took 7.6s with 4G of memory, and after this commit it takes 7.9s with 2G
    of memory. This means that for a 4% regression in time we're reducing
    memory usage by half.
    alexcrichton committed Jul 8, 2023
    Configuration menu
    Copy the full SHA
    8acbfd2 View commit details
    Browse the repository at this point in the history
  4. Fix tests

    alexcrichton committed Jul 8, 2023
    Configuration menu
    Copy the full SHA
    4d44196 View commit details
    Browse the repository at this point in the history