Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the serde helper process persistent. #2523

Closed
wants to merge 1 commit into from

Conversation

jsgf
Copy link

@jsgf jsgf commented Jul 23, 2023

Rather than starting the helper process for every proc-macro invocation, start it on the first macro invocation and keep it around. The goal is to amortize the process startup cost across multiple macro invocations.

The process loops waiting for multiple requests, until it gets an EOF on its stdin, whereupon it exits. This will happen when the rustc process exits and closes its end of the stdin pipe.

To do this, this change extends the protocol to prefix each chunk of data with a 32-bit little-endian length, so that each side knows how much to read, rather than relying on reading to EOF.

This in combination with #2522 has some additional risk of memory overflow, but it would require an enormous
amount of serde invocations in one crate.

Using the same clang-ast benchmark in the original PR, there's a noticable improvement:

Process-per-macro:

$ hyperfine ' cargo expand --ugly --test exhaustive >expand.rs'
Benchmark 1:  cargo expand --ugly --test exhaustive >expand.rs
  Time (mean ± σ):      1.891 s ±  0.192 s    [User: 1.010 s, System: 0.460 s]
  Range (min … max):    1.666 s …  2.173 s    10 runs

Persistent process:

$ hyperfine ' cargo expand --ugly --test exhaustive >expand.rs'
Benchmark 1:  cargo expand --ugly --test exhaustive >expand.rs
  Time (mean ± σ):      1.524 s ±  0.233 s    [User: 0.561 s, System: 0.102 s]
  Range (min … max):    1.297 s …  2.074 s    10 runs

Rather than starting the helper process for every proc-macro invocation,
start it on the first macro invocation and keep it around. The goal is to
amortize the process startup cost across multiple macro invocations.

The process loops waiting for multiple requests, until it gets an
EOF on its stdin, whereupon it exits. This will happen when the rustc
process exits and closes its end of the stdin pipe.

To do this, this change extends the protocol to prefix each chunk
of data with a 32-bit little-endian length, so that each side knows
how much to read, rather than relying on reading to EOF.
Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. This makes sense to me in the context of #2524, but if we are not merging that, would you still feel inclined to move forward with this?

I think the performance of the existing more straightforward way is fine: in #2514 I measured expanding that one enormous clang-ast test case takes 1.91s on my machine (which agrees closely with the 1.891s on your machine) and I measured an overhead of 3ms per macro call relative to the old serde_derive, while saving 4.4 seconds from the compilation of serde_derive and its dependency graph. That means the break-even point would be somewhere around 1500 serde derives; it would be extremely unusual for any codebase to have that many, and millisecond differences would be in the noise for their real-world builds i.e. as opposed to builds where we've intentionally tried to isolate just the measurement of macro expansion time.

I think this PR would be more compelling if either:

  • we had less than the 1500 headroom. Like if precompiled macros were slower than the old serde_derive for anybody with >40 serde derives in their program, then I would be more inclined to consider this optimization.

  • or, If this PR recouped substantially more than 1.25ms per macro call (maybe 4+ ms). If release-mode precompiled derives ended up being noticeably faster than a natively built debug-mode proc macro, then this technique would have novelty value that would make it worth considering merging.

@Mingun
Copy link
Contributor

Mingun commented Jul 25, 2023

@dtolnay , are you willing to do the same optimization on Windows? It is known thing, that starting processes on Windows is more slowly than on linux, so this PR may show more difference on Windows

@jsgf
Copy link
Author

jsgf commented Jul 25, 2023

Yeah I think measuring on Linux is the best case for execve performance. Mac and definitely Windows have much slower process spawn times, especially if there's any kind of anti-virus / "client protection" stuff enabled. I suspect we'd see much larger benefits/earlier breakeven on those platforms.

I'd be inclined to keep this design anyway even if there were no performance wins in this case, since it's not very complex and its a proof of concept for generalizing this "helper process" idea. Of course starting a whole new process from scratch for each macro does give a very strong level of hermeticity, but it is very expensive from a system POV (even though its a heavily optimized path on Linux) and it's not too hard to achieve the same guarantees at the Rust language level.

dtolnay added a commit that referenced this pull request Aug 18, 2023
This adds too much decompression overhead per invocation.

It could work if the subprocess were reused across multiple macro
expansions (#2523).
Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revisit this when we precompile for macOS or Windows.

@dtolnay dtolnay closed this Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants