Make the serde helper process persistent. #2523

jsgf · 2023-07-23T02:42:15Z

Rather than starting the helper process for every proc-macro invocation, start it on the first macro invocation and keep it around. The goal is to amortize the process startup cost across multiple macro invocations.

The process loops waiting for multiple requests, until it gets an EOF on its stdin, whereupon it exits. This will happen when the rustc process exits and closes its end of the stdin pipe.

To do this, this change extends the protocol to prefix each chunk of data with a 32-bit little-endian length, so that each side knows how much to read, rather than relying on reading to EOF.

This in combination with #2522 has some additional risk of memory overflow, but it would require an enormous
amount of serde invocations in one crate.

Using the same clang-ast benchmark in the original PR, there's a noticable improvement:

Process-per-macro:

$ hyperfine ' cargo expand --ugly --test exhaustive >expand.rs'
Benchmark 1:  cargo expand --ugly --test exhaustive >expand.rs
  Time (mean ± σ):      1.891 s ±  0.192 s    [User: 1.010 s, System: 0.460 s]
  Range (min … max):    1.666 s …  2.173 s    10 runs

Persistent process:

$ hyperfine ' cargo expand --ugly --test exhaustive >expand.rs'
Benchmark 1:  cargo expand --ugly --test exhaustive >expand.rs
  Time (mean ± σ):      1.524 s ±  0.233 s    [User: 0.561 s, System: 0.102 s]
  Range (min … max):    1.297 s …  2.074 s    10 runs

Rather than starting the helper process for every proc-macro invocation, start it on the first macro invocation and keep it around. The goal is to amortize the process startup cost across multiple macro invocations. The process loops waiting for multiple requests, until it gets an EOF on its stdin, whereupon it exits. This will happen when the rustc process exits and closes its end of the stdin pipe. To do this, this change extends the protocol to prefix each chunk of data with a 32-bit little-endian length, so that each side knows how much to read, rather than relying on reading to EOF.

dtolnay

Thanks for the PR. This makes sense to me in the context of #2524, but if we are not merging that, would you still feel inclined to move forward with this?

I think the performance of the existing more straightforward way is fine: in #2514 I measured expanding that one enormous clang-ast test case takes 1.91s on my machine (which agrees closely with the 1.891s on your machine) and I measured an overhead of 3ms per macro call relative to the old serde_derive, while saving 4.4 seconds from the compilation of serde_derive and its dependency graph. That means the break-even point would be somewhere around 1500 serde derives; it would be extremely unusual for any codebase to have that many, and millisecond differences would be in the noise for their real-world builds i.e. as opposed to builds where we've intentionally tried to isolate just the measurement of macro expansion time.

I think this PR would be more compelling if either:

we had less than the 1500 headroom. Like if precompiled macros were slower than the old serde_derive for anybody with >40 serde derives in their program, then I would be more inclined to consider this optimization.
or, If this PR recouped substantially more than 1.25ms per macro call (maybe 4+ ms). If release-mode precompiled derives ended up being noticeably faster than a natively built debug-mode proc macro, then this technique would have novelty value that would make it worth considering merging.

Mingun · 2023-07-25T06:49:01Z

@dtolnay , are you willing to do the same optimization on Windows? It is known thing, that starting processes on Windows is more slowly than on linux, so this PR may show more difference on Windows

jsgf · 2023-07-25T15:53:07Z

Yeah I think measuring on Linux is the best case for execve performance. Mac and definitely Windows have much slower process spawn times, especially if there's any kind of anti-virus / "client protection" stuff enabled. I suspect we'd see much larger benefits/earlier breakeven on those platforms.

I'd be inclined to keep this design anyway even if there were no performance wins in this case, since it's not very complex and its a proof of concept for generalizing this "helper process" idea. Of course starting a whole new process from scratch for each macro does give a very strong level of hermeticity, but it is very expensive from a system POV (even though its a heavily optimized path on Linux) and it's not too hard to achieve the same guarantees at the Rust language level.

This adds too much decompression overhead per invocation. It could work if the subprocess were reused across multiple macro expansions (#2523).

dtolnay

I will revisit this when we precompile for macOS or Windows.

jsgf force-pushed the persistent branch from 36ccb28 to 5054f71 Compare July 23, 2023 02:43

jsgf mentioned this pull request Jul 23, 2023

Add wasm_precomp feature #2524

Closed

dtolnay reviewed Jul 24, 2023

View reviewed changes

dtolnay added a commit that referenced this pull request Aug 18, 2023

Discard the possibility of upx compressing binary

bfcd447

This adds too much decompression overhead per invocation. It could work if the subprocess were reused across multiple macro expansions (#2523).

dtolnay reviewed Aug 18, 2023

View reviewed changes

dtolnay closed this Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the serde helper process persistent. #2523

Make the serde helper process persistent. #2523

jsgf commented Jul 23, 2023 •

edited

dtolnay left a comment

Mingun commented Jul 25, 2023

jsgf commented Jul 25, 2023

dtolnay left a comment

Make the serde helper process persistent. #2523

Make the serde helper process persistent. #2523

Conversation

jsgf commented Jul 23, 2023 • edited

dtolnay left a comment

Choose a reason for hiding this comment

Mingun commented Jul 25, 2023

jsgf commented Jul 25, 2023

dtolnay left a comment

Choose a reason for hiding this comment

jsgf commented Jul 23, 2023 •

edited