New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the serde helper process persistent. #2523
Conversation
Rather than starting the helper process for every proc-macro invocation, start it on the first macro invocation and keep it around. The goal is to amortize the process startup cost across multiple macro invocations. The process loops waiting for multiple requests, until it gets an EOF on its stdin, whereupon it exits. This will happen when the rustc process exits and closes its end of the stdin pipe. To do this, this change extends the protocol to prefix each chunk of data with a 32-bit little-endian length, so that each side knows how much to read, rather than relying on reading to EOF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. This makes sense to me in the context of #2524, but if we are not merging that, would you still feel inclined to move forward with this?
I think the performance of the existing more straightforward way is fine: in #2514 I measured expanding that one enormous clang-ast test case takes 1.91s on my machine (which agrees closely with the 1.891s on your machine) and I measured an overhead of 3ms per macro call relative to the old serde_derive, while saving 4.4 seconds from the compilation of serde_derive and its dependency graph. That means the break-even point would be somewhere around 1500 serde derives; it would be extremely unusual for any codebase to have that many, and millisecond differences would be in the noise for their real-world builds i.e. as opposed to builds where we've intentionally tried to isolate just the measurement of macro expansion time.
I think this PR would be more compelling if either:
-
we had less than the 1500 headroom. Like if precompiled macros were slower than the old serde_derive for anybody with >40 serde derives in their program, then I would be more inclined to consider this optimization.
-
or, If this PR recouped substantially more than 1.25ms per macro call (maybe 4+ ms). If release-mode precompiled derives ended up being noticeably faster than a natively built debug-mode proc macro, then this technique would have novelty value that would make it worth considering merging.
@dtolnay , are you willing to do the same optimization on Windows? It is known thing, that starting processes on Windows is more slowly than on linux, so this PR may show more difference on Windows |
Yeah I think measuring on Linux is the best case for execve performance. Mac and definitely Windows have much slower process spawn times, especially if there's any kind of anti-virus / "client protection" stuff enabled. I suspect we'd see much larger benefits/earlier breakeven on those platforms. I'd be inclined to keep this design anyway even if there were no performance wins in this case, since it's not very complex and its a proof of concept for generalizing this "helper process" idea. Of course starting a whole new process from scratch for each macro does give a very strong level of hermeticity, but it is very expensive from a system POV (even though its a heavily optimized path on Linux) and it's not too hard to achieve the same guarantees at the Rust language level. |
This adds too much decompression overhead per invocation. It could work if the subprocess were reused across multiple macro expansions (#2523).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will revisit this when we precompile for macOS or Windows.
Rather than starting the helper process for every proc-macro invocation, start it on the first macro invocation and keep it around. The goal is to amortize the process startup cost across multiple macro invocations.
The process loops waiting for multiple requests, until it gets an EOF on its stdin, whereupon it exits. This will happen when the rustc process exits and closes its end of the stdin pipe.
To do this, this change extends the protocol to prefix each chunk of data with a 32-bit little-endian length, so that each side knows how much to read, rather than relying on reading to EOF.
This in combination with #2522 has some additional risk of memory overflow, but it would require an enormous
amount of serde invocations in one crate.
Using the same clang-ast benchmark in the original PR, there's a noticable improvement:
Process-per-macro:
Persistent process: