Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library and Tools Subset build times regression (50+% increase) #82583

Closed
LoopedBard3 opened this issue Feb 22, 2023 · 32 comments · Fixed by #82599
Closed

Library and Tools Subset build times regression (50+% increase) #82583

LoopedBard3 opened this issue Feb 22, 2023 · 32 comments · Fixed by #82599
Labels
area-VM-coreclr tenet-performance Performance related issue
Milestone

Comments

@LoopedBard3
Copy link
Member

Description

Runtime builds for "-subset tools+libs" have started taking much more time as seen in timeouts in the performance pipelines and the official runtime build pipelines. The build time seems to have increased by 50+% (~40 min to 1+ hr in the perf pipeline).

Regression?

Data is below with diffs from before and after, although both performance pipeline runs used the same Dotnet SDK version: 8.0.100-alpha.1.23061.8.

Data

Specific runs that should properly capture the closest normal and regressed build:

@LoopedBard3 LoopedBard3 added the tenet-performance Performance related issue label Feb 22, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Feb 22, 2023
@jkoritzinsky
Copy link
Member

We updated the Roslyn compiler version in this commit range. Maybe there's a regression in the compiler?

@hoyosjs
Copy link
Member

hoyosjs commented Feb 22, 2023

This is related to #76454

@hoyosjs
Copy link
Member

hoyosjs commented Feb 22, 2023

Looking at the time taken to build SPC it becomes pretty obvious window:

2023-02-21T21:45:00Z	12
2023-02-21T10:30:00Z	11.666666666666666
2023-02-21T16:30:00Z	11.666666666666666
2023-02-22T01:30:00Z	11.666666666666666
2023-02-21T15:45:00Z	11.6
2023-02-21T19:30:00Z	11.6
2023-02-21T22:30:00Z	11.333333333333334
2023-02-21T12:45:00Z	11.111111111111111
2023-02-21T09:00:00Z	11
2023-02-21T12:00:00Z	11
2023-02-21T18:00:00Z	11
2023-02-20T15:45:00Z	9.6666666666666661
2023-02-21T23:15:00Z	9.4
2023-02-22T02:15:00Z	9.2
2023-02-21T11:15:00Z	9.1428571428571423
2023-02-21T17:15:00Z	8.6666666666666661
2023-02-22T00:00:00Z	8.25
2023-02-21T04:30:00Z	7
2023-02-20T13:30:00Z	6.75
2023-02-20T12:45:00Z	6.666666666666667
2023-02-20T06:45:00Z	6.5
2023-02-20T11:15:00Z	6.5
2023-02-21T08:15:00Z	6.333333333333333
2023-02-20T22:30:00Z	6.2
2023-02-20T10:30:00Z	6.166666666666667
2023-02-20T07:30:00Z	6
2023-02-20T08:15:00Z	6
2023-02-20T15:00:00Z	6
2023-02-20T17:15:00Z	6
2023-02-21T00:00:00Z	6
2023-02-21T02:15:00Z	6
2023-02-21T03:00:00Z	5.8571428571428568
2023-02-20T20:15:00Z	5.8
2023-02-20T23:15:00Z	5.8

image

After 8 UTC - which matches the merge time of #81164

@ghost
Copy link

ghost commented Feb 22, 2023

Tagging subscribers to this area: @dotnet/area-infrastructure-libraries
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Runtime builds for "-subset tools+libs" have started taking much more time as seen in timeouts in the performance pipelines and the official runtime build pipelines. The build time seems to have increased by 50+% (~40 min to 1+ hr in the perf pipeline).

Regression?

Data is below with diffs from before and after, although both performance pipeline runs used the same Dotnet SDK version: 8.0.100-alpha.1.23061.8.

Data

Specific runs that should properly capture the closest normal and regressed build:

Author: LoopedBard3
Assignees: -
Labels:

area-Infrastructure-libraries, tenet-performance, untriaged

Milestone: -

@jcouv
Copy link
Member

jcouv commented Feb 22, 2023

@EgorBo To validate whether this build time increase was caused by the new length-based switch dispatch analysis, there is a command-line flag that will disable the analysis: /features:disable-length-based-switch. I believe <Features>disable-length-based-switch</Features> from MSBuild will also work, but I have not confirmed the exact syntax.

hoyosjs referenced this issue Feb 22, 2023
Reverts version bump component of #81164

This PR caused heavy managed build regressions hitting all PR builds. See #82458 and #76454
@EgorBo
Copy link
Member

EgorBo commented Feb 22, 2023

Can confirm that build.cmd Clr.CoreLib -c Release is regressed from 30s to 40s on my machine. <Features>disable-length-based-switch</Features> didn't help so must be something else then

@agocke
Copy link
Member

agocke commented Feb 22, 2023

Moving to Roslyn since this is definitely a compiler perf regression.

@agocke agocke transferred this issue from dotnet/runtime Feb 22, 2023
@jcouv jcouv self-assigned this Feb 22, 2023
@jcouv jcouv removed the untriaged New issue has not been triaged by the area owner label Feb 22, 2023
@am11
Copy link
Member

am11 commented Feb 23, 2023

Did Roslyn update revert (#82466) fix the 50% regression? Runtime pipeline is still timing out.

@jcouv
Copy link
Member

jcouv commented Feb 23, 2023

I've been doing some investigation locally. I tried comparing build times after revert PR and before. My steps are: 1. build.cmd -clean, 2. build.cmd -restore, 3. build.cmd Clr.CoreLib -c Release.
I'm not seeing any meaningful difference between before vs. after...

An earlier comment pointed at PR #81164 as the source of the build-time regression, but I observed the revert PR is not a complete revert of that original PR (81164). The original PR (81164) also added some source generators (Microsoft.Interop.SourceGeneration) to several projects. So I wondered if that could be the problem.

But running similar tests comparing 81164 to the commit preceding it also yielded no significant build-time difference...

Could someone look at this with me to confirm my repro steps or to provide a before/after repro?

@jkoritzinsky
Copy link
Member

@hoyosjs had data showing the slowdown in the System.Private.CoreLib build, which didn't have any source-generator changes.

@hoyosjs
Copy link
Member

hoyosjs commented Feb 23, 2023

Let me check the data again

@agocke
Copy link
Member

agocke commented Feb 23, 2023

It's probably more productive to just look at a trace before and after the original merge, instead of before and after the revert. That should identify the problem, regardless of where it lies.

@hoyosjs
Copy link
Member

hoyosjs commented Feb 23, 2023

The revert helped - even with the libraries build. The partial revert helped fix the issues.

System.Private.CoreLib
image

Libraries:
image

The official commands are

  • build.cmd -subset clr.nativeprereqs -arch x86 -ci first
  • build.cmd -subset clr.corelib+clr.nativecorelib+clr.nativeaotlibs+clr.tools+clr.packages+clr.paltestlist -arch x86 -c release -ci and the one that regresses is

@jaredpar
Copy link
Member

@jcouv, @hoyosjs and I spent a few hours debugging the issue this afternoon. It appears that the regression comes because the VBCSCompiler process is crashing during build. This crash is causing us to fall back to csc which drops the benefit of the compiler server and causes the build time to drop. In the cases where VBCSCompiler does not crash the build time is the expected time.

The stack trace of the crash is below. Confirmed this is happening on the 8.0 preview runtime.

0:045> k
 # Child-SP          RetAddr               Call Site
00 (Inline Function) --------`--------     coreclr!VolatileStore [D:\a\_work\1\s\src\coreclr\inc\volatile.h @ 273] 
01 (Inline Function) --------`--------     coreclr!SetObjectReferenceUnchecked [D:\a\_work\1\s\src\coreclr\vm\object.cpp @ 319] 
02 (Inline Function) --------`--------     coreclr!PtrArray::SetAt+0x8 [D:\a\_work\1\s\src\coreclr\vm\object.h @ 705] 
03 000000f0`75aff990 00007ff8`d3e5ace6     coreclr!LoaderAllocator::SetHandleValue+0xa52fd [D:\a\_work\1\s\src\coreclr\vm\loaderallocator.cpp @ 989] 
04 000000f0`75affa00 00007ff8`d3ee8eda     coreclr!LoaderAllocator::FreeHandle+0x1a [D:\a\_work\1\s\src\coreclr\vm\loaderallocator.cpp @ 880] 
05 000000f0`75affa30 00007ff8`d3dfd9c9     coreclr!ThreadLocalBlock::FreeTLM+0xeb506 [D:\a\_work\1\s\src\coreclr\vm\threadstatics.cpp @ 61] 
06 000000f0`75affa70 00007ff8`d3dfeefb     coreclr!ThreadLocalBlock::FreeTable+0xc9 [D:\a\_work\1\s\src\coreclr\vm\threadstatics.cpp @ 93] 
07 (Inline Function) --------`--------     coreclr!Thread::DeleteThreadStaticData+0xc [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 7727] 
08 000000f0`75affaa0 00007ff8`d3dfd298     coreclr!Thread::OnThreadTerminate+0x9f [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 2956] 
09 000000f0`75affb70 00007ff8`d3d6a77c     coreclr!DestroyThread+0x64 [D:\a\_work\1\s\src\coreclr\vm\threads.cpp @ 924] 
0a 000000f0`75affba0 00007ff9`62a726bd     coreclr!ThreadNative::KickOffThread+0xcc [D:\a\_work\1\s\src\coreclr\vm\comsynchronizable.cpp @ 239] 
0b 000000f0`75affc00 00007ff9`6348dfb8     kernel32!BaseThreadInitThunk+0x1d
0c 000000f0`75affc30 00000000`00000000     ntdll!RtlUserThreadStart+0x28

@hoyosjs has dumps of the crash available for investigation.

Moving back to runtime

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Feb 24, 2023
@jaredpar jaredpar transferred this issue from dotnet/roslyn Feb 24, 2023
@EgorBo
Copy link
Member

EgorBo commented Feb 24, 2023

@jaredpar can it be a sort of build failure if VBCS crashes?

@marek-safar marek-safar added this to the 8.0.0 milestone Feb 24, 2023
@marek-safar marek-safar removed the untriaged New issue has not been triaged by the area owner label Feb 24, 2023
@marek-safar
Copy link
Contributor

/cc @jeffschwMSFT for visibility

@jeffschwMSFT
Copy link
Member

@jaredpar do we either have a reproducible build or a dump?

cc @mangod9

@stephentoub
Copy link
Member

@jaredpar do we either have a reproducible build or a dump?

"@hoyosjs has dumps of the crash available for investigation."

@jaredpar
Copy link
Member

can it be a sort of build failure if VBCS crashes?

This is a tricky question. Consider this case where essentially there is a native AV that rudely tears downs the VBCSCompiler process. From the perspective of the compiler MSBuild task this manifests as a broken named pipe. Raising an error for a single named pipe failure is a bit of a dicey proposition because that will happen in the wild (I/O can and will fail). Hard to see us shipping that to customers due to the noise false positives would cause.

Looking at the totality of the MSBuild log in this case though made it very clear a bug was occurring. That's because at the solution build level I could see multiple compiler server broken pipe failures in the log (we log server events in the binlog). Once I saw that it was a virtual lock that there was a crash happening. But MSBuild tasks don't have solution level views of events, they only see the subset of events that come their way so there isn't a mechanism to make this determination today.

Essentially this is another case where MSBuild / Binary Log analyzers would be a benefit to the ecosystem. At the solution build level it's very easy to spot these failures, and several other types of issues that can come up.

@rainersigwald, @marcpopMSFT, @baronfel

@jaredpar
Copy link
Member

@jaredpar do we either have a reproducible build or a dump?

"@hoyosjs has dumps of the crash available for investigation."

This does not reproduce reliably. When we were debugging last night I'd say something like 50% of the time. We could not get it to reproduce under a debugger (VS or WinDbg). Had to instead flip the HLKM registry key to grab dumps on crashes and run the build a few times till we got the crash.

@rainersigwald
Copy link
Member

But MSBuild tasks don't have solution level views of events, they only see the subset of events that come their way so there isn't a mechanism to make this determination today.

It's a poor substitute for what you actually want, but one option for this specific thing would be to keep state in MSBuild's persistence mechanism--on pipe failure increment a persisted count and if count > heuristic_threshold throw a warning or error. The state would be per-node.

@jaredpar
Copy link
Member

jaredpar commented Feb 24, 2023

The state would be per-node.

That generally runs counter to the problem though. Generally, in the wild at least, it's one, maybe two crashes, for a large build. The reason we end up with multiple broken pipe entries in the log is because we have builds running in parallel on different nodes. So each node gets a broken pipe for one crash and their individual heuristic counter stays at one.

I agree this can have positive impact though. Been using this trick in dotnet/roslyn for years to catch our own mistakes. But we're the compiler team and are pretty keen on catching server crashes in our CI so we fail when the heuristic on a single node is > 0. But a false positive once or twice a year is a good trade off for the number of real crashes this has stopped us from shipping.

.. to keep state in MSBuild's persistence mechanism ...

Is there a real persistence mechanism? We've been using mutable static data which is not ideal.

@agocke
Copy link
Member

agocke commented Feb 24, 2023

Also remember that killing the VBCSCompiler process is considered safe shutdown for csc. I don't think adding any extra layers for csc here would help much.

Instead, I think CI should be generally interested in crashing processes, especially if we're running things on preview runtimes. Having some sort of monitoring for all crashing processes during a run seems useful.

@jaredpar
Copy link
Member

jaredpar commented Feb 24, 2023

Having some sort of monitoring for all crashing processes during a run seems useful.

Agree. It's also pretty easy to setup and could be done fairly generally at the arcade level. Effectively all that needs to be done is pick a directory to put dumps in: say artifacts\dumps. Then set the following registry entries:

Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps]
"DumpFolder"="c:\\generated\path\artifacts\dumps"
"DumpCount"=dword:00000001
"DumpType"=dword:00000002

Have a step with publishes the content of that directory and another that fails the build if it's not empty.

@stephentoub
Copy link
Member

Have a step with publishes the content of that directory and another that fails the build if it's not empty.

while working around places we purposefully launch and crash processes, e.g. while testing Environment.FailFast.

@mangod9
Copy link
Member

mangod9 commented Feb 24, 2023

@hoyosjs could you please share a link to the dumps? I assume this is happening with preview1 sdk?

@hoyosjs
Copy link
Member

hoyosjs commented Feb 24, 2023

Sent directly.

@mangod9
Copy link
Member

mangod9 commented Feb 24, 2023

I notice that @jkoritzinsky had updated threadstatics allocation logic as part of #80189. This change is included in the sdk currently in main (8.0.0-alpha.1.23057.5)

@jkoritzinsky
Copy link
Member

There was a bug in that change that messed up zeroing for the dynamic case. The fix didn't make that alpha release.
We should try updating the global.json to Preview 1, which has the fix.

@mangod9
Copy link
Member

mangod9 commented Feb 24, 2023

ah yeah, upgrading to Preview 1 would be the right fix then.

@hoyosjs
Copy link
Member

hoyosjs commented Feb 24, 2023

This is an issue when collecting thread statics in the case of an unloadable assembly (Microsoft.CodeAnalysis.CodeStyle) in this case. Then as we loop to free the TLM, we hit a case where a CollectibleDynamicEntry is not properly 0'd

@hoyosjs
Copy link
Member

hoyosjs commented Feb 24, 2023

yeah, that's the right fix.

@ghost ghost added in-pr There is an active PR which will close this issue when it is merged and removed in-pr There is an active PR which will close this issue when it is merged labels Feb 25, 2023
@hoyosjs hoyosjs linked a pull request Feb 25, 2023 that will close this issue
@dotnet dotnet locked as resolved and limited conversation to collaborators Mar 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr tenet-performance Performance related issue
Projects
None yet