Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Do not emit safe point for TLS_GET_ADDR calls into native runtime. #102237

Merged
merged 4 commits into from
May 16, 2024

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented May 15, 2024

Fixes: #102140

The reason for the failure is GC-reporting of uninitialized temp.

We declare a temp to hold the address to the managed TLS blob. The temp is initialized by indirecting into native TLS. In some cases fetching the native TLS involves a method call.

Roughly it looks like:

managed_ptr tlsRoot;

// unmanaged TLS pattern starts
...
...
unmanaged_ptr nativeTLS = TLS_GET_ADDR();   
nativeTLS += someAdjustment;                           // if this is a safe point, tlsRoot must be zero-inited
// unmanaged TLS pattern ends

// only here tlsRoot gets a real value
tlsRoot = nativeTLS[some indirection];

The optimizer assumes that if we did not see a safe point between prolog and the first assignment, then zero-initing is unnecessary. The problem is that optimizer does not know that TLS access may emit a call, and that call may introduce a safe point (as calls do by default).

A simplest fix would be to not emit safepoints for calls into native TLS. They cannot participate in GC stackwalks anyways.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 15, 2024
@VSadov
Copy link
Member Author

VSadov commented May 15, 2024

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented May 15, 2024

all linux-arm64 jobs with NativeAOT have passed. I will run tests one more time to be sure.

@VSadov
Copy link
Member Author

VSadov commented May 15, 2024

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov VSadov marked this pull request as ready for review May 15, 2024 17:29
@VSadov VSadov requested a review from kunalspathak May 15, 2024 17:29
Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The TLS inlining change on arm64 has been enabled for few months now. Wondering if we tracked what triggered this failure now?

@@ -8985,7 +8985,8 @@ void emitter::emitIns_Call(EmitCallType callType,
regNumber xreg /* = REG_NA */,
unsigned xmul /* = 0 */,
ssize_t disp /* = 0 */,
bool isJump /* = false */)
bool isJump /* = false */,
bool isNoGCframe /* = false */)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you include the new parameter in the method description? for this file and in other files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. A good point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure if isNoGCframe is the best name. I could not come up with anything better at the time, but now I think noSafePoint might be better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed the parameter noSafePoint. I think it is better and easier to explain. I've also added comments about the parameter where we had any comments.

@@ -9079,7 +9080,7 @@ void emitter::emitIns_Call(EmitCallType callType,
emitThisByrefRegs = byrefRegs;

// for the purpose of GC safepointing tail-calls are not real calls
id->idSetIsNoGC(isJump || emitNoGChelper(methHnd));
id->idSetIsNoGC(isJump || isNoGCframe || emitNoGChelper(methHnd));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing around isNoGCframe, we could have added a check inside emitNoGChelper() to ignore for TLS, but adding a check for TLS methHnd might not solve the purpose, because for xarch we pass a dummy methodHandle == 1 and for arm64, it is the actual address.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with a solution where a special/fake JIT helper would be used to specify that we are dealing with a TLS call to the native runtime. In a way that is right - this is a JIT helper, just provided by the native runtime. That did not work though, since on arm64 the methodHandle is not just a marker.

I also realized that we may want to opt-out more calls from being safeponts in the future. Not for correctness, but for "we do not need them" reason. We make nearly every call a safe point, but some calls would never be stackwalked.
Like native calls (not talking about entire pinvoke here, but the actual calls to native methods - it is ok if they are safepoints, but not necessary - a thing to think about anyways,..)

So I switched to a parameter that allows to override the default. Currently it would only be used for TLS calls, but we could use it for other cases.

This is basically the reason why I am passing around a parameter.

@VSadov
Copy link
Member Author

VSadov commented May 16, 2024

LGTM. The TLS inlining change on arm64 has been enabled for few months now. Wondering if we tracked what triggered this failure now?

Good question. #95565 made safe points interruptible and that triggered the failure.
Prior to that, a safepoint could not be "activated" unless we do a stackwalk through it, which could not happen for this one since the calee is native. Now, if we catch a thread on a safepoint we can do a stackwalk, so the safepoint can be "activated".

Note that the presence of a safepoint by itself is not a problem here. It is a well-formed safepoint and has correct GC info, so we could do stack walks. The problem is that optimizer does not know that TLS access may introduce a safe point and leaves GC locals observably uninitialized.
So it was - either teach optimizer about TLS calls or just not produce a safe point, since a native call does not need it. Not emitting a safepoint is simpler.

Another question to ask - Why was this not a problem in Fully-Interruptible methods even before #95565, since we could start a stack walk at the same call site?
That is because optimizer does not optimize initialization of GC locals in Fully Interruptible code.

@VSadov
Copy link
Member Author

VSadov commented May 16, 2024

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented May 16, 2024

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented May 16, 2024

Thanks!

@VSadov VSadov merged commit 13d753c into dotnet:main May 16, 2024
127 of 135 checks passed
@VSadov VSadov deleted the fix102140 branch May 16, 2024 16:19
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
… runtime. (dotnet#102237)

* Do not emit safe point for TLS_GET_ADDR calls into native runtime.

* formatting

* renamed parameter to `noSafePoint`, added comments.

* clang formatting, bane of my existence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[NativeAOT] System.Collections.Concurrent.Tests crashing on linux-arm64
2 participants