Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono] Propagate pthread_kill EAGAIN to thread suspend initiator (remove workaround) #34132

Closed
lambdageek opened this issue Mar 26, 2020 · 4 comments
Assignees
Labels
area-VM-threading-mono runtime-mono specific to the Mono runtime tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@lambdageek
Copy link
Member

lambdageek commented Mar 26, 2020

In #32377 (comment) we observed that in many test runs on Helix the GC STW will send out suspend signals with pthread_kill and get back an error EAGAIN which indicates that the realtime signal queue was full.

The temporary workaround #33966 that was put in place in the low-level code was to sleep for some number of milliseconds and retry some fixed number of times.

We should instead investigate:

  • Why does the signal queue become full? Is this some pathological behavior of Mono's STW or an external stimulus?
  • Can we deal with suspend failures in a more general way by propagating an error code from mono_threads_suspend_begin_async_suspend to the suspend initiator so that the suspend initiator can take some action.
@lambdageek lambdageek added tenet-reliability Reliability/stability related issue (stress, load problems, etc.) FollowingUp area-VM-threading-mono runtime-mono specific to the Mono runtime labels Mar 26, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Mar 26, 2020
@lambdageek lambdageek removed the untriaged New issue has not been triaged by the area owner label Mar 26, 2020
@CoffeeFlux CoffeeFlux added this to the 6.0.0 milestone Jul 6, 2020
@SamMonoRT SamMonoRT modified the milestones: 6.0.0, 7.0.0 Jun 16, 2021
@SamMonoRT SamMonoRT modified the milestones: 7.0.0, 8.0.0 Aug 4, 2022
@SamMonoRT
Copy link
Member

@lambdageek - is the temporary workaround you merged good enough, or should we consider this for 8.0 and/or close this issue ? Not seeing any more reported CI failures initially linked to the issue.

@lambdageek
Copy link
Member Author

I think we can close the issue. I think this issue is quite artificial in practice, and the retry workaround is actually pretty reasonable.

@tmds
Copy link
Member

tmds commented Apr 11, 2023

@lambdageek the issue title makes me think of #77364. Could they be related?

@lambdageek
Copy link
Member Author

@lambdageek the issue title makes me think of #77364. Could they be related?

@tmds I don't think they're related directly. In #33966 we are working around the GC suspend initiator getting an unusual errno value. In this #77364, it's seems like it is other threads that leak an errno from some of the runtime machinery into the interop layer and then into managed code.

I suppose it is possible that a thread in 77364 is the suspend initiator. Although the hack in 33966 added a warning that should be visible on stderr if we this piece of the suspend code is getting EAGAIN.

It's possible that more libc functions or syscalls on Linux set errno to EAGAIN when it's not documented, so the issues could be related in a general sense.

@ghost ghost locked as resolved and limited conversation to collaborators May 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-threading-mono runtime-mono specific to the Mono runtime tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

No branches or pull requests

7 participants