Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High rate of "lost runner" errors for web-platform-tests on macOS 13 #7754

Open
2 of 10 tasks
jgraham opened this issue Jun 20, 2023 · 17 comments
Open
2 of 10 tasks

High rate of "lost runner" errors for web-platform-tests on macOS 13 #7754

jgraham opened this issue Jun 20, 2023 · 17 comments

Comments

@jgraham
Copy link

jgraham commented Jun 20, 2023

Description

Since approximately May 16th, we've been experiencing a high failure rate for web-platform-tests jobs running on macOS 13. This appears to be an infrastructure issue as we get a message indicating that the agent stopped responding. This affects some, but not all jobs, and it appears to be random within set of jobs running similar workloads (chunks of the testsuite) on macOS. It doesn't appear to be a specific part of the workload (e.g. a specific testcase).

One of the first affected builds is: https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=100660. A recent one is https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=102828&view=logs&jobId=9e909769-fc48-58b9-7383-225ac465e77e

Manually rerunning the failed jobs does work (but some jobs require multiple reruns, since the problem can also happen during the rerun)

We've tried to resolve the problem in the following ways:

  • Enabled automatic retries in the pipeline configuration. Either we got the configuration wrong, or these jobs are not retried.
  • Making each job smaller (i.e. run fewer tests per jobs). This didn't have any impact.
  • Testing on macOS-12 rather than 13. The problems started shortly after an update, but are apparently still reproducible on the older OS release (and using the latest version is important for our use case).

(cc @gsnedders who did most of the diagnosis work to date)

web-platform-tests/wpt#40085 is the corresponding wpt repository issue

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 20.04
  • Ubuntu 22.04
  • macOS 11
  • macOS 12
  • macOS 13
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=102828&view=logs&jobId=9e909769-fc48-58b9-7383-225ac465e77e

Is it regression?

Intermittent, but seems to have regressed mid-May.

Expected behavior

Run completes successfully

Actual behavior

Intermittent failure of runs, and no retries. Historically the error was always "We stopped hearing from agent ". Now there seem to be a mixture of error messages, including that one and "The hosted runner encountered an error while running your job. (Error Type: Disconnect).".

Repro steps

Failure is intermittent, but happens when we try to run a large number of tests in Safari on macOS using a WebDriver-backed testharness.

@ilia-shipitsin
Copy link
Contributor

@jgraham , may I ask you to open an issue on http://support.github.com/ ?

unfortunately, this issue tracker is for images, not for other aspects of github actions

I can try to reproduce your issue since you've provided clear repro steps, but I cannot provide a fix

@ilia-shipitsin
Copy link
Contributor

sorry, "support.github.com" is wrong link for ADO agents, I'll provide a link later

@jgraham
Copy link
Author

jgraham commented Jun 29, 2023

@ilia-shipitsin any update here? Is there a different repository where this issue should be filed?

@ilia-shipitsin
Copy link
Contributor

ilia-shipitsin commented Jun 29, 2023

I've escalated issue to proper team, they are investigating

@ilia-shipitsin ilia-shipitsin self-assigned this Jul 14, 2023
@ilia-shipitsin
Copy link
Contributor

@jgraham , I see that internal issue was marked as "resolved". Can you please try to enable macos-13 builds ?

@gsnedders
Copy link

@jgraham , I see that internal issue was marked as "resolved". Can you please try to enable macos-13 builds ?

Things seem much better over the last few days. Thanks!

We're still seeing occasional failures, for example:

That said, I think we were always seeing some level of drops even on macos-12 images, so it no longer appears to have significantly regressed.

@ilia-shipitsin
Copy link
Contributor

issue was identified on hosting level. fix is to be delivered around mid-august (reason for being better right now is not very clear).
I'm closing issue for now. If around mid-august number of failures will be still high enough, feel free to reopen.

thank for bringing the issue to attention.

@mikhailkoliada
Copy link
Member

lets keep it open for possible duplicates

@antonioalwan
Copy link

antonioalwan commented Aug 28, 2023

Hi @ilia-shipitsin, we're facing the same issue as described, details here. As per your comment this issue should be fixed by mid-august but i checked today and still when i switch my build to use macos13 I get the timeout error. I am looking forward for the fix, when do we expect it to be out?

@ilia-shipitsin
Copy link
Contributor

@antonioalwan , it is really impossible to tell whether your issue is same or not. Please provide details, and I would suggest to open a separate issue just to keep it clean.

@ilia-shipitsin
Copy link
Contributor

sorry, I see @mikhailkoliada has closed separate issue already. Let him provide a feedback

@gsnedders
Copy link

issue was identified on hosting level. fix is to be delivered around mid-august (reason for being better right now is not very clear).

We're still seeing this, even on the 20230821.3 agent image.

See, e.g., https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=106811&view=results, https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=106790&view=results, and https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=106767&view=results

@Steve-Glass
Copy link
Contributor

👋 We anticipate finishing the work to resolve this issue by October 2023, and will comment on this thread once finished.

@AndrewGable
Copy link

Is there any update here? I believe we are seeing the same errors with macos-13-xlarge runners.

The hosted runner encountered an error while running your job. (Error Type: Disconnect).

Here is an example: https://github.com/Expensify/App/actions/runs/7032789006

@mxschmitt
Copy link

mxschmitt commented Dec 6, 2023

Hi @Steve-Glass,

We run also into #7754 (comment) which degrades our testing infrastructure. Would it be possible to:

a) confirm that the errors we are running into are caused by the linked issue above? (logs: microsoft/playwright#28187)
b) let us know if there is some kind of workaround available except having self-hosted macOS runners?
c) can it be ruled out, that its caused by a memory leak on our side?

Thank you so much!

@pafdad
Copy link

pafdad commented Dec 21, 2023

We are getting this error now:

Received request to deprovision: The request was cancelled by the remote provider.

@linliu-code
Copy link

Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests