New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High rate of "lost runner" errors for web-platform-tests on macOS 13 #7754
Comments
@jgraham , may I ask you to open an issue on http://support.github.com/ ? unfortunately, this issue tracker is for images, not for other aspects of github actions I can try to reproduce your issue since you've provided clear repro steps, but I cannot provide a fix |
sorry, "support.github.com" is wrong link for ADO agents, I'll provide a link later |
@ilia-shipitsin any update here? Is there a different repository where this issue should be filed? |
I've escalated issue to proper team, they are investigating |
@jgraham , I see that internal issue was marked as "resolved". Can you please try to enable macos-13 builds ? |
Things seem much better over the last few days. Thanks! We're still seeing occasional failures, for example:
That said, I think we were always seeing some level of drops even on |
issue was identified on hosting level. fix is to be delivered around mid-august (reason for being better right now is not very clear). thank for bringing the issue to attention. |
lets keep it open for possible duplicates |
Hi @ilia-shipitsin, we're facing the same issue as described, details here. As per your comment this issue should be fixed by mid-august but i checked today and still when i switch my build to use macos13 I get the timeout error. I am looking forward for the fix, when do we expect it to be out? |
@antonioalwan , it is really impossible to tell whether your issue is same or not. Please provide details, and I would suggest to open a separate issue just to keep it clean. |
sorry, I see @mikhailkoliada has closed separate issue already. Let him provide a feedback |
We're still seeing this, even on the See, e.g., https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=106811&view=results, https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=106790&view=results, and https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=106767&view=results |
👋 We anticipate finishing the work to resolve this issue by October 2023, and will comment on this thread once finished. |
Is there any update here? I believe we are seeing the same errors with
Here is an example: https://github.com/Expensify/App/actions/runs/7032789006 |
Hi @Steve-Glass, We run also into #7754 (comment) which degrades our testing infrastructure. Would it be possible to: a) confirm that the errors we are running into are caused by the linked issue above? (logs: microsoft/playwright#28187) Thank you so much! |
We are getting this error now:
|
Any updates? |
Description
Since approximately May 16th, we've been experiencing a high failure rate for web-platform-tests jobs running on macOS 13. This appears to be an infrastructure issue as we get a message indicating that the agent stopped responding. This affects some, but not all jobs, and it appears to be random within set of jobs running similar workloads (chunks of the testsuite) on macOS. It doesn't appear to be a specific part of the workload (e.g. a specific testcase).
One of the first affected builds is: https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=100660. A recent one is https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=102828&view=logs&jobId=9e909769-fc48-58b9-7383-225ac465e77e
Manually rerunning the failed jobs does work (but some jobs require multiple reruns, since the problem can also happen during the rerun)
We've tried to resolve the problem in the following ways:
(cc @gsnedders who did most of the diagnosis work to date)
web-platform-tests/wpt#40085 is the corresponding wpt repository issue
Platforms affected
Runner images affected
Image version and build link
https://dev.azure.com/web-platform-tests/wpt/_build/results?buildId=102828&view=logs&jobId=9e909769-fc48-58b9-7383-225ac465e77e
Is it regression?
Intermittent, but seems to have regressed mid-May.
Expected behavior
Run completes successfully
Actual behavior
Intermittent failure of runs, and no retries. Historically the error was always "We stopped hearing from agent ". Now there seem to be a mixture of error messages, including that one and "The hosted runner encountered an error while running your job. (Error Type: Disconnect).".
Repro steps
Failure is intermittent, but happens when we try to run a large number of tests in Safari on macOS using a WebDriver-backed testharness.
The text was updated successfully, but these errors were encountered: