Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOLD for payment 2024-04-25] macos-12-xlarge runners are being canceled at a high rate #32212

Closed
AndrewGable opened this issue Nov 29, 2023 · 28 comments
Assignees
Labels
Awaiting Payment Auto-added when associated PR is deployed to production Daily KSv2

Comments

@AndrewGable
Copy link
Contributor

Problem

About 50% of our iOS builds that use the macos-13-xlarge runner are being canceled with the error:

[Build and deploy iOS](https://github.com/Expensify/App/actions/runs/7032789006/job/19137288520)
The hosted runner encountered an error while running your job. (Error Type: Disconnect).

Screenshot 2023-11-29 at 11 37 14 AM

Solution

Fix it

@AndrewGable AndrewGable self-assigned this Nov 29, 2023
@AndrewGable AndrewGable added the Daily KSv2 label Nov 29, 2023
@AndrewGable
Copy link
Contributor Author

Looks related: actions/runner-images#7754

@melvin-bot melvin-bot bot added the Overdue label Dec 2, 2023
@AndrewGable
Copy link
Contributor Author

GitHub confirmed this was something on their side and they are looking into it

@melvin-bot melvin-bot bot added Overdue and removed Overdue labels Dec 4, 2023
@AndrewGable
Copy link
Contributor Author

Been pretty quiet from GitHub support, I will bump them

@melvin-bot melvin-bot bot added Overdue and removed Overdue labels Dec 7, 2023
Copy link

melvin-bot bot commented Dec 11, 2023

@AndrewGable Whoops! This issue is 2 days overdue. Let's get this updated quick!

@AndrewGable
Copy link
Contributor Author

No update from GitHub support

@melvin-bot melvin-bot bot added Overdue and removed Overdue labels Dec 12, 2023
Copy link

melvin-bot bot commented Dec 15, 2023

@AndrewGable Whoops! This issue is 2 days overdue. Let's get this updated quick!

Copy link

melvin-bot bot commented Dec 19, 2023

@AndrewGable 6 days overdue. This is scarier than being forced to listen to Vogon poetry!

@AndrewGable
Copy link
Contributor Author

We got a work around from GitHub support, but I am not sure we are seeing the error anymore. Looking into it today.

@melvin-bot melvin-bot bot removed the Overdue label Dec 20, 2023
@AndrewGable
Copy link
Contributor Author

AndrewGable commented Dec 22, 2023

GitHub says if we set --maxsockets=1 on npm install it should help, but I am not sure we want to do so with all the runners not failing.

@melvin-bot melvin-bot bot added the Overdue label Dec 25, 2023
Copy link

melvin-bot bot commented Dec 26, 2023

@AndrewGable Whoops! This issue is 2 days overdue. Let's get this updated quick!

Copy link

melvin-bot bot commented Dec 28, 2023

@AndrewGable Huh... This is 4 days overdue. Who can take care of this?

Copy link

melvin-bot bot commented Jan 1, 2024

@AndrewGable Now this issue is 8 days overdue. Are you sure this should be a Daily? Feel free to change it!

1 similar comment
Copy link

melvin-bot bot commented Jan 1, 2024

@AndrewGable Now this issue is 8 days overdue. Are you sure this should be a Daily? Feel free to change it!

Copy link

melvin-bot bot commented Jan 3, 2024

@AndrewGable 10 days overdue. I'm getting more depressed than Marvin.

@AndrewGable
Copy link
Contributor Author

I'll look back into this

@melvin-bot melvin-bot bot added Overdue and removed Overdue labels Jan 3, 2024
Copy link

melvin-bot bot commented Jan 9, 2024

@AndrewGable Huh... This is 4 days overdue. Who can take care of this?

Copy link

melvin-bot bot commented Jan 11, 2024

@AndrewGable 6 days overdue. This is scarier than being forced to listen to Vogon poetry!

@AndrewGable
Copy link
Contributor Author

I think GitHub must have fixed it on their side, I haven't seen this happen in 2+ weeks.

@melvin-bot melvin-bot bot removed the Overdue label Jan 15, 2024
@AndrewGable AndrewGable reopened this Apr 1, 2024
@kgantchev
Copy link

kgantchev commented Apr 2, 2024

Hi, not to be intrusive here, but it seems that this is a recurring issue here... have you considered giving FlyCI a try?

Copy link

melvin-bot bot commented Apr 2, 2024

📣 @kgantchev! 📣
Hey, it seems we don’t have your contributor details yet! You'll only have to do this once, and this is how we'll hire you on Upwork.
Please follow these steps:

  1. Make sure you've read and understood the contributing guidelines.
  2. Get the email address used to login to your Expensify account. If you don't already have an Expensify account, create one here. If you have multiple accounts (e.g. one for testing), please use your main account email.
  3. Get the link to your Upwork profile. It's necessary because we only pay via Upwork. You can access it by logging in, and then clicking on your name. It'll look like this. If you don't already have an account, sign up for one here.
  4. Copy the format below and paste it in a comment on this issue. Replace the placeholder text with your actual details.
    Screen Shot 2022-11-16 at 4 42 54 PM
    Format:
Contributor details
Your Expensify account email: <REPLACE EMAIL HERE>
Upwork Profile Link: <REPLACE LINK HERE>

@AndrewGable
Copy link
Contributor Author

@kgantchev - Feel free to follow the proposal process, but no we haven't.

@kgantchev
Copy link

@AndrewGable thanks for sharing the proposal guide. I've created a proposal based on that guide.

The problem

Frequent GitHub failure at unsustainably high rates (close to 50%). This appears to be an infrastructure issue with a message indicating that the agent stopped responding:

The hosted runner encountered an error while running your job. (Error Type: Disconnect).

In addition to the runner failure ("disconnect"), the response time from GitHub support is too slow (up up to 8 days to resolve the issue).

What is the root cause of that problem?

The root cause is an infrastructure issue on GitHub's side. A complicating factor is GitHub's support, which is exceedingly slow with response times as slow as 8 days.

What changes do you think we should make in order to solve the problem?

A possible solution is to use FlyCI's macOS runners. FlyCI offers M2 runners ranging from 4 vCPUs to 8 vCPUs (macOS 13 and 14), with the largest being the flyci-macos-14-xlarge-m2 runner with 8 vCPUs and 14 GB RAM.

The FlyCI runners are highly reliable and are supported by a very responsive dev team. Support is available by e-mail and in the Discord server of FlyCI, with response rates that aim to always be below 24 hours.

The switch is simple:

Step 1: Install the FlyCI GitHub app and grant it permissions for this repo.
Step 2: Switch the relevant runner label to point to FlyCI's labels.

In this case, there are 3 workflow files that have the offending runner label:

  • actionlint
  • testBuild
  • platformDeploy

An example of the change looks like this for testBuild:

  iOS:
    name: Build and deploy iOS for testing
    needs: [validateActor, getBranchRef]
    if: ${{ fromJSON(needs.validateActor.outputs.READY_TO_BUILD) }}
    env:
      PULL_REQUEST_NUMBER: ${{ github.event.number || github.event.inputs.PULL_REQUEST_NUMBER }}
      DEVELOPER_DIR: /Applications/Xcode_15.0.1.app/Contents/Developer
-     runs-on: macos-13-xlarge
+     runs-on: flyci-macos-14-xlarge-m2

Note: the solution uses an M2 runner/macOS 14 (8 vCPU and 14 GB RAM), which should also provide a performance boost of about 20% compared to the M1 runners.

@AndrewGable
Copy link
Contributor Author

Thanks for the proposal @kgantchev - I will consider this proposal, but probably will look at smaller solutions first remaining on GitHub Actions as we've standardized on GitHub runners and don't really want to splinter them across providers.

Copy link

melvin-bot bot commented Apr 5, 2024

@AndrewGable Uh oh! This issue is overdue by 2 days. Don't forget to update your issues!

@melvin-bot melvin-bot bot added the Overdue label Apr 5, 2024
@AndrewGable AndrewGable changed the title macos-13-xlarge runners are being canceled at a high rate macos-12-xlarge runners are being canceled at a high rate Apr 8, 2024
@AndrewGable
Copy link
Contributor Author

Going to see if macos-13-large helps, I believe xl might have been depreciated. This will still use intel CPUs as we don't want to use arm64.

@melvin-bot melvin-bot bot removed the Overdue label Apr 8, 2024
@melvin-bot melvin-bot bot added Reviewing Has a PR in review Weekly KSv2 and removed Daily KSv2 labels Apr 8, 2024
@melvin-bot melvin-bot bot added Weekly KSv2 Awaiting Payment Auto-added when associated PR is deployed to production and removed Weekly KSv2 labels Apr 18, 2024
@melvin-bot melvin-bot bot changed the title macos-12-xlarge runners are being canceled at a high rate [HOLD for payment 2024-04-25] macos-12-xlarge runners are being canceled at a high rate Apr 18, 2024
Copy link

melvin-bot bot commented Apr 18, 2024

Reviewing label has been removed, please complete the "BugZero Checklist".

@melvin-bot melvin-bot bot removed the Reviewing Has a PR in review label Apr 18, 2024
Copy link

melvin-bot bot commented Apr 18, 2024

The solution for this issue has been 🚀 deployed to production 🚀 in version 1.4.62-17 and is now subject to a 7-day regression period 📆. Here is the list of pull requests that resolve this issue:

If no regressions arise, payment will be issued on 2024-04-25. 🎊

@melvin-bot melvin-bot bot added Daily KSv2 Overdue and removed Weekly KSv2 labels Apr 24, 2024
@melvin-bot melvin-bot bot removed the Overdue label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Payment Auto-added when associated PR is deployed to production Daily KSv2
Projects
None yet
Development

No branches or pull requests

2 participants