Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete and re-create v2 function on Cloud Run API quota exhaustion #5719

Merged
merged 8 commits into from
Jun 6, 2023

Conversation

blidd-google
Copy link
Contributor

@blidd-google blidd-google commented Apr 21, 2023

On large deployments of v2 functions (~80+), our users consistently run into Google Cloud Function's WRITE API quota. To deal with this issue, our CLI has backoff and retry logic baked into the function deploy process. However, the deploy logic is failing to detect cases in which the GCF API is running into the Cloud Run API quota limits when making Cloud Run requests on the user's behalf. To successfully retry a failed createFunction request, we need to first delete the broken GCF function resource before sending the GCF API another createFunction request.

This PR also includes minor refactors to the QueueExecutor logic to retry dynamically based on error codes specified by the request (e.g. deleteV2Function needs to retry on error code 8, while createV2Function does not want to retry on error code 8 until the broken function resource has been deleted).

@codecov-commenter
Copy link

codecov-commenter commented Apr 21, 2023

Codecov Report

Patch coverage: 90.00% and project coverage change: +0.07 🎉

Comparison is base (858695f) 54.98% compared to head (40d8eee) 55.05%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5719      +/-   ##
==========================================
+ Coverage   54.98%   55.05%   +0.07%     
==========================================
  Files         333      333              
  Lines       22990    23006      +16     
  Branches     4710     4713       +3     
==========================================
+ Hits        12640    12665      +25     
+ Misses       9220     9216       -4     
+ Partials     1130     1125       -5     
Impacted Files Coverage Δ
src/deploy/functions/release/executor.ts 88.88% <76.92%> (+3.88%) ⬆️
src/deploy/functions/release/fabricator.ts 79.40% <100.00%> (+2.71%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

result?: any;
error?: any;
}

const defaultRetryCodes = [429, 409, 503];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit* DEFAULT_RETRY_CODES?

pollerName: `create-${endpoint.codebase}-${endpoint.region}-${endpoint.id}`,
operationResourceName: op.name,
});
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit* shouldn't we be adding { retryCodes: [CLOUD_RUN_RESOURCE_EXHAUSTED_CODE] } here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deliberately don't want the default fabricator retry logic here and want to surface the error to the .catch(...) in line 360 to add custom error handling logic for error code CLOUD_RUN_RESOURCE_EXHAUSTED_CODE. The reason is that on those errors, we need to delete the function before retrying.

const op: Operation = { func };
async run<T>(func: () => Promise<T>, opts?: RunOptions): Promise<T> {
// merge and de-duplicate default and provided retry codes
let retryCodes = [...defaultRetryCodes, ...(opts?.retryCodes || [])];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's better to make retryCodes explicit and allow people to extend DEFAULT_RETRY_CODES manually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "manually"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOof this has been waiting around much longer than expected. Feel free to ping me if you're ever waiting on me for more than ~1d.

I'm suggesting that it's probably a more typical design that opts.retryCodes is literally the only retry codes that will be retried. If I was a consumer of the API and said retryCodes: [420] but was also getting retries on 429 I'd be pretty surprised and frustrated. Rather, make DEFAULT_RETRY_CODES a public part of the API and make opts.retryCodes completely override DEFAULT_RETRY_CODES. Then, if I wanted to handle all defaults and 420, I'd use retryCodes: [...DEFAULT_RETRY_CODES, 420]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Today, the executor retries on 429, 409, and 503 error codes for every API call wrapped in run(...), and I would assume that we would like to avoid breaking existing behavior that may rely on retrying the default codes. So most executor runs would need to be edited to .run(async () => {...}, { retryCodes: DEFAULT_RETRY_CODES }) which may actually turn out to be a non-trivial re-write.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to say that if {retryCodes} isn't specified, it should default to DEFAULT_RETRY_CODES. If {retryCodes} is specified, we should only retry on the specified codes. It follows the principle of least surprises to only retry on the codes which we were told to retry or have sane defaults when unspecified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see what you mean — definitely agree with this approach, next commit implements the behavior you're describing.

@blidd-google blidd-google requested a review from inlined June 5, 2023 04:20
Copy link
Member

@inlined inlined left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending feedback

@@ -5,15 +5,32 @@ import { ThrottlerOptions } from "../../../throttler/throttler";
* An Executor runs lambdas (which may be async).
*/
export interface Executor {
run<T>(func: () => Promise<T>): Promise<T>;
run<T>(func: () => Promise<T>, opts?: RunOptions): Promise<T>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit (don't feel obligated to take it, and don't feel the need to go through re-review if you do). Typically callbacks are the last parameter. In langauges like Scala, Swift, and Ruby, this allows a special syntax where the callback is after the function call. We don't have that candy (yet) in JavaScript, but it's still pretty common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting, definitely something to keep in mind — unfortunately in this case, opts is an optional parameter and func is required so I can't switch the parameter order without making opts required as well, but I'm guessing that the idiomatic approach would be to make opts required?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we did this in cf3v2 is using overloading:

run<T>(func: () => Promise<T>): Promise<T>;
run<T>(opts: RuntimeOptions, func: () => Promise<T>): Promise<T>;

// impl
run<T>(funcOrOpts: RuntimeOptions | () => Promise<T>, func?: () => Promise<T>) {
  let opts: RuntimeOptions = {};
  if (func) {
    opts = funcOrOpts;
   } else {
     func = funcOrOpts;
   }
   // ...
}

If this were a public tool, we might push for this. But it's an internal tool that's used only a few times in our codebase, so please don't go through the trouble.

const op: Operation = { func };
async run<T>(func: () => Promise<T>, opts?: RunOptions): Promise<T> {
// merge and de-duplicate default and provided retry codes
let retryCodes = opts?.retryCodes || [];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just make this let retryCodes = opts?.retryCodes || DEFAULT_RETRY_CODES

I think you're trying to make it so that someone who explicitly passes zero retry codes gets DEFAULT_RETRY_CODES, but I would presume such a user is using the queue for throttling, but passed retryCodes: [] because they explicitly don't want to retry.

@blidd-google blidd-google merged commit 6d9047f into master Jun 6, 2023
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants