Delete and re-create v2 function on Cloud Run API quota exhaustion #5719

blidd-google · 2023-04-21T16:55:59Z

On large deployments of v2 functions (~80+), our users consistently run into Google Cloud Function's WRITE API quota. To deal with this issue, our CLI has backoff and retry logic baked into the function deploy process. However, the deploy logic is failing to detect cases in which the GCF API is running into the Cloud Run API quota limits when making Cloud Run requests on the user's behalf. To successfully retry a failed createFunction request, we need to first delete the broken GCF function resource before sending the GCF API another createFunction request.

This PR also includes minor refactors to the QueueExecutor logic to retry dynamically based on error codes specified by the request (e.g. deleteV2Function needs to retry on error code 8, while createV2Function does not want to retry on error code 8 until the broken function resource has been deleted).

codecov-commenter · 2023-04-21T17:05:47Z

Codecov Report

Patch coverage: 90.00% and project coverage change: +0.07 🎉

Comparison is base (858695f) 54.98% compared to head (40d8eee) 55.05%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5719      +/-   ##
==========================================
+ Coverage   54.98%   55.05%   +0.07%     
==========================================
  Files         333      333              
  Lines       22990    23006      +16     
  Branches     4710     4713       +3     
==========================================
+ Hits        12640    12665      +25     
+ Misses       9220     9216       -4     
+ Partials     1130     1125       -5

Impacted Files	Coverage Δ
src/deploy/functions/release/executor.ts	`88.88% <76.92%> (+3.88%)`	⬆️
src/deploy/functions/release/fabricator.ts	`79.40% <100.00%> (+2.71%)`	⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

taeold · 2023-04-27T20:48:27Z

src/deploy/functions/release/executor.ts

  result?: any;
  error?: any;
 }

+const defaultRetryCodes = [429, 409, 503];


nit* DEFAULT_RETRY_CODES?

taeold · 2023-04-27T20:52:31Z

src/deploy/functions/release/fabricator.ts

+            pollerName: `create-${endpoint.codebase}-${endpoint.region}-${endpoint.id}`,
+            operationResourceName: op.name,
+          });
+        })


nit* shouldn't we be adding { retryCodes: [CLOUD_RUN_RESOURCE_EXHAUSTED_CODE] } here?

I deliberately don't want the default fabricator retry logic here and want to surface the error to the .catch(...) in line 360 to add custom error handling logic for error code CLOUD_RUN_RESOURCE_EXHAUSTED_CODE. The reason is that on those errors, we need to delete the function before retrying.

inlined · 2023-04-28T18:33:19Z

src/deploy/functions/release/executor.ts

-    const op: Operation = { func };
+  async run<T>(func: () => Promise<T>, opts?: RunOptions): Promise<T> {
+    // merge and de-duplicate default and provided retry codes
+    let retryCodes = [...defaultRetryCodes, ...(opts?.retryCodes || [])];


I wonder if it's better to make retryCodes explicit and allow people to extend DEFAULT_RETRY_CODES manually

What do you mean by "manually"?

OOof this has been waiting around much longer than expected. Feel free to ping me if you're ever waiting on me for more than ~1d.

I'm suggesting that it's probably a more typical design that opts.retryCodes is literally the only retry codes that will be retried. If I was a consumer of the API and said retryCodes: [420] but was also getting retries on 429 I'd be pretty surprised and frustrated. Rather, make DEFAULT_RETRY_CODES a public part of the API and make opts.retryCodes completely override DEFAULT_RETRY_CODES. Then, if I wanted to handle all defaults and 420, I'd use retryCodes: [...DEFAULT_RETRY_CODES, 420]

Got it. Today, the executor retries on 429, 409, and 503 error codes for every API call wrapped in run(...), and I would assume that we would like to avoid breaking existing behavior that may rely on retrying the default codes. So most executor runs would need to be edited to .run(async () => {...}, { retryCodes: DEFAULT_RETRY_CODES }) which may actually turn out to be a non-trivial re-write.

I'm trying to say that if {retryCodes} isn't specified, it should default to DEFAULT_RETRY_CODES. If {retryCodes} is specified, we should only retry on the specified codes. It follows the principle of least surprises to only retry on the codes which we were told to retry or have sane defaults when unspecified.

Ah I see what you mean — definitely agree with this approach, next commit implements the behavior you're describing.

inlined

LGTM pending feedback

inlined · 2023-06-06T16:40:23Z

src/deploy/functions/release/executor.ts

@@ -5,15 +5,32 @@ import { ThrottlerOptions } from "../../../throttler/throttler";
 * An Executor runs lambdas (which may be async).
 */
 export interface Executor {
-  run<T>(func: () => Promise<T>): Promise<T>;
+  run<T>(func: () => Promise<T>, opts?: RunOptions): Promise<T>;


super nit (don't feel obligated to take it, and don't feel the need to go through re-review if you do). Typically callbacks are the last parameter. In langauges like Scala, Swift, and Ruby, this allows a special syntax where the callback is after the function call. We don't have that candy (yet) in JavaScript, but it's still pretty common.

Ah interesting, definitely something to keep in mind — unfortunately in this case, opts is an optional parameter and func is required so I can't switch the parameter order without making opts required as well, but I'm guessing that the idiomatic approach would be to make opts required?

The way we did this in cf3v2 is using overloading:

run<T>(func: () => Promise<T>): Promise<T>; run<T>(opts: RuntimeOptions, func: () => Promise<T>): Promise<T>; // impl run<T>(funcOrOpts: RuntimeOptions | () => Promise<T>, func?: () => Promise<T>) { let opts: RuntimeOptions = {}; if (func) { opts = funcOrOpts; } else { func = funcOrOpts; } // ... }

If this were a public tool, we might push for this. But it's an internal tool that's used only a few times in our codebase, so please don't go through the trouble.

inlined · 2023-06-06T16:43:24Z

src/deploy/functions/release/executor.ts

-    const op: Operation = { func };
+  async run<T>(func: () => Promise<T>, opts?: RunOptions): Promise<T> {
+    // merge and de-duplicate default and provided retry codes
+    let retryCodes = opts?.retryCodes || [];


I would just make this let retryCodes = opts?.retryCodes || DEFAULT_RETRY_CODES

I think you're trying to make it so that someone who explicitly passes zero retry codes gets DEFAULT_RETRY_CODES, but I would presume such a user is using the queue for throttling, but passed retryCodes: [] because they explicitly don't want to retry.

blidd-google added 2 commits April 21, 2023 11:54

delete function on cloud run quota exhaustion and re-create

e0e8c5a

add changelog

79254b1

blidd-google requested review from taeold and inlined April 21, 2023 19:16

blidd-google self-assigned this Apr 21, 2023

blidd-google added the api: functions label Apr 21, 2023

taeold approved these changes Apr 27, 2023

View reviewed changes

inlined reviewed Apr 28, 2023

View reviewed changes

blidd-google and others added 3 commits May 30, 2023 14:24

Merge branch 'master' into bl-retry-on-run-quota

bc08fb8

change retry code name

4dcfea0

custom specified retry codes override defaults

f2efa6d

blidd-google requested a review from inlined June 5, 2023 04:20

Merge branch 'master' into bl-retry-on-run-quota

40d8eee

inlined approved these changes Jun 6, 2023

View reviewed changes

blidd-google and others added 2 commits June 6, 2023 13:17

allow to disable retries in executor

2708034

Merge branch 'master' into bl-retry-on-run-quota

aa2a939

blidd-google merged commit 6d9047f into master Jun 6, 2023
30 of 31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete and re-create v2 function on Cloud Run API quota exhaustion #5719

Delete and re-create v2 function on Cloud Run API quota exhaustion #5719

blidd-google commented Apr 21, 2023 •

edited

codecov-commenter commented Apr 21, 2023 •

edited

taeold Apr 27, 2023

taeold Apr 27, 2023

blidd-google May 2, 2023

inlined Apr 28, 2023

blidd-google May 2, 2023

inlined May 31, 2023

blidd-google Jun 3, 2023

inlined Jun 5, 2023

blidd-google Jun 5, 2023

inlined left a comment

inlined Jun 6, 2023

blidd-google Jun 6, 2023

inlined Jun 12, 2023

inlined Jun 6, 2023

Delete and re-create v2 function on Cloud Run API quota exhaustion #5719

Delete and re-create v2 function on Cloud Run API quota exhaustion #5719

Conversation

blidd-google commented Apr 21, 2023 • edited

codecov-commenter commented Apr 21, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

inlined left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blidd-google commented Apr 21, 2023 •

edited

codecov-commenter commented Apr 21, 2023 •

edited