Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GODRIVER-3172 Read response in the background after an op timeout. #1589

Merged
merged 11 commits into from Apr 12, 2024

Conversation

matthewdale
Copy link
Collaborator

@matthewdale matthewdale commented Apr 2, 2024

Summary

Experimental change to minimize connection churn during lots-of-timeout events caused by increased database latency.

Changes:

  • Fix a bug that prevents maxTimeMS from being added to commands when a context with deadline is used on an operation, even if timeoutMS is set on the client.
    • For the Context-based deadlines, omit maxTimeMS from operations that return a cursor (Find, Aggregate) because the default behavior of maxTimeMS is to limit the lifetime of the cursor as well as the initial command, which may be surprising (see DRIVERS-2722). The existing automatic maxTimeMS behavior remains the same (i.e. when timeoutMS is set and no Context deadline is provided, Find and Aggregate still have maxTimeMS set).
  • When an operation timeout occurs, mark the in-use connection as "awaiting response" and start a new goroutine that attempts to read the response on the connection. Wait for up to 1 second to get a response from the server. If the read is successful, check the connection into the pool to be reused. If the read fails, close the connection.

Background & Motivation

The Go Driver currently closes the in-use connection when a command times out on the client side. MongoDB processes commands sent on a single connection sequentially, so sending another command on the same connection after the previous one timed out could lead to the later command taking an unpredictable extra amount of time to complete (because it had to wait for the previous response to be read). Closing the connection is the simplest way to both resolve the unknown connection state and signal MongoDB to stop working on the command.

However, closing the connection after an operation timeout happens can cause problems when connection are closed and created at a very high rate. We've observed that setting maxTimeMS as described in the CSOT spec can cause MongoDB to reliably respond with a server-side timeout error 25-100ms after a client-side timeout occurs. That consistent response timing makes it feasible to await the response in the background after a client-side timeout instead of closing the connection.

@mongodb-drivers-pr-bot mongodb-drivers-pr-bot bot added the priority-3-low Low Priority PR for Review label Apr 2, 2024
Copy link

mongodb-drivers-pr-bot bot commented Apr 2, 2024

API Change Report

./x/mongo/driver

incompatible changes

##ExtractErrorFromServerResponse: changed from func(./x/bsonx/bsoncore.Document) error to func(context.Context, ./x/bsonx/bsoncore.Document) error

compatible changes

Operation.OmitCSOTMaxTimeMS: added

./x/mongo/driver/operation

compatible changes

(*Aggregate).OmitCSOTMaxTimeMS: added
(*Find).OmitCSOTMaxTimeMS: added

./x/mongo/driver/topology

compatible changes

BGReadCallback: added
BGReadTimeout: added

@matthewdale matthewdale added priority-2-medium Medium Priority PR for Review and removed priority-3-low Low Priority PR for Review labels Apr 3, 2024
@matthewdale matthewdale marked this pull request as ready for review April 3, 2024 18:09
@ShaneHarvey
Copy link
Member

Was this problem in the other PR addressed?

@matthewdale matthewdale added priority-1-high High Priority PR for Review and removed priority-2-medium Medium Priority PR for Review labels Apr 3, 2024
mongo/collection.go Outdated Show resolved Hide resolved
x/mongo/driver/operation.go Outdated Show resolved Hide resolved
func ExtractErrorFromServerResponse(doc bsoncore.Document) error {
//
// Set isCSOT to true if "timeoutMS" is set on the Client.
func ExtractErrorFromServerResponse(doc bsoncore.Document, isCSOT bool) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest not using a bool here and instead pass in the context and use csot.IsTimeoutContext(ctx) within decodeResult. This ensures that the source of CSOT-ness is derived from a context.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will do.

mongo/integration/client_test.go Outdated Show resolved Hide resolved
@prestonvasquez
Copy link
Collaborator

Here is a UML diagram for the proposal:

read failure workflow (5)

@matthewdale
Copy link
Collaborator Author

matthewdale commented Apr 5, 2024

Was this problem in the other PR addressed?

@ShaneHarvey I'm currently testing the tradeoffs between doing the read in the background vs doing it in-line with a subsequent operation. I'll document or link to those results here.

errs := make([]error, 0)
connClosed := false

err := conn.nc.SetReadDeadline(time.Now().Add(1 * time.Second))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to understand the reasoning behind the 1s deadline for reading any remaining server response. I'm wondering if this needs to be tuneable or whether it should be something dynamic like some multiple of P90 RTT.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1-second deadline is the failsafe timeout if the server doesn't respond in a reasonable period of time after the operation timed out. I picked 1 second because, in our testing, that value covered >99% of the server responses that came after the operation timed out when maxTimeMS is set (the "overshoot duration").

The "overshoot duration" is a combination of timing variability in the client application, database, and network, so it's not clear if there is a practical measurement-based algorithm for picking that value. In the updated PR, that failsafe value is configurable using a global setting in an experimental package (topology.BGReadTimeout). For now the goal is to make that a "safe" but not configurable value, but if we observe that different use cases need different values, we may make it generally configurable.

prestonvasquez
prestonvasquez previously approved these changes Apr 10, 2024
Copy link
Collaborator

@prestonvasquez prestonvasquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

prestonvasquez
prestonvasquez previously approved these changes Apr 10, 2024
prestonvasquez
prestonvasquez previously approved these changes Apr 11, 2024
@@ -683,8 +688,7 @@ func (op Operation) Execute(ctx context.Context) error {
first = false
}

// Calculate maxTimeMS value to potentially be appended to the wire message.
maxTimeMS, err := op.calculateMaxTimeMS(ctx, srvr.RTTMonitor().P90(), srvr.RTTMonitor().Stats())
maxTimeMS, err := op.calculateMaxTimeMS(ctx, srvr.RTTMonitor())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewdale We just discovered an edge case here. We apparently have a background job that sets a context timeout of ~25 days, more specifically 2211839988 milliseconds which is greater than 32-bit max int so they started getting BadValue) 2211839988 value for maxTimeMS is out of range [0, 2147483647] Obviously this is a somewhat far off edge case, I'm asking the team what's the reason of having a context with a deadline this long.

But wanted to flag this to you since this PR is going to start erroring out on those edge cases. The patch we applied internally is to just take min(math.MaxInt32 - (2 * time.Millisecond), time.Duration(float64(timeLeft))) to avoid triggering that validation error. Not sure if the logic in calculateMaxTimeMS should just say "That timeout is too long, I'm not going to set maxTimeMS at all" or if we'd want to apply the same min logic.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah the explanation of why the timeout is so long makes sense. tl;dr - Processes for things like refunds/chargebacks can have month long time frames, and various workflow orchestration systems will set a golang context to timeout when that window ends even if the current step of the workflow won't take nearly that long.

Our team is going to look at tuning DB calls to have more granular contexts/timeouts, but this still might be a broader issue y'all might be concerned about since others could run into it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, nice find! I've updated the driver to omit maxTimeMS if the calculated value is greater than max int32 so that there's no timeout you can set that will cause the "BadValue" error.

That means if an application starts an op with a timeout greater than 2147483647ms, there will be no database-side timeout automatically set. However, the client-side timeout logic will still work and will cause the driver to close the connection if that timeout is ever reached, cancelling the op on the server-side (unless the command returns within the new 1-second "grace period" after the timeout).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, that approach makes sense to me and the code looks good. Nice job on the quick fix!

@matthewdale
Copy link
Collaborator Author

matthewdale commented Apr 12, 2024

@ShaneHarvey I did a bunch of testing between "background reads" during check-in and "foreground reads" during check-out after an operation times out. The results unfortunately didn't show a clear winner as both behaved remarkably similarly, but each approach seemed to have one major downside:

  • The "background reads" approach can create more connections when operations time out, increasing reliance on maxConnecting to limit connection creation rate until the max pool size is reached. That may also result in applications maintaining a larger connection pool when timeouts happen.
  • The "foreground reads" approach can cause a higher error rate while lots of timeouts are happening. That may be caused by interaction between operations when timeouts occur, especially because pools by default use a LIFO algorithm to hand off idle connections.
    • The test that showed the difference in error rate attempts to replicate a database latency event by inserting then removing many documents from a collection while querying it, causing many timeouts as documents are inserted and the queries get slower. Across two test runs, the "foreground reads" approach recorded 15,910 errors over 37,605 requests, a 42.3% error rate. Across two test runs, the "background reads" approach recorded 11,538 errors over 34,141 requests, a 33.8% error rate. The average request rate for both approaches across all tests was around 350 req/sec with no significant difference in request rate between the two approaches.

I had to test both approaches at very high operation and timeout volume before I observed the above differences. In most other circumstances that I tested, they behaved almost identically.

For now I'm going to stick with the "background reads" approach to speed up delivering this fix. For the CSOT spec, I think we should allow driver teams to pick an approach suited to their language.

@matthewdale matthewdale merged commit b605d09 into v1 Apr 12, 2024
32 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority-1-high High Priority PR for Review
Projects
None yet
6 participants