VReplication: Improve handling of vplayer stalls #15797
Draft
+32
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Observations:
Logs:
vstreamer.go:1022] stream (at source tablet) error @ 056f7cce-9679-11ee-93de-7a43718bd292:1-88,05b63ad0-9679-11ee-8848-b6dd624d298a:1-339542,4e956b44-99ec-11ee-b993-76445c58387d:1-6192085,4e980878-99ec-11ee-9055-cae1ac6f6e27:1-17946122,b912bee5-99ec-11ee-8938-46a387219895:1-10463847: EOF (errno 2013) (sqlstate HY000)
dbclient.go:125] error in stream 50, will retry after 5s: vttablet: rpc error: code = Unknown desc = stream (at source tablet) error @ 056f7cce-9679-11ee-93de-7a43718bd292:1-88,05b63ad0-9679-11ee-8848-b6dd624d298a:1-339542,4e956b44-99ec-11ee-b993-76445c58387d:1-6196344,4e980878-99ec-11ee-9055-cae1ac6f6e27:1-17946122,b912bee5-99ec-11ee-8938-46a387219895:1-10463847: unexpected EOF
binlog_connection.go:164] connection closed during binlog stream (possibly intentional): unexpected EOF
MySQL connection:
VReplication / OnlineDDL:
Thesis:
com_binlog_dump_gtid
connection on the source tablet as that's what is then sent to thevplayer
on the target tablet (via theVStream
RPC call that the target made to the source) to write to the relay log.slave_net_timeout
. When this error ends theVStream
RPC from the target to the source the workflow goes into the error state (to be retried 5 seconds later by default).This is a valid scenario that can be encountered. The real problem here is that it was very much NOT obvious what was happening and where the real problem lied. Ideally we should provide information that allows the user to zero in on the actual problem as described above: the queries being executed via the replication stream are doing table scans and we'll need to address that somehow: new indexes, workflow adjustments (in the seen case we needed to adjust the workflow binlog source filter that OnlineDDL created), source and/or target tablet config adjustments (increasing
slave_net_timeout
on the source e.g.), etc.Proposed improvements:
Related Issue(s)
Checklist