Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Tablets in RESTORE and BACKUP cause SwitchTraffic to fail #15630

Open
TonySparc opened this issue Apr 3, 2024 · 1 comment
Open

Comments

@TonySparc
Copy link

TonySparc commented Apr 3, 2024

Overview of the Issue

When utilizing MoveTables a call is made to RefreshTabletsByShard which attempts to update the topology server for all tablets in the shard. Tablets in RESTORE and BACKUP take out a lock for the duration of their respective restore/backups, which leads to TabletManager.RefreshState failing as it is not able to obtain the lock.

From my understanding, it appears the RefreshTabletsByShard functionality that is used by SwitchTraffic should filter out BACKUP and RESTORE pods. Once a pod fails tablet refresh, all other pods not yet processed also report failing tablet refresh.

In our use environments we have some large keyspaces that have 32 shards, and for each shard the timing of the backup (as a cronjob) is randomly selected to prevent performance issues. For highly sharded, large keyspaces where backups can take hours this can lead to very small windows where a team has no shards backing up and therefore can call SwitchTraffic successfully.

Reproduction Steps

  1. Begin MoveTables run on large keyspace where backups take some time, wait for copy to complete
  2. Restore or Backup one of the pods in the keyspace
  3. Wait for above pod to go into RESTORE or BACKUP
  4. Call SwitchTraffic
  5. Observe that RefreshTabletsByShard fails on BACKUP and RESTORE pods preventing traffic from switching

Binary Version

> /opt/vitess/bin/vtgate --version
Version: 14.0.4-SNAPSHOT (Git revision <redacted> branch 'HEAD') built on Fri Mar 29 04:55:28 UTC 2024 by <redacted> using go1.18.7 linux/amd64

Operating System and Environment details

> cat /etc/os-release
NAME="CentOS Stream"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Stream 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_SUPPORT_PRODUCT_VERSION="CentOS Stream"

> uname -sr
Linux 6.1.66-hs57.el9.x86_64

>  uname -m
x86_64

Log Fragments

failed to refresh tablet vt-2108385091: 
	  rpc error: 
		  code = DeadlineExceeded 
		  desc = context deadline exceeded
  failed to refresh tablet vt-2108385092: 
	  rpc error: 
		  code = DeadlineExceeded 
@TonySparc TonySparc added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Apr 3, 2024
@mattlord
Copy link
Contributor

mattlord commented Apr 8, 2024

Thank you for the detailed issue @TonySparc ! There isn't currently a well defined behavior here — other than the code which will effectively not switch traffic if you have "interim" tablet types like this — so it's unclear if this is a bug or a feature request (technically it would be a feature request, but I can understand how it can be seen as a bug and we could potentially treat it that way).

I would lump this in with the general issue that VReplication does not currently have any special handling for tablet types other than primary, replica, and rdonly — again, other than that they can't exist at the time we switch traffic.

We can't really just go ahead and switch traffic because that tablet could become a replica or rdonly table, right? What does it mean to be backing up or restoring a tablet for a shard that is about to become non-serving (at least for a set of tables)? This is certainly simpler for MoveTables as both sides of the move have serving tablets, it's just a matter of what tables they are serving. Can you please explain in detail what you expected here and what you would prefer? You detailed what happened and the practical impact it had on you, and I understand that, but there are tradeoffs and potential issues in changing the current behavior.

It may very well end up that, at least for MoveTables, we can simply ignore these tablet types. But I wanted to dig in a bit before considering fixes for the issue.

Thanks again!

@mattlord mattlord added this to Backlog in VReplication via automation Apr 8, 2024
@mattlord mattlord self-assigned this Apr 8, 2024
@rohit-nayak-ps rohit-nayak-ps added Component: VReplication and removed Needs Triage This issue needs to be correctly labelled and triaged labels Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

3 participants