Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vtadmin: can't see orphaned tablet records #14152

Open
derekperkins opened this issue Sep 30, 2023 · 12 comments
Open

vtadmin: can't see orphaned tablet records #14152

derekperkins opened this issue Sep 30, 2023 · 12 comments
Assignees
Labels
Component: VTAdmin VTadmin interface Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects

Comments

@derekperkins
Copy link
Member

derekperkins commented Sep 30, 2023

Overview of the Issue

Migrating to vtadmin from the old vtctld app, I noticed that vtadmin doesn't list all tablets, seemingly just healthy ones.

$ vtctlclient -server localhost:15999 ListAllTablets uscentral1 | grep companies
uscentral1-1005772700 companies 0 replica :0 :3306 [] <null>
uscentral1-1123889000 companies 0 primary uscentral1-companies-0-replica-c-0.vttablet:15002 uscentral1-companies-0-replica-c-0.vttablet:3306 [] 2023-09-29T06:36:48Z
image

This isn't urgent to resolve for us. It seems like the UI should list all, but maybe that's a non-trivial change given how vtadmin doesn't use the topo like the old app did.

Reproduction Steps

  1. Launch 2 tablets in a keyspace
  2. Shut down 1 tablet, but don't delete the tablet record

Binary Version

vtadmin v17.0.2
vtctld v15.0.2
vtgate v15.0.2

Operating System and Environment details

GKE v1.27

Log Fragments

No response

@derekperkins derekperkins added Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VTAdmin VTadmin interface Needs Triage This issue needs to be correctly labelled and triaged labels Sep 30, 2023
@deepthi
Copy link
Member

deepthi commented Oct 18, 2023

This is stemming from the fact that VTAdmin gets its list of tablets from vtgate versus vtctld. I happen to agree with you, that all tablets should be visible in VTAdmin.
Some sort of redesign may be needed to have both views available to users.

@deepthi deepthi removed the Needs Triage This issue needs to be correctly labelled and triaged label Oct 18, 2023
@mattlord mattlord added this to Inbox (unplanned/untriaged) in VTAdmin via automation Oct 18, 2023
@mattlord
Copy link
Contributor

mattlord commented Oct 18, 2023

I also agree, FWIW.

The fact that it uses a vtgate's healthcheck cache for this info poses a more serious issue (this had come up in an internal discussion previously). This is that VTAdmin currently uses a ~ random vtgate in the cluster (requests are load balanced via the internal vtgate proxy) to get this info and vtgates can be doing keyspace (--keyspaces_to_watch) and/or cell (--cells_to_watch) filtering — so you may only see a subset of the tablets in a cluster, and a different subset each time.

// ShowTablets is part of the DB interface.
func (vtgate *VTGateProxy) ShowTablets(ctx context.Context) (*sql.Rows, error) {
span, ctx := trace.NewSpan(ctx, "VTGateProxy.ShowTablets")
defer span.Finish()
vtadminproto.AnnotateClusterSpan(vtgate.cluster, span)
return vtgate.conn.QueryContext(vtgate.getQueryContext(ctx), "SHOW vitess_tablets")
}

That makes this feature simply wrong IMO and should be considered a bug rather than an enhancement. As noted, however, this would require some refactoring of this feature as it currently works as designed — making the bug label less clear.

Please correct me if I'm wrong on anything here @notfelineit or @ajm188.

@notfelineit
Copy link
Contributor

I also agree. We can migrate ShowTablets to use https://github.com/vitessio/vitess/blob/main/proto/vtctlservice.proto#L155

@derekperkins
Copy link
Member Author

That makes this feature simply wrong IMO and should be considered a bug rather than an enhancement

Just to make an explicit decision - should this be backported and/or rushed into the v18 launch?

@deepthi
Copy link
Member

deepthi commented Oct 18, 2023

should this be backported and/or rushed into the v18 launch?

Not necessarily. It's been this way since day 1. Having said that, we do have another ~3 weeks before GA, so if we are able to get it done within that time we can do a backport.

@notfelineit notfelineit moved this from Inbox (unplanned/untriaged) to To do soon-ish (planned but not started) in VTAdmin Oct 18, 2023
@notfelineit notfelineit self-assigned this Oct 18, 2023
@notfelineit notfelineit moved this from To do soon-ish (planned but not started) to In progress in VTAdmin Oct 20, 2023
@notfelineit
Copy link
Contributor

notfelineit commented Oct 20, 2023

Leaving sample of vtctld's GetTablets for reference:

➜  ~ vtctldclient GetTablets --server localhost:15999 --format json
[
  {
    "alias": {
      "cell": "zone1",
      "uid": 100
    },
    "hostname": "localhost",
    "port_map": {
      "grpc": 16100,
      "vt": 15100
    },
    "keyspace": "commerce",
    "shard": "0",
    "type": 2,
    "mysql_hostname": "localhost",
    "mysql_port": 17100,
    "default_conn_collation": 255
  },
  {
    "alias": {
      "cell": "zone1",
      "uid": 101
    },
    "hostname": "localhost",
    "port_map": {
      "grpc": 16101,
      "vt": 15101
    },
    "keyspace": "commerce",
    "shard": "0",
    "type": 1,
    "mysql_hostname": "localhost",
    "mysql_port": 17101,
    "primary_term_start_time": {
      "seconds": 1697839748,
      "nanoseconds": 143528000
    },
    "default_conn_collation": 255
  },
  {
    "alias": {
      "cell": "zone1",
      "uid": 102
    },
    "hostname": "localhost",
    "port_map": {
      "grpc": 16102,
      "vt": 15102
    },
    "keyspace": "commerce",
    "shard": "0",
    "type": 3,
    "mysql_hostname": "localhost",
    "mysql_port": 17102,
    "default_conn_collation": 255
  }
]

Compared to what we get back from current implementation via VTGate:

Cell | Keyspace | Shard | TabletType (string) | ServingState (string) | Alias | Hostname | PrimaryTermStartTime.

I think ServingState might be the only field missing.

@ajm188
Copy link
Contributor

ajm188 commented Dec 12, 2023

That makes this feature simply wrong IMO and should be considered a bug rather than an enhancement.

should this be backported and/or rushed into the v18 launch?

Not necessarily. It's been this way since day 1. Having said that, we do have another ~3 weeks before GA, so if we are able to get it done within that time we can do a backport.

To be clear, I don't view this as a bug at all; it was designed this way deliberately because we (at the time) wanted a view into whether tablets were actually reachable via the serving path (vtgates) as opposed to "simply existing in the topo."

when @notfelineit says

I think ServingState might be the only field missing.

That exact field is what we were particularly interested in.

Now, I am completely open to the idea that this is no longer the best approach and isn't serving users, and we should change it (that sounds great! let's do it!!), but this is not a bug and should not be backported or rushed.

@derekperkins
Copy link
Member Author

this is not a bug and should not be backported or rushed

vtgates can be doing keyspace (--keyspaces_to_watch) and/or cell (--cells_to_watch) filtering — so you may only see a subset of the tablets in a cluster, and a different subset each time

I understand that it is working as designed, but especially given @mattlord's comment about how it could give unexpectedly filtered information, it feels important to fix for vtadmin to be reliable.

@ajm188
Copy link
Contributor

ajm188 commented Dec 15, 2023

vtgates can be doing keyspace (--keyspaces_to_watch) and/or cell (--cells_to_watch) filtering — so you may only see a subset of the tablets in a cluster, and a different subset each time

the way we ameliorated this was to deploy a very small pool of vtgates that could see every keyspace/cell, and instructed vtadmin to only query those. the added benefit of doing this is that your application(s) use different vtgates from vtadmin, so there's little/no chance of vtadmin accidentally impacting your apps (e.g. from overwhelming the gates)

like i said i have no objection to changing this design, but i don't view it as a bug, so i don't think it's really something we should be backporting.

@notfelineit
Copy link
Contributor

notfelineit commented Apr 1, 2024

Adding latest context:

  • We need to add a better healthcheck to VTTablet to get the tablet's serving state, and subsequently, to vtctldclient to fetch that healthcheck information
  • In the next release, we should update GetTablets to call that healthcheck for serving state, and release the updated GetTablets
  • Reason for doing in two separate releases, from @deepthi:

Typically we can't use new rpcs in the same release because if vtctld is the new version but vttablet is the old version (mid-upgrade), the rpc call will fail

@deepthi
Copy link
Member

deepthi commented Apr 1, 2024

I take issue with "better healthcheck" , the healthcheck we have is great :)
The nuance here is that the existing healthcheck on vttablet is a streaming rpc. This means that a gRPC connection will be kept open and vttablet broadcasts its health on it periodically. That is n vtctlds * m vttablets. This increases memory usage on all vtctlds and vttablets depending on the actual numbers. It seems wasteful to keep these open. Instead, vtadmin can poll for health status when it needs to show it on the UI.

@notfelineit
Copy link
Contributor

@deepthi thank you for clarifying 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VTAdmin VTadmin interface Type: Bug Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
VTAdmin
In progress
Development

No branches or pull requests

5 participants