Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: tablet throttler multi-metrics #15624

Open
shlomi-noach opened this issue Apr 3, 2024 · 6 comments
Open

RFC: tablet throttler multi-metrics #15624

shlomi-noach opened this issue Apr 3, 2024 · 6 comments
Assignees
Labels
Component: Throttler Type: Enhancement Logical improvement (somewhere between a bug and feature)

Comments

@shlomi-noach
Copy link
Contributor

shlomi-noach commented Apr 3, 2024

Today, table throttler uses a single metric by which to throttler. This metric is dynamically configurable, but is just the one. The default metric is replication lag, and can be modified based on any query that returns a scalar value, e.g. to return Threads_running.

We want the throttler to measure multiple metrics at once, and we want to be able to throttle based on a selective list of metrics. Such metrics could be:

  • Replication lag
  • Threads_running
  • Custom query
  • Load average on tablet host (per core)
  • Other OS metrics

To that effect, we want:

  • Tablets to always collect self multiple metrics on (on their own host or their designated MySQL server)
  • PRIMARY tablet to always collect all available metrics from replica tablet
  • Metrics should be identifiable by a designated name
  • Throttler check requests (mostly via throttler clients) should be able to specify the list of metrics on which they wish to throttle (e.g. "I care about replication lag, but fine to ignore load average")
  • User should be able to control the list of metrics for VReplication workflows (to be decided exactly how). And specifically for Online DDL. We will likely want to apply the same list of metrics for all workflows (ie we don't need different workflows to each have a different list of metrics on which to throttle)
  • Modifying list of metrics should apply dynamically to running workflows.
  • Throttler configuration should include expected thresholds per metric name.
  • We continue to apply throttler configuration across the keyspace (all tablets in all shards of a given keyspace align on the same single configuration)

Introducing multi-metrics dimension explodes the complexity of the throttler code. However, we are thankfully also able to reduce the complexity by getting rid of dimensions that we don't really use or need, and which were inherited from freno:

  • Clusters: today we use self and shard, but self isn't really a cluster, and the code largely handles it different than shard. We can therefore remove the "cluster" or "store" dimension.
    • Likewise we can also remove the per-cluster configuration overrides.
  • Store types: we only use MySQL, We can remove the dimension.
  • Probe settings: we always probe by tablet, and the probe layer is mostly redundant.
  • Other.

We will need to be backwards compatible: multi-metric PRIMARY should work with v19 replicas, and vice versa.

This will cause a major rewrite, with some temporary redundancy code to support backwards compatibility. Hopefully we can simplify some existing complexities inherited from freno, or technical debt we've accumulated since.

Unit tests and endtoend tests will remain (and expand) to protect us against incompatibilities.

@shlomi-noach shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Throttler labels Apr 3, 2024
@shlomi-noach shlomi-noach self-assigned this Apr 3, 2024
@shlomi-noach shlomi-noach changed the title Tracking: tablet throttler multi-metrics RFC: tablet throttler multi-metrics Apr 4, 2024
@shlomi-noach
Copy link
Contributor Author

Observability: we should be able to track why a certain client was throttled, ie which specific metric it was throttled on.

@shlomi-noach
Copy link
Contributor Author

Throttler check requests (mostly via throttler clients) should be able to specify the list of metrics on which they wish to throttle (e.g. "I care about replication lag, but fine to ignore load average")

The set of metrics specified by the client will AND with each other, ie if the client chooses to throttle based on lag,loadavg then both lag and loadavg need to individually pass for the overall check to pass.

I don't think it makes sense to OR or to have any other combination.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Apr 16, 2024

As mentioned above, we want to be able to change the list of considered metrics while an Online DDL operation is running (as an example). So that, for example, we want Online DDL to start throttling based on lag and based on load average, or then later on for it to stop throttling based on load average and remain just with lag.

IMO the way to do that is to associate metrics with an app name. All Online DDL operations use the app name "online-ddl". So the way would be to associate "online-ddl": "lag,loadavg".

That association will then either

  • make its way to the throttler client -- which then provides to the throttler the list of metrics its interested in,
  • or, keeping the throttler client ignorant, computed on behalf of the client by the throttler.

@shlomi-noach
Copy link
Contributor Author

metrics can be collected from the single tablet being probed, or from the collective shard.

  • Replication lag is normally something you wish to collect from the entire shard (including primary), because you want to know about replica's lag. There is a strong reason to check on all shard servers.
  • What about load average? Are you concerned with the load average on the PRIMARY or are you concerned about the metric on replicas? There is no clear answer and you probably want to check on PRIMARY only.

To that effect:

  • A metric is associated with a scope (self/shard). Each metric has a default scope. lag uses shard, others use self.
  • A normal check will use the default scopes (per metric).
  • But the user may also indicate "I wish to check the entire shard for all metrics" or I wish to check self scope for all metrics". In which case we override the metrics' defaults.

Moreover, consider the discussion in previous comment re: associating metrics with apps. It will be even further possible to fine grain the checks by associating "online-ddl": "lag,shard/loadavg". Note:

  • the scope is not mandatory (nothing declared for lag, and so the scope for lag is the default one for this metric, which happens to be shard).
  • per-metric scopes are ignored by the self-checks, which are the mechanism by which the tablets collect their own metrics and by which the PRIMARY tablet collects metrics from the replicas.

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented May 15, 2024

Adding support for an all app, which is a catch-all for anything that's doesn't have any specific rules. With all, it is possible to do inverted rules, such as "everything is rejected, except this app which is allowed". Or, "everything throttles at 0.7 ratio for the next 2 hours, except these two apps, one of which is exempted in the next 5 hours, the other throttled at 0.2 ratio for the next 30min". Or also "everything is exempted, but this app needs to go through normal throttling".

@shlomi-noach
Copy link
Contributor Author

Adding vtctldclient CheckThrottler command, which returns a detailed CheckThrottlerResponse. The command takes a tablet name as argument (potentially also it could take shard name, much like Backup and BackupShard). IT takes --app-name and --scope optional arguments as well as some extra flags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Throttler Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

No branches or pull requests

1 participant