Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metrics): Add label for main and other listeners #4739

Merged
merged 6 commits into from
Mar 12, 2025

Conversation

abhijat
Copy link
Contributor

@abhijat abhijat commented Mar 10, 2025

The stats collected per connection are divided according to main or other listener.

Metrics are decorated with labels listener= main or other.

fixes #4708

The behavior of INFO command is unchanged, it still displays the sum of connections and commands instead of separate main and other counts.

@abhijat abhijat force-pushed the abhijat/feat/label-for-main-listener branch 4 times, most recently from 8139a83 to 6321b86 Compare March 10, 2025 15:20

Verified

This commit was signed with the committer’s verified signature.
tshemsedinov Timur Shemsedinov
The stats collected per connection are divided according to main or
other listener.

Metrics are decorated with labels listener= main or other.

Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
@abhijat abhijat force-pushed the abhijat/feat/label-for-main-listener branch from 6321b86 to 7587220 Compare March 11, 2025 05:16
@abhijat abhijat changed the title [wip] feat(metrics): Add label for main and other listeners feat(metrics): Add label for main and other listeners Mar 11, 2025
@kostasrim kostasrim requested review from kostasrim and removed request for BagritsevichStepan March 11, 2025 09:10
Comment on lines +845 to +846
EXPECT_EQ(metrics.facade_stats.conn_stats.num_conns_main, 0);
EXPECT_EQ(metrics.facade_stats.conn_stats.num_conns_other, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is zero because we don't use an actual connection right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, in BaseFamilyTest::Run we seem to directly call DispatchCommand on the test connection wrapper, the HandleRequests method isn't called which would increment or decrement these.

Comment on lines 178 to 181
@dfly_args({"memcached_port": 11211})
async def test_metric_labels(
df_server: DflyInstance, async_client: aioredis.Redis, memcached_client: Client
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need a test for the prometheous metrics but I guess it won't hurt ? You can just open your browser and check via http://localhost:6379/metrics or just rely on the unit test. Not strongly opinionated about this tbh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you don't need memcached_client or memcached_port

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used memcached client to get a count of other connection and command as I see it goes via a listener other than main. Do you recommend to remove this test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's actually an interesting point. We want to count the memcache listener as "main" as well. it is customer facing as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new method which returns true if the role is MAIN or the protocol is MEMCACHED and adjusted the metrics to use this. The test is also changed to account for this and use the admin port for counting other metrics.

This will metric will be incremented for any memcached listener (or main role listener), I hope this is correct behavior.

Comment on lines +1284 to +1288
AppendMetricHeader("connected_clients", "", MetricType::GAUGE, &resp->body());
AppendMetricValue("connected_clients", conn_stats.num_conns_main, {"listener"}, {"main"},
&resp->body());
AppendMetricValue("connected_clients", conn_stats.num_conns_other, {"listener"}, {"other"},
&resp->body());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adiholden Won't this break our dashboards ? Since now we need to use the label as well on our queries ?

Double checking here that this is ok for us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double checked with @Pothulapati we should be ok with the label. It won't break our dashboards.

Comment on lines 2001 to 2003
uint32_t& Connection::NumConns() {
return IsMain() ? stats_->num_conns_main : stats_->num_conns_other;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do not like to return references especially when we only want to increment what is being referenced. I know that Stefan suggested this but I would also suggest a slightly change: Rename NumConns() to IncrNumConns() and do the increment here:

void Connection::IncrNumConnStats() {
  IsMain() ? ++stats_->blah blah....
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Comment on lines 712 to 714
if (const Listener* ls = dynamic_cast<Listener*>(listener()); ls) {
is_main_ = ls->IsMainInterface();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that the static_cast you replaced looks weird and wrong but it isn't.

  1. We do not have other Listener types so if you static_cast here is perfectly fine and the downcast is safe to do so at compile time.
  2. When you use dynamic_cast you are actually doing a runtime cast which queries the RTTI information of the object. There is an associated overhead with this cast. It's needed when "you are not sure" about the object type. Here, in this context and for our data types, we only have Listener so the static_cast will always succeed and be correct.

IMO this is our bad design. C++ has solved this problem and it's called CRTP (curiously recurring template pattern) which is a way to express static polymoprphism in a safe way (avoiding the explicit static_cast we used above by delegating that to the base class via the compiler). I never had a change to refactor our class hierarchies and that's why we got static_cast here. It's not wrong but it could be expressed better. However, I don't have the time to push such a refactor (it's not hard to do so but it does require time because you need to substitute virtual functions with static casts done by the base class).

p.s. you probably have encounter CRTP in seastar. All of the WRITE/READ DMA functions in there should use CRTP internally to decouple interface/impl requirements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you recommend switching back to static_cast here to avoid paying the type query cost?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhijat we usually do not rely on rtti in our code. Matter of taste and we do not use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romange I wrote a paragraph 🤣 But yeah rtti has a small overhead as well :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved back to static_cast

@@ -1318,7 +1318,8 @@ bool Service::InvokeCmd(const CommandId* cid, CmdArgList tail_args, SinkReplyBui
DispatchMonitor(cntx, cid, tail_args);
}

ServerState::tlocal()->RecordCmd();
const bool is_main = cntx->conn() ? cntx->conn()->IsMain() : false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cn cntx->conn() be null here ? If so maybe we can store the IsMain in ConnectionContext instead ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this was null during a unit test which runs the populate debug command, because it called InvokeCmd with a locally created context which had owner set to nullptr https://github.com/dragonflydb/dragonfly/blob/main/src/server/debugcmd.cc#L196

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep this is a stub context and we have a bug. So, why don't we use the context to check if it's main or not ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and since we ConnectionContext local_cntx{cntx, stub_tx.get()}; we will also copy the IsMain information and we will count the right thing. Otherwise we would count those commands as non main even if they actually are

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new field on ConnectionContext, it is false by default and copied from owner connection if the owner and owner's connection are valid.

abhijat added 4 commits March 12, 2025 10:16

Verified

This commit was signed with the committer’s verified signature.
tshemsedinov Timur Shemsedinov
Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>

Verified

This commit was signed with the committer’s verified signature.
tshemsedinov Timur Shemsedinov
Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>

Verified

This commit was signed with the committer’s verified signature.
tshemsedinov Timur Shemsedinov
Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>

Verified

This commit was signed with the committer’s verified signature.
tshemsedinov Timur Shemsedinov
Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
@abhijat abhijat requested a review from kostasrim March 12, 2025 08:23
Copy link
Contributor

@kostasrim kostasrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment LGTM

Comment on lines 929 to 937
if (is_main_) {
return true;
}

if (const Listener* lsnr = static_cast<Listener*>(listener()); lsnr) {
return lsnr->protocol() == Protocol::MEMCACHE;
}

return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (is_main_) {
return true;
}
if (const Listener* lsnr = static_cast<Listener*>(listener()); lsnr) {
return lsnr->protocol() == Protocol::MEMCACHE;
}
return false;
const Listener* lsnr = static_cast<Listener*>(listener());
return is_main_ || lsnr->protocol() == Protocol::MEMCACHE;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe rename because HasMainListener is kinda a lie if protocol is MEMCACHE. Maybe something like IsMainOrMemcached ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'm not too happy with HasMainListener, will change this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with accessing lsnr->protocol() without a null check is that I found it to be null during tests when we end up here with TestConnection, and it caused segfault. From what I could see the test connection doesn't have a listener pointer.

Copy link
Contributor

@kostasrim kostasrim Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case return is_main_ || (lsnr && snr->protocol() == Protocol::MEMCACHE)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made both the changes, I ordered it so the static_cast is only after checking for is_main_, but I can change it to make everything in one composite statement, in that case we always have to do the static cast.

Verified

This commit was signed with the committer’s verified signature.
tshemsedinov Timur Shemsedinov
Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
@abhijat abhijat requested a review from kostasrim March 12, 2025 10:58
Copy link
Contributor

@kostasrim kostasrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@abhijat abhijat merged commit ac33cd8 into main Mar 12, 2025
10 checks passed
@abhijat abhijat deleted the abhijat/feat/label-for-main-listener branch March 12, 2025 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

some traffic metrics should be labeled as main/other
4 participants