-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(metrics): Add label for main and other listeners #4739
Conversation
8139a83
to
6321b86
Compare
The stats collected per connection are divided according to main or other listener. Metrics are decorated with labels listener= main or other. Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
6321b86
to
7587220
Compare
EXPECT_EQ(metrics.facade_stats.conn_stats.num_conns_main, 0); | ||
EXPECT_EQ(metrics.facade_stats.conn_stats.num_conns_other, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is zero because we don't use an actual connection right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, in BaseFamilyTest::Run
we seem to directly call DispatchCommand
on the test connection wrapper, the HandleRequests
method isn't called which would increment or decrement these.
@dfly_args({"memcached_port": 11211}) | ||
async def test_metric_labels( | ||
df_server: DflyInstance, async_client: aioredis.Redis, memcached_client: Client | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need a test for the prometheous metrics but I guess it won't hurt ? You can just open your browser and check via http://localhost:6379/metrics
or just rely on the unit test. Not strongly opinionated about this tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also you don't need memcached_client
or memcached_port
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used memcached client to get a count of other
connection and command as I see it goes via a listener other than main. Do you recommend to remove this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's actually an interesting point. We want to count the memcache listener as "main" as well. it is customer facing as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new method which returns true if the role is MAIN
or the protocol is MEMCACHED
and adjusted the metrics to use this. The test is also changed to account for this and use the admin port for counting other
metrics.
This will metric will be incremented for any memcached listener (or main role listener), I hope this is correct behavior.
AppendMetricHeader("connected_clients", "", MetricType::GAUGE, &resp->body()); | ||
AppendMetricValue("connected_clients", conn_stats.num_conns_main, {"listener"}, {"main"}, | ||
&resp->body()); | ||
AppendMetricValue("connected_clients", conn_stats.num_conns_other, {"listener"}, {"other"}, | ||
&resp->body()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adiholden Won't this break our dashboards ? Since now we need to use the label as well on our queries
?
Double checking here that this is ok for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double checked with @Pothulapati we should be ok with the label. It won't break our dashboards.
src/facade/dragonfly_connection.cc
Outdated
uint32_t& Connection::NumConns() { | ||
return IsMain() ? stats_->num_conns_main : stats_->num_conns_other; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do not like to return references especially when we only want to increment what is being referenced. I know that Stefan suggested this but I would also suggest a slightly change: Rename NumConns()
to IncrNumConns()
and do the increment here:
void Connection::IncrNumConnStats() {
IsMain() ? ++stats_->blah blah....
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
src/facade/dragonfly_connection.cc
Outdated
if (const Listener* ls = dynamic_cast<Listener*>(listener()); ls) { | ||
is_main_ = ls->IsMainInterface(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that the static_cast
you replaced looks weird and wrong but it isn't.
- We do not have other
Listener
types so if youstatic_cast
here is perfectly fine and thedowncast
is safe to do so at compile time. - When you use
dynamic_cast
you are actually doing aruntime cast
which queries theRTTI
information of the object. There is an associated overhead with this cast. It's needed when "you are not sure" about the object type. Here, in this context and for our data types, we only haveListener
so thestatic_cast
will always succeed and be correct.
IMO this is our bad design. C++ has solved this problem and it's called CRTP (curiously recurring template pattern) which is a way to express static polymoprphism
in a safe way
(avoiding the explicit static_cast
we used above by delegating that to the base class via the compiler). I never had a change to refactor our class hierarchies and that's why we got static_cast
here. It's not wrong but it could be expressed better. However, I don't have the time to push such a refactor (it's not hard to do so but it does require time because you need to substitute virtual functions with static casts done by the base class).
p.s. you probably have encounter CRTP in seastar. All of the WRITE/READ DMA
functions in there should use CRTP
internally to decouple interface/impl requirements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you recommend switching back to static_cast
here to avoid paying the type query cost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhijat we usually do not rely on rtti in our code. Matter of taste and we do not use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romange I wrote a paragraph 🤣 But yeah rtti has a small overhead as well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved back to static_cast
src/server/main_service.cc
Outdated
@@ -1318,7 +1318,8 @@ bool Service::InvokeCmd(const CommandId* cid, CmdArgList tail_args, SinkReplyBui | |||
DispatchMonitor(cntx, cid, tail_args); | |||
} | |||
|
|||
ServerState::tlocal()->RecordCmd(); | |||
const bool is_main = cntx->conn() ? cntx->conn()->IsMain() : false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cn cntx->conn()
be null here ? If so maybe we can store the IsMain
in ConnectionContext
instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this was null during a unit test which runs the populate
debug command, because it called InvokeCmd
with a locally created context which had owner set to nullptr https://github.com/dragonflydb/dragonfly/blob/main/src/server/debugcmd.cc#L196
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep this is a stub context
and we have a bug. So, why don't we use the context to check if it's main
or not ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and since we ConnectionContext local_cntx{cntx, stub_tx.get()};
we will also copy the IsMain
information and we will count the right thing. Otherwise we would count those commands as non main even if they actually are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new field on ConnectionContext
, it is false by default and copied from owner connection if the owner and owner's connection are valid.
Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment LGTM
src/facade/dragonfly_connection.cc
Outdated
if (is_main_) { | ||
return true; | ||
} | ||
|
||
if (const Listener* lsnr = static_cast<Listener*>(listener()); lsnr) { | ||
return lsnr->protocol() == Protocol::MEMCACHE; | ||
} | ||
|
||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (is_main_) { | |
return true; | |
} | |
if (const Listener* lsnr = static_cast<Listener*>(listener()); lsnr) { | |
return lsnr->protocol() == Protocol::MEMCACHE; | |
} | |
return false; | |
const Listener* lsnr = static_cast<Listener*>(listener()); | |
return is_main_ || lsnr->protocol() == Protocol::MEMCACHE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and maybe rename because HasMainListener
is kinda a lie if protocol is MEMCACHE. Maybe something like IsMainOrMemcached
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I'm not too happy with HasMainListener
, will change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with accessing lsnr->protocol()
without a null check is that I found it to be null during tests when we end up here with TestConnection
, and it caused segfault. From what I could see the test connection doesn't have a listener pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case return is_main_ || (lsnr && snr->protocol() == Protocol::MEMCACHE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made both the changes, I ordered it so the static_cast
is only after checking for is_main_
, but I can change it to make everything in one composite statement, in that case we always have to do the static cast.
Signed-off-by: Abhijat Malviya <abhijat@dragonflydb.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The stats collected per connection are divided according to main or other listener.
Metrics are decorated with labels listener= main or other.
fixes #4708
The behavior of
INFO
command is unchanged, it still displays the sum of connections and commands instead of separate main and other counts.