Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trace grpc http requests via net/http/httptrace #6589

Closed
vtolstov opened this issue Aug 28, 2023 · 16 comments
Closed

trace grpc http requests via net/http/httptrace #6589

vtolstov opened this issue Aug 28, 2023 · 16 comments
Assignees
Labels

Comments

@vtolstov
Copy link

I need to trace grpc calls not only via interceptors, but understand how much time i spent on dns resolving and tls handshake. In plain http i can use net/http/httptrace to provide hooks.
How can i do something like this in go-grpc?

@vtolstov vtolstov added the Type: Feature New features or improvements in behavior label Aug 28, 2023
@easwars
Copy link
Contributor

easwars commented Aug 29, 2023

We currently don't expose TLS handshakes or DNS messages as trace events or via interceptors. There are some plans to make more data available via traces and metrics as part of our observability project, but we are not sure at this point whether what you are asking will be exposed.

Could you please tell us your usecase?

Also, if this (dns resolution events and tls handshakes) is something really important to you, you could implement your own custom credentials that wraps the TLS credentials implementation provided by gRPC and make some metrics available to your application. Similarly you can create your own custom resolver that wraps the DNS resolver and makes metrics available to you. But this is not something we would recommend without completely understanding your usecase.

@vtolstov
Copy link
Author

My use case is very simple. I have 50+ services and all works fine. But sometimes some services get timeouts. Dashboards from k8s does not have problems dns works fine, packet loss absent and so. I'm enable tracing in my services and see that some requests get stuck - one service call grpc endpoint , other service receives request and sent response. But half of all spent time is black hole - i'm lost something somewhere.
In http i can use httptrace to get tls timing and dns timing and add this info in trace span.
As i understand grpc uses the http2 go package so maybe httptrace can works too?

@vtolstov
Copy link
Author

@easwars ?

@vtolstov
Copy link
Author

I'm try to pass own ContextDialer to get span from context and measure time.
But as i saw - passed context does not have my span, but before Dial called i saw my span in context. Why this happened?

@easwars
Copy link
Contributor

easwars commented Aug 31, 2023

As i understand grpc uses the http2 go package so maybe httptrace can works too?

gRPC uses some types from the http2 package, but it does not rely on the http2 package for the transport. Specifically, we use the http2.Framer to read and write frames from the underlying connection, but gRPC decides when and how to send specific frames and don't expect the http2 package to manage that.

@easwars
Copy link
Contributor

easwars commented Aug 31, 2023

I'm try to pass own ContextDialer to get span from context and measure time. But as i saw - passed context does not have my span, but before Dial called i saw my span in context. Why this happened?

Maybe if you show us some code, we might be able to help better. Thanks.

@vtolstov
Copy link
Author

the example is very easy

type myxxx struct{}
Call some grpc method with rsp, err := Call(context.WithValue(ctx, myxxx, "test"), req)
custom ContextDialer passed as option
func(ctx context.Context, addr string) (net.Conn, error) {
v, ok := ctx.Value(myxxx).(string)
if ok && v == "test" {
fmt.Printf("OK")
} else {
fmt.Printf("FAIL")
}
}

i'm always get FAIL

@easwars
Copy link
Contributor

easwars commented Aug 31, 2023

The dialer passed to WithContextDialer is invoked every time a connection to a backend is being established. FYI, when your application creates a grpc.ClientConn using grpc.Dial, the ClientConn waits for addresses from the name resolver and passes them on to the load balancer, which is the one which actually creates connections to backends.

To be clear, that context that you pass to your grpc method call is not the same context that is passed to the custom dialer.

@vtolstov
Copy link
Author

vtolstov commented Sep 1, 2023

Yes i saw in code, but don't understand why context passed to call not propogated to ContextDialer.
And - without ability to pass context to Dial/Resolver etc - i cant trace timing to establish connect to endpoint. How to deal?

@vtolstov
Copy link
Author

vtolstov commented Sep 4, 2023

@easwars so, what the next steps? I can't propogate span to resolver and context dialer does not have context with my span (why??)

@vtolstov
Copy link
Author

vtolstov commented Sep 5, 2023

I think if you can pass context from grpc call to ContextDialer this can open many useful features like tracing connection establishment/reconnect detection in tracing, additional logging via injected logger and many other features. Also i think that this is very small change to code base and easy accepted by grpc authors...
@arvindbr8 @easwars ?

@easwars
Copy link
Contributor

easwars commented Sep 5, 2023

Name resolution is something that is started when the grpc.ClientConn is created and happens asynchronously. An RPC call will be blocked until the name resolver comes back with addresses. And these addresses are then passed to the LB policy which creates actual connections to backends. For every connection that is created, credentials handshake happens. To be clear, name resolution and credential handshaking will happen even without an ongoing RPC call.

At RPC time, the LB policy decides which of those backends to use to service that RPC call. There is no control flow between things happening at RPC time, and the name resolver. So, there is no way to forward the context passed to the RPC call to the name resolver. That just simply does not make sense.

I think if you can pass context from grpc call to ContextDialer this can open many useful features like tracing connection establishment/reconnect detection in tracing, additional logging via injected logger and many other features. Also i think that this is very small change to code base and easy accepted by grpc authors...

No, I don't think it will be a small change that will be easily accepted by the maintainers.

@zasweq
Copy link
Contributor

zasweq commented Sep 6, 2023

We have a stats handler that uses OpenCensus, that recordes a trace with the time the picker picked: #6422. This constructs a trace with annotations at the timestamp when the rpc was started, and the picker has successfully picked. This is not technically as granular/specific as the time of "dns resolving and tls handshake", but I think that within that timeframe can tell you if it's an rpc taking a long time or the steps up to the picker picking, which encapsulates those two events you want described. To echo Easwar's point, RPC comes after Dialing, so you can't pass an RPC context into Dial as described.

@zasweq
Copy link
Contributor

zasweq commented Sep 6, 2023

Also, taking the timestamps of the logs of dns resolver would help here.

@github-actions
Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@vtolstov
Copy link
Author

thank you for help

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants