Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

major CPU performance regression #6151

Closed
lts-rad opened this issue Jun 8, 2023 · 15 comments
Closed

major CPU performance regression #6151

lts-rad opened this issue Jun 8, 2023 · 15 comments
Labels

Comments

@lts-rad
Copy link

lts-rad commented Jun 8, 2023

Running coredns from master we've observed a major CPU usage regression

      flat  flat%   sum%        cum   cum%
     7.32s 23.42% 23.42%      7.78s 24.90%  runtime.(*itabTableType).find
     3.06s  9.79% 33.22%      5.05s 16.16%  runtime.mallocgc
     2.70s  8.64% 41.86%     10.48s 33.54%  runtime.getitab
     1.89s  6.05% 47.90%      1.89s  6.05%  sync.(*Mutex).Unlock
     1.62s  5.18% 53.09%      7.35s 23.52%  crypto/tls.(*Conn).Read
     1.53s  4.90% 57.98%      1.53s  4.90%  sync.(*Mutex).Lock (inline)
     1.46s  4.67% 62.66%      8.81s 28.19%  io.ReadAtLeast
     1.27s  4.06% 66.72%     25.70s 82.24%  github.com/miekg/dns.(*Conn).ReadMsgHeader
     1.26s  4.03% 70.75%      1.72s  5.50%  crypto/tls.(*Conn).handshakeContext
     1.23s  3.94% 74.69%      6.47s 20.70%  runtime.assertI2I2
     0.65s  2.08% 76.77%     12.82s 41.02%  encoding/binary.Read
     0.56s  1.79% 78.56%      0.56s  1.79%  runtime.getMCache (inline)
     0.49s  1.57% 80.13%      2.87s  9.18%  runtime.makeslice
     0.46s  1.47% 81.60%      0.46s  1.47%  runtime.itabHashFunc (inline)
     0.45s  1.44% 83.04%      0.45s  1.44%  sync/atomic.(*Bool).Load (inline)
     0.44s  1.41% 84.45%      0.44s  1.41%  runtime/internal/atomic.(*Uint8).Load
     0.41s  1.31% 85.76%      5.65s 18.08%  runtime.convI2I
     0.38s  1.22% 86.98%     29.88s 95.62%  github.com/coredns/coredns/plugin/pkg/proxy.(*Proxy).Connect
     0.38s  1.22% 88.19%      0.38s  1.22%  runtime.deductAssistCredit
     0.34s  1.09% 89.28%      0.34s  1.09%  encoding/binary.intDataSize
     0.29s  0.93% 90.21%      0.29s  0.93%  runtime.acquirem (inline)
     0.27s  0.86% 91.07%      0.30s  0.96%  runtime.nextFreeFast (inline)
     0.24s  0.77% 91.84%     25.94s 83.01%  github.com/miekg/dns.(*Conn).ReadMsg
     0.23s  0.74% 92.58%      0.23s  0.74%  crypto/tls.(*Conn).readRecordOrCCS
     0.23s  0.74% 93.31%      0.23s  0.74%  runtime.releasem (inline)

perhaps this is related to the proxy refactor

#5951

@lts-rad lts-rad added the bug label Jun 8, 2023
@chrisohaver
Copy link
Member

chrisohaver commented Jun 8, 2023

perhaps this is related to the proxy refactor #5951

Can you verify? If its the case we can revert the change.

@lts-rad
Copy link
Author

lts-rad commented Jun 8, 2023

I have not been able to verify the cause yet. the issue does not reproduce quickly or immediately from load. it happens during real world usage, and i don't know if its something in particular. i am testing with cloudflare as my upstream dns forward destination with dns over https.

@lts-rad
Copy link
Author

lts-rad commented Jun 8, 2023

what's frustrating is it's unclear what creates the state where so much cpu time is spent in github.com/miekg/dns.(*Conn).ReadMsgHeader. even while not a lot of queries are being actively sent in.

It may also be related to recent changes in https://github.com/miekg/dns/blame/master/client.go

either
miekg/dns#1454
or
miekg/dns#1430

going back to v1.10.0 the issue hasnt happened yet

@chrisohaver
Copy link
Member

I just looked at #5951 again, and it seems fairly innocuous.

@lts-rad
Copy link
Author

lts-rad commented Jun 8, 2023

So it's worth pointing out that the regression is not accompanied with a corresponding memory leak or consumption issue once coredns reaches a bad state. I am still trying to find a reliable trigger and I haven't identified one yet.

In the golang tracker threres an issue related to getitab and friends where the main hit is happening:

A conversion that involves runtime.getitab (convI2I, assertI2I, assertE2I) is generally much slower than an ordinary
interface value construction from a statically known concrete type.

golang/go#51133

(pprof) peek runtime.convI2I
Showing nodes accounting for 31.25s, 100% of 31.25s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context              
----------------------------------------------------------+-------------
                                             5.65s   100% |   github.com/miekg/dns.(*Conn).ReadMsgHeader
     0.41s  1.31%  1.31%      5.65s 18.08%                | runtime.convI2I
                                             5.24s 92.74% |   runtime.getitab
----------------------------------------------------------+-------------
(pprof) peek runtime.assertI2I2
Showing nodes accounting for 31.25s, 100% of 31.25s total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context              
----------------------------------------------------------+-------------
                                             3.55s 54.87% |   github.com/coredns/coredns/plugin/pkg/proxy.(*Proxy).Connect
                                             2.92s 45.13% |   github.com/miekg/dns.isPacketConn
     1.23s  3.94%  3.94%      6.47s 20.70%                | runtime.assertI2I2
                                             5.24s 80.99% |   runtime.getitab

@lobshunter
Copy link
Contributor

Is there any way to reproduce the issue locally(running for several hours is acceptable)? It does sound like some kind of resource leak if it happens during real world usage.

@cattyhouse
Copy link
Contributor

same here. after few hours of normal usage, coredns v1.11.0 consumes all the cpu cores, and cpu idle becomes 0%, temperature goes to 75c, fan spinning crazy.

@johnbelamaric
Copy link
Member

Have you tried rebuilding in an older version of go?

@chrisohaver
Copy link
Member

A possible culprit is #6014

It keeps requests open for longer when we get a non-network error reading the response to protect against malformed response dos cache poisoning attacks. But it could be that this is triggering in some common case, keeping connections open longer, resulting in swamping CoreDNS with too many concurrent connections.

@chrisohaver
Copy link
Member

If the issue is not reproducible with force_tcp added to the forward plugin, then #6014 is probably the cause.

@chrisohaver
Copy link
Member

Nevermind - it looks like #6255 sees the issue with tls forwarding, which would exclude #6014 from being the cause.

@cattyhouse
Copy link
Contributor

Have you tried rebuilding in an older version of go?

it was built with go 1.20.7

#6255 is almost the same as the issue i had with v1.11.0, i use tls forwarding as well.

@chrisohaver
Copy link
Member

chrisohaver commented Aug 10, 2023

Looks like the OP deployed with tls too, given the tls handshake in the profiling data.
Although, using dot forwarding is fairly common, so it could be a red herring.

miekg/dns#1430 did change how tls is dialed though.

@gcs278
Copy link
Contributor

gcs278 commented Aug 14, 2023

I have provided more information about a potential reproducer and root cause in #6255 (comment)

@lts-rad
Copy link
Author

lts-rad commented Aug 14, 2023

nice one @gcs278 , this certainly matches what we ran into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants