Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headscale stops accepting connections after ~500 nodes (likely 512) (0.23-alpha2) #1656

Open
2 tasks done
jwischka opened this issue Dec 16, 2023 · 23 comments
Open
2 tasks done
Labels
bug Something isn't working
Milestone

Comments

@jwischka
Copy link

jwischka commented Dec 16, 2023

Bug description

I have a large headscale instance (~550 nodes). 0.22.3 has extreme CPU usage (i.e., 30 cores), but is able to handle all clients. 0.23-alpha2 has substantially better CPU performance, but stops accepting connections after about 500 nodes. CLI shows "context exceeded" for all queries (e.g. "headscale nodes/users/routes list") and new clients are unable to join.

CPU usage after connections stop is relatively modest (<50%), and connected clients appear to be able to access each other (e.g. ping/login) as expected.

Environment

Headscale: 0.23-alpha2
Tailscale: various, but mostly 1.54+, mostly Linux, some Mac and Windows.
OS: Headscale installed in privileged Proxmox LXC container (Ubuntu 20.04.6), reverse proxied behind nginx per official docs
Kernel: 6.2.16
Resources: 36 cores, 8GB RAM, 4GB swap (ram usage is quite small)
DB: Using postgres backend (sqlite would die after about 100 clients on 0.22.X with similar symptoms)
Config: ACLs are in use based on users. Two users have access to all nodes. Most nodes have access only to a relatively small set of other nodes via ACL. Specifically, there are 7 nodes that have access to everything, but on almost all other nodes "tailscale status" will return only 8 devices.

  • Headscale is behind a (reverse) proxy
  • Headscale runs in a container

To Reproduce

I suspect difficult to do, but have large numbers of clients connect in.

Given that the behavior is a direct change from 0.22.3 -> 0.23-alpha2, I don't think the container or reverse proxy have anything to do with it. I can with some effort bypass the reverse proxy, and do a direct port-forward from a firewall instead, but running on bare metal will be substantially more difficult.

Because I'm in container, I can easily snap/test/revert possible fixes.

@jwischka jwischka added the bug Something isn't working label Dec 16, 2023
@kradalby kradalby added this to the v0.23.0 milestone Dec 17, 2023
@jwischka
Copy link
Author

Update to this - I have 2 headscale instances behind my nginx proxy, and in splitting out the connection status, it looks like this is cutting off at 512 nodes - I'm not sure if that's helpful, but since it's a magic number of sorts it might be.

@jwischka jwischka changed the title Headscale stops accepting connections after ~500 nodes (0.23-alpha2) Headscale stops accepting connections after ~500 nodes (likely 512) (0.23-alpha2) Dec 19, 2023
@TotoTheDragon
Copy link
Contributor

@jwischka Have you made sure you are not hitting your file descriptor limits.
If ulimit -n is still 1024, (awfully close to double when your issues start to arise)
try to raise it with ulimit -n 4096 and see if that fixes this issue

@jwischka
Copy link
Author

jwischka commented Feb 5, 2024

@TotoTheDragon Negative, unfortunately. Just tried raising the ulimit to 16384 on alpha3, same issue persists. Almost immediately gets to

root@headscale:/home/user# headscale nodes list
2024-02-05T21:00:18Z TRC DNS configuration loaded dns_config={"Nameservers":["1.1.1.1"],"Proxied":true,"Resolvers":[{"Addr":"1.1.1.1"}]}
Cannot get nodes: context deadline exceeded

@TotoTheDragon
Copy link
Contributor

@jwischka would you be able to test this with current version of main?

@jwischka
Copy link
Author

@TotoTheDragon forgive my ignorance but is there a snapshot build available? I don't have a build env set up, and won't be able to create one on short notice.

@kradalby
Copy link
Collaborator

@jwischka just released a new alpha4, please give that a go https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4 and report back.

@jwischka
Copy link
Author

@kradalby

After updating and fixing the database config section, I'm getting the following error:

FTL Migration failed: LastInsertId is not supported by this driver error="LastInsertId is not supported by this driver"

(postgres 12)

@kradalby
Copy link
Collaborator

Yep, on it, it was a regression in a upstream dependency, see #1755

@kradalby
Copy link
Collaborator

Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

The postgres issues should now be resolved.

@jwischka
Copy link
Author

@kradalby I think this may be resolved. I had some issues updating the config file, but rebuilt. Memory usage is way up, but overall processor usage is substantially lower than 0.22.2. It takes a while for a client to connect in, but it does appear that clients are reliably able to connect even when there are a lot of them. I'll monitor and report back.

@kradalby
Copy link
Collaborator

Thank you, could you elaborate on how much way up in terms of memory is? I would expect it to be up as we just keep more stuff around in memory but still nice to compare some numbers

@jwischka
Copy link
Author

@kradalby I ran for months at about ~700MB/4G of memory in my container instance. It would vary between 500-900MB. After installing alpha 5 I'm at 8GB after bumping the free memory to 8GB. It's actually consuming so much memory I can't log in. Disk IO also increased precipitously, probably (possibly) trying to swap?

Initial usage doesn't appear to be that bad, but something blows up at some point. It goes from using about 200MB to about 2.4GB after 2-3 minutes. It jumps another 500MB or so a couple of minutes later. Periodically I get massive CPU spikes (2500%) with an accompanying 6GB or so usage. Every time this happens it seems to grow another 500MB or so.

Like mentioned in another thread, I'm getting a ton of errors like the following in the logs:

Feb 22 00:53:04 headscale-server-name headscale[407]: 2024-02-22T00:53:04Z ERR update not sent, context cancelled error="context deadline exceeded" hostname=clientXXXX mkey=mkey:c691d490de423e2daccdd980f217e827b1431788c388ed0baf2c1c0c40413637 origin=poll-nodeupdate-onlinestatus
Feb 22 00:53:05 headscale-server-name headscale[407]: 2024-02-22T00:53:05Z ERR update not sent, context cancelled error="context deadline exceeded" hostname=clientYYYY mkey=mkey:ff1cf9ccc345d336d3c41b47f060bfeafb38ec8a19fe0f5f97243f17c01ea77a origin=poll-nodeupdate-peers-patch

Also a lot of:

Feb 22 00:54:18 headscale-server-name headscale[407]: 2024-02-22T00:54:18Z ERR ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:56 > Could not write the map response error="client disconnected" node=dfa205016 node_key=[n+gDM] omitPeers=false readOnly=false stream=true
Feb 22 00:54:18 headscale-server-name headscale[407]: 2024-02-22T00:54:18Z ERR ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:56 > Could not write the map response error="client disconnected" node=dfa215180 node_key=[vIPnF] omitPeers=false readOnly=false stream=true

I may have to revert if this continues, since this is a semi-production machine.

@jwischka
Copy link
Author

@kradalby Further update - it looks like once the memory runs out there are a lot of other connection issues - I go from ~500 units connected to between 100-250. Let me know if there's something else I can do to help debug this for you.

@kradalby
Copy link
Collaborator

kradalby commented Mar 4, 2024

@jwischka I've started some experimental work in #1791, which should both improve performance, and add some tunables for high traffic usage, but it isnt done yet so I would not recommend trying it in a prod env. If you have a non-prod env that is similar, you can.

@jwischka
Copy link
Author

jwischka commented Mar 4, 2024

@kradalby Sounds good - let me know when you think it's semi-ready and I can give it a go. I've got the ability to snapshot the container and rollback, but obviously I don't want to do that a ton if there's a chance to break things.

@ananthb
Copy link

ananthb commented Apr 10, 2024

I'm looking to run a large cluster in about a year or so and I can contribute dev time for this effort. Any thing I can help with?

@kradalby
Copy link
Collaborator

Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

@ananthb
Copy link

ananthb commented Apr 17, 2024

I can't find release binaries for alpha6 so I'm running alpha7. I'm already seeing reduced memory usage a couple of hours in. Anything in particular I should look for?

I see new log lines talking about partial updates.

@kradalby
Copy link
Collaborator

Sorry, 6 quickly got replaced with 7 because of an error. In principle I would say that the "nodes stays connected over time" would be the main goal, that none of them loose connection. The main change for performance and resource usage is a change in how the updates are batched and sent to the clients.

@jwischka
Copy link
Author

@kradalby Is alpha 7 working with postgres? I'm getting errors.

Also, it's showing that Alpha 8 is out but not posted?

Thanks

@jwischka
Copy link
Author

@kradalby I got alpha 8 to work.

At least so far things look like they are stable, but it looks like the issue (~500 nodes) still persists. When I try joining a client machine that is requesting the entire tailnet (i.e., a machine that should be able to connect to all 500 nodes), I get the following errors after things have settled down:

Apr 17 14:59:11 headscale headscale[21834]: 2024-04-17T14:59:11Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:719 > a node sending a MapRequest with Noise protocol node=x13s node.id=582 omitPeers=false readOnly=false stream=true
Apr 17 14:59:11 headscale headscale[21834]: 2024-04-17T14:59:11Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:719 > aquiring lock to check stream node=x13s node.id=582 omitPeers=false readOnly=false stream=true
Apr 17 14:59:14 headscale headscale[21834]: 2024-04-17T14:59:14Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:719 > a node sending a MapRequest with Noise protocol node=x13s node.id=582 omitPeers=false readOnly=false stream=true
Apr 17 14:59:14 headscale headscale[21834]: 2024-04-17T14:59:14Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:719 > aquiring lock to check stream node=x13s node.id=582 omitPeers=false readOnly=false stream=true

If I kill the process and the node is one of the first to try to join the tailnet, it joins quite quickly. You may understand these errors better than I do, but it looks to me like the client is timing out and requesting things again. These errors appear every few seconds. I'm also getting a lot of these (thousands) messages in the log files.

So it looks like memory usage, processor usage, etc are better, but the connection issues may still remain. It looks like I've leveled out at 538 connections. I can try to add a few different clients here in a bit and see if it goes up, or how long it takes to connect clients. Is there anything I can supply you that would be helpful?

@kradalby
Copy link
Collaborator

There are two new undocumented tuning options (please note the might change and/or removed) which could helps us figure out maybe the holdup:

https://github.com/juanfont/headscale/blob/main/hscontrol/types/config.go#L235-L236

The first is the max amount of time it waits for updates before bundling them and sending them to the node.

The seconds is how many updates a node can receive before it will block.

Can you experiment with these numbers?

I would not set the wait for higher than 1s, but lower will flush to client more often. I would expect lower wait to be lower memory, higher cpu.

For the batcher,you might see a super tiny increase in memory, but I doubt it as it's mostly just more ints allowed on a channel, but you could try that for quite some different large numbers and report back: 100, 300, 500, 1000 even.

It would help if you could try different combinations for these and try to make a table with your observation, it's hard for me to simulate environments like yours.

Thanks.

@jwischka
Copy link
Author

Some updates -

As soon as I start to increase node_mapsession_buffered_chan_size, I immediately get a lot of crashes. Even a value of 35 will result in reliable crashes. The error message is longer than my console log, but it ends with:

Apr 17 19:53:27 headscale headscale[3521]: database/sql.(*Rows).initContextClose.gowrap1()
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2977
Apr 17 19:53:27 headscale headscale[3521]: runtime.goexit({})
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/runtime/asm_amd64.s:1695 +0x1
Apr 17 19:53:27 headscale headscale[3521]: created by database/sql.(*Rows).initContextClose in goroutine 4029
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2977 +0x150
Apr 17 19:53:27 headscale headscale[3521]: goroutine 15485 [chan receive]:
Apr 17 19:53:27 headscale headscale[3521]: database/sql.(*Tx).awaitDone(0xc004717a00)
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2202 +0x2b
Apr 17 19:53:27 headscale headscale[3521]: created by database/sql.(*DB).beginDC in goroutine 4029
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:1915 +0x20d
Apr 17 19:53:27 headscale headscale[3521]: goroutine 15486 [runnable]:
Apr 17 19:53:27 headscale headscale[3521]: database/sql.(*Rows).initContextClose.gowrap1()
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2977
Apr 17 19:53:27 headscale headscale[3521]: runtime.goexit({})
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/runtime/asm_amd64.s:1695 +0x1
Apr 17 19:53:27 headscale headscale[3521]: created by database/sql.(*Rows).initContextClose in goroutine 3521
Apr 17 19:53:27 headscale headscale[3521]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/database/sql/sql.go:2977 +0x150
Apr 17 19:53:27 headscale systemd[1]: headscale.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 17 19:53:27 headscale systemd[1]: headscale.service: Failed with result 'exit-code'.
Apr 17 19:53:27 headscale systemd[1]: headscale.service: Consumed 13.084s CPU time.

Setting the value back down to 30 seems to calm things down.

Interestingly, if I increase the number of postgres max_open_conns I can get similar errors, and I can also generate http related errors:

Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/tailscale.com@v1.58.2/control/controlhttp/server.go:193 +0x5b
Apr 17 19:58:12 headscale headscale[6319]: tailscale.com/control/controlbase.(*Conn).readNLocked(0xc002e94d20, 0x3)
Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/tailscale.com@v1.58.2/control/controlbase/conn.go:115 +0xe2
Apr 17 19:58:12 headscale headscale[6319]: tailscale.com/control/controlbase.(*Conn).decryptOneLocked(0xc002e94d20)
Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/tailscale.com@v1.58.2/control/controlbase/conn.go:223 +0x1f4
Apr 17 19:58:12 headscale headscale[6319]: tailscale.com/control/controlbase.(*Conn).Read(0xc002e94d20, {0xc00278b0f8, 0x18, 0x458401?})
Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/tailscale.com@v1.58.2/control/controlbase/conn.go:253 +0x117
Apr 17 19:58:12 headscale headscale[6319]: io.ReadAtLeast({0x22811c0, 0xc002e94d20}, {0xc00278b0f8, 0x18, 0x18}, 0x18)
Apr 17 19:58:12 headscale headscale[6319]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/io/io.go:335 +0x90
Apr 17 19:58:12 headscale headscale[6319]: io.ReadFull(...)
Apr 17 19:58:12 headscale headscale[6319]: #011/nix/store/mzg3cka0bbr5jq96ysymwziw74fnk22m-go-1.22.1/share/go/src/io/io.go:354
Apr 17 19:58:12 headscale headscale[6319]: golang.org/x/net/http2.(*serverConn).readPreface.func1()
Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/golang.org/x/net@v0.22.0/http2/server.go:1055 +0x87
Apr 17 19:58:12 headscale headscale[6319]: created by golang.org/x/net/http2.(*serverConn).readPreface in goroutine 49462
Apr 17 19:58:12 headscale headscale[6319]: #011/home/runner/go/pkg/mod/golang.org/x/net@v0.22.0/http2/server.go:1052 +0xae
Apr 17 19:58:12 headscale systemd[1]: headscale.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 17 19:58:12 headscale systemd[1]: headscale.service: Failed with result 'exit-code'.
Apr 17 19:58:12 headscale systemd[1]: headscale.service: Consumed 32.458s CPU time.

With the 0.23 alphas I've been leaving the node update check interval at 10, but I bumped that to 30 and things seem to behave better (I had it at 35s on 0.22).

I'm not really having any cpu/memory issues at the moment, but the real problem is that when I try to connect with one of my "main" users that ought to be able to see all of the other nodes (via ACL policy) it's never able to join. It just hangs for minutes without joining.

Also worth noting because it's probably relevant, but ACLs seem to be broken on this instance (unless there was a change to the ACL behavior that I missed), and all of my 500+ nodes can see all of my other 500+ nodes, which is... well, kind of bad. Probably bad enough I need to revert. I'm not exactly sure what the issue is because I have another instance that I also updated to alpha8 and the ACLs work fine there. Both instances use a similar acl structure, so it's somewhat odd that it works in one place and not the other.

Thoughts? I saved the config file this time, so it's pretty easy to try new stuff at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants