Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

headscale server stopped answering after a day of uptime despite listening of all the ports #1572

Closed
axxonadmin opened this issue Oct 11, 2023 · 37 comments
Labels
bug Something isn't working
Milestone

Comments

@axxonadmin
Copy link

Went to a local shop and tried to connect to my remote headscale server v0.23.0-alpha1 I've got working several days ago.

Tailscale client (1.48.2 android) stuck at "connecting"
then I logged out and tried to log in, it stuck at this stage also with no output or warning
I got home, tried to connect from my desktop computer (1.51 windows 11) with no success (it stuck openning listen_addr and port in browser to authenticate)

Then I connected to my server over ssh and ran 'headscale apikeys list' and 'headscale nodes list' with no success, then I restarted headscale server with 'service headscale stop' / 'service headscale start' and everything started working just fine.

Can I do better for you to invetigate this issue if I get it next time?

What I've found so far:

# headscale nodes list
2023-10-10T06:50:37Z TRC DNS configuration loaded dns_config={"ExitNodeFilteredS                                                                                                                                                             et":null,"Nameservers":["1.1.1.1"],"Proxied":true,"Resolvers":[{"Addr":"1.1.1.1"                                                                                                                                                             }]}
Cannot get nodes: context deadline exceeded

# headscale apikeys list
2023-10-10T06:57:51Z TRC DNS configuration loaded dns_config={"ExitNodeFilteredSet":null,"Nameservers":["1.1.1.1"],"Proxied":true,"Resolvers":[{"Addr":"1.1.1.1"}]}
Error getting the list of keys: rpc error: code = DeadlineExceeded desc = context deadline exceeded

tcpdump on port 8080 (which is my 'listen_addr') showed smth like that:

listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
06:56:01.067801 IP server.http-alt > client.58443: Flags [.], ack 3966225333, win 501, length 0
06:56:01.128148 IP client.58443 > server.http-alt: Flags [.], ack 1, win 508, length 0
06:56:01.323836 IP server.http-alt > client.60098: Flags [.], ack 1688718613, win 501, options [nop,nop,TS val 3603882113 ecr 2274080242], length 0
06:56:01.388132 IP client.60098 > server.http-alt: Flags [.], ack 1, win 501, options [nop,nop,TS val 2274095346 ecr 3603867310], length 0
06:56:03.627793 IP server.http-alt > client.60094: Flags [.], ack 130672375, win 6146, options [nop,nop,TS val 3603884417 ecr 2274082546], length 0
06:56:03.688819 IP client.60094 > server.http-alt: Flags [.], ack 1, win 569, options [nop,nop,TS val 2274097650 ecr 3603868977], length 0
06:56:05.597059 IP client.58442 > server.http-alt: Flags [F.], seq 2662313159, ack 1270638799, win 511, length 0
06:56:05.597098 IP server.http-alt > client.58442: Flags [.], ack 1, win 501, length 0
06:56:05.597637 IP client.58442 > server.http-alt: Flags [R.], seq 1, ack 1, win 0, length 0
06:56:13.355788 IP server.http-alt > client.53110: Flags [.], ack 3672297188, win 501, length 0
06:56:13.419916 IP client.53110 > server.http-alt: Flags [.], ack 1, win 511, length 0
06:56:15.807181 IP server.http-alt > client.60098: Flags [P.], seq 1:133, ack 1, win 501, options [nop,nop,TS val 3603896596 ecr 2274095346], length 132: HTTP
06:56:15.872518 IP client.60098 > server.http-alt: Flags [.], ack 133, win 501, options [nop,nop,TS val 2274109830 ecr 3603896596], length 0
06:56:16.171786 IP server.http-alt > client.58443: Flags [.], ack 1, win 501, length 0
06:56:16.230261 IP client.58443 > server.http-alt: Flags [.], ack 1, win 508, length 0
06:56:18.507858 IP client.60094 > server.http-alt: Flags [.], ack 1, win 569, options [nop,nop,TS val 2274112469 ecr 3603868977], length 0
06:56:18.507892 IP server.http-alt > client.60094: Flags [.], ack 1, win 6146, options [nop,nop,TS val 3603899297 ecr 2274097650], length 0
06:56:18.731791 IP server.http-alt > client.60094: Flags [.], ack 1, win 6146, options [nop,nop,TS val 3603899521 ecr 2274097650], length 0
06:56:18.792686 IP client.60094 > server.http-alt: Flags [.], ack 1, win 569, options [nop,nop,TS val 2274112754 ecr 3603899297], length 0
06:56:21.055498 IP server.http-alt > client.60094: Flags [P.], seq 1:28, ack 1, win 6146, options [nop,nop,TS val 3603901845 ecr 2274112754], length 27: HTTP
06:56:21.116522 IP client.60094 > server.http-alt: Flags [.], ack 28, win 569, options [nop,nop,TS val 2274115077 ecr 3603901845], length 0
06:56:28.216581 IP client.53110 > server.http-alt: Flags [.], seq 0:1, ack 1, win 511, length 1: HTTP
06:56:28.216616 IP server.http-alt > client.53110: Flags [.], ack 1, win 501, options [nop,nop,sack 1 {0:1}], length 0
06:56:31.007091 IP client.58443 > server.http-alt: Flags [.], seq 0:1, ack 1, win 508, length 1: HTTP
06:56:31.007111 IP server.http-alt > client.58443: Flags [.], ack 1, win 501, options [nop,nop,sack 1 {0:1}], length 0
06:56:31.019802 IP server.http-alt > client.60098: Flags [.], ack 1, win 501, options [nop,nop,TS val 3603911809 ecr 2274109830], length 0
06:56:31.084272 IP client.60098 > server.http-alt: Flags [.], ack 133, win 501, options [nop,nop,TS val 2274125041 ecr 3603896596], length 0
06:56:36.139788 IP server.http-alt > client.60094: Flags [.], ack 1, win 6146, options [nop,nop,TS val 3603916929 ecr 2274115077], length 0
06:56:36.200912 IP client.60094 > server.http-alt: Flags [.], ack 28, win 569, options [nop,nop,TS val 2274130161 ecr 3603901845], length 0
06:56:43.307830 IP server.http-alt > client.53110: Flags [.], ack 1, win 501, length 0
06:56:43.374419 IP client.53110 > server.http-alt: Flags [.], ack 1, win 511, length 0
06:56:46.041372 IP client.60098 > server.http-alt: Flags [.], ack 133, win 501, options [nop,nop,TS val 2274139999 ecr 3603896596], length 0
06:56:46.041408 IP server.http-alt > client.60098: Flags [.], ack 1, win 501, options [nop,nop,TS val 3603926831 ecr 2274125041], length 0
06:56:46.123788 IP server.http-alt > client.58443: Flags [.], ack 1, win 501, length 0
06:56:46.123809 IP server.http-alt > client.60098: Flags [.], ack 1, win 501, options [nop,nop,TS val 3603926913 ecr 2274125041], length 0
06:56:46.182664 IP client.58443 > server.http-alt: Flags [.], ack 1, win 508, length 0
06:56:46.188073 IP client.60098 > server.http-alt: Flags [.], ack 133, win 501, options [nop,nop,TS val 2274140145 ecr 3603926831], length 0
06:56:51.147862 IP client.60094 > server.http-alt: Flags [.], ack 28, win 569, options [nop,nop,TS val 2274145109 ecr 3603901845], length 0
06:56:51.147895 IP server.http-alt > client.60094: Flags [.], ack 1, win 6146, options [nop,nop,TS val 3603931937 ecr 2274130161], length 0
06:56:51.243776 IP server.http-alt > client.60094: Flags [.], ack 1, win 6146, options [nop,nop,TS val 3603932033 ecr 2274130161], length 0
06:56:51.304654 IP client.60094 > server.http-alt: Flags [.], ack 28, win 569, options [nop,nop,TS val 2274145265 ecr 3603931937], length 0
06:56:58.296500 IP client.53110 > server.http-alt: Flags [.], seq 0:1, ack 1, win 511, length 1: HTTP
06:56:58.296535 IP server.http-alt > client.53110: Flags [.], ack 1, win 501, options [nop,nop,sack 1 {0:1}], length 0
06:57:01.227791 IP server.http-alt > client.60098: Flags [.], ack 1, win 501, options [nop,nop,TS val 3603942017 ecr 2274140145], length 0
06:57:01.227818 IP server.http-alt > client.58443: Flags [.], ack 1, win 501, length 0
06:57:01.286844 IP client.58443 > server.http-alt: Flags [.], ack 1, win 508, length 0
06:57:01.292055 IP client.60098 > server.http-alt: Flags [.], ack 133, win 501, options [nop,nop,TS val 2274155249 ecr 3603926831], length 0
06:57:01.707945 IP 46.174.191.28.38364 > server.http-alt: Flags [S], seq 3502016974, win 8192, options [mss 1460,nop,nop,sackOK], length 0

I tried to restart headscale with 'service headscale stop' and logs went like that:

ct 10 06:58:28 server headscale[482461]: 2023-10-10T06:58:28Z INF Received signal to stop, shutting down gracefully signal=terminated
Oct 10 06:58:28 server headscale[482461]: 2023-10-10T06:58:28Z INF 
../../../home/runner/work/headscale/headscale/hscontrol/poll.go:33 > The long-poll handler is shutting down node=MikroTik node_key=eeff3e25824571a2657da2127
ee3773c7be59d358fcf6f89fc84075e6151db2e noise=true omitPeers=false readOnly=false stream=true
Oct 10 06:58:28 server systemd[1]: Stopping headscale coordination server for Tailscale...
Oct 10 06:59:58 server systemd[1]: headscale.service: State 'stop-sigterm' timed out. Killing.
Oct 10 06:59:58 server systemd[1]: headscale.service: Killing process 482461 (headscale) with signal SIGKILL.
Oct 10 06:59:58 server systemd[1]: headscale.service: Killing process 482462 (headscale) with signal SIGKILL.
Oct 10 06:59:58 server systemd[1]: headscale.service: Killing process 482463 (headscale) with signal SIGKILL.
Oct 10 06:59:58 server systemd[1]: headscale.service: Killing process 482465 (headscale) with signal SIGKILL.
Oct 10 06:59:58 server systemd[1]: headscale.service: Main process exited, code=killed, status=9/KILL
Oct 10 06:59:58 server systemd[1]: headscale.service: Failed with result 'timeout'.
Oct 10 06:59:58 server systemd[1]: Stopped headscale coordination server for Tailscale.
Oct 10 06:59:58 server systemd[1]: headscale.service: Consumed 56.911s CPU time.

Before that logs were full of:
Oct 9 20:30:15 server headscale[482461]: 2023-10-09T20:30:15Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:33 > Waiting for update on stream channel node=MikroTik node_key=removed_key_hash_data
removed_key_hash_data noise=true omitPeers=false readOnly=false stream=true

@axxonadmin axxonadmin added the bug Something isn't working label Oct 11, 2023
@rjmalagon
Copy link

+1. After ~20 minutes headscale is stuck with timeouts while updating nodes.

@rjmalagon
Copy link

After carefully checks about my headscale VMs stats, I found that is very ram sensitive.
In a serve ~200+ nodes with a 2Gb RAM / 4 CPU VM.
Sometimes, headscale pushes a very brief RAM peak and usually stays below 1Gb RAM usage.
I just added a little swapfile and worked well. I am now experimenting with a zram swap compression (it seems to work as well)

@rjmalagon
Copy link

I am wrong on this, headscale sometimes gets stuck, even with enough memory and CPU.
Usually, it gets stuck on multiples "waiting for update on stream channel".
I set up a Caddy reverse proxy to get the connection logs, and I see many 502 on the tailscale-control-protocol upgrade when headscale stops responding.

@Nickiel12
Copy link

+1 Running on a nixos server with 24 GB or RAM (so ram isn't the issue). Headscale randomly (from what I can tell so far) get's stuck using 3% of the CPU, and takes 5 minutes for systemctl to restart, and doesn't allow new connections.

@kradalby
Copy link
Collaborator

I think this might be fixed with the tip of #1564, could you test?

@Nickiel12
Copy link

Nickiel12 commented Dec 1, 2023

@kradalby I would like to test the change that you have made - however I'm having an awful time trying to use the headscale flake. I've been using the nixpkgs version, and I don't have my head wrapped around how flakes and nix quite all work. I saw that you use the headscale flake in your configuration, and was wondering how your configuration uses the flake version of headscale instead of the nixpkgs version? I'm assuming it has to do with the overlay, but I tried putting overlay = [ headscale.overlay ] in my flake.nix (after adding it as an input) and it did not change what version was installed.

If I can figure out how to switch to the flake version, I will be happy to test this change.

I wasn't inheriting the overlay-ed pkgs to the server host. I'm switching to the patched version of the headscale server and will let you know how it works in a few days.

@Nickiel12
Copy link

I'm not sure if this is the right thread for this, but I noticed while watch systemctl status headscale and systemctl restart headscale it appears to have closed the DB (sqlite for my instance) before shutting down other threads properly, and it took almost two minutes for headscale to restart - I don't know if this is related, but if the sqlite database can disconnect before the rest of the application shuts down it might be causing this issue.

@kradalby kradalby added this to the v0.23.0 milestone Dec 10, 2023
@kradalby
Copy link
Collaborator

0.23.0-alpha2 addresses a series of issues with node synchronisation, online status and subnet routers, please test this release and report back if the issue still persist.

@Nickiel12
Copy link

@kradalby The issue of the headscale server not responding has gone away. I've noticed a weird issue with the app where I need to log out, change the server and save it, then log back in before I can connect. But I have not had to restart the headscale server at all since I switched to the alpha release.

@Nickiel12
Copy link

well, scratch that, now I keep getting prompted to re-register my phone when I try to connect to the server. But I'm not getting the same issue as before where I couldn't log on at all and the headscale service would hang.

@rjmalagon
Copy link

0.23.0-alpha2 addresses a series of issues with node synchronisation, online status and subnet routers, please test this release and report back if the issue still persist.

Hi, the stability and responsiveness are much better. Alpha 1 was easily stuck with some stress (+300 nodes) in just minutes. This new alpha is robust enough to handle the same workload 20+ hours without issues. Thanks @kradalby for the advances and the follow of this issue.

I can share any info if needed to help with the development of v0.23.

@jwischka
Copy link

I'm experiencing something similar on 0.23-alpha2 - I haven't had time to debug yet, but I've had to restart headscale twice on a moderately sized install (~30 nodes) in the last two days.

@Nickiel12
Copy link

I deleted my phone node, re-registered it, and I have not had any issues connecting in two days since!

@cfouche3005
Copy link

Same issue on 0.23-alpha2 after one hour, I will check if the issue still reappears

@cfouche3005
Copy link

I can confirm the issue is still present, after one hour or so, no response from grpc or http api but headscale still work (I haven't tested if I can register new nodes)

@cfouche3005
Copy link

And I also confirm that new devices or disconnected devices cannot join/reconnect to the tailnet

@cfouche3005
Copy link

I have to add this bug is very inconsistent, today it did not appear but yesterday, it was here.

@dustinblackman
Copy link

dustinblackman commented Feb 7, 2024

Finding myself in a similar situation where Headscale randomly locks running master branch from this commit and postgres. I haven't been able to find exactly where the issue happens while debugging. I'm hoping something like #1701 may solve it.

In the mean time, I've put together a small/lazy systemd service that acts as a healthcheck, rebooting Headscale if it locks up. Taking yesterday's data, this fired 9 times with 30 nodes in a for-fun containerized testing environment.

headscale-health

#!/usr/bin/env bash

set -e

while true; do
	echo "Checking headscale health"
	timeout 5 headscale nodes list >/dev/null ||
		(echo "Failed to get nodes list, rebooting headscale" && timeout 3 systemctl restart headscale) ||
		(echo "Failed to restart, killing process" && kill -9 "$(ps aux | grep 'headscale serve' | grep -v grep | awk '{print $2}')")

	sleep 10
done

headscale-health.service

[Unit]
After=syslog.target
After=network.target
Description=headscale health monitor

[Service]
Type=simple
User=root
Group=root
ExecStart=/usr/bin/headscale-health
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target

@kradalby
Copy link
Collaborator

kradalby commented Feb 7, 2024

Finding myself in a similar situation where Headscale randomly locks running master branch from this commit and postgres. I haven't been able to find exactly where the issue happens while debugging. I'm hoping something like #1701 may solve it.

@TotoTheDragon Could you have a look at this? I wont have time to get around to it for a bit more.

@TotoTheDragon
Copy link
Contributor

@kradalby I have been looking into this issue for the past day and agree with dustin that #1701 is a good contender for solving this issue and #1656 . This issue was present before the commit referenced.

I do not have the infrastructure to test with 30 nodes or so to recreate the issue, but we could make a build that includes the changes from #1701 and hopefully dustin is able to test that to see if it makes any difference.

@dustinblackman
Copy link

dustinblackman commented Feb 7, 2024

I'm down! If #1701 is considered complete in it's current state, I can ship it and see how it does.

@kradalby
Copy link
Collaborator

kradalby commented Feb 7, 2024

I would say that it is complete, but complete as in tip of main, not tested sufficiently to release as a version. But I interpret your current running of main as your risk appetite is fine with that.

@dustinblackman
Copy link

dustinblackman commented Feb 7, 2024

your risk appetite is fine with that.

More or less, it's the fact codebase is easy to read, so at least I know what I'm bringing in off main. I'll give this a shot either this week or next :)

@TotoTheDragon
Copy link
Contributor

@dustinblackman Would you be able to test with current version of main?

@dustinblackman
Copy link

dustinblackman commented Feb 12, 2024

@TotoTheDragon I've been running from 83769ba for the last four days. At first it looked like all was good, but looking at the logs I'm still seeing lockups, but less. I can try from the latest master later in the week.

I also have a set of scripts for a local cluster I had written for #1725. I can look to PR them if you think they'd be helpful in debugging this.

@TotoTheDragon
Copy link
Contributor

@dustinblackman seeing as headscale nodes list gets a context exceeded, maybe we can add a bunch of traces within the command and see where it gets stuck. This will help rule some stuff out.

@kradalby
Copy link
Collaborator

Could you give alpha4 a spin: https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4

@dustinblackman
Copy link

dustinblackman commented Feb 16, 2024

Could you give alpha4 a spin: https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4

@kradalby I ran this for about two hours, I saw no reboots, but I experienced issues where some newly added ephemeral nodes were unable to communicate over the network (port 443 requests), even with tailscale ping showing a direct connection. I'm wondering if nodes are not always being notified when a new node joins the network.

I'm going to try again in a localized environment and see if I can repro it.

@kradalby
Copy link
Collaborator

thank you @dustinblackman, thats helpful, it does sounds like there is some missing updates. There is a debug env flag you can turn on which will dump all the mapresponses sent, if you can repro, that would potentially be helpful info, but it produces a lot of data and might not be suitable if you have a lot of nodes.

You can play around with that by setting HEADSCALE_DEBUG_DUMP_MAPRESPONSE_PATH to somewhere on your system.

@kradalby
Copy link
Collaborator

Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

@dustinblackman
Copy link

@kradalby No reboots again, but after 30 minutes I get several lines such as the following. Couldn't prove they were actually causing issues. I'll test further.

{"level":"error","error":"context deadline exceeded","mkey":"mkey:0bdfa4de8f7c1f14d15fff04f39993c581c6ba131278ffsessdf5b0d","origin":"poll-nodeupdate-peers-patch","hostname":"mylaptop","time":1708543963,"message":"update not sent, context cancelled"}

@dustinblackman
Copy link

I'm unable to repro this in a local cluster. :(

@kradalby
Copy link
Collaborator

Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

@kradalby
Copy link
Collaborator

I think the latest alpha should have improved this a lot, can someone experiencing this give it a try?

@dustinblackman
Copy link

I'll look to give this a spin this week if I can slot it in :)

@dustinblackman
Copy link

Been running this for a little over a day with no issues! Amazing work, thank you! I appreciate all the effort.

@ohdearaugustin
Copy link
Collaborator

Will close this as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

9 participants