headscale server stopped answering after a day of uptime despite listening of all the ports #1572

axxonadmin · 2023-10-11T15:36:33Z

Went to a local shop and tried to connect to my remote headscale server v0.23.0-alpha1 I've got working several days ago.

Tailscale client (1.48.2 android) stuck at "connecting"
then I logged out and tried to log in, it stuck at this stage also with no output or warning
I got home, tried to connect from my desktop computer (1.51 windows 11) with no success (it stuck openning listen_addr and port in browser to authenticate)

Then I connected to my server over ssh and ran 'headscale apikeys list' and 'headscale nodes list' with no success, then I restarted headscale server with 'service headscale stop' / 'service headscale start' and everything started working just fine.

Can I do better for you to invetigate this issue if I get it next time?

What I've found so far:

# headscale nodes list
2023-10-10T06:50:37Z TRC DNS configuration loaded dns_config={"ExitNodeFilteredS                                                                                                                                                             et":null,"Nameservers":["1.1.1.1"],"Proxied":true,"Resolvers":[{"Addr":"1.1.1.1"                                                                                                                                                             }]}
Cannot get nodes: context deadline exceeded

# headscale apikeys list
2023-10-10T06:57:51Z TRC DNS configuration loaded dns_config={"ExitNodeFilteredSet":null,"Nameservers":["1.1.1.1"],"Proxied":true,"Resolvers":[{"Addr":"1.1.1.1"}]}
Error getting the list of keys: rpc error: code = DeadlineExceeded desc = context deadline exceeded

tcpdump on port 8080 (which is my 'listen_addr') showed smth like that:

listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
06:56:01.067801 IP server.http-alt > client.58443: Flags [.], ack 3966225333, win 501, length 0
06:56:01.128148 IP client.58443 > server.http-alt: Flags [.], ack 1, win 508, length 0
06:56:01.323836 IP server.http-alt > client.60098: Flags [.], ack 1688718613, win 501, options [nop,nop,TS val 3603882113 ecr 2274080242], length 0
06:56:01.388132 IP client.60098 > server.http-alt: Flags [.], ack 1, win 501, options [nop,nop,TS val 2274095346 ecr 3603867310], length 0
06:56:03.627793 IP server.http-alt > client.60094: Flags [.], ack 130672375, win 6146, options [nop,nop,TS val 3603884417 ecr 2274082546], length 0
06:56:03.688819 IP client.60094 > server.http-alt: Flags [.], ack 1, win 569, options [nop,nop,TS val 2274097650 ecr 3603868977], length 0
06:56:05.597059 IP client.58442 > server.http-alt: Flags [F.], seq 2662313159, ack 1270638799, win 511, length 0
06:56:05.597098 IP server.http-alt > client.58442: Flags [.], ack 1, win 501, length 0
06:56:05.597637 IP client.58442 > server.http-alt: Flags [R.], seq 1, ack 1, win 0, length 0
06:56:13.355788 IP server.http-alt > client.53110: Flags [.], ack 3672297188, win 501, length 0
06:56:13.419916 IP client.53110 > server.http-alt: Flags [.], ack 1, win 511, length 0
06:56:15.807181 IP server.http-alt > client.60098: Flags [P.], seq 1:133, ack 1, win 501, options [nop,nop,TS val 3603896596 ecr 2274095346], length 132: HTTP
06:56:15.872518 IP client.60098 > server.http-alt: Flags [.], ack 133, win 501, options [nop,nop,TS val 2274109830 ecr 3603896596], length 0
06:56:16.171786 IP server.http-alt > client.58443: Flags [.], ack 1, win 501, length 0
06:56:16.230261 IP client.58443 > server.http-alt: Flags [.], ack 1, win 508, length 0
06:56:18.507858 IP client.60094 > server.http-alt: Flags [.], ack 1, win 569, options [nop,nop,TS val 2274112469 ecr 3603868977], length 0
06:56:18.507892 IP server.http-alt > client.60094: Flags [.], ack 1, win 6146, options [nop,nop,TS val 3603899297 ecr 2274097650], length 0
06:56:18.731791 IP server.http-alt > client.60094: Flags [.], ack 1, win 6146, options [nop,nop,TS val 3603899521 ecr 2274097650], length 0
06:56:18.792686 IP client.60094 > server.http-alt: Flags [.], ack 1, win 569, options [nop,nop,TS val 2274112754 ecr 3603899297], length 0
06:56:21.055498 IP server.http-alt > client.60094: Flags [P.], seq 1:28, ack 1, win 6146, options [nop,nop,TS val 3603901845 ecr 2274112754], length 27: HTTP
06:56:21.116522 IP client.60094 > server.http-alt: Flags [.], ack 28, win 569, options [nop,nop,TS val 2274115077 ecr 3603901845], length 0
06:56:28.216581 IP client.53110 > server.http-alt: Flags [.], seq 0:1, ack 1, win 511, length 1: HTTP
06:56:28.216616 IP server.http-alt > client.53110: Flags [.], ack 1, win 501, options [nop,nop,sack 1 {0:1}], length 0
06:56:31.007091 IP client.58443 > server.http-alt: Flags [.], seq 0:1, ack 1, win 508, length 1: HTTP
06:56:31.007111 IP server.http-alt > client.58443: Flags [.], ack 1, win 501, options [nop,nop,sack 1 {0:1}], length 0
06:56:31.019802 IP server.http-alt > client.60098: Flags [.], ack 1, win 501, options [nop,nop,TS val 3603911809 ecr 2274109830], length 0
06:56:31.084272 IP client.60098 > server.http-alt: Flags [.], ack 133, win 501, options [nop,nop,TS val 2274125041 ecr 3603896596], length 0
06:56:36.139788 IP server.http-alt > client.60094: Flags [.], ack 1, win 6146, options [nop,nop,TS val 3603916929 ecr 2274115077], length 0
06:56:36.200912 IP client.60094 > server.http-alt: Flags [.], ack 28, win 569, options [nop,nop,TS val 2274130161 ecr 3603901845], length 0
06:56:43.307830 IP server.http-alt > client.53110: Flags [.], ack 1, win 501, length 0
06:56:43.374419 IP client.53110 > server.http-alt: Flags [.], ack 1, win 511, length 0
06:56:46.041372 IP client.60098 > server.http-alt: Flags [.], ack 133, win 501, options [nop,nop,TS val 2274139999 ecr 3603896596], length 0
06:56:46.041408 IP server.http-alt > client.60098: Flags [.], ack 1, win 501, options [nop,nop,TS val 3603926831 ecr 2274125041], length 0
06:56:46.123788 IP server.http-alt > client.58443: Flags [.], ack 1, win 501, length 0
06:56:46.123809 IP server.http-alt > client.60098: Flags [.], ack 1, win 501, options [nop,nop,TS val 3603926913 ecr 2274125041], length 0
06:56:46.182664 IP client.58443 > server.http-alt: Flags [.], ack 1, win 508, length 0
06:56:46.188073 IP client.60098 > server.http-alt: Flags [.], ack 133, win 501, options [nop,nop,TS val 2274140145 ecr 3603926831], length 0
06:56:51.147862 IP client.60094 > server.http-alt: Flags [.], ack 28, win 569, options [nop,nop,TS val 2274145109 ecr 3603901845], length 0
06:56:51.147895 IP server.http-alt > client.60094: Flags [.], ack 1, win 6146, options [nop,nop,TS val 3603931937 ecr 2274130161], length 0
06:56:51.243776 IP server.http-alt > client.60094: Flags [.], ack 1, win 6146, options [nop,nop,TS val 3603932033 ecr 2274130161], length 0
06:56:51.304654 IP client.60094 > server.http-alt: Flags [.], ack 28, win 569, options [nop,nop,TS val 2274145265 ecr 3603931937], length 0
06:56:58.296500 IP client.53110 > server.http-alt: Flags [.], seq 0:1, ack 1, win 511, length 1: HTTP
06:56:58.296535 IP server.http-alt > client.53110: Flags [.], ack 1, win 501, options [nop,nop,sack 1 {0:1}], length 0
06:57:01.227791 IP server.http-alt > client.60098: Flags [.], ack 1, win 501, options [nop,nop,TS val 3603942017 ecr 2274140145], length 0
06:57:01.227818 IP server.http-alt > client.58443: Flags [.], ack 1, win 501, length 0
06:57:01.286844 IP client.58443 > server.http-alt: Flags [.], ack 1, win 508, length 0
06:57:01.292055 IP client.60098 > server.http-alt: Flags [.], ack 133, win 501, options [nop,nop,TS val 2274155249 ecr 3603926831], length 0
06:57:01.707945 IP 46.174.191.28.38364 > server.http-alt: Flags [S], seq 3502016974, win 8192, options [mss 1460,nop,nop,sackOK], length 0

I tried to restart headscale with 'service headscale stop' and logs went like that:

ct 10 06:58:28 server headscale[482461]: 2023-10-10T06:58:28Z INF Received signal to stop, shutting down gracefully signal=terminated
Oct 10 06:58:28 server headscale[482461]: 2023-10-10T06:58:28Z INF 
../../../home/runner/work/headscale/headscale/hscontrol/poll.go:33 > The long-poll handler is shutting down node=MikroTik node_key=eeff3e25824571a2657da2127
ee3773c7be59d358fcf6f89fc84075e6151db2e noise=true omitPeers=false readOnly=false stream=true
Oct 10 06:58:28 server systemd[1]: Stopping headscale coordination server for Tailscale...
Oct 10 06:59:58 server systemd[1]: headscale.service: State 'stop-sigterm' timed out. Killing.
Oct 10 06:59:58 server systemd[1]: headscale.service: Killing process 482461 (headscale) with signal SIGKILL.
Oct 10 06:59:58 server systemd[1]: headscale.service: Killing process 482462 (headscale) with signal SIGKILL.
Oct 10 06:59:58 server systemd[1]: headscale.service: Killing process 482463 (headscale) with signal SIGKILL.
Oct 10 06:59:58 server systemd[1]: headscale.service: Killing process 482465 (headscale) with signal SIGKILL.
Oct 10 06:59:58 server systemd[1]: headscale.service: Main process exited, code=killed, status=9/KILL
Oct 10 06:59:58 server systemd[1]: headscale.service: Failed with result 'timeout'.
Oct 10 06:59:58 server systemd[1]: Stopped headscale coordination server for Tailscale.
Oct 10 06:59:58 server systemd[1]: headscale.service: Consumed 56.911s CPU time.

Before that logs were full of:
Oct 9 20:30:15 server headscale[482461]: 2023-10-09T20:30:15Z INF ../../../home/runner/work/headscale/headscale/hscontrol/poll.go:33 > Waiting for update on stream channel node=MikroTik node_key=removed_key_hash_data
removed_key_hash_data noise=true omitPeers=false readOnly=false stream=true

The text was updated successfully, but these errors were encountered:

rjmalagon · 2023-10-28T03:24:10Z

+1. After ~20 minutes headscale is stuck with timeouts while updating nodes.

rjmalagon · 2023-10-28T16:47:45Z

After carefully checks about my headscale VMs stats, I found that is very ram sensitive.
In a serve ~200+ nodes with a 2Gb RAM / 4 CPU VM.
Sometimes, headscale pushes a very brief RAM peak and usually stays below 1Gb RAM usage.
I just added a little swapfile and worked well. I am now experimenting with a zram swap compression (it seems to work as well)

rjmalagon · 2023-10-30T21:10:31Z

I am wrong on this, headscale sometimes gets stuck, even with enough memory and CPU.
Usually, it gets stuck on multiples "waiting for update on stream channel".
I set up a Caddy reverse proxy to get the connection logs, and I see many 502 on the tailscale-control-protocol upgrade when headscale stops responding.

Nickiel12 · 2023-11-29T02:51:05Z

+1 Running on a nixos server with 24 GB or RAM (so ram isn't the issue). Headscale randomly (from what I can tell so far) get's stuck using 3% of the CPU, and takes 5 minutes for systemctl to restart, and doesn't allow new connections.

kradalby · 2023-11-30T20:38:47Z

I think this might be fixed with the tip of #1564, could you test?

Nickiel12 · 2023-12-01T02:17:15Z

@kradalby I would like to test the change that you have made - however I'm having an awful time trying to use the headscale flake. I've been using the nixpkgs version, and I don't have my head wrapped around how flakes and nix quite all work. I saw that you use the headscale flake in your configuration, and was wondering how your configuration uses the flake version of headscale instead of the nixpkgs version? I'm assuming it has to do with the overlay, but I tried putting overlay = [ headscale.overlay ] in my flake.nix (after adding it as an input) and it did not change what version was installed.

~~If I can figure out how to switch to the flake version, I will be happy to test this change.~~

I wasn't inheriting the overlay-ed pkgs to the server host. I'm switching to the patched version of the headscale server and will let you know how it works in a few days.

Nickiel12 · 2023-12-03T23:09:07Z

I'm not sure if this is the right thread for this, but I noticed while watch systemctl status headscale and systemctl restart headscale it appears to have closed the DB (sqlite for my instance) before shutting down other threads properly, and it took almost two minutes for headscale to restart - I don't know if this is related, but if the sqlite database can disconnect before the rest of the application shuts down it might be causing this issue.

kradalby · 2023-12-10T15:21:30Z

0.23.0-alpha2 addresses a series of issues with node synchronisation, online status and subnet routers, please test this release and report back if the issue still persist.

Nickiel12 · 2023-12-16T19:24:40Z

@kradalby The issue of the headscale server not responding has gone away. I've noticed a weird issue with the app where I need to log out, change the server and save it, then log back in before I can connect. But I have not had to restart the headscale server at all since I switched to the alpha release.

Nickiel12 · 2023-12-16T21:58:40Z

well, scratch that, now I keep getting prompted to re-register my phone when I try to connect to the server. But I'm not getting the same issue as before where I couldn't log on at all and the headscale service would hang.

rjmalagon · 2023-12-19T19:25:16Z

0.23.0-alpha2 addresses a series of issues with node synchronisation, online status and subnet routers, please test this release and report back if the issue still persist.

Hi, the stability and responsiveness are much better. Alpha 1 was easily stuck with some stress (+300 nodes) in just minutes. This new alpha is robust enough to handle the same workload 20+ hours without issues. Thanks @kradalby for the advances and the follow of this issue.

I can share any info if needed to help with the development of v0.23.

jwischka · 2023-12-20T12:36:36Z

I'm experiencing something similar on 0.23-alpha2 - I haven't had time to debug yet, but I've had to restart headscale twice on a moderately sized install (~30 nodes) in the last two days.

Nickiel12 · 2023-12-23T01:37:35Z

I deleted my phone node, re-registered it, and I have not had any issues connecting in two days since!

cfouche3005 · 2023-12-27T14:00:03Z

Same issue on 0.23-alpha2 after one hour, I will check if the issue still reappears

cfouche3005 · 2023-12-27T16:35:37Z

I can confirm the issue is still present, after one hour or so, no response from grpc or http api but headscale still work (I haven't tested if I can register new nodes)

cfouche3005 · 2023-12-28T11:05:50Z

And I also confirm that new devices or disconnected devices cannot join/reconnect to the tailnet

cfouche3005 · 2023-12-29T11:04:30Z

I have to add this bug is very inconsistent, today it did not appear but yesterday, it was here.

dustinblackman · 2024-02-07T14:28:20Z

Finding myself in a similar situation where Headscale randomly locks running master branch from this commit and postgres. I haven't been able to find exactly where the issue happens while debugging. I'm hoping something like #1701 may solve it.

In the mean time, I've put together a small/lazy systemd service that acts as a healthcheck, rebooting Headscale if it locks up. Taking yesterday's data, this fired 9 times with 30 nodes in a for-fun containerized testing environment.

headscale-health

#!/usr/bin/env bash

set -e

while true; do
	echo "Checking headscale health"
	timeout 5 headscale nodes list >/dev/null ||
		(echo "Failed to get nodes list, rebooting headscale" && timeout 3 systemctl restart headscale) ||
		(echo "Failed to restart, killing process" && kill -9 "$(ps aux | grep 'headscale serve' | grep -v grep | awk '{print $2}')")

	sleep 10
done

headscale-health.service

[Unit]
After=syslog.target
After=network.target
Description=headscale health monitor

[Service]
Type=simple
User=root
Group=root
ExecStart=/usr/bin/headscale-health
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target

kradalby · 2024-02-07T15:04:39Z

Finding myself in a similar situation where Headscale randomly locks running master branch from this commit and postgres. I haven't been able to find exactly where the issue happens while debugging. I'm hoping something like #1701 may solve it.

@TotoTheDragon Could you have a look at this? I wont have time to get around to it for a bit more.

TotoTheDragon · 2024-02-07T17:16:58Z

@kradalby I have been looking into this issue for the past day and agree with dustin that #1701 is a good contender for solving this issue and #1656 . This issue was present before the commit referenced.

I do not have the infrastructure to test with 30 nodes or so to recreate the issue, but we could make a build that includes the changes from #1701 and hopefully dustin is able to test that to see if it makes any difference.

dustinblackman · 2024-02-07T17:51:11Z

I'm down! If #1701 is considered complete in it's current state, I can ship it and see how it does.

kradalby · 2024-02-07T18:05:09Z

I would say that it is complete, but complete as in tip of main, not tested sufficiently to release as a version. But I interpret your current running of main as your risk appetite is fine with that.

dustinblackman · 2024-02-07T18:47:26Z

your risk appetite is fine with that.

More or less, it's the fact codebase is easy to read, so at least I know what I'm bringing in off main. I'll give this a shot either this week or next :)

TotoTheDragon · 2024-02-12T11:24:14Z

@dustinblackman Would you be able to test with current version of main?

dustinblackman · 2024-02-12T13:17:23Z

@TotoTheDragon I've been running from 83769ba for the last four days. At first it looked like all was good, but looking at the logs I'm still seeing lockups, but less. I can try from the latest master later in the week.

I also have a set of scripts for a local cluster I had written for #1725. I can look to PR them if you think they'd be helpful in debugging this.

TotoTheDragon · 2024-02-13T20:20:45Z

@dustinblackman seeing as headscale nodes list gets a context exceeded, maybe we can add a bunch of traces within the command and see where it gets stuck. This will help rule some stuff out.

kradalby · 2024-02-15T10:02:13Z

Could you give alpha4 a spin: https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4

dustinblackman · 2024-02-16T21:49:18Z

Could you give alpha4 a spin: https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha4

@kradalby I ran this for about two hours, I saw no reboots, but I experienced issues where some newly added ephemeral nodes were unable to communicate over the network (port 443 requests), even with tailscale ping showing a direct connection. I'm wondering if nodes are not always being notified when a new node joins the network.

I'm going to try again in a localized environment and see if I can repro it.

kradalby · 2024-02-17T18:37:17Z

thank you @dustinblackman, thats helpful, it does sounds like there is some missing updates. There is a debug env flag you can turn on which will dump all the mapresponses sent, if you can repro, that would potentially be helpful info, but it produces a lot of data and might not be suitable if you have a lot of nodes.

You can play around with that by setting HEADSCALE_DEBUG_DUMP_MAPRESPONSE_PATH to somewhere on your system.

kradalby · 2024-02-19T14:06:34Z

Could you please test if this is still the case with https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha5 ?

dustinblackman · 2024-02-21T19:36:02Z

@kradalby No reboots again, but after 30 minutes I get several lines such as the following. Couldn't prove they were actually causing issues. I'll test further.

{"level":"error","error":"context deadline exceeded","mkey":"mkey:0bdfa4de8f7c1f14d15fff04f39993c581c6ba131278ffsessdf5b0d","origin":"poll-nodeupdate-peers-patch","hostname":"mylaptop","time":1708543963,"message":"update not sent, context cancelled"}

dustinblackman · 2024-02-23T02:39:31Z

I'm unable to repro this in a local cluster. :(

kradalby · 2024-04-17T06:25:08Z

Could you please try the newest alpha (https://github.com/juanfont/headscale/releases/tag/v0.23.0-alpha6) and report back?

kradalby · 2024-04-29T15:28:47Z

I think the latest alpha should have improved this a lot, can someone experiencing this give it a try?

dustinblackman · 2024-05-07T13:33:24Z

I'll look to give this a spin this week if I can slot it in :)

dustinblackman · 2024-05-10T20:28:00Z

Been running this for a little over a day with no issues! Amazing work, thank you! I appreciate all the effort.

ohdearaugustin · 2024-05-18T16:29:16Z

Will close this as fixed.

axxonadmin added the bug Something isn't working label Oct 11, 2023

Nickiel12 mentioned this issue Dec 3, 2023

Main branch - server cannot connect to itself #1633

Closed

kradalby added this to the v0.23.0 milestone Dec 10, 2023

dustinblackman mentioned this issue Feb 7, 2024

Replace database locks with transactions #1701

Merged

ohdearaugustin closed this as completed May 18, 2024

headscale server stopped answering after a day of uptime despite listening of all the ports #1572

headscale server stopped answering after a day of uptime despite listening of all the ports #1572

Comments

axxonadmin commented Oct 11, 2023

rjmalagon commented Oct 28, 2023

rjmalagon commented Oct 28, 2023

rjmalagon commented Oct 30, 2023

Nickiel12 commented Nov 29, 2023

kradalby commented Nov 30, 2023

Nickiel12 commented Dec 1, 2023 • edited

Nickiel12 commented Dec 3, 2023

kradalby commented Dec 10, 2023

Nickiel12 commented Dec 16, 2023

Nickiel12 commented Dec 16, 2023

rjmalagon commented Dec 19, 2023

jwischka commented Dec 20, 2023

Nickiel12 commented Dec 23, 2023

cfouche3005 commented Dec 27, 2023

cfouche3005 commented Dec 27, 2023

cfouche3005 commented Dec 28, 2023

cfouche3005 commented Dec 29, 2023

dustinblackman commented Feb 7, 2024 • edited

kradalby commented Feb 7, 2024

TotoTheDragon commented Feb 7, 2024

dustinblackman commented Feb 7, 2024 • edited

kradalby commented Feb 7, 2024

dustinblackman commented Feb 7, 2024 • edited

TotoTheDragon commented Feb 12, 2024

dustinblackman commented Feb 12, 2024 • edited

TotoTheDragon commented Feb 13, 2024

kradalby commented Feb 15, 2024

dustinblackman commented Feb 16, 2024 • edited

kradalby commented Feb 17, 2024

kradalby commented Feb 19, 2024

dustinblackman commented Feb 21, 2024

dustinblackman commented Feb 23, 2024

kradalby commented Apr 17, 2024

kradalby commented Apr 29, 2024

dustinblackman commented May 7, 2024

dustinblackman commented May 10, 2024

ohdearaugustin commented May 18, 2024

Nickiel12 commented Dec 1, 2023 •

edited

dustinblackman commented Feb 7, 2024 •

edited

dustinblackman commented Feb 7, 2024 •

edited

dustinblackman commented Feb 7, 2024 •

edited

dustinblackman commented Feb 12, 2024 •

edited

dustinblackman commented Feb 16, 2024 •

edited