New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty lockfile, likely previous process crashed or storage medium failure; treating as stale #232
Comments
Never heard of something like this... Are you sure the .lock files don't ever exist? Because those files tend to get cleaned up pretty quick, or at least they should.
It is doing that already; just a few lines earlier: 79babff#diff-bb094528ba8793b4ce43d063c43a23b31ca8aefbc412cbbe8dc2188b21c36022R171-R175 If the file does not exist, I'd like more details on what's actually on the disk, the contents of the files, and preferably how to reproduce this. |
Matt, thanks for your quick reply, here is the info files and content related to the
|
btw, as i mentioned in the the log produced by |
I think in order to give you any answers and, if necessary, produce a patch, I'll need to be able to reproduce the behavior in the most minimal way possible. The full and unredacted logs could be a good start, that should at least give some insight into what's going on. Make sure to enable debug-level logging (will be very noisy) and don't redact any information, as that defeats the purpose and I can't help you then. Feel free to attach the logs to a reply here. |
this plugin provides a trojan protocol server, it is must be used with a trojan protocol client ( it will be used as proxy like let's say, socks5 ), it is not a common use case, but here is the steps of minimal reproducing (although it is already well written on the project page), i'll repeat here :
i'll try to get the debug log later. thanks for your help. |
i have another question, as you mentioned, there must be a lock file created according to the checks in 79babff#diff-bb094528ba8793b4ce43d063c43a23b31ca8aefbc412cbbe8dc2188b21c36022R171-R175 i agree with that, but why does certmagic care if it is empty or not? i mean, that |
Lock files created by the filesystem storage driver Line 350 in 8728b18
|
interesting findings: sometimes the .lock file has content, sometimes it was deleted so fast that even
lockfile=/home/caddy/.local/share/caddy/locks/trojanfb3936685511492c06bb12b60cdfb40968e1072ced30b5244b860950.lock
# there is no sleep what so ever between -f test and cat and while loop
while true ; do [[ -f $lockfile ]] && cat $lockfile ; done
|
Yeah the lockfiles don't often stay around for long. I wonder if the file is removed after opening the file but before reading its contents. That would return an empty file maybe? Is that something you could test on your system? Create a file with some contents, then close it. Then open it, delete it, then read from it. |
like this?
|
No, you'll have to open the file for reading, then delete it, then read from it. Running |
that's easy to simulate with a program language like Go, but not easy to to with existing Linux userland tools, i guess? |
the file system is ext4 with these features: tune2fs -l /dev/sda2 | grep -i features
|
Hmm, yeah, I don't know how you'd do it with just a shell. There might be something fancy you can do with bash but I don't know it. Here's a simple Go program you can use to test this though: package main
import (
"fmt"
"io"
"log"
"os"
)
func main() {
const filename = "foo.txt"
f, err := os.Open(filename)
if err != nil {
log.Fatal(err)
}
if err = os.Remove(filename); err != nil {
log.Fatal(err)
}
contents, err := io.ReadAll(f)
if err != nil {
log.Fatal(err)
}
fmt.Println(string(contents))
if err := f.Close(); err != nil {
log.Fatal(err)
}
} First create a file called "foo.txt" (if you use a different filename, just update the program as well) with something, like: Then run that program: See what happens? |
|
Ok, so that shows that when a file is deleted while opened it can still read the contents, i.e. it's not empty. There must be some other explanation. |
another loop test while true ; do
[[ -e $lockfile ]] && {
echo "=========================="
ls -la $lockfile
cat $lockfile
}
done sometimes see this: # ls sees the file, cat CAN NOT see it
==========================
-rw-r--r-- 1 caddy caddy 98 May 18 11:07 /home/caddy/.local/share/caddy/locks/trojanfb3936685511492c06bb12b60cdfb40968e1072ced30b5244b860950.lock
cat: /home/caddy/.local/share/caddy/locks/trojanfb3936685511492c06bb12b60cdfb40968e1072ced30b5244b860950.lock: No such file or directory
# both ls and cat CAN NOT see it
==========================
ls: cannot access '/home/caddy/.local/share/caddy/locks/trojanfb3936685511492c06bb12b60cdfb40968e1072ced30b5244b860950.lock': No such file or directory
cat: /home/caddy/.local/share/caddy/locks/trojanfb3936685511492c06bb12b60cdfb40968e1072ced30b5244b860950.lock: No such file or directory
may be the time window between |
if err = os.Remove(filename); err != nil {
log.Fatal(err)
}
contents, err := io.ReadAll(f)
if err != nil {
log.Fatal(err)
} in this case, it remove then read. what about read first then remove? |
We call I would expect that any read after that, if the file was successfully opened, should read data. However, at one point, in // truncate file and reset I/O offset to beginning
if err := f.Truncate(0); err != nil {
return true, err
}
if _, err := f.Seek(0, io.SeekStart); err != nil {
return true, err
} This is done so that we can replace the file contents. The |
there is a loop (caddy-trojan project) here too to update the which will involve creating a lock file i can make a patch to stop it from updating the diff --git a/listener/listener.go b/listener/listener.go
index b8116c2..88efe02 100644
--- a/listener/listener.go
+++ b/listener/listener.go
@@ -192,11 +192,11 @@ func (l *Listener) loop() {
lg.Info(fmt.Sprintf("handle trojan net.Conn from %v", c.RemoteAddr()))
}
- nr, nw, err := l.Proxy.Handle(io.Reader(c), io.Writer(c))
+ _, _, err := l.Proxy.Handle(io.Reader(c), io.Writer(c))
if err != nil {
lg.Error(fmt.Sprintf("handle net.Conn error: %v", err))
}
- up.Consume(utils.ByteSliceToString(b[:trojan.HeaderLen]), nr, nw)
+ // up.Consume(utils.ByteSliceToString(b[:trojan.HeaderLen]), nr, nw)
}(conn, l.Logger, l.Upstream)
}
} and now the content of {"up":0,"down":0} and of course, caddy no longer output stuff like |
do you think there is a |
Seems likely. I'm not sure how to replace the contents of a file atomically. I thought opening a file for writing would cause other readers to block while the writer was open, but I guess not...? |
@cattyhouse Would you mind trying with 321ed64? I'm guessing that if it's simply reading the lock file while it's still being truncated/written (not sure why that's allowed by the FS, but whatever) and thus seeing a size of 0, retrying after a brief delay should resolve the problem because by that time, the file should be fully written. If it truly is stale/abandoned, we still detect that, but it's after ~200ms instead of right away. |
thanks, i am building with this options: CGO_ENABLED=0 \
GOARCH=amd64 \
GOOS=linux \
XCADDY_GO_BUILD_FLAGS="-ldflags '-s -w'" \
xcaddy build master \
--output caddy@master_with-certmagic@master_with-caddy-trojan@master \
--with github.com/imgk/caddy-trojan@master \
--with github.com/caddyserver/certmagic@master
it looks like the version is correct
i am gonna run it and report back in one hour |
@mholt looks like the issue is still there after using certmagic master and also update the certmagic to v0.18.1-0.20230615213225-321ed6491253 in
|
@cattyhouse Ahh, dangit. Thank you for testing, and figuring that out; this is super helpful. So that means the file is still empty (but still exists) after 200ms...? Seems extremely unlikely. Either the storage medium is very slow, or something is lying. I mean, would we want to wait even longer? Right now we're waiting 100ms and trying up to 2 times (so 200ms). Should we be trying up to 5 times (500ms)? 10 (1s)? That seems ludicrous though. We're even closing and re-opening the file, so if the file is deleted in the meantime it shouldn't re-open successfully. That seems like a broken file system, but I can't believe ext4 would be broken like that. Is something else happening? 🤔 |
let me do some experiment by changing |
(THANK YOU, by the way, for going to the effort to help try my ideas. I hope we can figure this out soon. But I'll be honest I'm running out of ideas.) |
build steps
EDIT: this step is not needed : as long as we build a new certmagic into caddy, it is the only option caddy-trojan could use, disregarding whatever version in it's go.mod
CGO_ENABLED=0 \
GOARCH=amd64 \
GOOS=linux \
XCADDY_GO_BUILD_FLAGS="-ldflags '-s -w'" \
xcaddy build master --output caddy-master_certmagic-master_caddy-trojan-master \
--with github.com/imgk/caddy-trojan@master \
--with github.com/caddyserver/certmagic=./certmagic
looks like the brute |
btw, can you backport this patch (10 tries) to v1.7.2 also? as this is the version caddy v2.6.4 and caddy-trojan uses. (i am not sure if it is possible to patch a tag though) |
@cattyhouse Wow, that's really interesting. I am glad we seemed to have made progress here! Thank you for troubleshooting/debugging so thoroughly. I'm going to change the delay interval to 250ms and set the retry count to 6. Question: Any idea why your storage backend is so slow? Are you using an SSD or an HDD? It's taking about a second to create and write a file. That's 10x longer than I would have guessed. It does seem you are using the locks in a very busy part of the code (for every request, I think?) and the locks are very very short-lived. So there's lots of writes to disk. I could see why that's potentially slowing things down. Are you actually needing this to coordinate multiple instances in a cluster? Or is this all for a single machine? If it's just the same instance of Caddy then why not use a regular sync.Mutex... Anyway just wanted to get some more info so I can be confident with the patch. :) Thanks.
That is because that function,
Unfortunately tags are immutable (with respect to Go modules). This means we can't change a tag. I'll get it into the next Caddy release though: 2.7 beta 2. |
thanks for all the help!
|
Alright, gotcha gotcha. (I decided to change the count to 8; that's two seconds total.)
Wow, well there must be some bottleneck, I wonder if it's the RAID, or maybe just shared hardware in general.
I didn't realize you aren't the author, so now I really extra-appreciate your willingness to help investigate the code. :) Yes, I would highly recommend just using
For in-process locking, definitely just use sync.Mutex. Either way, glad we are able to have figured out a workaround, even if it's not absolutely great. I'm still bewildered that the FS is letting readers read while it is writing the file. |
since 79babff , caddy produces lots of logs (32285 lines so far and keep increasing) like: imgk/caddy-trojan#55 (comment)
in above case the plugin
caddy-trojan
does not create.lock
file at all.Suggestion:
maybe certmagic checks if
.lock
file exists before checkingio.EOF
?The text was updated successfully, but these errors were encountered: