Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(transport): sendmsg: invalid argument when using QUIC on latest version #2591

Closed
distractedm1nd opened this issue Oct 2, 2023 · 22 comments
Labels
kind/bug A bug in existing code (including security flaws) kind/stale need/author-input Needs input from the original author

Comments

@distractedm1nd
Copy link

Since upgrading to v0.31.0, we encountered a new error: sendmsg: invalid argument.

I am not entirely sure yet how the entire network got into this state (15 nodes), but all nodes were logging similarly and it persisted until the network degraded into a state where the protocol was no longer being used. After restarting the nodes, the network recovered and has been stable since. My theory is that it happened when a single node got overloaded that was acting as the original source of the information to be shared throughout the network. I will gladly try to recreate the scenario with whatever debugging flags are needed to gather more information.

I saw this issue in quic-go that seems to be resolved, but may be related: quic-go/quic-go#3911. In our network both the clients and servers arch is x86_64/

Clients started logging:

failed to read status from stream: INTERNAL_ERROR (local): write udp4 0.0.0.0:2121->100.64.13.187:2121: sendmsg: invalid argument
failed to read eds from ods bytes: share: reading next car entry: INTERNAL_ERROR (local): write udp4 0.0.0.0:2121->100.64.0.86:2121: sendmsg: invalid argument
failed to read status from stream: INTERNAL_ERROR (local): write udp4 0.0.0.0:2121->100.64.13.187:2121: sendmsg: invalid argument
failed to write request to stream: received a stateless reset with token 03cb6bd662b7f8be9172d39336b039f3
failed to read eds from ods bytes: share: reading next car entry: INTERNAL_ERROR (local): write udp4 0.0.0.0:2121->100.64.0.86:2121: sendmsg: invalid argument
failed to open stream: connection failed
failed to read eds from ods bytes: share: reading next car entry: INTERNAL_ERROR (local): write udp4 0.0.0.0:2121->100.64.0.86:2121: sendmsg: invalid argument
failed to read eds from ods bytes: share: reading car file: INTERNAL_ERROR (local): write udp4 0.0.0.0:2121->100.64.0.86:2121: sendmsg: invalid argument
failed to read status from stream: INTERNAL_ERROR (local): write udp4 0.0.0.0:2121->100.64.12.233:2121: sendmsg: invalid argument
failed to read eds from ods bytes: share: reading next car entry: INTERNAL_ERROR (local): write udp4 0.0.0.0:2121->100.64.0.86:2121: sendmsg: invalid argument
failed to read status from stream: INTERNAL_ERROR (local): write udp4 0.0.0.0:2121->100.64.0.86:2121: sendmsg: invalid argument
failed to read status from stream: received a stateless reset with token de28b1cd1aae453820fab0efe4de9963

Servers of this protocol logged the same error along with a stateless reset for concurrent streams:

2023-10-02T10:41:14.672Z	WARN	shrex/eds	shrexeds/server.go:140	server: writing ods to stream	{"peer": "12D3KooWDr92PrFG1kKRJJn2VPK7uZB9TAgBv76oB3gti1sw3yEp", "hash": "360F454ACFD00261988829E4504F751AE7C8C8C282E49AF34219AF740176A488", "err": "writing ODS bytes: 1016877 bytes written, INTERNAL_ERROR (local): write udp4 0.0.0.0:2121->100.64.13.187:2121: sendmsg: invalid argument"}
2023-10-02T10:41:15.667Z	WARN	shrex/eds	shrexeds/server.go:140	server: writing ods to stream	{"peer": "12D3KooWDr92PrFG1kKRJJn2VPK7uZB9TAgBv76oB3gti1sw3yEp", "hash": "360F454ACFD00261988829E4504F751AE7C8C8C282E49AF34219AF740176A488", "err": "writing ODS bytes: 1206038 bytes written, received a stateless reset with token 2f9fb982cbcf677f84a5a04284723fbd"}
2023-10-02T10:41:16.667Z	WARN	shrex/eds	shrexeds/server.go:140	server: writing ods to stream	{"peer": "12D3KooWDr92PrFG1kKRJJn2VPK7uZB9TAgBv76oB3gti1sw3yEp", "hash": "360F454ACFD00261988829E4504F751AE7C8C8C282E49AF34219AF740176A488", "err": "writing ODS bytes: 1190481 bytes written, received a stateless reset with token 191b2b55a484ccffaebfb6bf29879cd5"}
2023-10-02T10:41:18.667Z	WARN	shrex/eds	shrexeds/server.go:140	server: writing ods to stream	{"peer": "12D3KooWDr92PrFG1kKRJJn2VPK7uZB9TAgBv76oB3gti1sw3yEp", "hash": "360F454ACFD00261988829E4504F751AE7C8C8C282E49AF34219AF740176A488", "err": "writing ODS bytes: 972566 bytes written, received a stateless reset with token 66b89f515f584e1762d8836dd2ba3d5e"}
2023-10-02T10:41:20.678Z	WARN	shrex/eds	shrexeds/server.go:140	server: writing ods to stream	{"peer": "12D3KooWDr92PrFG1kKRJJn2VPK7uZB9TAgBv76oB3gti1sw3yEp", "hash": "360F454ACFD00261988829E4504F751AE7C8C8C282E49AF34219AF740176A488", "err": "writing ODS bytes: 39633 bytes written, received a stateless reset with token c72aa0e326e0dd06d99628199d5d41a3"}

go.mod: https://github.com/celestiaorg/celestia-node/blob/main/go.mod

@sukunrt
Copy link
Member

sukunrt commented Oct 2, 2023

Do you have QUIC_GO_DISABLE_GSO env variable set to true? Is this on linux?

@distractedm1nd
Copy link
Author

distractedm1nd commented Oct 2, 2023

Yes, it is on linux with x86_64 arch. QUIC_GO_DISABLE_GSO seems to work to fix this, but need to try to fully reproduce the scenario again/need some more uptime to confirm

@sukunrt
Copy link
Member

sukunrt commented Oct 2, 2023

QUIC_GO_DISABLE_GSO seems to work to fix this

You don't have to try/change it. I just want to know when the bug happened what the state of this environment variable was.

@distractedm1nd
Copy link
Author

It was not set, so false

@marten-seemann
Copy link
Contributor

@distractedm1nd Which operating system are you using? Apparently there are some problems with FreeBSD, see quic-go/quic-go#4105 and quic-go/quic-go#4106.

@distractedm1nd
Copy link
Author

@marten-seemann We are using alpine linux for these nodes - thanks for linking the issues, I'll look into them

@marten-seemann
Copy link
Contributor

Can you try setting QUIC_GO_DISABLE_GSO=true and see if the problem still occurs?
Also (separately) try setting QUIC_GO_DISABLE_ECN=true.

This way, we can find out if GSO or ECN is the problem.

@distractedm1nd
Copy link
Author

Yes, I will be trying to recreate the scenario today. Once I can get the logs consistently again, I will experiment with the flags and post the results here

@distractedm1nd
Copy link
Author

image

We enabled QUIC_GO_DISABLE_GSO=true at the cursor and do not have any sendmsg logs on any node since then.

@distractedm1nd
Copy link
Author

Would the next step be enabling QUIC debug logs and finding out what is happening right before this occurs?

@marten-seemann
Copy link
Contributor

@distractedm1nd Can you share details about the environment you're running this on? We've had reports of GSO failures on some platforms (quic-go/quic-go#3911), although a lot less after the v0.38.0 release.

Alternatively, could you add some logging in https://github.com/quic-go/quic-go/blob/master/sys_conn_oob.go#L240, so we can see what exactly the arguments to sendmsg where when this error occurs?

@Wondertan
Copy link
Contributor

@marten-seemann, is there a way to QLOG this or do you want us to print out the inputs there?

@distractedm1nd distractedm1nd changed the title bug(webtransport): sendmsg: invalid argument when using QUIC on latest version bug(transport): sendmsg: invalid argument when using QUIC on latest version Oct 11, 2023
@marten-seemann
Copy link
Contributor

@Wondertan There's no way to qlog this at the moment. We could change the code to pass a qlogger into that function if we need to for debugging, but for now I suggest we do the easy thing and just print the inputs. This is assuming that this bug can be triggered reliably and we won't spam the log with many entries from unrelated connections.

@distractedm1nd
Copy link
Author

distractedm1nd commented Oct 12, 2023

I added logs to WritePacket right before the WriteMsgUDP and they are not being logged. I am however seeing in the logs spammed:

2023/10/12 14:42:13 server Destroying connection with error: write udp4 0.0.0.0:2121->100.64.8.28:2121: sendmsg: invalid argument

for both server and client.

Am I doing something wrong here, or do I need to add logs to a different method? I am logging with utils.DefaultLogger.Debugf.

@marten-seemann
Copy link
Contributor

Am I doing something wrong here, or do I need to add logs to a different method? I am logging with utils.DefaultLogger.Debugf.

That will only work if you run with QUIC_GO_LOG_LEVEL=debug. That will also print every packet sent on the connection, plus a number of other events. It might be easier to just use fmt.Printf in this case.

@distractedm1nd
Copy link
Author

Yes, we ran with QUIC_GO_LOG_LEVEL=debug. But I will switch to fmt.Printf and disable the environment vars to see if I can get it to log

Wondertan added a commit to celestiaorg/celestia-node that referenced this issue Oct 13, 2023
Disables QUIC by default yet allows to enable it with ENV var.
This is temporary and until libp2p/go-libp2p#2591 is investigated and fixed.

The current solution disables QUIC programmatically while we still keep writing in the config QUIC listen addresses by default. This removes the need for users to reinit their configs once we turn QUIC back by default.

This works perfectly fine on mocha.
@marten-seemann
Copy link
Contributor

@distractedm1nd Do you have any update on this?

@marten-seemann marten-seemann added need/author-input Needs input from the original author kind/bug A bug in existing code (including security flaws) labels Oct 20, 2023
@distractedm1nd
Copy link
Author

Sorry Marten, I've been out this week for my wedding and will get back to it asap next week. Last attempts didn't reproduce the issue unfortunately

@Wondertan
Copy link
Contributor

We should check if quic-go/quic-go#4127 is related

@marten-seemann
Copy link
Contributor

Please do. There’s a quic-go v0.39.2 patch release that contains fixes for both FreeBSD and Linux.

Copy link

github-actions bot commented Nov 1, 2023

Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days.

@marten-seemann
Copy link
Contributor

Closing since this has (probably) been fixed by the latest quic-go release. Please feel free to reopen if it still happens after updating your dependencies.

Wondertan added a commit to celestiaorg/celestia-node that referenced this issue Dec 13, 2023
Reverts disabled quic. We weren't able to reproduce [the issue](libp2p/go-libp2p#2591) since we first observed it, as well as quic released changes to address this. Even if the issue is there, I think it's more or less safe as we always have TCP as fallback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) kind/stale need/author-input Needs input from the original author
Projects
None yet
Development

No branches or pull requests

4 participants