-
-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix kevent(..) failed: Invalid argument
#13503
Conversation
Motivation: While I was investating Armeria CI build failures on macOS, many `kevent(..) failed: Invalid argument` error messages were produced on the failed tests. https://ge.armeria.dev/s/h2fz4h4th66we/tests/task/:core:shadedTest/details/com.linecorp.armeria.server.HttpServerNoKeepAliveTest/shouldDisconnectWhenConnectionCloseIsIncluded(SessionProtocol)%5B4%5D?focused-execution=1&top-execution=1#L7 After adding some debug messages, I figured out that the error raised when a large number such `Integer.MAX_VALUE` is set as a timeout. The long timeout is derived from a scheduled task whose timeout is `Long.MAX_VALUE`. We set `Long.MAX_VALUE` to `DnsNameResolver` to disable a query timeout. https://github.com/netty/netty/blob/cbc995865ed1e082d5581744e9c90c6d119652b1/resolver-dns/src/main/java/io/netty/resolver/dns/DnsQueryContext.java#L282-L293 https://github.com/line/armeria/blob/2bfe424b652a16f603e2ba1ca075bf992f80f24e/core/src/test/java/com/linecorp/armeria/internal/client/dns/DefaultDnsResolverTest.java#L91-L93 Failed to find a specification for the timeout value of `kevent()`. However, there was a note in a man page that the timeout value is limited to 24 hours. https://man.freebsd.org/cgi/man.cgi?query=kevent&apropos=0&sektion=0&manpath=FreeBSD+6.1-RELEASE&format=html#end Modifications: - Limit the maximum allowed timeout for `kevent()` to 24 hours. - Skip to schedule a query timeout when the value is `Long.MAX_VALUE` Result: `kevent(..) failed: Invalid argument` no longer raises when a task is scheduled with a large value.
Please let me know if there is a good way to test these changes. |
resolver-dns/src/main/java/io/netty/resolver/dns/DnsQueryContext.java
Outdated
Show resolved
Hide resolved
transport-classes-kqueue/src/main/java/io/netty/channel/kqueue/KQueueEventLoop.java
Show resolved
Hide resolved
This is the build result run with the change in this PR. |
Motivation: The build failures on macOS had two issues. 1. A deadlock on the JaCoCo task that prevented tests from running on macOS. 2. Blocked event loops that were caused by `kevent(..) Invalid Argument`. https://github.com/netty/netty/blob/830166292be28e64ec6b43170cff40a31a442ad4/transport-classes-kqueue/src/main/java/io/netty/channel/kqueue/KQueueEventLoop.java#L399-L405 A patch for the bug was submmited to netty/netty#13503 Modifications: - Fixed DNS builder and tests to use `-1` in order to disable a query timeout. Long.MAX_VALUE could causes `kevent(..) Invalid Argument`. - Nomarilized `Long.MAX_VALUE` to `-1` - Removed a depedency between `reportTask` and `testTask.path` that caused the deadlock with the current Gradle version. The dependency looked redundant. The JaCoCo task runs without it. - Upgrade deprecated Codecov GitHub Actions plugin. The old version is no longer working. - Miscellaneous) Added debug message to `AbstractServerCall` to find out the cause of line#5035 Result: - Code coverage is correctly reported by Codecov. - You no longer see `kevent(..) Invalid Argument` when testing Armeria on macOS.
int delaySeconds = (int) min(totalDelay / 1000000000L, Integer.MAX_VALUE); | ||
return kqueueWait(delaySeconds, (int) min(totalDelay - delaySeconds * 1000000000L, Integer.MAX_VALUE)); | ||
int delaySeconds = (int) min(totalDelay / 1000000000L, KQUEUE_MAX_TIMEOUT_SECONDS); | ||
int delayNanos = (int) (totalDelay % 1000000000L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we also need to subtract the delaySeconds ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ikhoon ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it seems a typo here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the remainder of modular operation (% 1000000000L) become the nanos' part of totalDelay
? Correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrote a unit test and it passed.
@Test
void testNanos() {
final long nanoTime = System.nanoTime();
final long seconds = nanoTime / 1_000_000_000L;
final long nanos1 = nanoTime % 1_000_000_000L;
final long nanos2 = nanoTime - seconds * 1_000_000_000L;
assertThat(nanos2).isEqualTo(nanos1);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine because KQUEUE_MAX_TIMEOUT_SECONDS
is 24 hours minus 1 second, and the nanosecond component cannot exceed one second. Keeping the total under 24 hours.
int delaySeconds = (int) min(totalDelay / 1000000000L, Integer.MAX_VALUE); | ||
return kqueueWait(delaySeconds, (int) min(totalDelay - delaySeconds * 1000000000L, Integer.MAX_VALUE)); | ||
int delaySeconds = (int) min(totalDelay / 1000000000L, KQUEUE_MAX_TIMEOUT_SECONDS); | ||
int delayNanos = (int) (totalDelay % 1000000000L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine because KQUEUE_MAX_TIMEOUT_SECONDS
is 24 hours minus 1 second, and the nanosecond component cannot exceed one second. Keeping the total under 24 hours.
Motivation: The build failures on macOS had two issues. 1. A deadlock on the JaCoCo task that prevented tests from running on macOS. `-Pcoverage` option was only used on macOS for CI builds. 2. Blocked event loops that were caused by `kevent(..) Invalid Argument`. https://github.com/netty/netty/blob/830166292be28e64ec6b43170cff40a31a442ad4/transport-classes-kqueue/src/main/java/io/netty/channel/kqueue/KQueueEventLoop.java#L399-L405 A patch for the bug was submitted to netty/netty#13503 Modifications: - Fixed DNS builder and tests to use `-1` in order to disable a query timeout. Long.MAX_VALUE could cause `kevent(..) Invalid Argument`. - Removed a dependency between `reportTask` and `testTask.path` that caused the deadlock with the current Gradle version. It seemed the dependency was redundant. The JaCoCo task now runs without it. - Upgrade deprecated Codecov GitHub Actions plugin. The old version is no longer working. - Miscellaneous) Added debug message to `AbstractServerCall` to find out the cause of #5035 Result: - Code coverage is correctly reported by Codecov. - You no longer see `kevent(..) Invalid Argument` when testing Armeria on macOS.
Motivation: While I was investigating Armeria CI build failures on macOS, many `kevent(..) failed: Invalid argument` error messages were produced on the failed tests. https://ge.armeria.dev/s/h2fz4h4th66we/tests/task/:core:shadedTest/details/com.linecorp.armeria.server.HttpServerNoKeepAliveTest/shouldDisconnectWhenConnectionCloseIsIncluded(SessionProtocol)%5B4%5D?focused-execution=1&top-execution=1#L7 After adding some debug messages, I figured out that the error raised when a large number such as `Integer.MAX_VALUE` is set as a timeout. https://github.com/netty/netty/blob/cbc995865ed1e082d5581744e9c90c6d119652b1/transport-classes-kqueue/src/main/java/io/netty/channel/kqueue/KQueueEventLoop.java#L169-L171 The long timeout is derived from a scheduled task whose timeout is `Long.MAX_VALUE`. We set `Long.MAX_VALUE` to `DnsNameResolver` in order to disable a query timeout for testing. https://github.com/netty/netty/blob/cbc995865ed1e082d5581744e9c90c6d119652b1/resolver-dns/src/main/java/io/netty/resolver/dns/DnsQueryContext.java#L282-L293 https://github.com/line/armeria/blob/2bfe424b652a16f603e2ba1ca075bf992f80f24e/core/src/test/java/com/linecorp/armeria/internal/client/dns/DefaultDnsResolverTest.java#L91-L93 Failed to find a specification for the timeout value of `kevent()`. However, there was a note on a man page that says the timeout value is limited to 24 hours. https://man.freebsd.org/cgi/man.cgi?query=kevent&apropos=0&sektion=0&manpath=FreeBSD+6.1-RELEASE&format=html#end Modifications: - Limit the maximum allowed timeout for `kevent()` to 24 hours. - Skip scheduling a query timeout when the value is `Long.MAX_VALUE` Result: `kevent(..) failed: Invalid argument` no longer raises when a task is scheduled with a large value.
Motivation:
While I was investigating Armeria CI build failures on macOS, many
kevent(..) failed: Invalid argument
error messages were produced on the failed tests.https://ge.armeria.dev/s/h2fz4h4th66we/tests/task/:core:shadedTest/details/com.linecorp.armeria.server.HttpServerNoKeepAliveTest/shouldDisconnectWhenConnectionCloseIsIncluded(SessionProtocol)%5B4%5D?focused-execution=1&top-execution=1#L7
After adding some debug messages, I figured out that the error raised when a large number such as
Integer.MAX_VALUE
is set as a timeout.netty/transport-classes-kqueue/src/main/java/io/netty/channel/kqueue/KQueueEventLoop.java
Lines 169 to 171 in cbc9958
The long timeout is derived from a scheduled task whose timeout is
Long.MAX_VALUE
. We setLong.MAX_VALUE
toDnsNameResolver
in order to disable a query timeout for testing.netty/resolver-dns/src/main/java/io/netty/resolver/dns/DnsQueryContext.java
Lines 282 to 293 in cbc9958
Failed to find a specification for the timeout value of
kevent()
. However, there was a note on a man page that says the timeout value is limited to 24 hours.https://man.freebsd.org/cgi/man.cgi?query=kevent&apropos=0&sektion=0&manpath=FreeBSD+6.1-RELEASE&format=html#end
Modifications:
kevent()
to 24 hours.Long.MAX_VALUE
Result:
kevent(..) failed: Invalid argument
no longer raises when a task is scheduled with a large value.