Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix kevent(..) failed: Invalid argument #13503

Merged
merged 4 commits into from
Jul 27, 2023

Conversation

ikhoon
Copy link
Contributor

@ikhoon ikhoon commented Jul 20, 2023

Motivation:

While I was investigating Armeria CI build failures on macOS, many kevent(..) failed: Invalid argument error messages were produced on the failed tests.
https://ge.armeria.dev/s/h2fz4h4th66we/tests/task/:core:shadedTest/details/com.linecorp.armeria.server.HttpServerNoKeepAliveTest/shouldDisconnectWhenConnectionCloseIsIncluded(SessionProtocol)%5B4%5D?focused-execution=1&top-execution=1#L7

After adding some debug messages, I figured out that the error raised when a large number such as Integer.MAX_VALUE is set as a timeout.

long totalDelay = delayNanos(System.nanoTime());
int delaySeconds = (int) min(totalDelay / 1000000000L, Integer.MAX_VALUE);
return kqueueWait(delaySeconds, (int) min(totalDelay - delaySeconds * 1000000000L, Integer.MAX_VALUE));

The long timeout is derived from a scheduled task whose timeout is Long.MAX_VALUE. We set Long.MAX_VALUE to DnsNameResolver in order to disable a query timeout for testing.
timeoutFuture = channel.eventLoop().schedule(new Runnable() {
@Override
public void run() {
if (promise.isDone()) {
// Received a response before the query times out.
return;
}
finishFailure("query '" + id + "' via " + protocol() + " timed out after " +
queryTimeoutMillis + " milliseconds", null, true);
}
}, queryTimeoutMillis, TimeUnit.MILLISECONDS);
https://github.com/line/armeria/blob/2bfe424b652a16f603e2ba1ca075bf992f80f24e/core/src/test/java/com/linecorp/armeria/internal/client/dns/DefaultDnsResolverTest.java#L91-L93

Failed to find a specification for the timeout value of kevent(). However, there was a note on a man page that says the timeout value is limited to 24 hours.
https://man.freebsd.org/cgi/man.cgi?query=kevent&apropos=0&sektion=0&manpath=FreeBSD+6.1-RELEASE&format=html#end

Modifications:

  • Limit the maximum allowed timeout for kevent() to 24 hours.
  • Skip scheduling a query timeout when the value is Long.MAX_VALUE

Result:

kevent(..) failed: Invalid argument no longer raises when a task is scheduled with a large value.

Motivation:

While I was investating Armeria CI build failures on macOS,
many `kevent(..) failed: Invalid argument` error messages were produced
on the failed tests.
https://ge.armeria.dev/s/h2fz4h4th66we/tests/task/:core:shadedTest/details/com.linecorp.armeria.server.HttpServerNoKeepAliveTest/shouldDisconnectWhenConnectionCloseIsIncluded(SessionProtocol)%5B4%5D?focused-execution=1&top-execution=1#L7

After adding some debug messages, I figured out that the error raised
when a large number such `Integer.MAX_VALUE` is set as a timeout.
The long timeout is derived from a scheduled task whose timeout is
`Long.MAX_VALUE`. We set `Long.MAX_VALUE` to `DnsNameResolver` to
disable a query timeout.
https://github.com/netty/netty/blob/cbc995865ed1e082d5581744e9c90c6d119652b1/resolver-dns/src/main/java/io/netty/resolver/dns/DnsQueryContext.java#L282-L293
https://github.com/line/armeria/blob/2bfe424b652a16f603e2ba1ca075bf992f80f24e/core/src/test/java/com/linecorp/armeria/internal/client/dns/DefaultDnsResolverTest.java#L91-L93

Failed to find a specification for the timeout value of `kevent()`.
However, there was a note in a man page that the timeout value is
limited to 24 hours.
https://man.freebsd.org/cgi/man.cgi?query=kevent&apropos=0&sektion=0&manpath=FreeBSD+6.1-RELEASE&format=html#end

Modifications:

- Limit the maximum allowed timeout for `kevent()` to 24 hours.
- Skip to schedule a query timeout when the value is `Long.MAX_VALUE`

Result:

`kevent(..) failed: Invalid argument` no longer raises when a task
is scheduled with a large value.
@ikhoon
Copy link
Contributor Author

ikhoon commented Jul 20, 2023

Please let me know if there is a good way to test these changes.
I locally tested the patched artifacts with Armeria and the error messages had gone and finished successfully.

@normanmaurer normanmaurer added this to the 4.1.96.Final milestone Jul 20, 2023
@ikhoon
Copy link
Contributor Author

ikhoon commented Jul 20, 2023

This is the build result run with the change in this PR.

ikhoon added a commit to ikhoon/armeria that referenced this pull request Jul 21, 2023
Motivation:

The build failures on macOS had two issues.

1. A deadlock on the JaCoCo task that prevented tests from running on
   macOS.
2. Blocked event loops that were caused by `kevent(..) Invalid Argument`.
   https://github.com/netty/netty/blob/830166292be28e64ec6b43170cff40a31a442ad4/transport-classes-kqueue/src/main/java/io/netty/channel/kqueue/KQueueEventLoop.java#L399-L405
   A patch for the bug was submmited to netty/netty#13503

Modifications:

- Fixed DNS builder and tests to use `-1` in order to disable a query timeout.
  Long.MAX_VALUE could causes `kevent(..) Invalid Argument`.
- Nomarilized `Long.MAX_VALUE` to `-1`
- Removed a depedency between `reportTask` and `testTask.path` that
  caused the deadlock with the current Gradle version. The dependency
  looked redundant. The JaCoCo task runs without it.
- Upgrade deprecated Codecov GitHub Actions plugin. The old version is no
  longer working.
- Miscellaneous) Added debug message to `AbstractServerCall` to
  find out the cause of line#5035

Result:

- Code coverage is correctly reported by Codecov.
- You no longer see `kevent(..) Invalid Argument` when testing Armeria
  on macOS.
int delaySeconds = (int) min(totalDelay / 1000000000L, Integer.MAX_VALUE);
return kqueueWait(delaySeconds, (int) min(totalDelay - delaySeconds * 1000000000L, Integer.MAX_VALUE));
int delaySeconds = (int) min(totalDelay / 1000000000L, KQUEUE_MAX_TIMEOUT_SECONDS);
int delayNanos = (int) (totalDelay % 1000000000L);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also need to subtract the delaySeconds ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikhoon ping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it seems a typo here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the remainder of modular operation (% 1000000000L) become the nanos' part of totalDelay? Correct me if I'm wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote a unit test and it passed.

@Test
void testNanos() {
    final long nanoTime = System.nanoTime();
    final long seconds = nanoTime / 1_000_000_000L;
    final long nanos1 = nanoTime % 1_000_000_000L;
    final long nanos2 = nanoTime - seconds * 1_000_000_000L;
    assertThat(nanos2).isEqualTo(nanos1);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine because KQUEUE_MAX_TIMEOUT_SECONDS is 24 hours minus 1 second, and the nanosecond component cannot exceed one second. Keeping the total under 24 hours.

int delaySeconds = (int) min(totalDelay / 1000000000L, Integer.MAX_VALUE);
return kqueueWait(delaySeconds, (int) min(totalDelay - delaySeconds * 1000000000L, Integer.MAX_VALUE));
int delaySeconds = (int) min(totalDelay / 1000000000L, KQUEUE_MAX_TIMEOUT_SECONDS);
int delayNanos = (int) (totalDelay % 1000000000L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine because KQUEUE_MAX_TIMEOUT_SECONDS is 24 hours minus 1 second, and the nanosecond component cannot exceed one second. Keeping the total under 24 hours.

ikhoon added a commit to line/armeria that referenced this pull request Jul 27, 2023
Motivation:

The build failures on macOS had two issues.

1. A deadlock on the JaCoCo task that prevented tests from running on macOS. `-Pcoverage` option was only used on macOS for CI builds.
2. Blocked event loops that were caused by `kevent(..) Invalid Argument`.
   https://github.com/netty/netty/blob/830166292be28e64ec6b43170cff40a31a442ad4/transport-classes-kqueue/src/main/java/io/netty/channel/kqueue/KQueueEventLoop.java#L399-L405
   A patch for the bug was submitted to netty/netty#13503

Modifications:

- Fixed DNS builder and tests to use `-1` in order to disable a query timeout. Long.MAX_VALUE could cause `kevent(..) Invalid Argument`.
- Removed a dependency between `reportTask` and `testTask.path` that caused the deadlock with the current Gradle version. It seemed the dependency was redundant. The JaCoCo task now runs without it.
- Upgrade deprecated Codecov GitHub Actions plugin. The old version is no longer working.
- Miscellaneous) Added debug message to `AbstractServerCall` to find out the cause of #5035

Result:

- Code coverage is correctly reported by Codecov.
- You no longer see `kevent(..) Invalid Argument` when testing Armeria on macOS.
@normanmaurer normanmaurer merged commit 1735b42 into netty:4.1 Jul 27, 2023
14 checks passed
normanmaurer pushed a commit that referenced this pull request Jul 27, 2023
Motivation:

While I was investigating Armeria CI build failures on macOS, many
`kevent(..) failed: Invalid argument` error messages were produced on
the failed tests.

https://ge.armeria.dev/s/h2fz4h4th66we/tests/task/:core:shadedTest/details/com.linecorp.armeria.server.HttpServerNoKeepAliveTest/shouldDisconnectWhenConnectionCloseIsIncluded(SessionProtocol)%5B4%5D?focused-execution=1&top-execution=1#L7

After adding some debug messages, I figured out that the error raised
when a large number such as `Integer.MAX_VALUE` is set as a timeout.

https://github.com/netty/netty/blob/cbc995865ed1e082d5581744e9c90c6d119652b1/transport-classes-kqueue/src/main/java/io/netty/channel/kqueue/KQueueEventLoop.java#L169-L171
The long timeout is derived from a scheduled task whose timeout is
`Long.MAX_VALUE`. We set `Long.MAX_VALUE` to `DnsNameResolver` in order
to disable a query timeout for testing.

https://github.com/netty/netty/blob/cbc995865ed1e082d5581744e9c90c6d119652b1/resolver-dns/src/main/java/io/netty/resolver/dns/DnsQueryContext.java#L282-L293
https://github.com/line/armeria/blob/2bfe424b652a16f603e2ba1ca075bf992f80f24e/core/src/test/java/com/linecorp/armeria/internal/client/dns/DefaultDnsResolverTest.java#L91-L93

Failed to find a specification for the timeout value of `kevent()`.
However, there was a note on a man page that says the timeout value is
limited to 24 hours.

https://man.freebsd.org/cgi/man.cgi?query=kevent&apropos=0&sektion=0&manpath=FreeBSD+6.1-RELEASE&format=html#end

Modifications:

- Limit the maximum allowed timeout for `kevent()` to 24 hours.
- Skip scheduling a query timeout when the value is `Long.MAX_VALUE`

Result:

`kevent(..) failed: Invalid argument` no longer raises when a task is
scheduled with a large value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants