Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpringFW 6 / SpringBoot 3.1.4 http_server_requests_seconds_count metrics incorrectly reports outcome=UNKNOWN for success requests and not reporting exception/errors for client cancel #31365

Closed
jtorkkel opened this issue Oct 5, 2023 · 22 comments
Assignees
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) status: superseded An issue that has been superseded by another theme: observability An issue related to observability and tracing

Comments

@jtorkkel
Copy link

jtorkkel commented Oct 5, 2023

Current situation:

There are 3 main issues in http_server_requests_seconds_count/sum/max/bucket metrics in SpringBoot 3.x/Springframework 6.x

success response are randomly often counted as status="UNKNOWN", outcome="UNKNOWN", error="none". Up to 10% of the calls might be incorrectly reported. The calls itself works correctly, so some doOnCancel due race before metrics are recorded
random extra status="UNKNOWN", outcome="UNKNOWN", error="none" metrics added not matching the request count. Probability << 1%. Those are not real meatrics but some kind of duplicate false doOnCancel events, maybe race double doOnCancels
calling client cancel (TCP close/reset) are not recorded with exception/error label but reported as status="UNKNOWN", outcome="UNKNOWN",error="none. doNotCancel source not recording to exception/error like in SringBoot2.5
Mixing success and UNKNOWN, and reporting incorrect counts results that http_server_requests_seconds_count/sum/max/bucket cannot be used to alerting or monitoring purpose. Data is false and cause false alerts.

metrics spring_security_filterchains_seconds_count and reactor_netty_http_server_data_sent_time_seconds_count report correct counts and status="200" for 1 and 2. Also all metrics generate by service like http_client_* and related netty metrics are correct.

canceled case 3 requests reported correctly in spring_security_filterchains_seconds_count filter spring_security_reached_filter_name="AuthorizationWebFilter"
Not reported in filter spring_security_reached_filter_name="ServerWebExchangeReactorContextWebFilter", also reactor_netty_http_server_data_sent_time_seconds_count seems not to report other than 200

Expected behavior:

metrics success count status="200, outcome="SUCCESS", error="none" to match requests count when no errors, success not reported as status="UNKNOWN", outcome="UNKNOWN", error="none"
no extra/duplicate status="UNKNOWN", outcome="UNKNOWN", error="none" metrics without matching requests
client calling the spring application and requests canceled should be counted as status="UNKNOWN", outcome="UNKNOWN", exception/error="CancelledServerWebExchangeException" instead of status="UNKNOWN", outcome="UNKNOWN", exception/error="none"

Tested Versions:

SpringBoot 3.1.4 (also 3.1.2)
Spring Framework 6.x matching springBoot
Wndows and Linux
Java 17
We use reactive programming and new observation framework.

How to reproduce:

1 and 2. Simplest way just generate requests usign Jmeter with paraller threads for long enough duration, for example 20 threads and 100000 request for each thread, then check that metrics report 2M success and no UNKNOWN.
3 can be tested by generating traffic similar way, but stopping the jmeter forcefully. Metrics should report canceled call with exception/error label different than "none", or as success if specific thread was not canceled. No status="UNKNOWN, outcome=UNKNOWN", error="none" should be reported.

Detailed description:

In our testing we noticed that metrics "http_server_requests_seconds_count" reporting is incorrectly and result are misleading.

For example if we make 1 requests (warmup after restart) plus 2000 requests (20 threads, 100 times) from Jmeter, we expect to see 2001 success responses reported by metrics. Jmeter reports 2001 success response as expected.

However http_server_requests_seconds_count reports less that 2001 success

http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="hostx",method="GET",outcome="SUCCESS",springBoot="3.1.4",status="200",uri="/customers/",ver="3.1.10-SNAPSHOT",} 1868.0

And we see one or more status="UNKNOWN", outcome="UNKNOWN"

http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="UNKNOWN",uri="/customers/",ver="3.1.10-SNAPSHOT",} 134.0

Total count might be larger than requests count, in our case we got one alltogether 2002 response when we expect to get 2001 success.

Jmeter is using connection pool and closing 20 connection only after the test.

If we use single thread issue is not seen. Looks that something in spring FW is generating false doOnCancel.

There is identical error created for springBoot Spring Boot (version 3.1.2) has UNKNOWN status in prometheus metrics but apparently issue not progressing as FW issue, not springboot issue.

Other metrics in system reports correctly 2001 requests.
Spring security, and reactor netty metrics showns correctly 2001 requests, also http client:
spring_security_filterchains_seconds_count{app="service-v3",cid="",error="none",host="hostx",spring_security_filterchain_position="9",spring_security_filterchain_size="9",spring_security_reached_filter_name="AuthorizationWebFilter",spring_security_reached_filter_section="before",springBoot="3.1.4",ver="3.1.10-SNAPSHOT",} 2001.0
http_client_requests_seconds_count{app="service-v3",cid="",client_name="localhost",clientName="localhost",error="none",exception="none",host="hostx",method="GET",outcome="SUCCESS",springBoot="3.1.4",status="200",uri="/customers/{customer-id}",ver="3.1.10-SNAPSHOT",} 2001.0
reactor_netty_http_server_data_received_time_seconds_count{app="service-v3",cid="",host="hostx",method="GET",springBoot="3.1.4",uri="/",ver="3.1.10-SNAPSHOT",} 2001.0
reactor_netty_http_server_data_sent_time_seconds_count{app="service-v3",cid="",host="hostx",method="GET",springBoot="3.1.4",status="200",uri="/",ver="3.1.10-SNAPSHOT",} 2001.0

Seems that issue happens in ServerHttpObservationFilter class filter and doOnCancel get triggered randomly resulting reporting SUCCESS as UNKNOWN.


	private static final Set<String> DISCONNECTED_CLIENT_EXCEPTIONS = Set.of("AbortedException",
			"ClientAbortException", "EOFException", "EofException");

	private Publisher<Void> filter(ServerWebExchange exchange, ServerRequestObservationContext observationContext, Mono<Void> call) {
		Observation observation = ServerHttpObservationDocumentation.HTTP_REACTIVE_SERVER_REQUESTS.observation(this.observationConvention,
				DEFAULT_OBSERVATION_CONVENTION, () -> observationContext, this.observationRegistry);
		observation.start();
		return call.doOnEach(signal -> {
					Throwable throwable = signal.getThrowable();
					if (throwable != null) {
						if (DISCONNECTED_CLIENT_EXCEPTIONS.contains(throwable.getClass().getSimpleName())) {
							observationContext.setConnectionAborted(true);
						}
						observationContext.setError(throwable);
					}
					onTerminalSignal(observation, exchange);
				})
				.doOnCancel(() -> {
					observationContext.setConnectionAborted(true);
					observation.stop();
				})
				.contextWrite(context -> context.put(ObservationThreadLocalAccessor.KEY, observation));
	}

In addition we think there is regression for exception/error label handling. doOnCancel does not set exception/error label values.

If we generate client error by canceling JMETER forcefully during the test pressing stop (jmeter close socket or sends reset while waiting response), in springboot 2.5 onwards it reported consistently
"status="200", outcome="UNKNOWN", exception="CancelledServerWebExchangeException."

In SpringBoot 2 initially http_server metrics did not implement doOnCancel at all for metrics at all, so no metrics were seen. We raised the issue to and it was added to 2.5 SpringBoot, fix was apparently done to spring FW.
Unfortunately status was still incorrectly "200" as there is no standard error codes for client canceling connection (de facto 599 could be used) and spring developer made unfortunate decision to leave status to 200

Now in 3.1 it looks that client cancel is not detected at all or very seldom. And cancel errors are reported just like these incorrectly reported "UNKNOWN".
I can see that "cancels ends to be handled in ServerHttpObservationFilter filter as "doOnCancel", but now in 3.1 there are no exception/error set.

Actually ServerHttpObservationFilter filter has code which should catch DISCONNECTED_CLIENT_EXCEPTIONS but apparently lower level stack throw doOnCancel which get triggered instead.

http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="UNKNOWN",uri="/customers/",ver="3.1.10-SNAPSHOT",}

If service can response fast when requests is canceled sometimes we might get these metrics, but very seldom. AbortedException probably means that cancel is propagated, but not happening consistently. Sometimes also uri is changed to "UNKNOWN.

http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="UNKNOWN",uri="UNKNOWN",ver="3.1.10-SNAPSHOT",} 5.0
http_server_requests_seconds_count{app="service-v3",cid="",error="AbortedException",exception="AbortedException",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="UNKNOWN",uri="/customers/",ver="3.1.10-SNAPSHOT",} 2.0

if service use http client and client calls timeout then WebClientRequestException is added correctly to server and client metrics.
http_server_requests_seconds_count{app="service-v3",cid="",error="WebClientRequestException",exception="WebClientRequestException",host="hostx",method="GET",outcome="SERVER_ERROR",springBoot="3.1.4",status="500",uri="/customers"} 1.0
http_client_requests_seconds_count{app="dbf-customerinfo-service-v3",cid="",client_name="localhost",clientName="localhost",error="WebClientRequestException",exception="WebClientRequestException",host="hostx",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="CLIENT_ERROR",uri="/customers/{customer-id}} 1.0

Sounds that DISCONNECTED_CLIENT_EXCEPTIONS is catching only http client calls the service initiates but fails due timeouts etc, but not cancel which http server receiver from calling client.

I think that "CancelledServerWebExchangeException" exception/error should somehow also propagated to "doOnCancel"

All this means that http_server metrics are now very unreliable, inaccurate, random , thus very hard or impossible to use for tracking and correlating to errors.
doOnCancel should always set exception/error to other than "none", to differentiate form success.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Oct 5, 2023
@bclozel
Copy link
Member

bclozel commented Oct 5, 2023

Thanks for the report, but it's not clear from this issue what is the main ask.
Should those "UNKNOWN" outcomes not happen at all? What's the expected behavior and why?

@bclozel bclozel added the status: waiting-for-feedback We need additional information before we can continue label Oct 5, 2023
@bclozel
Copy link
Member

bclozel commented Oct 5, 2023

Note the following comment: #29720 (comment)

We are getting cancel signals from the HTTP library and we are reporting those as best as we can. The response status was indeed "200", as what's being sent by the server. It could be that the client cancelled before receiving the response or right after, racing with the normal closure of the channel. We can only assume "UNKNOWN" as the status since we cannot know whether the response as been received.

@bclozel
Copy link
Member

bclozel commented Oct 5, 2023

Anyway Spring Boot 2.3 is out of support, so we can't change anything there. I'm closing this issue as a result. Thanks!

@bclozel bclozel closed this as not planned Won't fix, can't repro, duplicate, stale Oct 5, 2023
@bclozel bclozel added in: web Issues in web modules (web, webmvc, webflux, websocket) status: invalid An issue that we don't feel is valid and removed status: waiting-for-feedback We need additional information before we can continue status: waiting-for-triage An issue we've not yet triaged or decided on labels Oct 5, 2023
@jtorkkel
Copy link
Author

jtorkkel commented Oct 5, 2023

Sorry typo in version, should be 3.1.4

@jtorkkel jtorkkel changed the title SpringBoot 2.3.4 http_server_requests_seconds_count metrics incorrectly reports outcome=UNKNOWN for success requests and not reporting exception/errors for client cancel SpringBoot 3.1.4 http_server_requests_seconds_count metrics incorrectly reports outcome=UNKNOWN for success requests and not reporting exception/errors for client cancel Oct 5, 2023
@jtorkkel
Copy link
Author

jtorkkel commented Oct 5, 2023

Should I create new one

@jtorkkel jtorkkel changed the title SpringBoot 3.1.4 http_server_requests_seconds_count metrics incorrectly reports outcome=UNKNOWN for success requests and not reporting exception/errors for client cancel SpringFW 6 / SpringBoot 3.1.4 http_server_requests_seconds_count metrics incorrectly reports outcome=UNKNOWN for success requests and not reporting exception/errors for client cancel Oct 5, 2023
@jtorkkel
Copy link
Author

jtorkkel commented Oct 5, 2023

I created new one #31367 with correct version number

@bclozel
Copy link
Member

bclozel commented Oct 5, 2023

No need to create a new one.
I don't think we'll need to reopen this issue per my comments.

@jtorkkel
Copy link
Author

jtorkkel commented Oct 5, 2023

Why not reopen, it is latest Spring Framework 6.x and springBoot 3.1.4
And why you close new one as duplicate

@jtorkkel
Copy link
Author

jtorkkel commented Oct 5, 2023

There are 3 main issues in http_server_requests_seconds_count/sum/max/bucket metrics in SpringBoot 3.x/Springframework 6.x

  1. success response are randomly counted as status="UNKNOWN", outcome="UNKNOWN" --> not success, probably due race mentioned in closed issie
  2. random status="UNKNOWN", outcome="UNKNOWN" metrics added not matching the request count, those are not real meatrics but some kind of duplicate doOnCancel events, maybe race double doOnCancels
  3. calling client cancel (TCP close/reset) are not recorded with exception/error label but reported as status="UNKNOWN", outcome="UNKNOWN", doNotCancel source not recording to exception/error like in SringBoot2.5

Thus success, client cancel and random extra false doOnCancel events contribute up to 10% of the unknown metrics.

This cause false alerts when errors.

Thus http_server_requests_seconds_count/sum/max/bucket metrics are random and useless.
This is unfortunate that finally Spring 3.1 has working tracing but broken metrics.

@bclozel
Copy link
Member

bclozel commented Oct 6, 2023

I'll review this issue on Monday.

bucket metrics are random and useless.
This is unfortunate that finally Spring 3.1 has working tracing but broken metrics.

This is an open source project and we expect from our community constructive feedback and actionable issues.

@jtorkkel
Copy link
Author

jtorkkel commented Oct 6, 2023

Thanks for opening the error.

Sorry for stating that metrics are now useless due this regression.

We have been successfully using spring metrics metrics for years, they have been super reliable and accurate. They can pinpoint the issue into single requests.
This unfortunate regression will prevent such detail analysis because success and errors are now mixed. I just tried to highlight the importance of this metrics.

We are just migrating from 2.7 to 3.1 and finding this kind of "regression" late, and finding same issues being reported earlier but because not actionable not progressing.

Sounds that metrics are not "tested" systemically, and because of that issues are not seen in automated tests. I admit that they are not always easy to reproduce, and probably require proper end to end setup or use of customizable mock server, with large number of requests and test automation. Because of that we also see this issue when analyzing production or performance environment metrics against test results. We also missed these in our unit tests.

I tried to create detailed description of the error and detailed repro steps, due randomness it might not happen in all test setup so changing setup slightly might bring issue visible.

I also linked two separate issue to same case as I think they are related, and another is not reported earlier

But the issue is real, once again sorry for non constructive feedback.

@snicoll snicoll added status: waiting-for-triage An issue we've not yet triaged or decided on and removed status: invalid An issue that we don't feel is valid labels Oct 6, 2023
@bclozel
Copy link
Member

bclozel commented Oct 9, 2023

  1. success response are randomly counted as status="UNKNOWN", outcome="UNKNOWN" --> not success, probably due race mentioned in closed issue

After this comment #29720 (comment), I'm actually tempted to change our instrumentation and ignore cancellations if the response is already committed. In this case, it is very likely that the response was sent but the client closed the connection before we had a chance to complete the exchange. I've opened #31387 as an enhancement, as we can't change this behavior in a maintenance release.

It's not really possible to know for sure here, since we're setting status "UNKNOWN" if the client has disconnected. This one looks strange and I think this is a bug, I've opened #31388 (please check that one and let me know if the described behavior sounds sane).

  1. random status="UNKNOWN", outcome="UNKNOWN" metrics added not matching the request count, those are not real meatrics but some kind of duplicate doOnCancel events, maybe race double doOnCancels

I don't think those are duplicate calls to doOnCancel (as in #22952), as Reactor guarantees that and would have gotten lots of reports if it published duplicate signals. Also, this would mean that we're closing the same observation twice and publishing it twice, which is also guarded by Micrometer. Still, you can verify that by creating your own WebFilter like this:

public class CustomWebFilter implements WebFilter {

	@Override
	public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
		return chain.filter(exchange).log();
	}
}

and check whether you have a consistent pairs of onSubscribe and cancel. Lots of clients and load testing tools issue lots of requests and don't record them (they just drop them) once the limit is reached or when concurrency is too high.

  1. calling client cancel (TCP close/reset) are not recorded with exception/error label but reported as status="UNKNOWN", outcome="UNKNOWN", doNotCancel source not recording to exception/error like in SringBoot2.5
    Thus success, client cancel and random extra false doOnCancel events contribute up to 10% of the unknown metrics.

Issues created above could improve the situation here. We changed things from Spring Boot 2.5 since the reported exceptions were either completely artificial (and not thrown anywhere in the code really), or they could change at any time depending on the server implementation. As for the "outcome" tag, we thought about "CANCELLED" but went with UNKNOWN at the time.

You can check whether improvements made here will be helpful by setting up a custom convention on the filter, like this one:

public class CustomServerObservationConvention extends DefaultServerRequestObservationConvention {

	private static final KeyValue HTTP_OUTCOME_UNKNOWN = KeyValue.of(ServerHttpObservationDocumentation.LowCardinalityKeyNames.OUTCOME, "UNKNOWN");

	@Override
	protected KeyValue status(ServerRequestObservationContext context) {
		return (context.getResponse() != null && context.getResponse().getStatusCode() != null) ?
				KeyValue.of(ServerHttpObservationDocumentation.LowCardinalityKeyNames.STATUS, Integer.toString(context.getResponse().getStatusCode().value())) :
				KeyValue.of(ServerHttpObservationDocumentation.LowCardinalityKeyNames.STATUS, "UNKNOWN");
	}

	@Override
	protected KeyValue outcome(ServerRequestObservationContext context) {
		if (context.isConnectionAborted() && !context.getResponse().isCommitted()) {
			return KeyValue.of(LowCardinalityKeyNames.OUTCOME, "SUCCESS");
		}
		return super.outcome(context);
	}

}

Let us know if:

  1. the custom webfilter detects duplicate cancel signals
  2. if the custom convention helps with the metrics

@bclozel bclozel added the status: waiting-for-feedback We need additional information before we can continue label Oct 10, 2023
@jtorkkel
Copy link
Author

Thanks,

I tested CustomServerObservationConvention

	@Override
	protected KeyValue status(ServerRequestObservationContext context) {
		return (context.getResponse() != null && context.getResponse().getStatusCode() != null) ?
				KeyValue.of(ServerHttpObservationDocumentation.LowCardinalityKeyNames.STATUS, Integer.toString(context.getResponse().getStatusCode().value())) :
				KeyValue.of(ServerHttpObservationDocumentation.LowCardinalityKeyNames.STATUS, "UNKNOWN");
	}

and with 20000 requests (20 threads, 1000 times) in 20 seconds
I can see 20013 http_server metrics recorded (and correctly 20000 http_client metrics)

Now status="UNKNOWN"-->"200", and outcome="UNKNOWN" still as only status is overwritten by CustomServerObservationConvention

Total count was 13 extra, do not understand where they come from, uri is correct. Something must be counted twice.

http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="localcast",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="200",uri="/customers"} 1479.0
http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="localhost",method="GET",outcome="SUCCESS",springBoot="3.1.4",status="200",uri="/customers"} 18534.0

http_client_requests_seconds_count{app="dbf-customerinfo-service-v3",cid="",client_name="localhost",clientName="localhost",error="none",exception="none",host="FINPWM18138472",method="GET",outcome="SUCCESS",springBoot="3.1.4",status="200",uri="/customers/{customer-id}/info",ver="3.1.10-SNAPSHOT",} 20000.0

I did not notice any logs from CustomWebFilter , should I see something ?

@spring-projects-issues spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels Oct 10, 2023
@jtorkkel
Copy link
Author

jtorkkel commented Oct 11, 2023

Update, Now CustomWebFilter works

Test 1
20 threads, 50 requests each:

3022 reactor log events, 2021 parallel, 1001 nio:

1001 request(unbounded), parallel 1000, 1 http-nio
1001 onSubscribe([Fuseable] MonoFlatMap.FlatMapMain), parallel 1000, 1 http-nio
938 onComplete (), 917 nio, parallel 21
82 cancel, always http-nio

Summary:

  • one extra http_server reactor events and metrics entry, coming after tests, coming from nio (1001)
  • Sometimes there is both onComplete and cancel, sometimes only cancel, most only onComplete. Metrics with missing status when cancel

Http server records one extra metrics.
8% of the metrics has status="UNKNOWN", outcome="UNKNOWN" instead of status="200", outcome="SUCCESS".
Provided CustomServerObservationConvention/status can turn status="UNKNOWN" to "200", leaving OUTCOME still to "UNKNOWN", same but more complexworkaround maybe could be applied to OUTCOME.

http_server_requests_seconds_count{app="dbf-customerinfo-service-v3",cid="",error="none",exception="none",host="localhost",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="200",uri="/customers"} 84.0
http_server_requests_seconds_count{app="dbf-customerinfo-service-v3",cid="",error="none",exception="none",host="localhost",method="GET",outcome="SUCCESS",springBoot="3.1.4",status="200",uri="/customers"} 917.0
reactor_netty_http_server_data_sent_time_seconds_count{app="dbf-customerinfo-service-v3",cid="",host="localhost",method="GET",springBoot="3.1.4",status="200",uri="/"} 1000.0

Up to 20 parallel requests

2023-10-11T12:49:38.115+03:00  INFO service-v3,,] 17788 --- [     parallel-4] reactor.Mono.Defer.6                     : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-11T12:49:38.115+03:00  INFO service-v3,,] 17788 --- [     parallel-5] reactor.Mono.Defer.3                     : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-11T12:49:38.115+03:00  INFO service-v3,,] 17788 --- [     parallel-1] reactor.Mono.Defer.2                     : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-11T12:49:38.115+03:00  INFO service-v3,,] 17788 --- [     parallel-3] reactor.Mono.Defer.1                     : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)

2023-10-11T12:49:38.118+03:00  INFO service-v3,,] 17788 --- [     parallel-5] reactor.Mono.Defer.3                     : request(unbounded)
2023-10-11T12:49:38.118+03:00  INFO service-v3,,] 17788 --- [     parallel-1] reactor.Mono.Defer.2                     : request(unbounded)
2023-10-11T12:49:38.118+03:00  INFO service-v3,,] 17788 --- [     parallel-4] reactor.Mono.Defer.6                     : request(unbounded)
2023-10-11T12:49:38.118+03:00  INFO service-v3,,] 17788 --- [     parallel-3] reactor.Mono.Defer.1                     : request(unbounded)

2023-10-11T12:49:38.273+03:00  INFO service-v3,,] 17788 --- [     parallel-3] reactor.Mono.Defer.20                    : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-11T12:49:38.273+03:00  INFO service-v3,,] 17788 --- [     parallel-3] reactor.Mono.Defer.20                    : request(unbounded)

And gradually completing, note 1s delay due first requests warmup the JIT
2023-10-11T12:49:39.683+03:00  INFO service-v3,,] 17788 --- [ctor-http-nio-5] reactor.Mono.Defer.2                     : onComplete()
2023-10-11T12:49:39.683+03:00  INFO service-v3,,] 17788 --- [tor-http-nio-12] reactor.Mono.Defer.3                     : onComplete()
2023-10-11T12:49:39.683+03:00  INFO service-v3,,] 17788 --- [ctor-http-nio-6] reactor.Mono.Defer.13                    : onComplete()

one extra seems to be last one, nio (1001) and coming 50s late !!!
It is not coming from external requests, other metrics does not list extra requests.

2023-10-11T12:49:42.571+03:00  INFO service-v3,,] 17788 --- [    parallel-11] reactor.Mono.Defer.999                   : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-11T12:49:42.571+03:00  INFO service-v3,,] 17788 --- [    parallel-11] reactor.Mono.Defer.999                   : request(unbounded)
2023-10-11T12:49:42.579+03:00  INFO service-v3,,] 17788 --- [     parallel-1] reactor.Mono.Defer.1000                  : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-11T12:49:42.579+03:00  INFO service-v3,,] 17788 --- [     parallel-1] reactor.Mono.Defer.1000                  : request(unbounded)
2023-10-11T12:49:42.585+03:00  INFO service-v3,,] 17788 --- [ctor-http-nio-8] reactor.Mono.Defer.999                   : onComplete()
2023-10-11T12:49:42.591+03:00  INFO service-v3,,] 17788 --- [ctor-http-nio-3] reactor.Mono.Defer.1000                  : onComplete()
2023-10-11T12:50:31.153+03:00  INFO service-v3,,] 17788 --- [ctor-http-nio-1] reactor.Mono.Defer.1001                  : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-11T12:50:31.153+03:00  INFO service-v3,,] 17788 --- [ctor-http-nio-1] reactor.Mono.Defer.1001                  : request(unbounded)
2023-10-11T12:50:32.055+03:00  INFO service-v3,,] 17788 --- [ctor-http-nio-1] reactor.Mono.Defer.1001                  : onComplete()

test 2:

Running similar test with 10x more data, 20x 5000 results two extra http_server metrics entry and matching nio events, one in the middle and one coming after tests. When test time is increased these extra nio events increase.

start

2023-10-11T11:33:37.033+03:00  INFO [service-v3,,] 17528 --- [    parallel-12] reactor.Mono.Defer.12                    : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-11T11:33:37.033+03:00  INFO [service-v3,,] 17528 --- [    parallel-12] reactor.Mono.Defer.12                    : request(unbounded)
2023-10-11T11:33:37.128+03:00  INFO [service-v3,,] 17528 --- [tor-http-nio-11] reactor.Mono.Defer.12                    : onComplete()


**extra request, nio 5458 instead of parallel, in the middle, seen in metrics as extra entry, **

2023-10-11T11:34:03.843+03:00  INFO service-v3,,] 17528 --- [ctor-http-nio-6] reactor.Mono.Defer.5458                  : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-11T11:34:03.843+03:00  INFO service-v3,,] 17528 --- [ctor-http-nio-6] reactor.Mono.Defer.5458                  : request(unbounded)
2023-10-11T11:34:04.150+03:00  INFO [customerinfo-v3,,] 17528 --- [ctor-http-nio-6] reactor.Mono.Defer.5458                  : onComplete()

**cancel and onComplete**
2023-10-11T11:33:44.702+03:00  INFO [service-v3,,] 17528 --- [     parallel-8] reactor.Mono.Defer.1114                  : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-11T11:33:44.702+03:00  INFO [service-v3,,] 17528 --- [     parallel-8] reactor.Mono.Defer.1114                  : request(unbounded)
2023-10-11T11:33:44.802+03:00  INFO [service-v3,,] 17528 --- [     parallel-3] reactor.Mono.Defer.1114                  : onComplete()
2023-10-11T11:33:44.802+03:00  INFO [service-v3,,] 17528 --- [ctor-http-nio-7] reactor.Mono.Defer.1114                  : cancel()

**cancel instead of onComplete**
2023-10-11T11:34:13.573+03:00  INFO [service-v3,,] 17528 --- [     parallel-6] reactor.Mono.Defer.10006                 : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-11T11:34:13.573+03:00  INFO [service-v3,,] 17528 --- [     parallel-6] reactor.Mono.Defer.10006                 : request(unbounded)
2023-10-11T11:34:13.588+03:00  INFO [service-v3,,] 17528 --- [ctor-http-nio-2] reactor.Mono.Defer.10006                 : cancel()

**normal**
2023-10-11T11:34:13.641+03:00  INFO service-v3,,] 17528 --- [     parallel-8] reactor.Mono.Defer.10012                 : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-11T11:34:13.641+03:00  INFO service-v3,,] 17528 --- [     parallel-8] reactor.Mono.Defer.10012                 : request(unbounded)
2023-10-11T11:34:13.649+03:00  INFO service-v3,,] 17528 --- [ctor-http-nio-2] reactor.Mono.Defer.10012                 : onComplete()

**extra request, nio 10013 instead of parallel, 13s after last requests, seen in metrics as extra entry**
2023-10-11T11:34:26.645+03:00  INFO [service-v3,,] 17528 --- [ctor-http-nio-6] reactor.Mono.Defer.10013                 : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-11T11:34:26.645+03:00  INFO [service-v3,,] 17528 --- [ctor-http-nio-6] reactor.Mono.Defer.10013                 : request(unbounded)
2023-10-11T11:34:26.697+03:00  INFO [service-v3,,] 17528 --- [ctor-http-nio-6] reactor.Mono.Defer.10013                 : onComplete()

test 3 - testing external client cancel with override

This override which checks isConnectionAborted and !isCommitted successfully turn external client cancels to outcome="CANCELED".
This should be recorded as error/exception with "cancel" status code. I prefer status="599" as it works without need to use outcome in dashboards. outcome="CANCELED" is also quite good in addition.

    @Override
    protected KeyValue outcome(ServerRequestObservationContext context) {
        if (context.isConnectionAborted() && !context.getResponse().isCommitted()) {
            return KeyValue.of(ServerHttpObservationDocumentation.LowCardinalityKeyNames.OUTCOME, "CANCELED");
//            return KeyValue.of(ServerHttpObservationDocumentation.LowCardinalityKeyNames.OUTCOME, "SUCCESS");
        }
        return super.outcome(context);
    }

I will comment new issues created tomorrow

@bclozel
Copy link
Member

bclozel commented Oct 12, 2023

I've created #31417 which should address the main problem here.

@jtorkkel
Copy link
Author

jtorkkel commented Oct 12, 2023

Thanks, again.

As said in my tests there is no client cancel. There is 20 connection and no cancels. Jmeter should use http client with 4min connection timeout. I will turn on more logging to get confirmation. Unfortunately no metrics show if new sockets opened like netty client metrics does

client like Jmeter cancel was separate issue I reported.

I made more tests and here are my findings
; 20000 request, 20x1000
; 7 mx/prometheus calls during the test
; total 20007 reactor onSubscribe
; --> 18721 + 1283 = 20004 http_server metrics url="/customers" reported, 4 extra
; 6/7 url="/mx_prometheus" reported matching requests, 2 outcome SUCCESS, 4 UKNNOWN, browser making mx/prometheus call received metrics correctly.

Summary of analysis

  1. extra nio events were due prometheus metrics query to /mx/prometheus endpoint, not sure if some of these prometheus calls trigger duplicate http_server metrics with labels "url="mx/prometheus" and url="/customers" because

    • extra 4 metrics reported to API url="/customers" match the number metrics url="mx/prometheus" queries which resulted outcome="UNKNOWN",
    • however in later tests there were extra metrics without prometheus reporting "UNKNOWN" or at least count not matching
  2. spring security filters has some problems, active requests not completed resulting memory leaks (each requests has timer etc). I suspect that based on metrics somehow context handling fails in ServerWebExchangeReactorContextWebFilter, which trigger cancel, but does not cancel active metrics.

    • spring_security_http_secured_requests_seconds_count is 20408, not 20000 as expected
    • spring_security_filterchains_seconds_count{spring_security_reached_filter_name="ServerWebExchangeReactorContextWebFilter",spring_security_reached_filter_section="after"} 18723, not 20000
    • spring_security_filterchains_active_seconds_active_count{spring_security_reached_filter_section="after"} 1276, not expected 0 after test.
  3. I tried removing spring_Security metrics not removing extra metrics and UNKNOWNS

So I suspect that root cause for cancels is ServerWebExchangeReactorContextWebFilter context propagation error.

Below metrics showing this quite clearly something wrong in ServerWebExchangeReactorContextWebFilter

http_server_requests_seconds_count{app="dbf-customerinfo-service-v3",cid="",error="none",exception="none",host="localhost",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="200",
	uri="/customers",} 1283.0
http_server_requests_seconds_count{app="dbf-customerinfo-service-v3",cid="",error="none",exception="none",host="localhost",method="GET",outcome="SUCCESS",springBoot="3.1.4",status="200",
	uri="/customers",} 18721.0
reactor_netty_http_server_data_received_time_seconds_count{app="dbf-customerinfo-service-v3",cid="",host="localhost",method="GET",springBoot="3.1.4",uri="/",} 20000.0
reactor_netty_http_server_data_received_time_seconds_count{app="dbf-customerinfo-service-v3",cid="",host="localhost",method="GET",springBoot="3.1.4",uri="/mx/",} 7.0

;7th mx/prometheus query, the 7th response not sent while metrics sent
reactor_netty_http_server_data_sent_time_seconds_count{app="dbf-customerinfo-service-v3",cid="",host="localhost",method="GET",springBoot="3.1.4",status="200",uri="/mx/",} 6.0
reactor_netty_http_server_data_sent_time_seconds_count{app="dbf-customerinfo-service-v3",cid="",host="localhost",method="GET",springBoot="3.1.4",status="200",uri="/",} 20000.0

;7th mx/prometheus query, the 7th response not sent while metrics sent
mx_http_server_requests_seconds_count{app="dbf-customerinfo-service-v3",cid="",error="none",exception="none",host="localhost",method="GET",outcome="SUCCESS",springBoot="3.1.4",status="200",
	uri="/mx/prometheus",} 6.0

http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="localhost",method="GET",outcome="UNKNOWN",springBoot="3.1.4",status="200",uri="/mx/prometheus",} 4.0
http_server_requests_seconds_count{app="service-v3",cid="",error="none",exception="none",host="localhost",method="GET",outcome="SUCCESS",springBoot="3.1.4",status="200",uri="/mx/prometheus",} 2.0

spring_security_http_secured_requests_seconds_count{app="dbf-customerinfo-service-v3",cid="",error="none",host="localhost",springBoot="3.1.4",} 20408.0
spring_security_http_unsecured_requests_seconds_count{app="dbf-customerinfo-service-v3",cid="",error="none",host="localhost",springBoot="3.1.4",} 6.0

spring_security_filterchains_seconds_count{app="dbf-customerinfo-service-v3",cid="",error="none",host="localhost",spring_security_filterchain_position="0",spring_security_filterchain_size="0",
	spring_security_reached_filter_name="none",spring_security_reached_filter_section="after",springBoot="3.1.4",} 1.0
spring_security_filterchains_seconds_count{app="dbf-customerinfo-service-v3",cid="",error="none",host="localhost",spring_security_filterchain_position="9",spring_security_filterchain_size="9",
	spring_security_reached_filter_name="AuthorizationWebFilter",spring_security_reached_filter_section="before",springBoot="3.1.4",} 20000.0
spring_security_filterchains_seconds_count{app="dbf-customerinfo-service-v3",cid="",error="none",host="localhost",spring_security_filterchain_position="9",spring_security_filterchain_size="9",
	spring_security_reached_filter_name="ServerWebExchangeReactorContextWebFilter",spring_security_reached_filter_section="after",springBoot="3.1.4",} 18723.0

spring_security_filterchains_active_seconds_active_count{app="dbf-customerinfo-service-v3",cid="",host="localhost",spring_security_filterchain_position="0",spring_security_filterchain_size="0",
	spring_security_reached_filter_name="none",spring_security_reached_filter_section="before",springBoot="3.1.4",} 0.0
spring_security_filterchains_active_seconds_active_count{app="dbf-customerinfo-service-v3",cid="",host="localhost",spring_security_filterchain_position="0",spring_security_filterchain_size="0",
	spring_security_reached_filter_name="none",spring_security_reached_filter_section="after",springBoot="3.1.4",} 1276.0
; IJ console log showing only requests from 8810 onwards
2023-10-12T11:17:48.599+03:00  INFO [service-v3,,] 14056 --- [     parallel-1] reactor.Mono.Defer.8810                  : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-12T11:17:48.599+03:00  INFO [service-v3,,] 14056 --- [     parallel-1] reactor.Mono.Defer.8810                  : request(unbounded)
2023-10-12T11:17:48.637+03:00  INFO [service-v3,,] 14056 --- [tor-http-nio-10] reactor.Mono.Defer.8810                  : onComplete()

; 2nd mx/prometheus
2023-10-12T11:18:00.449+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.13799                 : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-12T11:18:00.449+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.13799                 : request(unbounded)
2023-10-12T11:18:00.608+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.13799                 : onComplete()

; 3rd mx/prometheus
2023-10-12T11:18:05.922+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.16137                 : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-12T11:18:05.922+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.16137                 : request(unbounded)
2023-10-12T11:18:06.113+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.16137                 : onComplete()

; 4th mx/prometheus
2023-10-12T11:18:10.398+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.17739                 : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-12T11:18:10.398+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.17739                 : request(unbounded)
2023-10-12T11:18:10.524+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.17739                 : onComplete()

; 5th mx/prometheus
2023-10-12T11:18:14.605+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.19282                 : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-12T11:18:14.605+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.19282                 : request(unbounded)
2023-10-12T11:18:14.807+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.19282                 : onComplete()

; 20000 requests
2023-10-12T11:18:16.874+03:00  INFO [service-v3,,] 14056 --- [     parallel-8] reactor.Mono.Defer.20005                 : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-12T11:18:16.874+03:00  INFO [service-v3,,] 14056 --- [     parallel-8] reactor.Mono.Defer.20005                 : request(unbounded)
2023-10-12T11:18:16.883+03:00  INFO [service-v3,,] 14056 --- [tor-http-nio-12] reactor.Mono.Defer.20005                 : onComplete()

; 6th mx/prometheus
2023-10-12T11:18:19.783+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.20006                 : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-12T11:18:19.783+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.20006                 : request(unbounded)
2023-10-12T11:18:19.846+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.20006                 : onComplete()

; 7th mx/prometheus
2023-10-12T11:18:28.796+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.20007                 : onSubscribe([Fuseable] MonoFlatMap.FlatMapMain)
2023-10-12T11:18:28.796+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.20007                 : request(unbounded)
2023-10-12T11:18:28.854+03:00  INFO [service-v3,,] 14056 --- [ctor-http-nio-1] reactor.Mono.Defer.20007                 : onComplete()

@bclozel
Copy link
Member

bclozel commented Oct 12, 2023

Thanks for the update.

With #31417, we are going to fix the root cause of those "extra" observations being recorded. The frequency of those in my tests (and in yours it seems) is approx 1/5000, so quite far from the 10% you were mentioning initially.

With #31388, the "status" tag will not be "UNKNOWN" as much as it was before.

As for the "CANCELED" outcome or the 599 response status, I'm afraid these are opinions we can't apply to the default convention here. A custom convention like you applied is the right way to go in my opinion.

With that being said, I don't se anything remaining in this issue here.

The ServerWebExchangeReactorContextWebFilter concerns should be reported to the Spring Security project directly, as there's nothing actionable from my side.

@jtorkkel
Copy link
Author

jtorkkel commented Oct 12, 2023

Thanks,
Understood that ServerWebExchangeReactorContextWebFilter done in another projects.

We need investigate the ServerWebExchangeReactorContextWebFilter which in our case is probably reason for 5-10% ratio. We have also some custom MDC but after removing we still have issue.

I can confirm that Jmeter in my test use only 20 connection, and success fully reuse them, sockets only closed after test completed.

I made short test with 20 thread, 5 requests each
logging.level.reactor.netty.http.server.HttpServerOperations=TRACE

; first 20 socket openend, server/service runs in port=8082
2023-10-12T14:17:22.794+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] New http connection, requesting read
2023-10-12T14:17:22.794+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-9] r.n.http.server.HttpServerOperations     : [b75fe397, L:/127.0.0.1:8082 - R:/127.0.0.1:45204] New http connection, requesting read
2023-10-12T14:17:22.794+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-8] r.n.http.server.HttpServerOperations     : [33422bfe, L:/127.0.0.1:8082 - R:/127.0.0.1:45203] New http connection, requesting read
2023-10-12T14:17:22.799+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-11] r.n.http.server.HttpServerOperations     : [f2f47f15, L:/127.0.0.1:8082 - R:/127.0.0.1:45206] New http connection, requesting read
2023-10-12T14:17:22.799+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-12] r.n.http.server.HttpServerOperations     : [a8b70624, L:/127.0.0.1:8082 - R:/127.0.0.1:45207] New http connection, requesting read
2023-10-12T14:17:22.799+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-2] r.n.http.server.HttpServerOperations     : [3543eb4d, L:/127.0.0.1:8082 - R:/127.0.0.1:45209] New http connection, requesting read
2023-10-12T14:17:22.799+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-1] r.n.http.server.HttpServerOperations     : [b4d22020, L:/127.0.0.1:8082 - R:/127.0.0.1:45208] New http connection, requesting read
2023-10-12T14:17:22.799+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-3] r.n.http.server.HttpServerOperations     : [3b686af4, L:/127.0.0.1:8082 - R:/127.0.0.1:45210] New http connection, requesting read
2023-10-12T14:17:22.800+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-4] r.n.http.server.HttpServerOperations     : [c79f1e5e, L:/127.0.0.1:8082 - R:/127.0.0.1:45211] New http connection, requesting read
2023-10-12T14:17:22.804+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-5] r.n.http.server.HttpServerOperations     : [8afe348f, L:/127.0.0.1:8082 - R:/127.0.0.1:45212] New http connection, requesting read
2023-10-12T14:17:22.804+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-6] r.n.http.server.HttpServerOperations     : [5f5af9da, L:/127.0.0.1:8082 - R:/127.0.0.1:45213] New http connection, requesting read
2023-10-12T14:17:22.806+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-7] r.n.http.server.HttpServerOperations     : [12cf02c9, L:/127.0.0.1:8082 - R:/127.0.0.1:45214] New http connection, requesting read
2023-10-12T14:17:22.806+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-8] r.n.http.server.HttpServerOperations     : [a5c1ac6e, L:/127.0.0.1:8082 - R:/127.0.0.1:45215] New http connection, requesting read
2023-10-12T14:17:22.836+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-2] r.n.http.server.HttpServerOperations     : [d084560f, L:/127.0.0.1:8082 - R:/127.0.0.1:45221] New http connection, requesting read
2023-10-12T14:17:22.843+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [2d6fa272, L:/127.0.0.1:8082 - R:/127.0.0.1:45217] New http connection, requesting read
2023-10-12T14:17:22.844+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-11] r.n.http.server.HttpServerOperations     : [399a5686, L:/127.0.0.1:8082 - R:/127.0.0.1:45218] New http connection, requesting read
2023-10-12T14:17:22.844+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-12] r.n.http.server.HttpServerOperations     : [135887eb, L:/127.0.0.1:8082 - R:/127.0.0.1:45219] New http connection, requesting read
2023-10-12T14:17:22.844+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-1] r.n.http.server.HttpServerOperations     : [b5c663f7, L:/127.0.0.1:8082 - R:/127.0.0.1:45220] New http connection, requesting read
2023-10-12T14:17:22.845+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-9] r.n.http.server.HttpServerOperations     : [36089395, L:/127.0.0.1:8082 - R:/127.0.0.1:45216] New http connection, requesting read
2023-10-12T14:17:22.850+03:00 DEBUG [service-v3,,] 12136 --- [ctor-http-nio-3] r.n.http.server.HttpServerOperations     : [e20216a0, L:/127.0.0.1:8082 - R:/127.0.0.1:45222] New http connection, requesting read

; then each socket reused 5 times, here socket 45205 shown

2023-10-12T14:17:22.948+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-1, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Decreasing pending responses, now 0
2023-10-12T14:17:22.949+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-1, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Last HTTP response frame
2023-10-12T14:17:22.949+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-1, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Last HTTP packet was sent, terminating the channel
2023-10-12T14:17:22.949+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Increasing pending responses, now 1
2023-10-12T14:17:22.997+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-2, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Decreasing pending responses, now 0
2023-10-12T14:17:22.997+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-2, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Last HTTP response frame
2023-10-12T14:17:22.997+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-2, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Last HTTP packet was sent, terminating the channel
2023-10-12T14:17:22.998+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Increasing pending responses, now 1
2023-10-12T14:17:23.033+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-3, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Decreasing pending responses, now 0
2023-10-12T14:17:23.033+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-3, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Last HTTP packet was sent, terminating the channel
2023-10-12T14:17:23.033+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Increasing pending responses, now 1
2023-10-12T14:17:23.070+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-4, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Decreasing pending responses, now 0
2023-10-12T14:17:23.070+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-4, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Last HTTP response frame
2023-10-12T14:17:23.070+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-4, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Last HTTP packet was sent, terminating the channel
2023-10-12T14:17:23.070+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Increasing pending responses, now 1
2023-10-12T14:17:23.113+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-5, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Decreasing pending responses, now 0
2023-10-12T14:17:23.114+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-5, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Last HTTP response frame
2023-10-12T14:17:23.114+03:00 DEBUG [service-v3,,] 12136 --- [tor-http-nio-10] r.n.http.server.HttpServerOperations     : [0377f1e4-5, L:/127.0.0.1:8082 - R:/127.0.0.1:45205] Last HTTP packet was sent, terminating the channel

@bclozel
Copy link
Member

bclozel commented Oct 12, 2023

Thanks, I'm closing this issue for now, we can reopen if it turns out the issues we've opened are not enough.

@bclozel bclozel closed this as not planned Won't fix, can't repro, duplicate, stale Oct 12, 2023
@bclozel bclozel added status: superseded An issue that has been superseded by another theme: observability An issue related to observability and tracing and removed status: waiting-for-triage An issue we've not yet triaged or decided on status: feedback-provided Feedback has been provided labels Oct 12, 2023
@jtorkkel
Copy link
Author

Thanks,
I think we need to raise more issues, just do not know yet to whom to raise and need your feedback.

I executed longer tests in our DEV 20x5000 (100 000 requests), connection pool used --> no eager socket close

  • 134 status="UNKNOWN"
  • 99884 status="200"
  • thus 17 extra http_server_metrics. Close to what you detected.

My close to 10% UNKNOWN seems to happen in Windows, which seems to generate cancel 100x more frequently.

However more concerning result from DEV test and from longer 24h tests is that
spring_security_filterchains_active_seconds_active_count{spring_security_reached_filter_section="after"} was 121 active connections after 100000 requests, in perf test it grow to 8000 when few million requests sent.
These connections remains in micrometer registry "FOREVER" until service OOM or is restarted, at least I saw 8000 in our perf test, and they were not clearer for 10 days.

This means the some false cancel() (or onComplete+cancel() is not catched by observation longTaskTimers when spring_security_filterchains_active post filter is applied and thus observation is never stop. Maybe cancel() close context --> stop cannot be called !

Note that some security needs to be used to activate the spring_security_filterchain observations. We use JWT token
but I also tested service without security, plain http and cancel() still happen with similar frequency but naturally spring_security_filterchains metrics are not used for those calls thus not leaking.

I did not fully understood if #31417 will eliminate the all false cancels() (when socket is not closed) which is root cause in my opinion or just try to minimize the issues in http_server_metrics.

Frequency seems to be VM specific and and load specific, require at least 10 thread parallel requests in my tests.
Resulting extra metrics and UNKNOWN, but also never stopped longRunningTimers.

I can see that http_server_requests_active_seconds_active_count count goes down to zero (1 as prometheus query increase to 1 at least), I do not understand why it is stopped but spring_security_filterchains is not. I would expect it is run after spring_security_filterchain.

What comes to issues you created
I did not notice new issue handling outcome="UKNOWN", should similar logic used in #31388 fix outcome

    @Override
    protected KeyValue outcome(ServerRequestObservationContext context) {
        if (context.isConnectionAborted() && !context.getResponse().isCommitted()) {
            return KeyValue.of(ServerHttpObservationDocumentation.LowCardinalityKeyNames.OUTCOME, "CANCELED");
        }
        if (context.getResponse() != null && context.getResponse().getStatusCode() != null) {
            return HttpOutcome.forStatus(context.getResponse().getStatusCode());
        }
        return HTTP_OUTCOME_UNKNOWN;
    }

And I did notice how exception label will be fixed, missing and clientCanceled (in SpringBoot 2.5 exception was set)

    @Override
    protected KeyValue exception(ServerRequestObservationContext context) {
        Throwable error = context.getError();
        if (error != null) {
            String simpleName = error.getClass().getSimpleName();
            return KeyValue.of(ServerHttpObservationDocumentation.LowCardinalityKeyNames.EXCEPTION, StringUtils.hasText(simpleName) ?
                    simpleName : error.getClass().getName());
        }
        if (context.isConnectionAborted() && !context.getResponse().isCommitted()) {
            return KeyValue.of(ServerHttpObservationDocumentation.LowCardinalityKeyNames.EXCEPTION, "clientCanceled");
        }
        return EXCEPTION_NONE;
    }

@bclozel
Copy link
Member

bclozel commented Oct 13, 2023

My close to 10% UNKNOWN seems to happen in Windows, which seems to generate cancel 100x more frequently.

The only difference here must be the networking stack and how the connection pools are dealing with socket events.

However more concerning result from DEV test and from longer 24h tests is that
spring_security_filterchains_active_seconds_active_count{spring_security_reached_filter_section="after"} was 121 active connections after 100000 requests, in perf test it grow to 8000 when few million requests sent.

I've already replied to that, you should create an issue in the Spring Security project.

I did not fully understood if #31417 will eliminate the all false cancels() (when socket is not closed) which is root cause in my opinion or just try to minimize the issues in http_server_metrics.

I fixes the fact that both "COMPLETE" and "CANCEL" signals can be received and only records observations once. If I'm not mistaken this was the main problem reported here. This is now fixed in SNAPSHOTs.

What comes to issues you created
I did not notice new issue handling outcome="UKNOWN", should similar logic used in #31388 fix outcome
...

I've also replied to that already:

As for the "CANCELED" outcome or the 599 response status, I'm afraid these are opinions we can't apply to the default convention here. A custom convention like you applied is the right way to go in my opinion.

As for the last request:

And I did notice how exception label will be fixed, missing and clientCanceled (in SpringBoot 2.5 exception was set)

I don't think we should bring that back as this exception was artificial and was only confusing users in the past. You can bring that back if you wish to with a custom convention.

@jtorkkel
Copy link
Author

jtorkkel commented Oct 14, 2023

Understood, you fixes remove extra metrics, and related UNKNOWN.

However root cause for cancel() is still open. It happen also in Linux as spring_security_filterchain "after" increase and resulting OOM or registry issue in micrometer unless it limit max longRunningTasks.

I Override MeterObservationHandler onStart and onStop,

Normal success case indicate that:
spring.security.filter "after" onStop first, followed by http.server.requests onStop

When Failure/cancel() --> status UNKNOWN case then
http.server.requests onStop but NO onStop for spring.security.filter "after", thus leak.

Sounds that filter are stopped in wrong order, and applying http.server.requests onStop trigger cancel() resulting status="UNKNOWN" and spring.security.filterchain after not to be stopped, or .. maybe something else leaks cancel() resulting filters to be stopped in random order.

spring_security_filter are implemented in ObservationWebFilterChainDecorator --> I will raise spring security issue as it fails to catch the cancel() or ...

public class CustomDefaultMeterObservationHandler implements MeterObservationHandler<Observation.Context> {
    private final MeterRegistry meterRegistry;

    public NDFDefaultMeterObservationHandler(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
    }

    @Override
    public void onStart(Observation.Context context) {
        LongTaskTimer.Sample longTaskSample = LongTaskTimer.builder(context.getName() + ".active").tags(this.createTags(context)).register(this.meterRegistry).start();
        if (context.getName().contains("spring.security.filterchains")) {
            if (context.getLowCardinalityKeyValue("spring.security.reached.filter.section").getValue().contains("after")) {
                log.error("STARTname= spring.security.filterchains, spring.security.reached.filter.section=after");
            }
        } else if (context.getName().contains("http.server.requests")) {
            log.error("STARTname=http.server.requests");
        }
        context.put(LongTaskTimer.Sample.class, longTaskSample);
        Timer.Sample sample = Timer.start(this.meterRegistry);
        context.put(Timer.Sample.class, sample);
    }

    @Override
    public void onStop(Observation.Context context) {
        List<Tag> tags = this.createTags(context);
        tags.add(Tag.of("error", this.getErrorValue(context)));

        if (context.getName().contains("spring.security.filterchains")) {
            if (context.getLowCardinalityKeyValue("spring.security.reached.filter.section").getValue().contains("after")) {
                log.error("STOPname= spring.security.filterchains, spring.security.reached.filter.section=after");
            }
        } else if (context.getName().contains("http.server.requests")) {
            log.error("STOPname=http.server.requests");
        }

        Timer.Sample sample = (Timer.Sample) context.getRequired(Timer.Sample.class);
        sample.stop(Timer.builder(context.getName()).tags(tags).register(this.meterRegistry));
        LongTaskTimer.Sample longTaskSample = (LongTaskTimer.Sample) context.getRequired(LongTaskTimer.Sample.class);
        longTaskSample.stop();
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) status: superseded An issue that has been superseded by another theme: observability An issue related to observability and tracing
Projects
None yet
Development

No branches or pull requests

4 participants