Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in DataBufferUtils.readAsynchronousFileChannel resulting in FD leak (Channel remains open) #25831

Closed
gitdude1458 opened this issue Sep 29, 2020 · 2 comments
Assignees
Labels
in: core Issues in core modules (aop, beans, core, context, expression) type: bug A general bug
Milestone

Comments

@gitdude1458
Copy link

Affects: 5.2.x (but probably others too)


DataBufferUtils.readAsynchronousFileChannel is prone to file descriptor leak with certain race condition.

Channel is created using method Flux.using and provided channel supplier.

Flux<DataBuffer> flux = Flux.using(channelSupplier,
				channel -> Flux.create(sink -> {
					ReadCompletionHandler handler =
							new ReadCompletionHandler(channel, sink, position, bufferFactory, bufferSize);
					sink.onCancel(handler::cancel);
					sink.onRequest(handler::request);
				}),
				channel -> {
					// Do not close channel from here, rather wait for the current read callback
					// and then complete after releasing the DataBuffer.
				});

However, the builtin Flux cleanup method isn't used here and instead it relies on sink and channel callbacks. This is probably the core of this issue, although it seems reasonable why it was done like this (to prevent DataBuffer leaks as each DataBuffer needs to be released).

Let's presume the Flux is subscribed and something like this happens:

onSubscribe([Fuseable] FluxOnAssembly.OnAssemblyConditionalSubscriber)
request(128)
onNext(PooledUnsafeDirectByteBuf(ridx: 0, widx: 4100, cap: 8192))
cancel()

This can result in following chain of events in ReadCompletionHandler:

  • completed() method is called (but not finished) with disposed=false, passing the initial condition isNotDisposed and second condition read != -1
  • cancel() method is called while having disposed = false and reading = true, which results in:
    • Channel is not closed (because of condition if (!reading.get()) {closeChannel(channel)}
    • disposed is set to true
  • read() method is called from completed handler while disposed=true, which result in NOOP (because of condition java sink.requestedFromDownstream() > 0 && isNotDisposed && reading.compareAndSet(false, true)
  • completed() method finishes

The outcome of this chain of events is that the Channel is left open and nobody ever closes it as no further callbacks are delivered (the sink is cancelled and cancel event was delivered, the channel finished last reading operation and no further operations were performed).

The race condition is inevitable because while AtomicBooleans are used to synchronize the callbacks between NIO2 and Reactor. The outcomes are driven by 2 AtomicBooleans which in this case results in non atomic synchronization of operations if certain order of operations happen across different threads.

I would be willing to help with PR, but I don't believe there is simple fix as this looks as more fundamental design flaw in this code. While it would be easy to check if disposed==true in read() method and if yes and no reading is going to happen, then close the channel, I don't believe this will cover all possible scenarios how the methods can be executed in parallel and the resulting order of operations serialized. In my opinion the Flux cleanup handler should be used to atomically set disposed to true, also checking if reading is in progress and if not, close the channel or single atomic state needs to be used instead of 2 independent booleans.

Also it's pretty fundamental part of framework as this method (readAsynchronousFileChannel) is used by the framework itself to for example serve resources, so it may be worth having bit of discussion how this should be fixed for all possible cases.

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Sep 29, 2020
@gaoyeming
Copy link

???

@rstoyanchev rstoyanchev self-assigned this Oct 15, 2020
@rstoyanchev rstoyanchev added in: core Issues in core modules (aop, beans, core, context, expression) type: bug A general bug and removed status: waiting-for-triage An issue we've not yet triaged or decided on labels Oct 15, 2020
@rstoyanchev rstoyanchev added this to the 5.2.10 milestone Oct 15, 2020
@rstoyanchev
Copy link
Contributor

rstoyanchev commented Oct 15, 2020

Thanks for bringing this up. There was indeed a race condition between a concurrent read completion and cancellation. Arguably it is also less than ideal to have any delay in closing the channel.

I've made changes to close the channel immediately which should trigger a failed read callback to release the DataBuffer, as well as consolidated the 2 AtomicBoolean flags, and made a further improvement to stay in READING mode without switching states when there is sufficient demand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: core Issues in core modules (aop, beans, core, context, expression) type: bug A general bug
Projects
None yet
Development

No branches or pull requests

4 participants