Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JObject.implement deadlocks #1908

Closed
knopp opened this issue Jan 16, 2025 · 19 comments · Fixed by #2032
Closed

JObject.implement deadlocks #1908

knopp opened this issue Jan 16, 2025 · 19 comments · Fixed by #2032

Comments

@knopp
Copy link
Contributor

knopp commented Jan 16, 2025

Java code

import android.os.Handler;
import android.os.Looper;

public class Deadlock {
    public interface Delegate {
        void perform();
    }

    public static void deadlock(Delegate delegate) {
        Handler handler = new Handler(Looper.getMainLooper());
        handler.post(delegate::perform);
    }
}

Dart code

  final delegate = Deadlock$Delegate.implement($Deadlock$Delegate(perform: () {
    print('Hello');
  }));
  Deadlock.deadlock(delegate);

Java Stack trace

_invoke:-1, PortProxyBuilder (com.github.dart_lang.jni), PortProxyBuilder.java
invoke:143, PortProxyBuilder (com.github.dart_lang.jni), PortProxyBuilder.java
2 hidden frames
run:0, Deadlock$$ExternalSyntheticLambda0 (com.superlist.super_native_dialogs), D8$$SyntheticClass
8 hidden frames

Native Stack trace

[libc.so] syscall 0x000000710b8d63cc
[libdartjni.so] wait_for dartjni.h:119
[libdartjni.so] Java_com_github_dart_1lang_jni_PortProxyBuilder__1invoke dartjni.c:459

The problem that Java_com_github_dart_1lang_jni_PortProxyBuilder__1invoke checks Dart_CurrentIsolate_DL to determine whether the call is coming from another thread, and if that returns null it sends message on port and wait. However in case of Flutter on Android, the platform thread is the isolate thread, which means it is essentially blocking the main thread. Note that Dart_CurrentIsolate_DL returns null, because after posting the callback to main looper the isolate has been exited.

The solution that would work in the context of Flutter is to remember the thread Id alongside isolate, and if the thread Id matches, calling Dart_EnterIsolate_DL and Dart_ExitIsolate_DL around the trampoline.

Now while this works for Flutter, I'm not sure the solution is generic enough since it makes assumption about the isolate being "pinned" to a specific thread.

cc @HosseinYousefi

@knopp
Copy link
Contributor Author

knopp commented Jan 16, 2025

Note that I'm having same issue with using ffi with NativeFunction.isolateLocal, but because the counterpart is C code, calling Dart_EnterIsolate_DL and Dart_ExitIsolate_DL manually is not as inconvenient as having to do that in Java

@HosseinYousefi
Copy link
Member

Now while this works for Flutter, I'm not sure the solution is generic enough since it makes assumption about the isolate being "pinned" to a specific thread.

Yes, that's why we didn't do this before. Maybe we can only do this when we detect that we're on the main isolate of a Flutter application where the thread is indeed pinned.

cc @dcharkes @liamappelbe @mkustermann for ideas.

@liamappelbe
Copy link
Contributor

Note that I'm having same issue with using ffi with NativeFunction.isolateLocal, but because the counterpart is C code, calling Dart_EnterIsolate_DL and Dart_ExitIsolate_DL manually is not as inconvenient as having to do that in Java

Can you elaborate on this? Are you seeing a deadlock with NativeFunction.isolateLocal itself, or when waiting for a response message? I wouldn't expect NativeFunction.isolateLocal to ever deadlock.

Yes, that's why we didn't do this before. Maybe we can only do this when we detect that we're on the main isolate of a Flutter application where the thread is indeed pinned.

IIUC, jnigen does blocking callbacks similarly to ffigen, and there are 2 code paths. When the callback is coming from a random thread, it sends a message to the target isolate and waits for a reply. When the callback is coming from the same thread as the target isolate, the callback is invoked synchronously. And it sounds like the issue here is that the check that decides which code path to take is a bit unreliable on flutter.

jnigen is using Dart_CurrentIsolate_DL to do this check, and ffigen is using the current thread ID. Both have issues, since as you say, we don't pin isolates to a particular thread. Maybe the best we can do atm is to check both?

  • When the callback is created, save the current isolate ID and the current thread ID
  • When the callback is invoked
    • If the isolate ID matches, or the current isolate is null but the thread ID matches, use the synchronous code path (enter the target isolate first if the current isolate is null)
    • Otherwise use the message sending code path

Another option would be to discard the message sending code path entirely, and just enter the target isolate and invoke the callback synchronously. In fact, this is one of the NativeCallable proposals that hasn't been implemented yet:

  • When the callback is created, save the current isolate ID
  • When the callback is invoked
    • If the isolate ID matches, call the callback synchronously
    • If the current isolate is null, enter the target isolate, call the callback, then exit the target isolate
    • If the current isolate is non-null and doesn't match, save the ID and exit the isolate, enter the target isolate, call the callback, exit the target isolate, then re-enter the original isolate

@knopp
Copy link
Contributor Author

knopp commented Jan 17, 2025

Apologies for confusion, perhaps shouldn't have mixed these under same issue. The NativeFunction.isolateLocal situation is different. It does not deadlock, it just fails when Dart_CurrentIsolate_DL returns NULL. I.e. consider the following on iOS where platform and UI threads are merged.

// this works because dart_ffi_callback is called while isolate is active
void dart_ffi_callback(void (*isolate_local_trampoline)(void)) {   
   isolate_local_trampoline();
}

// this doesn't work, even though the trampoline is invoked on same thread, because 
// the trampoline is invokedwhile pumping the dispatch queue and isolate is
// no longer active.
void dart_ffi_callback(void (*isolate_local_trampoline)(void)) {   
   dispatch_async(dispatch_get_main_queue(), ^{
     // same thread, fails.
     isolate_local_trampoline();
   });
}

// This works again. This could be done automatically by the trampoline if we saved thread Id 
// with the callback metadata, but it might be the wrong thing to do if we don't know that isolate
// is always running on a particular thread (i.e. like flutter UI thread).
void dart_ffi_callback(void (*isolate_local_trampoline)(void)) {   
   Dart_Isolate isolate = Dart_CurrentIsolate_DL();
   dispatch_async(dispatch_get_main_queue(), ^{
     Dart_EnterIsolate_DL(isolate);
     isolate_local_trampoline();
     Dart_ExitIsolate_DL(isolate);
   });
}

As far as I can tell, unlike jnigen, dart ffi trampolines never block? NativeFunction.isolateLocal simply fails if the isolate thread local is not set, and NativeFunction.listener only post on port and does not attempt to propagate the return value so it never blocks.

@HosseinYousefi
Copy link
Member

@liamappelbe is in the process of designing a solution for this.

@HosseinYousefi
Copy link
Member

Since it's going to take a while until the fix lands on Flutter stable, I'll use a workaround of only calling Dart_EnterIsolate_DL/Dart_ExitIsolate_DL when the isolate is null but the threads match and it's the main thread.

@HosseinYousefi HosseinYousefi moved this to In Progress in JNIgen tracker Feb 11, 2025
@HosseinYousefi HosseinYousefi added this to the JNI / JNIgen 2025 Q1 milestone Feb 11, 2025
@knopp
Copy link
Contributor Author

knopp commented Feb 11, 2025

Which fix is meant to land on flutter stable?

@dcharkes
Copy link
Collaborator

Which fix is meant to land on flutter stable?

Flutter promising it only runs the main isolate on the platform thread, and us being able to query that. So, that we can check whether it's safe to "enter isolate" if we're not currently entered the isolate.

https://dart-review.googlesource.com/c/sdk/+/407700

@knopp
Copy link
Contributor Author

knopp commented Feb 11, 2025

Ah, ownership isolate API. Nice, was not aware that was in the works. Can that also be used for NativeFunction.isolateLocal to enter the isolate if needed? Which would fix #1908 (comment).

@liamappelbe
Copy link
Contributor

Can that also be used for NativeFunction.isolateLocal to enter the isolate if needed? Which would fix #1908 (comment).

We won't be changing how NativeFunction.isolateLocal works. We might add another NativeFunction variant in future that works that way though. One variant under development atm is related to the shared memory multithreading proposal, where the callback would create a temporary isolate that shares memory with the target isolate. Another variant we've talked about would try to enter the target isolate, and block the caller thread until it's able to do so.

copybara-service bot pushed a commit to dart-lang/sdk that referenced this issue Feb 12, 2025
go/dart-isolate-ownership-api

Change-Id: Ia778a916de3fecec9f0aa1a5c8bc9fd7dd421267
Bug: dart-lang/native#1908
TEST=runtime/vm/dart_api_impl_test.cc
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/407700
Reviewed-by: Martin Kustermann <kustermann@google.com>
Commit-Queue: Liam Appelbe <liama@google.com>
@escamoteur
Copy link

we just got bitten by this trying to run our app with native video player with the latest Flutter version 3.29.0. any idea when and how this will be fixed?

@HosseinYousefi
Copy link
Member

we just got bitten by this trying to run our app with native video player with the latest Flutter version 3.29.0. any idea when and how this will be fixed?

I've started working on this now. It will be fixed by the end of the week.

github-merge-queue bot pushed a commit to flutter/flutter that referenced this issue Feb 27, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
The isolate ownership API was [introduced
recently](https://dart-review.googlesource.com/c/sdk/+/407700) to solve
[some deadlock bugs](dart-lang/native#1908) in
native callbacks.

A native callback is a call from native code into a Dart function.
Currently all such callbacks must run that Dart function in the isolate
that created the callback (called the target isolate). The only native
callback primitives at the moment are `NativeCallable.isolateLocal`
(blocking, but must be invoked from the same thread as the target
isolate, and the target isolate must be currently entered on that
thread) and `NativeCallable.listener` (non-blocking, can be invoked from
any thread).

To build blocking callbacks that can be called from any thread, we can
use a `NativeCallable.listener`, and use a synchronization object like a
mutex or a condition variable to block until the callback is complete.
However, if we try to do this on the thread that is currently entered in
the target isolate, we will deadlock: we invoke the listener, a message
is sent to the target isolate, and we block waiting for the message to
be handled, so we never pass control flow back to the isolate to handle
the message, and never stop waiting.

To fix this deadlock, Ffigen and Jnigen both have a mechanism that
checks if we're on the target isolate's thread first:
- If the native caller is already on the same thread as the target
isolate, and the target isolate is entered:
- Call the Dart function directly using `NativeCallable.isolateLocal` or
similar
- Otherwise, if the native caller is coming from a different thread:
- Call the Dart function asynchronously using `NativeCallable.listener`
or similar
  - Block until the callback finishes

However, this neglects the case where we're on the target isolate's
thread, but not entered into the isolate. This case happens in Flutter
when the callback is invoked from the UI thread (or the platform thread
when thread merging is enabled), and the target isolate is the root
isolate. When the native callback is invoked, the root isolate is not
entered, so we hit the second case: we send a message to the root
isolate, and block to wait for a response. Since the root isolate is
exclusively run on the UI thread, and we're blocking the UI thread, the
message will never be handled, and we deadlock.

The isolate ownership API fixes this by allowing the embedder to inform
the VM that it will run a particular isolate exclusively on a particular
thread, using `Dart_SetCurrentThreadOwnsIsolate`. Other native code can
then query that ownership using `Dart_GetCurrentThreadOwnsIsolate`. This
lets us add a third case to our conditional:

- If the native caller is on the thread that is currently entered in the
target isolate:
- Call the Dart function directly using `NativeCallable.isolateLocal` or
similar
- Otherwise, if the native caller is on the thread that owns the target
isolate
  - Enter the target isolate
- Call the Dart function directly using `NativeCallable.isolateLocal `or
similar
  - Exit the target isolate
- Otherwise, the native caller is coming from an unrelated thread:
- Call the Dart function asynchronously using `NativeCallable.listener`
or similar
  - Block until the callback finishes

**Note:** We don't need to set the ownership of VM managed threads,
because they run in a thread pool exclusively used by the VM, so there's
no way for native code to be executed on the thread (except by FFI, in
which case we're entered into the isolate anyway). We only need this for
Flutter's root isolate because work can be sent to the UI
thread/platform thread using OS specific APIs like Android's
`Looper.getMainLooper()`.
@github-project-automation github-project-automation bot moved this from In Progress to Done in JNIgen tracker Feb 27, 2025
@escamoteur
Copy link

escamoteur commented Feb 27, 2025 via email

@HosseinYousefi
Copy link
Member

When will there be a hot fix?

Today!

@escamoteur
Copy link

escamoteur commented Feb 27, 2025 via email

@HosseinYousefi
Copy link
Member

Uhm including Flutter?
Am 27. Feb. 2025, 07:30 -0500 schrieb Hossein Yousefi @.***>:

I landed a fix in the package only which is not dependent on the Flutter fix so you can use it in any version.

@escamoteur
Copy link

escamoteur commented Feb 27, 2025 via email

@HosseinYousefi
Copy link
Member

Ah, as this was filed in the language repo I was fearing that it is a problem in the language compiler itself

Not sure if I'm following, this is the native repo.

@escamoteur
Copy link

escamoteur commented Feb 27, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants