Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tobira seems to freeze randomly, not responding to HTTP requests at all anymore #1129

Closed
LukasKalbertodt opened this issue Mar 4, 2024 · 3 comments · Fixed by #1141
Closed
Labels

Comments

@LukasKalbertodt
Copy link
Member

LukasKalbertodt commented Mar 4, 2024

From the ETH and Bern, we got reports of Tobira randomly stopping to work. No more log outputs are happening, and HTTP requests are not responded to anymore at all. The process still runs, but it seems like it's not doing anything anymore. Log does not have any relevant entries.

Bern is running v2.5 (where it first occurred AFAIK), while ETH is running v2.6 (and they apparently did not see that problem on v2.5). Both are running it behind an nginx. From ETH I heard that it could happen immediately, or after hours or days. Bern reported it failing on roughly 26.02. 17:27, 22.02. 21:10 and 10.02. 16:53.

Playing around on the ETH system right now, I could somewhat reliably cause the problem to occur by immediately refreshing the page after restarting Tobira. I couldn't yet identify a relevant HTTP request, but assume that it's a graphql request. The index.html was almost (I think?) delivered correctly. The problem also occurred with the -musl build of v2.6. ETH observed that the memory-usage as reported by systemctl status was at only 784KB for the frozen Tobira, but 50MB for a normally running one. I did not see those numbers in my experiments, but it might still be useful information. The frozen process does not consume any noticeable CPU time (i.e. not a full core, for example).

I now deployed a MUSL build of 97d28b2 which is just v2.6 + #1120. There we update Tokio and hyper, two prime candidates for causing things like that. So far I could not reproduce the problem, I think? I will see if it continues to work for a few days.

I have not yet gone through the changelogs of Tokio and hyper, but will do so soon. Other things I have planned to debug this:

  • Get a Tobira into the frozen state and poke at it in various ways, using strace and friends to see what's happening.
  • Enable more logs/tracing from dependencies of Tobira (Tokio, hyper, ...).
@LukasKalbertodt
Copy link
Member Author

@oas777
Copy link
Collaborator

oas777 commented Mar 4, 2024

ETH is running v2.6 (and they apparently did not see that problem on v2.5).

FWIW: I saw that problem once or twice before we upgraded , at least if it is the problem we saw at our meeting last week /discussed with Piri and Waldemar.

LukasKalbertodt added a commit to LukasKalbertodt/tobira that referenced this issue Mar 12, 2024
Fixes elan-ev#1129

The "frozen Tobira" bug was caused by the following: we had two
in-memory hashmaps that cached different things. Each entry in that
cache has a time stamp and is evicted/ignored once it is too old. The
eviction is done by a long running task, that calls `retain` on the
cache. Of course, the cache is also read. In the case of the user cache
these steps are regularly performed:
1. Does an entry for this username exists? If yes:
2. Is this entry too old or differs from the new data? If yes:
3. Write new data to database.
4. Update the timestamp in the cache entry to mark it as "brand new".

This tricky part is this: step 3 contains an `.await` point (waiting for
the DB to reply). At that point, the Tokio runtime will pause that task
and run a different task on that same thread. If that different task
happens to be the maintenance task that calls `retain` on the hashmap,
then we are in trouble: the documentation of `retain` says:

> May deadlock if called when holding any sort of reference into the map

And in [this issue](xacrimon/dashmap#233) it
is specified, that this means:

> May deadlock if called when **the current thread is** holding any sort
> of reference into the map.

That is usually not a problem in non-async multi-threaded code, but in
async code when a lock/reference is held across an await point, then it
can happen that the same thread holds a lock and tries to call `retain`.

The solution is to not do that: avoid holding locks across await points
(a good advice generally anyway). However, that isn't trivial in our
case, but much more importantly: doing that is a subtle mistake! I don't
want bugs like this cropping up accidentally again. I'm sure there is
a clippy lint for that, but still, it feels wrong to me. So I decided
to use a different library that does not have this problem. It's not yet
as widely used but it seems very promising.

The auth callback cache could also use `HashIndex`, which might be
faster. But for simplicity I kept both `HashMaps` now. Should be plenty
fast.
LukasKalbertodt added a commit to LukasKalbertodt/tobira that referenced this issue Mar 12, 2024
Fixes elan-ev#1129

The "frozen Tobira" bug was caused by the following: we had two
in-memory hashmaps that cached different things. Each entry in that
cache has a time stamp and is evicted/ignored once it is too old. The
eviction is done by a long running task, that calls `retain` on the
cache. Of course, the cache is also read. In the case of the user cache
these steps are regularly performed:
1. Does an entry for this username exists? If yes:
2. Is this entry too old or differs from the new data? If yes:
3. Write new data to database.
4. Update the timestamp in the cache entry to mark it as "brand new".

This tricky part is this: step 3 contains an `.await` point (waiting for
the DB to reply). At that point, the Tokio runtime will pause that task
and run a different task on that same thread. If that different task
happens to be the maintenance task that calls `retain` on the hashmap,
then we are in trouble: the documentation of `retain` says:

> May deadlock if called when holding any sort of reference into the map

And in [this issue](xacrimon/dashmap#233) it
is specified, that this means:

> May deadlock if called when **the current thread is** holding any sort
> of reference into the map.

That is usually not a problem in non-async multi-threaded code, but in
async code when a lock/reference is held across an await point, then it
can happen that the same thread holds a lock and tries to call `retain`.

The solution is to not do that: avoid holding locks across await points
(a good advice generally anyway). However, that isn't trivial in our
case, but much more importantly: doing that is a subtle mistake! I don't
want bugs like this cropping up accidentally again. I'm sure there is
a clippy lint for that, but still, it feels wrong to me. So I decided
to use a different library that does not have this problem. It's not yet
as widely used but it seems very promising.

The auth callback cache could also use `HashIndex`, which might be
faster. But for simplicity I kept both `HashMaps` now. Should be plenty
fast.
owi92 added a commit that referenced this issue Mar 12, 2024
Fixes #1129

See first commit description for the technical details on how this was
caused. But the gist is: I incorrectly used `DashMap`, holding locks
across await points. This causes a deadlock if the timing is right and
two specific tasks are scheduled to run in the same thread. I could have
fixed the code around the await point, but since this is a very easy and
subtle mistake to make, I decided to use a different concurrent hashmap
that can better deal with these scenarios.

The second commit also fixes the cache dealing with a 0 cache duration
(which is supposed to disable the cache). See commits for more details.

On the ETH test system I deployed v2.6 plus this patch and verified that
the freeze is not happening anymore in the only situation where I could
reliably reproduce it before: starting Tobira and immediately making an
authenticated request. Since the cache_duration was set to 0, the timing
somehow worked out most of the time. Doing that does not freeze Tobira
anymore with this patch (I tried several times).

---

<details>
<summary>Some additional tests/details</summary>

The `scc` hashmap has no problem when a lock is held on the same thread
that `retain` is called. I tested it like this:

```rust
#[tokio::main(flavor = "current_thread")]
async fn main() {
    let start = Instant::now();

    let map = HashMap::new();
    let _ = map.insert_async("foo", 4).await;
    let _ = map.insert_async("bar", 27).await;
    let map = Arc::new(map);

    {
        let map = Arc::clone(&map);
        tokio::spawn(async move {
            println!("--- {:.2?} calling entry", start.elapsed());
            let e = map.entry_async("foo").await;
            println!("--- {:.2?} acquired entry", start.elapsed());
            tokio::time::sleep(Duration::from_millis(3000)).await;
            *e.or_insert(6).get_mut() *= 2;
            println!("--- {:.2?} done with entry", start.elapsed());
        });
    }

    {
        let map = Arc::clone(&map);
        tokio::spawn(async move {
            tokio::time::sleep(Duration::from_millis(1500)).await;
            println!("--- {:.2?} calling entry 2", start.elapsed());
            let e = map.entry_async("foo").await;
            println!("--- {:.2?} acquired entry 2", start.elapsed());
            tokio::time::sleep(Duration::from_millis(3000)).await;
            *e.or_insert(6).get_mut() *= 2;
            println!("--- {:.2?} done with entry 2", start.elapsed());
        });
    }

    tokio::time::sleep(Duration::from_millis(1000)).await;
    println!("--- {:.2?} calling retain", start.elapsed());
    map.retain_async(|_, v| *v % 2 != 0).await;
    println!("--- {:.2?} done retain", start.elapsed());
}
```

This outputs:

```
--- 31.88µs calling entry
--- 38.91µs acquired entry
--- 1.00s calling retain
--- 1.50s calling entry 2
--- 3.00s done with entry
--- 3.00s acquired entry 2
--- 6.00s done with entry 2
--- 6.00s done retain
```

Of course a single test doesn't guarantee that this is supported by the
library, but all docs indicate as well that the library can deal with
these situations. "async" is used everywhere and these kinds of
situations regularly occur in async code.

Edit: thinking about it more, I suppose the key feature here is that
`retain_async` can yield, whereas the `retain` from dashmap could only
block when it couldn't make any progress.

</details>
@LukasKalbertodt
Copy link
Member Author

See the linked PR for an in-depth explanation of the cause. To add some more useful detail:

  • This bug likely existed starting with v2.4
  • The slower your database is, the more likely is it that you run into the bug.
  • If you are on >= v2.6, setting auth.callback.cache_duration to 0 makes it a lot more likely to run into the bug.
  • We will be releasing the patch as v2.8 very soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants