Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ContentStore locking exceptions in async code #17246

Merged
merged 4 commits into from
Oct 11, 2024

Conversation

ronaldbarendse
Copy link
Contributor

Prerequisites

  • I have added steps to test this contribution in the description below

Description

After large Deploy operations (like an environment restore) has completed and the CMS tries to update the content cache (NuCache) as part of completing the Umbraco scope, the following exceptions were thrown:

System.AggregateException: One or more errors occurred. (Write lock must be acquried.) (Exceptions were thrown by listed actions. (Object synchronization method was called from an unsynchronized block of code.) (Object synchronization method was called from an unsynchronized block of code.) (Object synchronization method was called from an unsynchronized block of code.))
   at Umbraco.Cms.Infrastructure.Scoping.Scope.TryFinally(Action[] actions)
   at Umbraco.Cms.Infrastructure.Scoping.Scope.RobustExit(Boolean completed, Boolean onException)
   at Umbraco.Cms.Infrastructure.Scoping.Scope.DisposeLastScope()
   at Umbraco.Cms.Infrastructure.Scoping.Scope.Dispose()
   at Umbraco.Deploy.Infrastructure.Core.AltScope.Dispose()
   at Umbraco.Deploy.Infrastructure.Work.WorkItems.SessionWorkItem.<>c__DisplayClass24_0.<Complete>b__0()
...
 System.InvalidOperationException: Write lock must be acquried.
   at Umbraco.Cms.Infrastructure.PublishedCache.ContentStore.EnsureLocked()
   at Umbraco.Cms.Infrastructure.PublishedCache.ContentStore.SetLocked(ContentNodeKit kit)
   at Umbraco.Cms.Infrastructure.PublishedCache.PublishedSnapshotService.NotifyLocked(IEnumerable`1 payloads, Boolean& draftChanged, Boolean& publishedChanged)
   at Umbraco.Cms.Infrastructure.PublishedCache.PublishedSnapshotService.Notify(JsonPayload[] payloads, Boolean& draftChanged, Boolean& publishedChanged)
   at Umbraco.Cms.Core.Cache.ContentCacheRefresher.NotifyPublishedSnapshotService(IPublishedSnapshotService service, AppCaches appCaches, JsonPayload[] payloads)
   at Umbraco.Cms.Core.Cache.ContentCacheRefresher.Refresh(JsonPayload[] payloads)
   at Umbraco.Cms.Infrastructure.Sync.ServerMessengerBase.DeliverLocal[TPayload](ICacheRefresher refresher, TPayload[] payload)
   at Umbraco.Cms.Infrastructure.Sync.ServerMessengerBase.Deliver[TPayload](ICacheRefresher refresher, TPayload[] payload)
   at Umbraco.Cms.Infrastructure.Sync.ServerMessengerBase.QueueRefresh[TPayload](ICacheRefresher refresher, TPayload[] payload)
   at Umbraco.Cms.Core.Cache.DistributedCache.RefreshByPayload[TPayload](Guid refresherGuid, TPayload[] payload)
   at Umbraco.Cms.Core.Cache.DistributedCache.RefreshByPayload[TPayload](Guid refresherGuid, IEnumerable`1 payloads)
   at Umbraco.Extensions.DistributedCacheExtensions.RefreshContentCache(DistributedCache dc, IEnumerable`1 changes)
   at Umbraco.Cms.Core.Cache.ContentTreeChangeDistributedCacheNotificationHandler.Handle(IEnumerable`1 entities)
   at Umbraco.Cms.Core.Cache.DistributedCacheNotificationHandlerBase`2.Handle(IEnumerable`1 notifications)
System.AggregateException: One or more errors occurred. (Could not enter monitor before timeout in content store)
   at Umbraco.Cms.Infrastructure.Scoping.Scope.TryFinally(Action[] actions)
   at Umbraco.Cms.Infrastructure.Scoping.Scope.RobustExit(Boolean completed, Boolean onException)
   at Umbraco.Cms.Infrastructure.Scoping.Scope.DisposeLastScope()
   at Umbraco.Cms.Infrastructure.Scoping.Scope.Dispose()
   at Umbraco.Deploy.Infrastructure.Core.AltScope.Dispose()
   at Umbraco.Deploy.Infrastructure.Work.WorkItems.SessionWorkItem.<>c__DisplayClass24_0.<Complete>b__0()
...
System.TimeoutException: Could not enter monitor before timeout in content store
   at Umbraco.Cms.Infrastructure.PublishedCache.ContentStore.Lock(WriteLockInfo lockInfo, Boolean forceGen)
   at Umbraco.Cms.Infrastructure.PublishedCache.ContentStore.ScopedWriteLock..ctor(ContentStore store, Boolean scoped)
   at Umbraco.Cms.Infrastructure.PublishedCache.ContentStore.<GetScopedWriteLock>b__29_0(Boolean scoped)
   at Umbraco.Cms.Core.Scoping.ScopeContextualBase.<>c__DisplayClass2_0`1.<Get>b__0()
   at Umbraco.Cms.Core.Scoping.ScopeContext.Enlist[T](String key, Func`1 creator, Action`2 action, Int32 priority)
   at Umbraco.Cms.Core.Scoping.ScopeContextualBase.Get[T](ICoreScopeProvider scopeProvider, String key, Func`2 ctor)
   at Umbraco.Cms.Infrastructure.PublishedCache.ContentStore.GetScopedWriteLock(ICoreScopeProvider scopeProvider)
   at Umbraco.Cms.Infrastructure.PublishedCache.PublishedSnapshotService.Notify(JsonPayload[] payloads, Boolean& draftChanged, Boolean& publishedChanged)
   at Umbraco.Cms.Core.Cache.ContentCacheRefresher.NotifyPublishedSnapshotService(IPublishedSnapshotService service, AppCaches appCaches, JsonPayload[] payloads)
   at Umbraco.Cms.Core.Cache.ContentCacheRefresher.Refresh(JsonPayload[] payloads)
   at Umbraco.Cms.Infrastructure.Sync.ServerMessengerBase.DeliverLocal[TPayload](ICacheRefresher refresher, TPayload[] payload)
   at Umbraco.Cms.Infrastructure.Sync.ServerMessengerBase.Deliver[TPayload](ICacheRefresher refresher, TPayload[] payload)
   at Umbraco.Cms.Infrastructure.Sync.ServerMessengerBase.QueueRefresh[TPayload](ICacheRefresher refresher, TPayload[] payload)
   at Umbraco.Cms.Core.Cache.DistributedCache.RefreshByPayload[TPayload](Guid refresherGuid, TPayload[] payload)
   at Umbraco.Cms.Core.Cache.DistributedCache.RefreshByPayload[TPayload](Guid refresherGuid, IEnumerable`1 payloads)
   at Umbraco.Extensions.DistributedCacheExtensions.RefreshContentCache(DistributedCache dc, IEnumerable`1 changes)
   at Umbraco.Cms.Core.Cache.ContentTreeChangeDistributedCacheNotificationHandler.Handle(IEnumerable`1 entities)
   at Umbraco.Cms.Core.Cache.DistributedCacheNotificationHandlerBase`2.Handle(IEnumerable`1 notifications)

This all originates from the ContentStore either not 'seeing' it already has a write lock or getting a timeout when trying to get it (essentially deadlocking). Looking at the locking implementation in this class, it uses Monitor, which has the following requirement:

Because the Monitor class has thread affinity, the thread that acquired a lock must release the lock by calling the Monitor.Exit method.

But looking at the code and stack trace, the lock/monitor is entered and exited in different methods and they can potentially be called from different threads, which is especially likely if it contains async calls in-between them (Deploy heavily uses async and does more work in a single lock, further increasing the chances of exiting on a different thread).

I've added a test showcasing one of the exceptions and will push a fix that uses SemaphoreSlim.

Sorry, something went wrong.

Unverified

This user has not yet uploaded their public signing key.
@ronaldbarendse
Copy link
Contributor Author

As expected, the test without fix results in this exception:

Failed SetLocked [462 ms]
  Error Message:
   System.Threading.SynchronizationLockException : Object synchronization method was called from an unsynchronized block of code.
  Stack Trace:
     at System.Threading.Monitor.Exit(Object obj)
   at Umbraco.Cms.Infrastructure.PublishedCache.ContentStore.Release(WriteLockInfo lockInfo, Boolean commit) in /_/src/Umbraco.PublishedCache.NuCache/ContentStore.cs:line 412
   at Umbraco.Cms.Infrastructure.PublishedCache.ContentStore.ScopedWriteLock.Release(Boolean completed) in /_/src/Umbraco.PublishedCache.NuCache/ContentStore.cs:line 311
   at Umbraco.Cms.Core.Scoping.ScopeContextualBase.Dispose() in /_/src/Umbraco.Infrastructure/Scoping/ScopeContextualBase.cs:line 76
   at Umbraco.Cms.Tests.Integration.Umbraco.Infrastructure.PublishedCache.ContentCacheTests.SetLocked() in /_/tests/Umbraco.Tests.Integration/Umbraco.Infrastructure/PublishedCache/ContentCacheTests.cs:line 75

I've applied the same change in SnapDictionary (which has more locking tests in SnapDictionaryTests that all still pass). I also noticed an unused MonitorLock class and obsoleted it.

The only remaining usage of Monitor that can potentially enter/exit on different threads is in HttpContextRequestAppCache:

protected override void EnterReadLock()
{
var locker = GetLock();
if (locker == null)
{
return;
}
Monitor.Enter(locker);
}
protected override void EnterWriteLock()
{
var locker = GetLock();
if (locker == null)
{
return;
}
Monitor.Enter(locker);
}
protected override void ExitReadLock()
{
var locker = GetLock();
if (locker == null)
{
return;
}
if (Monitor.IsEntered(locker))
{
Monitor.Exit(locker);
}
}
protected override void ExitWriteLock()
{
var locker = GetLock();
if (locker == null)
{
return;
}
if (Monitor.IsEntered(locker))
{
Monitor.Exit(locker);
}
}

ExamineIndexRebuilder also uses Monitor, but always exits within the same method and therefore doesn't have the same thread synchronization issues.

Copy link
Member

@bergmania bergmania left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me

@bergmania bergmania merged commit c3db345 into v13/dev Oct 11, 2024
12 of 15 checks passed
@bergmania bergmania deleted the v13/bugfix/contentstore-locking branch October 11, 2024 07:45
bergmania pushed a commit that referenced this pull request Oct 11, 2024
* Add ContentCache test

* Use SemaphoreSlim as write lock

* Apply lock imrpovements to SnapDictionary

* Obsolete unused MonitorLock

(cherry picked from commit c3db345)
@ronaldbarendse
Copy link
Contributor Author

Awesome, let's verify whether this not only fixes the test, but also the runtime exceptions!👍🏻

Looking more into HttpContextRequestAppCache, I can see the locks are all entered/exited within the same methods (and thus same threads): it's just that the EnterReadLock()/ExitReadLock() and EnterWriteLock()/ExitWriteLock() are protected methods and can potentially be called incorrectly, but that's currently not the case...

bergmania pushed a commit that referenced this pull request Oct 16, 2024
* Add ContentCache test

* Use SemaphoreSlim as write lock

* Apply lock imrpovements to SnapDictionary

* Obsolete unused MonitorLock

(cherry picked from commit c3db345)
@sauron
Copy link

sauron commented Nov 11, 2024

Hello, I wanted to touch base with you all, because, since this was applied automatically by Umbraco Cloud,
we are getting the (Recursive locks not allowed) on tasks we didn't have before while doing simple things,
like moving nodes between folders
Screenshot 2024-11-11 at 11 14 00

or while running recurring Jobs that pulls from APIs and store locally.
The RecurringJob is a simple request to an API then insert or update into Umbraco under a folder like:

var results = "pull from API and parse with the Proper class"
foreach (var item in results)
{
    IContent dbNode = null;
    IPublishedContent? found = _publishedContentQuery.Content("<parentFOlderID>").Children().Where(x => x.Value<string>("externalId") == item.Id.Trim()).FirstOrDefault();
    if(found != null){
        dbNode = ContentService.GetById(found.Id);
    } else {
        dbNode = ContentService.Create(track.Title, int.Parse(parentId), "<nodeType>");
    }
    // Set Values
    dbNode.SetValue("...." , item.Id);
    // .....

    try
    {
        ContentService.SaveAndPublish(dbNode);
    }
    catch (AggregateException ae)
    {
        Console.WriteLine("Caught aggregate exception behavior"+ae.Message);
    }
}

While using try/catch avoid raising the error, we are wondering if there is a way to "release" the node being updated/inserted or a different approach we should take to run this recurring jobs to avoid this error.

For moving content between folders, we are "workarounding" it by moving smaller groups of content.

Is this the intended behavoir? Should I open a new issue?

@ronaldbarendse
Copy link
Contributor Author

Hello, I wanted to touch base with you all, because, since this was applied automatically by Umbraco Cloud,
we are getting the (Recursive locks not allowed) on tasks we didn't have before while doing simple things,

Looking at the stack trace, the error happens in a notification handler (as notifications are published in the TryFinally() method after completing/disposing the scope). Do you have any custom handlers registered that trigger when moving content (like ContentSavedNotification or ContentMovedNotification)? And the code that's running in the recurring job, is that creating its own scope before creating/saving content, as is shown in the documentation?

@lauren-g-2025
Copy link

lauren-g-2025 commented Jan 14, 2025

We are experiencing the same issue as user sauron, starting mid-December on one of our Umbraco Cloud sites. We have two Hangfire jobs that run at the same time and now seem to interfere with each other, causing Recursive locks not allowed.

We can probably get round it by making the jobs run at different times, but it'd be good to understand the root cause here. I can reproduce the issue locally by triggering the two jobs together. Adding ICoreScopeProvider as in the documentation has not helped.

@sauron
Copy link

sauron commented Jan 14, 2025

@lauren-g-2025 After reviewing our code and verifying that we were not customizing the handler, we end up "workarounding" it.
We added a try/catch syntax around the saveAndPublish call and we also made sure that we were triggering async using the await Task.Run(() => ...

So basically:

                    try
                    {
                         await Task.Run(() => _contentService.SaveAndPublish(article));
                    }
                    catch (AggregateException ae)
                    {
                        Console.WriteLine("Caught aggregate exception behavior "+ae.Message);
                    }

But going back to the main problem it is definitively this change.
if we run the app with the previous version of Umbraco, the one that didn't include this change, we were able to run everything smoothly.

I hope this helps.

@ronaldbarendse
Copy link
Contributor Author

@sauron Can you share the stack traces of the inner exceptions of that AggregateException? The fix in this PR ensures the lock is correctly seen by other threads and the current IScope ensures only a single write lock is taken (as recursive locks are indeed not supported). So it seems like some code is executing outside of the scope that took the write lock and is also trying to write to the content store.

The write lock is enlisted in the current scope here:

// no scope context = create a non-scoped object
IScopeContext? scopeContext = scopeProvider.Context;
if (scopeContext == null)
{
return ctor(false);
}
// create & enlist the scoped object
T? w = scopeContext.Enlist(
"ScopeContextualBase_" + key,
() => ctor(true),
(completed, item) => { item?.Release(completed); });

If there's no scope, it will call the ctor() and create a new ScopedWriteLock:

public IDisposable? GetScopedWriteLock(ICoreScopeProvider scopeProvider) =>
ScopeContextualBase.Get(scopeProvider, _instanceId, scoped => new ScopedWriteLock(this, scoped));

Also, using await Task.Run() can cause other issues and in the above example only causes additional overhead and can potentially result in thread starvation. Also see this comment in the Scope.Dispose() method:

$"The {nameof(Scope)} {InstanceId} being disposed is not the Ambient {nameof(Scope)} {_scopeProvider.AmbientScope?.InstanceId.ToString() ?? "NULL"}. This typically indicates that a child {nameof(Scope)} was not disposed, or flowed to a child thread that was not awaited, or concurrent threads are accessing the same {nameof(Scope)} (Ambient context) which is not supported. If using Task.Run (or similar) as a fire and forget tasks or to run threads in parallel you must suppress execution context flow with ExecutionContext.SuppressFlow() and ExecutionContext.RestoreFlow().";

You might need to wrap the code in a using (ExecutionContext.SuppressFlow()), like the webhook background job does:

await Task.WhenAll(requests.Select(request =>
{
using (ExecutionContext.SuppressFlow())
{
return Task.Run(async () =>
{
IWebhook? webhook = await _webHookService.GetAsync(request.WebhookKey);
if (webhook is null)
{
return;
}
using HttpResponseMessage? response = await SendRequestAsync(webhook, request.EventAlias, request.RequestObject, request.RetryCount, CancellationToken.None);
if ((response?.IsSuccessStatusCode ?? false) || request.RetryCount >= _webhookSettings.MaximumRetries)
{
await _webhookRequestService.DeleteAsync(request);
}
else
{
request.RetryCount++;
await _webhookRequestService.UpdateAsync(request);
}
});
}
}));

@lauren-g-2025 Umbraco uses distributed write locks, so only a single write operation (like updating content) can be done at a time, so running 2 background jobs at the same isn't recommended for that reason alone. But otherwise, it seems like Hangfire executes jobs from the same thread, potentially causing issues with Umbraco contexts being shared between jobs, so the same SuppressFlow() needs to be added, see also: https://discuss.hangfire.io/t/use-of-asynclocal-t-in-a-job/9443

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants