Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock wait timeout exceeded MySQL errors #9905

Closed
nickdunn opened this issue Sep 30, 2021 · 58 comments
Closed

Lock wait timeout exceeded MySQL errors #9905

nickdunn opened this issue Sep 30, 2021 · 58 comments
Labels

Comments

@nickdunn
Copy link

Description

Very sporadically I am getting Lock wait timeout exceeded errors from MySQL.

These appear in Craft logs like:

Next yii\db\Exception: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction

and

[error][yii\db\Exception] PDOException: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction in /var/www/craft/vendor/yiisoft/yii2/db/Command.php:1302

Running show engine innodb status shows they are consistently from queries on the structureelements table:

------------------------
LATEST DETECTED DEADLOCK
------------------------
2021-09-30 13:49:41 0x14b8
*** (1) TRANSACTION:
TRANSACTION 6366085, ACTIVE 2 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 22 lock struct(s), heap size 1136, 10 row lock(s), undo log entries 8
MySQL thread id 89160, OS thread handle 5260, query id 7402009 10.2.0.62 cswwwuatdbmysql updating
UPDATE `structureelements` SET `lft`=`lft`+2, `dateUpdated`='2021-09-30 13:49:39' WHERE (`lft` >= 884) AND (`root`=1)
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 623 page no 20 n bits 1192 index structureelements_root_idx of table `craftcms`.`structureelements` trx id 6366085 lock_mode X waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 4; hex 00000001; asc     ;;
 1: len 4; hex 80000001; asc     ;;

*** (2) TRANSACTION:
TRANSACTION 6365973, ACTIVE 5 sec inserting
mysql tables in use 1, locked 1
335 lock struct(s), heap size 41168, 2975 row lock(s), undo log entries 2044
MySQL thread id 89190, OS thread handle 5304, query id 7403525 10.2.0.62 cswwwuatdbmysql update
INSERT INTO `structureelements` (`structureId`, `dateCreated`, `dateUpdated`, `uid`, `lft`, `rgt`, `level`) VALUES (20875, '2021-09-30 13:49:41', '2021-09-30 13:49:41', '58e60ef1-6f8b-4a5d-9783-cf378111c29d', 1, 2, 0)
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 623 page no 20 n bits 1192 index structureelements_root_idx of table `craftcms`.`structureelements` trx id 6365973 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 4; hex 00000001; asc     ;;
 1: len 4; hex 80000001; asc     ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 623 page no 20 n bits 1192 index structureelements_root_idx of table `craftcms`.`structureelements` trx id 6365973 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 4; hex 00000001; asc     ;;
 1: len 4; hex 80000001; asc     ;;

*** WE ROLL BACK TRANSACTION (1)

The database server is pretty chunky and performs well elsewhere. The innodb_lock_wait_timeout setting was at the 50 second default. I have just dumped to 100 but this feels like a sticking plaster i.e. telling it to wait a bit longer just in case!

#5872 suggests a large number of revisions could be the issue. I only have about 6k entries and 4k revisions, while there are 57k rows in structureelements.

I'm conscious StackOverflow is awash with useless threads on these errors and the usual reply is "refactor your code to make sure everything is in one transaction" which doesn't seem too useful.

The offending queries seem fairly innocuous to me. The issue is very sporadic — maybe 10 times in 3 months — and always seemingly when a content editor goes to save an entry.

Reckons?

Additional info

@nickdunn nickdunn added the bug label Sep 30, 2021
@brandonkelly
Copy link
Member

Hey @nickdunn, do you have SEOmatic installed by chance? If so can you please ensure that you are running its latest version? We have seen similar issues where that ended up being the culprit, and has been fixed in v3.4.14.

@lassemt
Copy link

lassemt commented Oct 2, 2021

@brandonkelly I've been experiencing the same on a page recently. It's a "Pruning extra revisions" job that fails:
Screenshot 2021-10-02 at 14 03 23

Now we are running the latest version, but we noticed the issue 10 days ago after running 3.7.11 for 2 weeks. Before that we have been running 3.6.17 without any problems. So it might be a change in 3.7 causing this?

Another thing I have noticed is that this is often happening after saving a specific entry. The entry containes a matrix field with a bunch of blocks, but this entry have had it's revisions pruned without a problem earlier.

System info

PHP 7.4.15
Linux 5.4.0-88-generic
MySQL 8.0.23
Craft Pro 3.7.14

Plugins

AsyncQueue 2.3.0
Calendar Links 1.0.1
Craft MailChimp 1.0.2 (custom plugin)
DigitalOcean Spaces Volume 1.1.3
Element API 2.8.3
Imgix 2.1.0
Minify 1.2.10
Redactor 2.8.8
Super Table 2.6.8
Twigpack 1.2.15

Notes

It's an issue that occured before AsyncQueue was installed.

Timeline

Sun Jul 04 2021 Updated to 3.6.17
Tue Sep 07 2021 Updated to 3.7.11
Fri Sep 24 2021 First time we noticed the problem. Installed Async Queue plugin as a potential solution.
Sat Oct 02 2021 Experiencing the same problem. Updated to 3.7.14. Managed to recreate problem by saving entry after update.

Hope this helps. Let me know if you need me to test anything more.

Edit:
It's not hard to notice the problem as the servers stales when it happens.

@brandonkelly
Copy link
Member

@lassemt Does the job finish successfully if you restart it? Or does it consistently fail?

@lassemt
Copy link

lassemt commented Oct 5, 2021

@lassemt Does the job finish successfully if you restart it? Or does it consistently fail?

@brandonkelly I didn't try restarting the queue as most of the times I had to restart the server. I have the queue.log from when it happened and you see it tries to restart the job itself:

2021-10-01 17:14:33 [-][150469][-][info][craft\queue\QueueLogBehavior::beforeExec]  [272119] Pruning extra revisions (attempt: 1, pid: 461734) - Started
2021-10-01 17:15:25 [-][150469][-][error][craft\queue\QueueLogBehavior::afterError]  [272119] Pruning extra revisions (attempt: 1, pid: 461734) - Error (time: 52.052s): SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
The SQL being executed was: DELETE FROM `craft_structureelements` WHERE ((`lft` >= 1059) AND (`rgt` <= 1060)) AND (`root`=1)
2021-10-01 17:15:25 [-][150469][-][info][craft\queue\QueueLogBehavior::beforeExec]  [277820] Pruning extra revisions (attempt: 1, pid: 461734) - Started
2021-10-01 17:17:58 [-][150469][-][error][craft\queue\QueueLogBehavior::afterError]  [277820] Pruning extra revisions (attempt: 1, pid: 461734) - Error (time: 152.207s): SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
The SQL being executed was: DELETE FROM `craft_structureelements` WHERE ((`lft` >= 955) AND (`rgt` <= 956)) AND (`root`=1)
2021-10-01 17:17:58 [-][150469][-][info][craft\queue\QueueLogBehavior::beforeExec]  [283346] Pruning extra revisions (attempt: 1, pid: 461734) - Started
2021-10-01 17:22:06 [-][150469][-][error][craft\queue\QueueLogBehavior::afterError]  [283346] Pruning extra revisions (attempt: 1, pid: 461734) - Error (time: 248.894s): SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
The SQL being executed was: DELETE FROM `craft_structureelements` WHERE ((`lft` >= 955) AND (`rgt` <= 956)) AND (`root`=1)
2021-10-01 17:22:06 [-][150469][-][info][craft\queue\QueueLogBehavior::beforeExec]  [270997] Pruning extra revisions (attempt: 2, pid: 461734) - Started
2021-10-01 17:27:08 [-][150469][-][error][craft\queue\QueueLogBehavior::afterError]  [270997] Pruning extra revisions (attempt: 2, pid: 461734) - Error (time: 301.233s): SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
The SQL being executed was: DELETE FROM `craft_structureelements` WHERE ((`lft` >= 955) AND (`rgt` <= 956)) AND (`root`=1)

@brandonkelly
Copy link
Member

Thanks. The first and last errors are 12 minutes apart, so that probably rules out the possibility that the lock was just legitimately tied up by a separate process.

Do you see any other [errors] further up in the logs that look like they might have had something to do with updating structure data? Or do you have anything with a semi-recent timestamp in storage/logs/phperrors.log? Seems like maybe there was a previous request that ended prematurely, and wasn’t able to release its lock on the structure.

@lassemt
Copy link

lassemt commented Oct 5, 2021

@brandonkelly that makes sense. Here is batch from queue.log when I tried to force recreating the error by manually saving the entry a couple of times. This was done right after I restarted the server, so the first event you see is the first queued job after the restart, and the last event is from right after I triggered the server restart through Digital Ocean.

2021-10-02 13:42:20 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292891] Updating search indexes (attempt: 1, pid: 1628) - Started
2021-10-02 13:42:20 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292891] Updating search indexes (attempt: 1, pid: 1628) - Done (time: 0.056s)
2021-10-02 13:42:20 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292894] Updating search indexes (attempt: 1, pid: 1628) - Started
2021-10-02 13:42:20 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292894] Updating search indexes (attempt: 1, pid: 1628) - Done (time: 0.050s)
2021-10-02 13:42:21 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292893] Pruning extra revisions (attempt: 1, pid: 1628) - Started
2021-10-02 13:42:22 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292893] Pruning extra revisions (attempt: 1, pid: 1628) - Done (time: 1.023s)
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292896] Updating element slugs and URIs (attempt: 1, pid: 1665) - Started
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292896] Updating element slugs and URIs (attempt: 1, pid: 1665) - Done (time: 0.039s)
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292895] Updating search indexes (attempt: 1, pid: 1665) - Started
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292895] Updating search indexes (attempt: 1, pid: 1665) - Done (time: 0.052s)
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292898] Updating search indexes (attempt: 1, pid: 1665) - Started
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292898] Updating search indexes (attempt: 1, pid: 1665) - Done (time: 0.048s)
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292897] Pruning extra revisions (attempt: 1, pid: 1665) - Started
2021-10-02 13:42:49 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292897] Pruning extra revisions (attempt: 1, pid: 1665) - Done (time: 0.788s)
2021-10-02 13:43:06 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292900] Updating element slugs and URIs (attempt: 1, pid: 1727) - Started
2021-10-02 13:43:07 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292900] Updating element slugs and URIs (attempt: 1, pid: 1727) - Done (time: 0.037s)
2021-10-02 13:43:07 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292899] Updating search indexes (attempt: 1, pid: 1727) - Started
2021-10-02 13:43:07 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292899] Updating search indexes (attempt: 1, pid: 1727) - Done (time: 0.046s)
2021-10-02 13:43:07 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292902] Updating search indexes (attempt: 1, pid: 1727) - Started
2021-10-02 13:43:07 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292902] Updating search indexes (attempt: 1, pid: 1727) - Done (time: 0.049s)
2021-10-02 13:43:07 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292901] Pruning extra revisions (attempt: 1, pid: 1727) - Started
2021-10-02 13:43:08 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292901] Pruning extra revisions (attempt: 1, pid: 1727) - Done (time: 0.654s)
2021-10-02 13:43:24 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292904] Updating element slugs and URIs (attempt: 1, pid: 1755) - Started
2021-10-02 13:43:24 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292904] Updating element slugs and URIs (attempt: 1, pid: 1755) - Done (time: 0.036s)
2021-10-02 13:43:24 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292903] Updating search indexes (attempt: 1, pid: 1755) - Started
2021-10-02 13:43:24 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292903] Updating search indexes (attempt: 1, pid: 1755) - Done (time: 0.048s)
2021-10-02 13:43:24 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292906] Updating search indexes (attempt: 1, pid: 1755) - Started
2021-10-02 13:43:24 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292906] Updating search indexes (attempt: 1, pid: 1755) - Done (time: 0.048s)
2021-10-02 13:43:25 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292905] Pruning extra revisions (attempt: 1, pid: 1755) - Started
2021-10-02 13:43:25 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292905] Pruning extra revisions (attempt: 1, pid: 1755) - Done (time: 0.749s)
2021-10-02 13:43:43 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292908] Updating element slugs and URIs (attempt: 1, pid: 1784) - Started
2021-10-02 13:43:43 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292908] Updating element slugs and URIs (attempt: 1, pid: 1784) - Done (time: 0.040s)
2021-10-02 13:43:43 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292907] Updating search indexes (attempt: 1, pid: 1784) - Started
2021-10-02 13:43:43 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292907] Updating search indexes (attempt: 1, pid: 1784) - Done (time: 0.049s)
2021-10-02 13:43:43 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292910] Updating search indexes (attempt: 1, pid: 1784) - Started
2021-10-02 13:43:43 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292910] Updating search indexes (attempt: 1, pid: 1784) - Done (time: 0.050s)
2021-10-02 13:43:44 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292909] Pruning extra revisions (attempt: 1, pid: 1784) - Started
2021-10-02 13:43:44 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292909] Pruning extra revisions (attempt: 1, pid: 1784) - Done (time: 0.780s)
2021-10-02 13:44:29 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292912] Updating element slugs and URIs (attempt: 1, pid: 1894) - Started
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292912] Updating element slugs and URIs (attempt: 1, pid: 1894) - Done (time: 0.440s)
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292916] Updating element slugs and URIs (attempt: 1, pid: 1897) - Started
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292916] Updating element slugs and URIs (attempt: 1, pid: 1897) - Done (time: 0.091s)
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292911] Updating search indexes (attempt: 1, pid: 1894) - Started
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292911] Updating search indexes (attempt: 1, pid: 1894) - Done (time: 0.114s)
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292915] Updating search indexes (attempt: 1, pid: 1894) - Started
2021-10-02 13:44:31 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292915] Updating search indexes (attempt: 1, pid: 1894) - Done (time: 0.204s)
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292914] Updating search indexes (attempt: 1, pid: 1897) - Started
2021-10-02 13:44:31 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292914] Updating search indexes (attempt: 1, pid: 1897) - Done (time: 0.351s)
2021-10-02 13:44:31 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292918] Updating search indexes (attempt: 1, pid: 1894) - Started
2021-10-02 13:44:31 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292918] Updating search indexes (attempt: 1, pid: 1894) - Done (time: 0.105s)
2021-10-02 13:44:31 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292913] Pruning extra revisions (attempt: 1, pid: 1897) - Started
2021-10-02 13:45:24 [-][-][-][error][craft\queue\QueueLogBehavior::afterError]  [292913] Pruning extra revisions (attempt: 1, pid: 1897) - Error (time: 53.096s): SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
The SQL being executed was: DELETE FROM `craft_structureelements` WHERE ((`lft` >= 927) AND (`rgt` <= 928)) AND (`root`=1)
2021-10-02 13:44:32 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292917] Pruning extra revisions (attempt: 1, pid: 1894) - Started
2021-10-02 13:47:45 [-][-][-][error][craft\queue\QueueLogBehavior::afterError]  [292917] Pruning extra revisions (attempt: 1, pid: 1894) - Error (time: 193.661s): SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

The only error in phperrors.log close to this timestamp is at [02-Oct-2021 13:47:45 Europe/Oslo] which is reelated to the shutdown: "Communication link failure: 1053 Server shutdown in progress".

I was watching the queue in the admin when the above was happening and took screenshot of the 2 "Pruning extra revisions" jobs that "got stuck". The images are in the order they were in the queue overview in the admin:
Screenshot 2021-10-05 at 20 18 25
And the second:
Screenshot 2021-10-05 at 20 18 33

I'm not sure if it matters, but notice that the 2 jobs are related to the same SourceID and that the one with the ID 292913 that was pushed first failed right after the one with the ID 292917 was pushed to the queue. I believe it was right after the second one was pushed to the queue the server started to run slow.

@brandonkelly
Copy link
Member

@lassemt Can you try updating to Craft 3.7.15, restart the DB server, and see if you can still reproduce this? Sounds like maybe 8e780c7 might help.

@lassemt
Copy link

lassemt commented Oct 5, 2021

@brandonkelly I updated and restarted the server now. The same issue is still there and the logs are the same. But I noticed that it only happens when there are 2 "Pruning extra revisions" jobs for the same entry. None of the jobs are completed before the server starts to slow down, and it is always the first one pushed to queue that get the Lock Wait timeout exceeeded error message. The last one of the 2 jobs rarely shows up in the queue.log.

Screenshot 2021-10-06 at 00 59 54

If there is only one "Pruning extra revisions" job for the entry, the job finishes as expected:

2021-10-06 00:46:36 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [294411] Pruning extra revisions (attempt: 1, pid: 222642) - Started
2021-10-06 00:46:37 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [294411] Pruning extra revisions (attempt: 1, pid: 222642) - Done (time: 0.915s)

I had to open the entry editor in 2 tabs and save them almost simultaneously to get 2 jobs going.

@nickdunn
Copy link
Author

nickdunn commented Oct 6, 2021

Hey @nickdunn, do you have SEOmatic installed by chance? If so can you please ensure that you are running its latest version? We have seen similar issues where that ended up being the culprit, and has been fixed in v3.4.14.

We do, and have recently updated to latest version. Not seen the error since... but then again we've only seen the error a handful of times in 6 months so it's sporadic. I don't think this was the culprit since we've had regenerateSitemapsAutomatically set to false for a long time so SEOmatic isn't doing anything queue related — so I think different to nystudio107/craft-seomatic#989.

@brandonkelly
Copy link
Member

@lassemt Do you have multiple queue runners?

@wallacio
Copy link

wallacio commented Oct 7, 2021

I'd like to chip in with a similar experience, although it's not just a single resource/element that it's happening on. We do have a faily chunky install, with 9 sites, ~50k rows in elements and ~8k in structureelements. Max revisions is set to 5 in config and although we're using Redis for caching, it's not being used for the queue and the default driver is being used here. I'd consider offloading to Redis if that would help or is recommended anyway. Queue runner in the offending environment is the default (triggered by CP rather than queue/run in a cron job).

MariaDB pegged a single core (out of 4) once this behaviour started and had a huge impact on DB performance such that the CP was unususable (couldn't save entries reliably). Killing the DB process was the only way to resolve things. I have since run prune-revisions via the CLI and it ripped through pretty quickly.

queue.log attached for info, not that it reports anything that hasn't been seen already (what's with the order of the entries being messed up though? why aren't they in ascending time order?)

Craft 3.6.18
Plugins:
"bertoost/craft-mailjet": "1.2.3",
"born05/craft-imagehotspots": "1.2.0",
"chasegiunta/scss": "1.0.0",
"craftcms/feed-me": "4.3.6",
"craftcms/redactor": "2.8.8",
"endroid/qr-code": "^3.9",
"ether/simplemap": "3.9.1",
"mmikkel/reasons": "2.2.5",
"nthmedia/entry-gps-coordinates": "1.3.1",
"verbb/field-manager": "2.2.3",
"verbb/formie": "1.4.12",
"verbb/super-table": "2.6.8"

queue.log:
queue-snip.log

@lassemt
Copy link

lassemt commented Oct 7, 2021

@lassemt Do you have multiple queue runners?

@brandonkelly I don't think so (I'm not too familiar with queues in PHP). As far as I know, I'm only running the default queue driver. I disabled the ostark/craft-async-queue plugin before updating and running the previous tests.

@nickdunn
Copy link
Author

nickdunn commented Oct 7, 2021

Using Async Queue here, default 2 concurrent processes.

@brandonkelly
Copy link
Member

@nickdunn I have a feeling multiple queue runners has a tendency to exacerbate these sorts of issues.

@sjcallender
Copy link

Yeah, use 1 queue processor and you'll be fine. We learned this the hard way too, hence: https://twitter.com/sjcallender/status/1433238613039407105

@nickdunn
Copy link
Author

Ok, we've upgraded Craft, Neo and SEOmatic to latest and have dropped Async Queue to one queue processor so will see if this fixes it.

@kevinmu17
Copy link

Same here!

The entire site from our client went down after a few “Internal server errors” messages. For me as the admin I could test this once and after it the website + CP crashed down.
The YII toolbar gave an error:
Lock wait timeout exceeded; try restarting transaction

There where 2 failed tasks in the queue before it threw me out that stated something about Pruning extra revisions:

2021-10-25 12:28:34 [-][1440][-][info][craft\queue\QueueLogBehavior::beforeExec]  [36509] Pruning extra revisions (attempt: 1, pid: 1930471) - Started
2021-10-25 15:09:29 [-][1440][-][error][craft\queue\QueueLogBehavior::afterError]  [36509] Pruning extra revisions (attempt: 1, pid: 1930471) - Error (time: 9655.015s): SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
2021-10-25 15:09:29 [-][1440][-][error][PDOException] PDOException: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away in /vendor/yiisoft/yii2/db/Transaction.php:203
Stack trace:
#0 /vendor/yiisoft/yii2/db/Transaction.php(203): PDO->rollBack()
#1 /vendor/craftcms/cms/src/services/Elements.php(1541): yii\db\Transaction->rollBack()
#2 /vendor/craftcms/cms/src/queue/jobs/PruneRevisions.php(75): craft\services\Elements->deleteElement()
#3 /vendor/yiisoft/yii2-queue/src/Queue.php(246): craft\queue\jobs\PruneRevisions->execute()
#4 /vendor/yiisoft/yii2-queue/src/cli/Queue.php(147): yii\queue\Queue->handleMessage()
#5 /vendor/craftcms/cms/src/queue/Queue.php(131): yii\queue\cli\Queue->handleMessage()
#6 [internal function]: craft\queue\Queue->craft\queue\{closure}()
#7 /vendor/yiisoft/yii2-queue/src/cli/Queue.php(117): call_user_func()
#8 /vendor/craftcms/cms/src/queue/Queue.php(140): yii\queue\cli\Queue->runWorker()
#9 /vendor/craftcms/cms/src/controllers/QueueController.php(84): craft\queue\Queue->run()
#10 [internal function]: craft\controllers\QueueController->actionRun()
#11 /vendor/yiisoft/yii2/base/InlineAction.php(57): call_user_func_array()
#12 /vendor/yiisoft/yii2/base/Controller.php(181): yii\base\InlineAction->runWithParams()
#13 /vendor/craftcms/cms/src/web/Controller.php(190): yii\base\Controller->runAction()
#14 /vendor/yiisoft/yii2/base/Module.php(534): craft\web\Controller->runAction()
#15 /vendor/craftcms/cms/src/web/Application.php(274): yii\base\Module->runAction()
#16 /vendor/craftcms/cms/src/web/Application.php(577): craft\web\Application->runAction()
#17 /vendor/craftcms/cms/src/web/Application.php(253): craft\web\Application->_processActionRequest()
#18 /vendor/yiisoft/yii2/base/Application.php(392): craft\web\Application->handleRequest()
#19 /public_html/index.php(22): yii\base\Application->run()
#20 {main}

We found this query task on our server. We think this was our little devil that crashed everything. (We manually killed it because there was no other option left)
| 11450319 | DBNAME | localhost | DBNAME | Query | 0 | Updating | UPDATE structureelements SET rgt=rgt-2, dateUpdated='2021-10-25 13:08:37' WHERE (rgt >= 512) AND (root=123) | 0.000 |

Also these:
| 11469436 | DBNAME | localhost | DBNAME | Query | 1874 | Waiting for table metadata lock | UPDATE users SET lastLoginDate='2021-10-25 12:36:15', dateUpdated='2021-10-25 12:36:15' WHERE id=1 | 0.000 |
| 11469471 | DBNAME | localhost | DBNAME | Query | 1871 | Waiting for table metadata lock | UPDATE users SET lastLoginDate='2021-10-25 12:36:18', dateUpdated='2021-10-25 12:36:18' WHERE id=1 | 0.000 |
| 11466868 | DBNAME | localhost | DBNAME | Query | 2836 | Waiting for table metadata lock | LOCK TABLES assets

Maybe you guys can shed some light on this issue so we can maybe prevent this in the future.

  • PHP version: 7.4.23
  • Craft CMS Pro: 3.6.11.2
  • MariaDB: 10.3.31

@sjcallender
Copy link

@kevinmu17 - This is the exact issue that forced us to use just one queue processor. We've been fine ever since.

@kevinmu17
Copy link

@kevinmu17 - This is the exact issue that forced us to use just one queue processor. We've been fine ever since.

We didn't install any, or does craft out of the box use more than 1? :)

@sjcallender
Copy link

We didn't install any, or does craft out of the box use more than 1? :)

By default, the queue is run when you visit the the control panel. I'm not sure if two or more can ever be initiated this way. @brandonkelly?

I'd recommended setting 'runQueueAutomatically' => false in config/general.php and then enabling a single alternative queue worker as noted in the docs: https://craftcms.com/docs/3.x/config/config-settings.html#runqueueautomatically

@angrybrad
Copy link
Member

@kevinmu17 I'd suggest updating to the latest 3.7 release as a starting point and see if the behavior changes. There have been improvements made around transactions and locking that you're missing.

@nickdunn
Copy link
Author

nickdunn commented Nov 3, 2021

Could #9977 be a potential fix for this?

Edit: unlikely — locally I've just run utils/repair/section-structure against my main "Page" structure and the number of structureelements rows has remained at around 55k.

@brandonkelly
Copy link
Member

@nickdunn You updated to 3.7.19 first right?

@nickdunn
Copy link
Author

nickdunn commented Nov 4, 2021

Ha, yes I did! The utils/repair/section-structure CLI task ran and output the tree structure of my entries.

But I'm thinking maybe 55K structureelements rows is par for the course for a few hundred entries. I have 3 Structure sections all totalling ~250 entries.

I must admit I don't fully understand what the lower level structures and structureelements are, but I have other plugins that use things that are "ordered" (Neo blocks, Navigation etc) so maybe this adds rows to the table and 55K rows is expected. There are 18K rows in structures, 8K having dateDeleted dates set.

@brandonkelly
Copy link
Member

Ah… it’s due to Neo. Each individual Neo field/element combo is going to add its own additional structure.

@hiasl
Copy link
Contributor

hiasl commented Jan 28, 2022

We just update to 3.7.10.1 since we had similar problems with a version < 3.7.19, but the problem still persists., although now it does not create that massive db load anymore which can finally only be resolved by restarting it.

Here are some facts:

  • It is related to "pruning extra revisions tasks", started a day before making no progress at all.
  • The database is executing the same queries again and again:
SELECT *
FROM `structureelements`
WHERE ((`lft` > 44) AND (`rgt` < 45) AND (`level` <= 2)) AND (`root`=1919849)
ORDER BY `lft`

UPDATE `structureelements` SET `lft`=`lft`-2, `dateUpdated`='2022-01-28 08:05:27' WHERE (`lft` >= 46) AND (`root`=1919849)

DELETE FROM `structureelements` WHERE ((`lft` >= 44) AND (`rgt` <= 45)) AND (`root`=1919849)
  • We have failed "prune extra revision tasks" with the following error:
SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction The SQL being executed was: DELETE FROM `structureelements` WHERE ((`lft` >= 41) AND (`rgt` <= 42)) AND (`root`=1919849)
  • I have 1 queue runner via cronjob using flock to prevent concurrent runs (runQueueAutomatically disabled). The queue runner shows the "prune" job being executed hanging at 33% (no changes for 24 hours). Although this job is active/reserved, other tasks are still handled (updating search indexes primarily). My first guess would be I have multiple runners, which I can not confirm. I mentioned this in a support ticket before: The queue sometimes looses track of its running tasks and does not close them. It has something todo with database timeouts and long running queue tasks.
  • Next I emptied the queue and made sure no queue running happens (ps ax grepping for the queue tasks - nothing): But the database queries mentioned above are still happening, again and again. I can not stop them, so they must be self triggering or something.

I know that some of my findings above sound rather confused, but this is what I am seeing.

I need to mention that we have Neo installed in version 2.8.16 and we need to update that. will happen soon and I will keep you posted.

@metheeruetaichetcharoen-toast

@hiasl We have the same exact issue, We've tried rebooting the cluster DB and still the issue persist, did you do anything to resolve your issue?

brandonkelly added a commit that referenced this issue Nov 18, 2022
Fixes #9905 - where a deadlock error would occur when updating a row referencing an elements table row, which may be locked by another request
@brandonkelly
Copy link
Member

We finally got to the bottom of this, with much help form @aaronbushnell 🎉

The deadlock errors occurred when updating rows in the structureelements or relations tables, which referenced an elements table row that was currently being edited by a separate request.

We fixed by dropping the foreign keys that relate these tables together, and deferring the “cascade deletes” to garbage collection.

Since that involved a new DB migration, we decided to fix it for Craft 4.4 rather than the next patch releases.

To test the fix early, you can change your craftcms/cms requirement in composer.json to "4.4.x-dev#45d78959f9f08930bec5b3f6bb7235a9bb103cc2 and run composer update. Note that reverting to Craft 4.3 is not possible due to the new 4.4 migrations though, so create a database backup before you update.

@aaronbushnell
Copy link
Contributor

I tried this update in my Craft 4 branch of this project and it works, @brandonkelly! Thanks so much for the work you and Brad put into this!

I presume this probably requires Craft 4 and couldn't be a fix backported to Craft 3?

@nickdunn
Copy link
Author

Yay, well done!

Stupid question, but is it genuinely safe to remove these? Don't the FKs provide data integrity for those relationships? And is there a risk that any plugins rely on these FK rows and it may be a bigger breaking change?

@brandonkelly
Copy link
Member

brandonkelly commented Nov 18, 2022

@aaronbushnell Due to the schema change, we wouldn’t backport the fix to v3 unless there was going to be a 3.8 release. So maybe down the road (e.g. if we have a need to bump the minimum PHP version), but not in the immediate future.

Update: We ended up releasing 3.8 alongside 4.4.

@nickdunn It can be a concern, but not in this case. Both relations and structureelements will just get joined into element queries based on elements.id matches. There’s no harm in having extra rows in there referencing nonexistent element IDs; they will just be ignored.

(Plus, elements typically get hard-deleted via garbage collection, so in most cases those extra rows will only be around for a fraction of a millisecond, between the time the elements are hard-deleted and the orphaned extra rows are deleted.)

@aaronbushnell
Copy link
Contributor

Due to the schema change, we wouldn’t backport the fix to v3 unless there was going to be a 3.8 release. So maybe down the road (e.g. if we have a need to bump the minimum PHP version), but not in the immediate future.

That makes sense, thanks for the follow-up @brandonkelly!

brandonkelly added a commit that referenced this issue Feb 21, 2023
Fixes #9905 for Craft 3.8
@brandonkelly
Copy link
Member

Craft 3.8 and 4.4 are out now with this deadlock fix!

@boboldehampsink
Copy link
Contributor

@brandonkelly I'm having a lot of issues with structures being orphaned on multiple 3.8 sites, could this change have anything to do with it?

@brandonkelly
Copy link
Member

@boboldehampsink This change could lead to orphaned structureelements rows that reference a now-deleted element ID, but those shouldn’t have any noticeable effect, and will get deleted during garbage collection.

If you want to send in a DB backup + Composer files, and point out the problem, we can look into it from there. Or if you can come up with steps to reproduce, post a new issue about it.

@boboldehampsink
Copy link
Contributor

@brandonkelly I was already mailing with support (Oli) about this. Just sent another dump where it can be reproduced by running GC, see support mail.

@olivierbon
Copy link
Member

Related spicywebau/craft-neo#725

@brandonkelly
Copy link
Member

brandonkelly commented Mar 20, 2023

@boboldehampsink Is your issue specific to Neo as well?

@boboldehampsink
Copy link
Contributor

@brandonkelly I do have this specific issue, but I had one other site that had non-Neo structure related issues after GC - it just broke a Structure section. So my guess is its not Neo but Craft (structure) specific.

@lthurston
Copy link

After an upgrade from 3.7.68 -> 3.8.3 last week we've had several structures flatten and attempts to edit them result in server errors. This is an instance with Neo, Navigation plugins, among others, but non-plugin structures (the page tree, for instance) was also flattened. We don't know when this background process ran, so we're rolling back (spinning up a new host based on a snapshot, rolling back the db to before the upgrade and restoring a pre-upgrade release).

Stack: headless Craft, NextJS, Postgres

@brandonkelly brandonkelly reopened this Mar 20, 2023
@brandonkelly
Copy link
Member

brandonkelly commented Mar 20, 2023

Thanks all! I was able to reproduce with @boboldehampsink’s DB, and tracked it down to a PostgreSQL + garbage collection bug in 3.8 + 4.4.

Craft 3.8.4 and 4.4.4 are out now with a fix. (96c56f1)

@lthurston
Copy link

@brandonkelly The flattening of structures appears to be fixed, but GC in 3.8.4 seems to be removing tag, category and asset relations. I've been able to reproduce this two times now by rolling back, looking at entries, running GC, looking again.

@brandonkelly
Copy link
Member

Doh, sorry about that. We were so focused on this being a structure bug, I completely forgot that we made the same GC change for relations as well.

Craft 3.8.5 and 4.4.5 are out with the same fix for relations.

@lthurston
Copy link

We'll test it out and will confirm.

@lthurston
Copy link

Looks good to us, thanks!

@paragonn
Copy link

paragonn commented May 24, 2023

Hi, I recently upgraded a large site from Craft CMS 3.7.55.1 to Craft CMS 4.4.11 and I am receiving the following error when trying to save changes to an entry:
SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction

I added some test blocks to a page that I had forgotten to remove. I went into the entry deleted the test blocks and clicked save and that is when the error happened. I now can't make any changes to the entry without receiving the error above in the logs and an Internal Server Error in the CP.

I tried to add a block to other entries and they seem to be fine, however, I am concerned that this is not an isolated case and I could receive more of these errors.

@brandonkelly

@brandonkelly
Copy link
Member

Hey @paragonn, it sounds like maybe your mutex lock didn’t clear itself for some reason. How is your mutex component configured within config/app.php?

@paragonn
Copy link

paragonn commented May 24, 2023

Hey @brandonkelly -- I should also have mentioned that it is a load-balanced environment across two servers. We have a redis component configured, but no mutex. Should we add a mutex configuration?

Below is the components configuration that we have in place in the config/app.php:
$components = [
'redis' => [
'class' => yii\redis\Connection::class,
'hostname' => App::env('REDIS_HOST'),
'port' => App::env('REDIS_PORT'),
'password' => App::env('REDIS_PASSWORD'),
],
'cache' => [
'class' => yii\redis\Cache::class,
'defaultDuration' => 86400,
'keyPrefix' => App::env('APP_ID') ?: 'CraftCMS',
],
];

@angrybrad
Copy link
Member

@paragonn Yup! See https://craftcms.com/knowledge-base/configuring-load-balanced-environments#mutex-locks

Probably worth giving the rest of that page a once-over, too.

@paragonn
Copy link

Thanks @angrybrad !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests