We had an issue with our busiest client, where some updates were failing with a “lock wait timeout." These were updates of a single row in a single table, and when I ran out of my own ideas, I threw it at an LLM.
The research phase
The main weakness of LLMs is that they’re unliely to say, “I don’t know. Maybe it’s not here." Hence, the first thing I did was play along and waste a week on “lock ordering” changes, so that tables would be locked in a consistent order across the main site. We (someone else with an LLM) identified one query doing a full table scan up front, which would lock everything unintentionally. But with that fixed, and other ordering changes in place, there was a 0.0% reduction in problems in production.
With my human insight, I looked outside the subsystem, and used the LLM again to analyze the access logs; this found a long-running request in the administrative backend, which was active at the time of the lock-wait errors. I rediscovered internal timing logs from the admin area, and they confirmed the activity with a narrower time window, but I still didn’t know what was wrong yet.
The final clue came when I had an LLM generate code to instrument the main site, to show lock status when receiving a timeout error. When the resulting spam was run back through the LLM, it pointed to the contract type table as the source of the failures. But, the admin page wasn’t planning to write to the type, only the contract itself. The problem was clearly that it had acquired too strong of a lock.
The actual problem
SELECT … FOR UPDATE in MySQL/InnoDB acquires a write lock on all records scanned. This means means that
selecting from t1 and joining two other tables will take write locks on all three tables. For us, blocking
reads of contract types would behave the same as locking all contracts of that type, across a much larger
number of queries for unrelated contracts.
To limit which tables have write locks acquired, MySQL extended the syntax: SELECT … FOR UPDATE OF t1 will lock only t1 for writing. Multiple tables can still be locked by comma-separating them.
There’s an additional option which may follow, the NOWAIT (fail the statement if any matched rows are
already locked) or SKIP LOCKED (do not lock or return rows which are already locked by another thread.)
Without options, the normal “wait for locks” behavior applies. Hence, the final statement shape may look
something like:
SELECT … FOR UPDATE OF t1, t2 SKIP LOCKED;
SELECT … FOR UPDATE OF t1 NOWAIT;
SELECT … FOR UPDATE OF t1;
We haven’t seen another lock wait timeout since putting this small change into production, which is anywhere from a 0 to 100% decrease in failures. We will never know exactly, because I couldn’t leave well enough alone. Next week, we’ll release another change that segments the work in the admin page. Instead of locking “up to the limit” rows at once, it will lock in blocks of 20, and stream the results.
No comments:
Post a Comment