Gunjan Sharma

System Design · Incident Report

The Night Our Database Locked Up and Took Down ₹2Cr in Transactions

· Updated

It was 2:47 AM when my phone went off.

Not a gentle buzz. Three back-to-back alerts from our monitoring system, all screaming the same thing: transaction processing had stopped. Completely. The kind of stop where your stomach drops before your brain even fully wakes up.

I was the on-call engineer that week. I grabbed my laptop off the nightstand, opened Slack in the dark, and what I saw made my blood go cold. 47 unread messages. Our CEO was already in the channel.

The context: We were running a fintech platform processing real estate investment transactions. That night, roughly ₹2 crore worth of settlement instructions were sitting in a queue, waiting. Just waiting. And the queue wasn't moving.

Here is what happened, exactly, step by step — because I wrote every detail down in a post-mortem document the next morning. I promised myself I would publish this someday.

Today is that day.

The Background

Our core transaction table was a MySQL InnoDB table with approximately 14 million rows. We had the standard indices: primary key on transaction_id, an index on user_id, an index on status, and a composite index on (created_at, status) for our most common dashboard queries.

For about 8 months, this setup had been rock solid. Response times were healthy. P95 latency for our settlement pipeline was under 400ms. We were proud of it.

Then a product manager came to me two weeks before the incident and said, "The finance team wants a live dashboard showing transaction volumes grouped by investment category, region, and date range. Can we build it?"

I said yes. I should have asked more questions.

What I Built (That Caused the Problem)

The analytics endpoint I built was straightforward on the surface. It ran a query like this:

SELECT category, region, DATE(created_at) as date, COUNT(*) as count, SUM(amount) as total FROM transactions WHERE created_at BETWEEN ? AND ? GROUP BY category, region, DATE(created_at) ORDER BY date DESC;

On a staging database of 200,000 rows, this query ran in 80ms. I tested it, it looked good, I deployed it.

What I did not think about: the finance dashboard auto-refreshed every 30 seconds. And there were 12 finance team members using it simultaneously during business hours. And our production table had 14 million rows, not 200,000.

What I definitely did not think about: InnoDB row-level locking behavior during long-running scans.

The actual production query was doing a full table scan because my date range filter did not align with the composite index. It was scanning millions of rows. Each scan was acquiring and holding shared locks. The query was taking 18-22 seconds to complete. And every 30 seconds, a new one started.

At around 11 PM, enough concurrent long-running analytics queries had stacked up that our settlement processing began timing out waiting for locks. By 2:30 AM — during a scheduled batch settlement run — the system completely deadlocked.

The Incident

I SSH'd into the database host. First thing I ran:

SHOW PROCESSLIST;

And I saw it. 23 queries in the list. 21 of them were my analytics query, in various states of "Sending data" or "Waiting for table metadata lock." The other two were our settlement processor, both stuck on "Waiting for table flush."

I ran SHOW ENGINE INNODB STATUS and the deadlock section was enormous. Locks held, locks waiting, transaction IDs — it looked like a war zone.

The immediate fix was brutal and embarrassing: I killed all the running analytics queries manually, one by one, using KILL QUERY [id]. The settlement processor immediately resumed. Transactions started flowing within 90 seconds of the kills.

But the problem wasn't solved. It would come back the moment the dashboard started refreshing again.

The Real Fix

At 4 AM, with the immediate crisis over, I sat down and thought about what should have been obvious from the start.

The analytics query should never have been running against the live transaction table at all. That is a separation of concerns violation so basic that I am still embarrassed about it.

Here's what we changed:

First, we created a read replica specifically for analytics. All dashboard queries were routed to the replica. This immediately isolated analytics traffic from the write-heavy OLTP workload on the primary.

Second, we added a proper materialized summary table. A background job runs every 5 minutes and updates a transaction_summaries table pre-aggregated by category, region, and date. The dashboard now queries this table instead — the query went from 18 seconds on the live table to 3ms on the summary table.

Third, we added query timeouts at the application level. Any analytics query that runs for more than 5 seconds is automatically killed. This is the last line of defense. You never want to rely on it, but you absolutely need it.

Fourth, and this is the one that actually surprised me when I thought about it — we changed the dashboard refresh to be on-demand rather than automatic. The finance team never asked for auto-refresh. A product manager assumed they wanted it. When we removed auto-refresh and added a manual refresh button, nobody complained. Not a single person.

The auto-refresh that caused the entire incident was a feature nobody actually needed.

What I Learned

One: Test your queries against production-scale data. Always. Staging environments are useful but they will lie to you about performance at scale. We now have a copy of production data (anonymized) in our staging environment. It changed everything about how we catch issues before deployment.

Two: Analytical and transactional workloads have fundamentally different requirements. They should not share a database, and they should not share a table. If you are running GROUP BY queries with date ranges on the same MySQL instance that is processing payments, you are one bad query away from exactly what happened to me.

Three: Understand what your database does under concurrent load. I understood SQL. I did not understand InnoDB locking mechanics deeply enough. After this incident, I spent a week reading about shared locks, exclusive locks, intention locks, gap locks, and the metadata locking system. That week of learning was the most valuable engineering education I got that entire year.

Four: Read replicas are not optional for serious production systems. They are essential. The cost of a read replica is trivial compared to the cost of analytics killing your write path.

Five: Always ask "who uses this and how often?" before you build a feature. The auto-refresh frequency was the variable that turned a moderately bad query into a production disaster. If the dashboard had refreshed every 5 minutes instead of every 30 seconds, we might never have hit the problem at scale. Understanding usage patterns is as important as understanding query performance.

The Aftermath

The total downtime for our settlement pipeline was 1 hour and 14 minutes. ₹2.1 crore in transactions were delayed. Not lost — just delayed. They all processed correctly once the locks cleared.

We wrote a thorough post-mortem. The document ran to 11 pages. It had a full timeline, a root cause analysis section, a section on immediate fixes, a section on long-term fixes, and a section — which I insisted on including — titled "What the engineer on call could have done differently." That last section had my name in it. I wrote it myself.

I think that was the most important part of the whole document. It would have been easy to frame it as "the system had a design flaw." That would have been accurate but incomplete. The system had a design flaw that I introduced, did not test adequately, and did not review carefully enough.

Taking ownership of that — in writing, in a document that the entire engineering team and the CEO read — was uncomfortable. But it was the right thing to do.

The incident made our system better, our team more careful, and me a significantly more thorough engineer. I would not trade it.

I would just prefer not to do it at 2:47 AM again.