It was 9:03 AM on a Monday morning. Our on-call Slack channel lit up with three alerts in the span of 90 seconds.
API p95 latency > 5000ms
MySQL connection pool exhausted
User-facing error rate > 12%
I was in the middle of my first coffee. By the time I opened the dashboard, response times had climbed to 14 seconds. Our portfolio summary API — the first thing every user sees when they log in — was timing out.
This is the story of a cache stampede. And more importantly, how we stopped it from ever happening again.
What Is a Cache Stampede
A cache stampede, also called a thundering herd, happens when a large number of cache keys expire at the same moment. Every request that was being served from cache suddenly hits the database at once. If your database cannot handle that volume of concurrent queries, it collapses under the load — and then every thread waiting for a response from the database starts queuing, which makes the problem worse.
It is a feedback loop. More requests wait. More threads pile up. The database slows further. Latency climbs. Users retry. Retries make the pile worse.
How We Got Here
Our platform served portfolio summary data to investors. Each user had a summary object that aggregated their investments, returns, and valuations. This was expensive to compute — it joined across 7 tables and did some currency conversion calculations.
We cached these summaries in Redis with a TTL of 3600 seconds (one hour).
The problem: our caching was done inside a nightly batch job that regenerated all user portfolio summaries and wrote them to Redis in a tight loop. The job ran at 6 AM, and it wrote roughly 11,000 keys, all with SETEX key 3600 value. Because the job ran sequentially and finished in about 12 minutes, all 11,000 keys had TTLs that were set within a 12-minute window.
At 7 AM, the first keys started expiring. By 9 AM, the bulk of them had expired in a 30-minute burst — right as user traffic ramped up from the morning logins.
Every user that hit the portfolio summary endpoint during that window found no cache. Every one of those requests hit the database. We had about 600 concurrent active sessions at peak. 600 queries against a MySQL instance that was sized for 50-100 concurrent queries.
The Diagnosis
Once I understood the error pattern, the cause was clear from the logs. I pulled up our APM tool and looked at the database query trace.
Every timeout was on the same query:
SELECT i.*, p.nav, p.change_percent FROM investments i JOIN property_nav p ON i.property_id = p.id JOIN users u ON i.user_id = u.id WHERE i.user_id = ? AND i.status = 'active'This query took about 180ms under normal conditions. With 600 concurrent copies of it running simultaneously, the MySQL connection pool (max 100 connections) was saturated. Queries were queuing. Queue time itself was adding 8-12 seconds per request.
I checked Redis: redis-cli INFO keyspace. Hit rate had dropped from 98.7% to 3.1%.
The fix was obvious in retrospect. The root causes were:
1. Synchronized TTL — all keys expired at the same time
2. No fallback for cache miss — all misses went directly to DB
3. No request coalescing — multiple requests for the same user's data all queued separate DB queries
Immediate Mitigation
First thing I did: manually re-prime the cache. I ran a script that pulled the most recently active user IDs from the DB and pre-populated their portfolio summaries back into Redis. This took 4 minutes and immediately brought hit rates back up to 94%. Latency dropped from 14 seconds back to under 200ms.
Once the immediate fire was out, we started the real fix.
Fix 1: TTL Jitter
Instead of setting every key with a fixed TTL of 3600 seconds, we added random jitter:
const baseTTL = 3600;
const jitter = Math.floor(Math.random() * 600); // 0-600 seconds
await redis.setex(key, baseTTL + jitter, value);
This spreads expirations over a 10-minute window instead of a single point. No more synchronized mass expiry.
Fix 2: Probabilistic Early Expiration (PER)
Even with jitter, a single popular key expiring still causes a spike for that user if they happen to be active. We implemented probabilistic early expiration — a technique where you probabilistically recompute a cache value before it actually expires:
function shouldRecompute(ttlRemaining, computeTime, beta = 1.0) {
return (-computeTime beta Math.log(Math.random())) >= ttlRemaining;
}If shouldRecompute returns true, the current request recomputes the value and refreshes the cache in the background — before the key expires. This means by the time the key actually expires, the value has already been refreshed. The cache never truly goes cold.
Fix 3: Request Coalescing with a Mutex
For the rare case where a cache miss still occurs, we added a Redis-based mutex to ensure only one request recomputes the value at a time. Other requests for the same key wait and read the freshly computed value when it is ready.
const lockKey = lock:portfolio:${userId};
const lock = await redis.set(lockKey, '1', 'NX', 'EX', 10);
if (lock) {
// This request won the lock, compute and cache
const data = await computePortfolioSummary(userId);
await redis.setex(cacheKey, baseTTL + jitter, JSON.stringify(data));
await redis.del(lockKey);
return data;
} else {
// Wait for the lock holder to finish, then read from cache
await sleep(200);
const cached = await redis.get(cacheKey);
return JSON.parse(cached);
}Fix 4: Decoupled Batch Job
We also changed the batch job so it no longer wrote all keys in a tight loop. Instead, it wrote keys with randomized delays between writes, ensuring the initial TTLs were already spread out from the moment they were first created.
Post-Incident Numbers
One week after the fixes:
- Cache hit rate: 99.2% (up from 98.7% pre-incident, never drops during morning traffic)
- p95 latency: 87ms (down from 14,000ms during incident)
- Database concurrent connections at peak: 18 (down from 600+ during incident)
- Zero stampede events in the 4 months since
What I Learned
The thing about cache stampedes is that they are entirely predictable — if you know what to look for. A fixed TTL on a batch-written cache is a time bomb. The math is simple: if you write N keys at time T with TTL of D, those keys will all expire at approximately T+D. If T+D coincides with peak traffic, you will have an incident.
The fixes are not complicated either. TTL jitter is a 2-line change. PER takes an afternoon to implement properly. Request coalescing is a day of work.
What takes the most effort is recognizing the pattern before the incident. The best time to add cache jitter is when you first design the caching layer. The second best time is now.
If your system uses Redis with batch-written fixed TTLs, go check your TTL distribution. Run redis-cli DEBUG sleep 0 and look at when your keys are set to expire. If they cluster around a single time, you have a stampede waiting to happen.