Gunjan Sharma

System Design · DevOps

The Jenkins Deploy That Broke Production for 4 Hours: A CI/CD Postmortem

· Updated

The deploy started at 2:14 PM on a Tuesday. By 2:31 PM, our production API was returning 502s. By 6:09 PM, we had fully recovered. In those four hours, we made every mistake a CI/CD pipeline can allow: no rollback automation, no canary deploy, no health check gates, and a Jenkinsfile that had accumulated 9 months of untested stages.

This is the complete postmortem.

The Pipeline Before the Incident

Our Jenkinsfile at the time of the incident was a linear sequence:

1. Checkout code

2. Install dependencies (npm ci)

3. Run tests

4. Build Docker image

5. Push to ECR

6. SSH into production server

7. docker pull

8. docker stop app

9. docker run new image

10. Send Slack notification

This had been working fine for 9 months. We were deploying 3-4 times per week with no issues.

What Changed

A week before the incident, a developer added a new stage to the Jenkinsfile: a database migration step. The step was:

sh 'node migrate.js --env production'

The migration script was meant to run any pending Sequelize migrations before the new app version started. On paper, reasonable. In practice, there were three problems.

Problem 1: No transaction wrapping. The migration script ran multiple ALTER TABLE statements in sequence. If any one of them failed, the ones that had already executed were committed. The database was in a partial migration state.

Problem 2: No version check. The script ran every migration in /migrations that had not been marked as executed in the SequelizeMeta table. After 9 months, this included 47 migration files. Most had been run in production already — but two of them had been manually applied 3 months ago without using the migration tooling (someone had run raw SQL directly). Those two files were not in SequelizeMeta, so the script tried to run them again. One of them included a CREATE TABLE IF NOT EXISTS (harmless). The other included a DROP INDEX IF EXISTS followed by a CREATE UNIQUE INDEX — which on a live table with data in it, took 11 minutes to complete, held a metadata lock, and blocked all writes to that table during the deploy window.

Problem 3: The migration ran before health checks. We did not have a way to abort the deploy if the migration caused problems.

The Failure Sequence

2:14 PM - Jenkins pipeline started. Tests passed.

2:17 PM - Migration script started. 45 migrations marked as pending.

2:19 PM - Migration #23 (the CREATE UNIQUE INDEX one) started. MySQL started building the index on a 4.2 million row table.

2:22 PM - Writes to the investments table started timing out. Investors trying to submit investment orders got errors.

2:24 PM - First PagerDuty alert fired: "API error rate > 5%"

2:26 PM - Pipeline was still running the migration. Jenkins had no visibility into what the migration was doing.

2:28 PM - Pipeline stage finished. Docker container was stopped, new one started.

2:31 PM - App container started but immediately failed health check (it could not connect to the database because the migration had locked a core table). Container restarted loop began.

2:33 PM - We got paged. Logged in, saw the container restart loop.

2:40 PM - Attempted rollback: stopped the new container, started the old one from the previous image. Old container also failed — the partial migration had changed a column type that the old app code could not handle.

We were stuck. Rolling forward required a working database. Rolling back required a working database. The database migration was still in progress.

3:20 PM - The CREATE UNIQUE INDEX finally completed (it took 61 minutes, not 11, because of the concurrent lock contention). Database writes resumed.

3:24 PM - We manually ran the remaining migrations, this time with visibility.

3:31 PM - Restarted the new app container. Health checks passed.

3:45 PM - Verified all API endpoints were responding normally.

4:09 PM - Confirmed no data loss. Closed the incident.

Root Causes

1. Migration not reviewed before deploy. The migration had not been tested on a production-scale dataset.

2. No lock timeout configured. MySQL's default behavior is to wait indefinitely for a lock. We should have set lock_wait_timeout and innodb_lock_wait_timeout.

3. No canary deploy. We deployed to 100% of traffic immediately.

4. No health check gate between migration and deploy. The app started regardless of migration outcome.

5. Rollback required a compatible database. We had no backward-compatible migration strategy.

What We Changed

Fix 1: Separate migration and deploy stages with a gate.

stage('Migrate') {
steps {
sh 'node migrate.js --dry-run 2>&1 | tee migration-plan.txt'
input message: 'Review migration plan. Proceed?'
sh 'node migrate.js'
}
}

For any migration touching tables with > 500k rows, we now require a human to review the migration plan before the deploy continues.

Fix 2: Online schema changes for large tables. We switched to gh-ost (GitHub's online schema change tool) for any ALTER TABLE on large tables. It creates a shadow table, syncs data incrementally, and cuts over without a metadata lock.

Fix 3: Zero-downtime deploy with rolling restart and health gates.

stage('Deploy') {
steps {
sh 'docker pull $IMAGE'
sh 'docker run -d --name app_new $IMAGE'
sh './wait-for-health.sh app_new 30' // poll health endpoint for 30s
sh 'docker stop app_old'
sh 'docker rename app_new app'
}
}

Fix 4: Backward-compatible migrations. Every migration must be deployable with both the current and previous version of the app. We introduced a migration review checklist that asks: "If we roll back the app code but not this migration, does anything break?"

Fix 5: Lock timeouts. Added to our MySQL config:

SET SESSION lock_wait_timeout = 30;

SET SESSION innodb_lock_wait_timeout = 30;

Any migration that takes a lock for more than 30 seconds now fails instead of blocking indefinitely.

Six months later, we have had zero deploy-related production incidents. The pipeline is slower — the health check gate adds 2-3 minutes. The human review for large migrations adds up to 10 minutes when required. But we have not had a 4-hour incident either.

That is a trade I will take every time.

The Jenkins Deploy That Broke Production for 4 Hours: A CI/CD Postmortem | Gunjan Sharma