The bug that triggered the rewrite was subtle. A race condition in our investment processing code. Two concurrent requests for the same user — one submitting an investment, one cancelling a pending one — would occasionally overwrite each other in the database. The result: an investment status update would be silently dropped. The investment would be stuck in processing indefinitely.
We found out three weeks after the bug was introduced, during a manual audit. By then, 47 investments were in inconsistent states. We had no audit trail. No way to replay what the sequence of operations had been. We had to manually interview investors and reconcile records one by one.
That was the day we decided to rebuild.
What Event Sourcing Actually Is
Event sourcing is an architectural pattern where instead of storing the current state of an entity, you store the sequence of events that led to that state.
In a traditional system:
- An investment record has a status column. It changes from pending to processing to completed.
- If something goes wrong, you know the current status, but not how it got there.
In an event-sourced system:
- You have an investment_events table.
- Every state change is an immutable event: InvestmentCreated, InvestmentProcessingStarted, InvestmentCompleted, InvestmentFailed.
- The current state is derived by replaying all events in order.
The events are your source of truth. The derived state (the current status) is a read model that you can reconstruct from events at any time.
Our Event Schema
Each event in our system has this structure:
CREATE TABLE investment_events (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
aggregate_id VARCHAR(36) NOT NULL, -- investment UUID
aggregate_version INT NOT NULL, -- optimistic concurrency version
event_type VARCHAR(100) NOT NULL, -- e.g., 'InvestmentCreated'
event_data JSON NOT NULL, -- event payload
metadata JSON, -- who, when, correlation ID
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE KEY uk_aggregate_version (aggregate_id, aggregate_version)
);The UNIQUE KEY on (aggregate_id, aggregate_version) is critical. This is our optimistic concurrency control. When we write an event, we specify the expected version. If another process has already written an event with that version, our insert fails with a duplicate key error. No race condition can silently corrupt state.
This is what prevents the bug that triggered the rewrite.
The Command Handler Pattern
Every state change goes through a command handler:
async function handleSubmitInvestment(command) {
// Load current aggregate state
const events = await loadEvents(command.investmentId);
const investment = replayEvents(events); // derive current state
// Validate business rules against current state
if (investment.status !== 'pending') {
throw new Error('Cannot submit investment: not in pending state');
}
// Create new event
const event = {
type: 'InvestmentSubmitted',
data: { userId: command.userId, amount: command.amount },
version: investment.version + 1 // expected next version
};
// Append event (will fail if version conflict)
await appendEvent(command.investmentId, event);
// Trigger side effects (payment, notification, etc.)
await processPayment(command);
}The replayEvents function applies each event to build the current state:
function replayEvents(events) {
return events.reduce((state, event) => {
switch (event.event_type) {
case 'InvestmentCreated':
return { ...state, status: 'pending', version: event.aggregate_version };
case 'InvestmentSubmitted':
return { ...state, status: 'processing', version: event.aggregate_version };
case 'InvestmentCompleted':
return { ...state, status: 'completed', version: event.aggregate_version };
case 'InvestmentFailed':
return { ...state, status: 'failed', reason: event.event_data.reason, version: event.aggregate_version };
default:
return state;
}
}, { status: null, version: 0 });
}
The Read Model (CQRS)
Replaying events for every read is expensive. We maintain a separate read model (CQRS — Command Query Responsibility Segregation) that is updated asynchronously whenever events are appended:
// Event processor runs as a background worker
async function processNewEvents() {
const unprocessedEvents = await getUnprocessedEvents();
for (const event of unprocessedEvents) {
await updateReadModel(event);
await markEventProcessed(event.id);
}
}
// The read model is just a standard investments table
async function updateReadModel(event) {
if (event.event_type === 'InvestmentCompleted') {
await Investment.update(
{ status: 'completed', completed_at: new Date() },
{ where: { id: event.aggregate_id } }
);
}
// ... handle other event types
}This means reads are fast (hit the read model table) and writes are consistent (go through event sourcing).
The Migration
Migrating a live production system to event sourcing took 4 months. The approach:
Phase 1 (Month 1): Write events alongside existing database writes. Dual-write mode. Both the old table and the new event table get updated. Zero user impact.
Phase 2 (Month 2): Validate that replaying events produces the same state as the existing records. Run a daily reconciliation job. Fix discrepancies in the event generation logic.
Phase 3 (Month 3): Switch reads to the event-sourced read model. Still dual-writing. Monitor for differences between old and new read paths.
Phase 4 (Month 4): Switch writes to go through command handlers only. Remove dual-write. The old direct database writes are gone.
The Cost
This migration was not cheap:
- 4 months of engineering time (2 engineers)
- Significant increase in database storage (events are append-only; we went from 2GB to 18GB for transaction data over 12 months)
- More complex debugging (you now have to understand events and projections, not just table rows)
- Longer onboarding for new engineers ("what is a projection?" is now a question you have to answer)
The Benefits
After 12 months in production with event sourcing:
1. Zero silent data corruption incidents. The optimistic concurrency control catches every race condition.
2. Full audit trail for every investment. We can tell exactly what happened, in what order, and by whom, for any investment in the system. Regulatory audits now take 30 minutes instead of 3 days.
3. Temporal queries. We can reconstruct the state of any investment at any point in time by replaying events up to that timestamp. This is invaluable for debugging.
4. Bug recovery. When we found a bug in our fee calculation logic last year, we replayed all affected investments through the corrected logic and recomputed the correct state. In the old system, that would have required manual database corrections. With event sourcing, it was a script.
5. New features without schema migrations. When we added a new investment status type, we added a new event type and updated the projection. No ALTER TABLE. No migration script. No downtime.
Was It Worth It
For a financial system handling real investor money: yes, absolutely.
For a CRUD web app: probably not. Event sourcing adds significant complexity. You pay a complexity tax on every feature you build. That tax is worth it when your data integrity requirements are high, when you need a complete audit trail, or when you need temporal queries. It is not worth it for a blog or a todo app.
If you are considering event sourcing, the question to ask is: what is the cost of silent data corruption in my system? For us, the answer was investor trust, regulatory exposure, and manual reconciliation nightmares. That made the complexity tax worthwhile.
For most systems, it is not. Know which category you are in before you start building.