Diagnosing a Memory Leak in a PM2 Cluster: How Our Node.js App Was Growing to 4GB

We had set up PM2 in cluster mode with 4 workers on a 4-core EC2 instance. Each worker was a Node.js process running our Express API. Memory usage started at about 180MB per worker after startup.

Every day, each worker grew by about 80MB. After 48 hours, they were collectively consuming 4GB+ of the 8GB available. PM2 would auto-restart the workers when they hit the memory limit, which caused brief availability gaps. Users noticed.

We had configured max_memory_restart: '1G' in our PM2 config, so workers restarted before crashing the server. But restarts every 48 hours were not acceptable for a financial platform. We had to find the leak.

Step 1: Establishing a Memory Baseline

First, I needed to confirm this was actually a leak and not just normal Node.js memory behavior. Node.js uses V8's garbage collector, which does not immediately release memory — it holds onto heap space for future allocations. A growing heap is not always a leak.

I added memory logging to our app:

setInterval(() => {
const used = process.memoryUsage();
console.log(JSON.stringify({
rss: Math.round(used.rss / 1024 / 1024) + 'MB',
heapUsed: Math.round(used.heapUsed / 1024 / 1024) + 'MB',
heapTotal: Math.round(used.heapTotal / 1024 / 1024) + 'MB',
external: Math.round(used.external / 1024 / 1024) + 'MB'
}));
}, 60000);

I ran this for 12 hours. heapUsed grew steadily from 140MB to 280MB without ever decreasing, even during periods of zero traffic. The garbage collector was running (I could see the minor and major GC events in --trace-gc output) but it was not reclaiming this growing portion of the heap.

That confirmed a real leak: objects being retained indefinitely, preventing garbage collection.

Step 2: Taking a Heap Snapshot

Node.js has built-in heap snapshot support via the v8 module:
const v8 = require('v8');
const fs = require('fs');
app.get('/debug/heap-snapshot', (req, res) => {
const filename = heap-${Date.now()}.heapsnapshot;
const stream = v8.writeHeapSnapshot(filename);
res.json({ snapshot: filename });
});

I took three snapshots:

- T0: Right after a fresh worker start (heapUsed: 145MB)

- T12: After 12 hours of normal traffic (heapUsed: 248MB)

- T24: After 24 hours (heapUsed: 362MB)

I loaded the T0 and T24 snapshots into Chrome DevTools (Memory tab > Load profile) and used the comparison view to find objects that existed in T24 but not T0.

Step 3: Finding the Culprit

The comparison view showed a class called EventEmitter with a retained size of 112MB. Drilling into the instances, each one was an object with a property called _listeners containing an array of callback functions.

Cross-referencing the call stack in the snapshot with our codebase, I traced this to our XRPL WebSocket client initialization:

// Called every time an investment transaction was submitted
async function submitAndMonitor(tx) {
const client = new xrpl.Client(XRPL_NODE_URL);
await client.connect();
await client.submitAndWait(tx, { wallet });

// Subscribe to account updates
client.on('transaction', (event) => {
processIncomingTransaction(event);
});

// NOTE: client.disconnect() was never called
}

Every transaction submission created a new WebSocket client, registered an event listener, and never disconnected the client. The client objects — and all their registered listeners — were being retained indefinitely.

With about 200 transactions per day and each client object retaining callbacks that referenced other objects in scope (including the tx parameter), the retained size grew at roughly 80MB per day. Exactly what we saw.

Step 4: The Fix

The fix was to reuse a single shared client and properly manage the connection lifecycle:

let xrplClient = null;
async function getXRPLClient() {
if (!xrplClient || !xrplClient.isConnected()) {
xrplClient = new xrpl.Client(XRPL_NODE_URL);
await xrplClient.connect();

// Single global listener
xrplClient.on('transaction', (event) => {
processIncomingTransaction(event);
});
}
return xrplClient;
}
async function submitAndMonitor(tx) {
const client = await getXRPLClient();
await client.submitAndWait(tx, { wallet });
// No new client created, no new listener registered
}

This meant one client object, one listener, and no accumulation.

After deploying:

- Worker memory at T0: 148MB

- Worker memory at T12: 152MB

- Worker memory at T24: 154MB

- Worker memory at T72: 157MB

Flat. No leak. PM2 restarts stopped.

Broader Lessons About Node.js Memory Leaks

This incident taught me the four most common causes of Node.js memory leaks:

1. Event listeners not removed. emitter.on(event, listener) accumulates if emitter.off() is never called. This is our case.

2. Closures retaining large objects. A small closure that captures a large outer scope object prevents GC from collecting the outer object.

3. Growing caches without eviction. An in-memory Map that you add to but never remove from will grow until OOM.

4. Timers not cleared. setInterval callbacks keep their surrounding scope alive. If you never call clearInterval, the scope is never collected.

The diagnostic pattern is always the same:

1. Establish a memory baseline with process.memoryUsage()

2. Confirm the leak is real (heap grows even during idle periods)

3. Take heap snapshots at intervals and compare

4. Look for objects with unexpected growing retained sizes

5. Trace back to the allocation site in your code

The tooling for this in Node.js is excellent — Chrome DevTools heap analysis is genuinely powerful. But you have to know how to take the snapshots and what to look for. That knowledge is not widely shared, which is why memory leaks in Node.js apps often go undiagnosed for months.