Profiling a Node.js blocked event loop in production

See how we set up just-in-time profiling of our Node.js app to tackle a stuck event loop.

Our stack at Deepnote consists of several Node.js apps running in pods in a Kubernetes cluster. A few weeks ago we started having issues with the pods' liveness probes failing because our Express server didn't respond to a health check in time.

We suspected that the culprit might be the event loop being stuck.

Debugging a stuck event loop

The question was what was causing the stuck event loop. Unfortunately, we weren't able to reproduce the issue in development, so we decided to try profiling the Node.js app in production.

The debugger can be enabled on demand, even on an already-running app. And collecting a CPU profile works even when the app's event loop is stuck!

But doing this manually means an engineer would need to respond really fast to a health check failure, SSH to the pod, enable the node debugger, and then run the Chrome DevTools profiler through an SSH tunnel on the production instance.

We were never really able to complete all these steps in time.

Automating CPU profiling

We decided to automate the CPU profiling process. We realized it should actually be quite straightforward because Kubernetes sends a SIGTERM signal when a pod's liveness probe fails, waits 30 seconds, and only then sends a SIGKILL signal and terminates the pod. This means we have 30 seconds to start the debugger, collect the CPU profile, and save it somewhere we can later access it from.

But how to automate the Chrome DevTools profiling? Luckily there's a nifty CLI tool (diat) that does exactly that!

Now all we need is a script that wraps our standard startup command — node ./dist/server.js — in logic that traps the SIGTERM signal:

#!/bin/bash

CPU_PROFILE_DESTINATION_PATH=${CPU_PROFILE_DESTINATION_PATH:=/tmp}

# Sigterm handler
_term() {
  DATE=$(date +"%Y-%m-%dT%H.%M.%S%z")
  HOSTNAME=$(hostname)

  # run the diat cli tool that will generate the cpuprofile
  diat cpuprofile -p $NODE_PROCESS_PID --file ${CPU_PROFILE_DESTINATION_PATH}/${HOSTNAME}-${DATE}.cpuprofile

  kill -TERM "$NODE_PROCESS_PID" 2>/dev/null
  wait $NODE_PROCESS_PID
}

# start the actual node process in the background - this is our app
node ./dist/server.js &

# store the pid of the node process started above
NODE_PROCESS_PID=$!

echo "Node process is running with PID $NODE_PROCESS_PID, debugger is waiting for SIGTERM"
trap _term SIGTERM
wait $NODE_PROCESS_PID

And that's it! Now whenever our app stops responding to a liveness check, we automatically capture its CPU profile so we can later open in Chrome DevTools and inspect where our app was spending time. Root-causing the issue is then usually a piece of cake.

If you liked what you saw here and you want to help us solve engineering challenges like this one, join us at Deepnote!

*We save the .cpuprofile file in /tmp, which in our case is an NFS mount, so the file will persist after the pod is killed. Storing the profile to an S3 bucket would be a nice alternative, but outside of the scope of this post.

Profiling a Node.js blocked event loop in production

Debugging a stuck event loop

Automating CPU profiling

Blog

Introducing modules: reusable workflows for your entire team

Beyond AI chatbots: how we tripled engagement with Deepnote AI

How we made data apps 40% faster

That’s it, time to try Deepnote