Deepnote
IntegrationsPricing
Join usDocsSign in
← All posts
ENGINEERING

– by Simon on March 7, 2022

Profiling Node.js blocked event loop in production

How we set up just-in-time profiling of our nodejs app when its event loop gets stuck.

App crash because of event loop hang

Our stack at Deepnote consists of several Node.js apps running in pods in a Kubernetes cluster. A few weeks ago, we started having issues with the pods' liveness probes failing, because our Express server didn't respond to a health check in time. We suspected that the culprit might be the event loop being stuck.

Debugging a stuck event loop

The question was – what's causing the event loop being stuck? Unfortunately we weren't able to reproduce the issue in development, so we decided to try profiling the Node.js app in production. The debugger can fortunately be turned on on demand, even on an already running app. And collecting a CPU profile works even when the app's event loop is stuck!

But doing this manually means an engineer would need to respond really fast to a health check failure, ssh to the pod, enable the node debugger, and then run the Chrome DevTools profiler through an ssh tunnel on the production instance. We were never really able to complete all these steps in time.

Can we automate CPU profiling?

So we decided to automate the CPU profiling process. We realized that should actually be quite straightforward, because Kubernetes sends a SIGTERM signal when a pod's liveness probe fails, waits 30 seconds, and only then send a SIGKILL signal and terminates the pod. This means we have 30 seconds to start the debugger, collect the CPU profile, and save it somewhere we can later access it from.

But how to automate the Chrome DevTools profiling? Luckily there's a nifty CLI tool diat that does exactly that!

Now all we need is a script that wraps our standard startup command node ./dist/server.js in logic that traps the SIGTERM signal:

#!/bin/bash

CPU_PROFILE_DESTINATION_PATH=${CPU_PROFILE_DESTINATION_PATH:=/tmp}

# Sigterm handler
_term() {
  DATE=$(date +"%Y-%m-%dT%H.%M.%S%z")
  HOSTNAME=$(hostname)

  # run the diat cli tool that will generate the cpuprofile
  diat cpuprofile -p $NODE_PROCESS_PID --file ${CPU_PROFILE_DESTINATION_PATH}/${HOSTNAME}-${DATE}.cpuprofile

  kill -TERM "$NODE_PROCESS_PID" 2>/dev/null
  wait $NODE_PROCESS_PID
}

# start the actual node process in the background - this is our app
node ./dist/server.js &

# store the pid of the node process started above
NODE_PROCESS_PID=$!

echo "Node process is running with PID $NODE_PROCESS_PID, debugger is waiting for SIGTERM"
trap _term SIGTERM
wait $NODE_PROCESS_PID

And that's it! Now whenever our app stops responding to a liveness check, we automatically capture its CPU profile that we can later open in Chrome DevTools and inspect where our app was spending time. Root-causing the issue is then usually a piece of cake.

If you liked what you saw here and you want to help us solve engineering challenges like this one, join us at Deepnote!

Note: we save the .cpuprofile file in /tmp, which in our case is a NFS mount, so the file will persist after the pod is killed. Storing the profile e.g. to an S3 bucket would be a nice alternative, but outside of the scope of this post.

Share this post

Twitter icon

Simon Sotak

Join the world's best data teams and get started with Deepnote

No credit card required. Run your first notebook in seconds.

Deepnote
Product
© 2022 Deepnote. All rights reserved.