LinkedIn found its feed freeze in the time the CPU could not explain
Short freeze windows become visible only when waiting time is mapped.📷 AI-generated image / TECH&SPACE
- ★LinkedIn’s user feed database periodically became unavailable and then quickly recovered without clear diagnostic traces.
- ★Engineers used off-CPU profiling with eBPF to see where the system was blocked while processes were not consuming CPU.
- ★The case shows why classic CPU profiles are not enough for failures caused by waiting, locking and kernel behavior.
From the supplied context, LinkedIn’s engineers were not simply hunting for another slow query or a saturated CPU. The signal pointed lower in the stack, toward moments when a process was not actively running but waiting. Classic CPU profiles can be empty or misleading in that situation. If the real cause is lock contention, I/O waiting or kernel behavior, CPU consumption may be a secondary clue rather than the central one.
That is why the tooling choice matters. eBPF makes it possible to run small, controlled programs inside the kernel to observe events that are difficult to see from the application layer. In this case, the important angle was off-CPU profiling: not where code burns processor cycles, but where time disappears while a thread or process is stalled. It is a different diagnostic model, closer to waiting-time forensics than to searching for hot functions.
Short feed database outages left few useful traces, so engineers turned to off-CPU profiling with eBPF.
Off-CPU profiling shifts attention from CPU usage to blocked threads.📷 AI-generated image / TECH&SPACE
For a system like LinkedIn’s feed, that distinction is not academic. A high-traffic feed stack depends on many connected components, and a short database outage can quickly become a user-visible failure. If the incident recovers before conventional logs or metric spikes capture enough detail, the team is left with evidence of the consequence, not the mechanism.
Off-CPU profiling tries to expose that mechanism. Instead of showing only active work, it maps waiting periods and the system paths behind them. Combined with kernel-level observability, that can separate an application symptom from a lower-level cause, including contention around locks. The Linux kernel’s own BPF documentation explains why this layer has become central to modern observability: it is close enough to the system to see what applications miss, while still being designed for controlled use.
The lesson is broader than one LinkedIn incident. Large platforms can no longer assume that every serious failure will leave a neat log entry or a visible CPU peak. Short, recurring freezes require measurement of the negative space of system behavior: waits, blocks, locks and kernel paths. Tools such as the bcc project, which uses BPF for tracing and profiling, have therefore become part of a serious SRE toolkit.
The important detail is not the novelty of eBPF by itself, but the change in the question. Instead of asking “what is consuming CPU?”, LinkedIn’s engineers had to ask “where does the system disappear while it appears to be doing nothing?”. For recurring freeze incidents, that is often the only question that leads toward the root cause rather than toward another tidy but useless chart.

