InfoQ: DDSketch hunts slow microservices before p99 latency burns
A fan-out chain where controlled hedging cuts tail latency.📷 AI-generated image / TECH&SPACE
- ★Adaptive hedging targets stragglers in fan-out architectures, where multiple slow calls cumulatively inflate p99 latency.
- ★DDSketch estimates quantiles in real time, while rotating windows help the system track latency distribution drift.
- ★A token-bucket budget limits duplicate requests so the optimization does not become a load problem.
InfoQ’s article “Adaptive Hedged Requests for Reducing p99 Latency”, by Prathamesh Bhope, starts from that exact failure mode. Hedged requests are not a new idea: send a backup request when the first one takes too long, then use whichever response arrives first. The problem is that the naive version can create its own load incident. If a system duplicates every suspiciously slow operation without restraint, p99 may improve on a chart while the infrastructure absorbs a new wave of work.
This is where “adaptive” matters. Instead of relying on a fixed timeout, the mechanism uses DDSketch for real-time latency quantile estimation. That means the hedging threshold does not have to be a manually guessed number; it can be tied to the current response distribution. When the system speeds up or slows down, the threshold moves with it. That matters in production, where latency is shaped by traffic, cache hit rates, deployments, regional issues, and external dependencies rather than a clean lab curve.
InfoQ details a mechanism combining DDSketch, rotating time windows, and a token-bucket budget to reduce tail latency without uncontrolled request amplification.
Quantiles, time windows, and budget decide when a backup request is sent.📷 AI-generated image / TECH&SPACE
The second part of the design is windowed rotation. It prevents decisions from being made against stale samples. If the latency distribution shifts, the system needs to forget yesterday’s shape quickly and react to the current one. Otherwise hedging becomes sluggish: it either sends the backup too late to help p99, or sends it too early and piles up unnecessary work.
The third guardrail is a token-bucket budget. This is the damage-control layer: hedging gets a limited number of tokens, meaning it can issue extra requests only while spending stays inside the budget. The same token-bucket logic is well known from rate limiting and traffic shaping, including RFC 2697. Here its purpose is blunt and practical: a tail-latency optimization must not become a hidden denial-of-service attack against the system’s own services.
The article’s signal is strong because it cites a 74 percent reduction in p99 latency. That is not a cosmetic metric. P99 is where the worst user experiences accumulate, where transactions threaten SLAs, and where downstream timeout strategies can start cascading. Google’s classic paper “The Tail at Scale” remains a useful reference for the same lesson: as systems grow, the tail of the distribution becomes an operational problem, not a statistical footnote.
The important part is not just the backup request. It is the combination of three constraints: measure the live tail, discard stale distributions, and spend from a strict budget. That turns adaptive hedging into an operational technique rather than a hopeful trick. For large distributed systems, the takeaway is direct: a slow request is not always a failure, but passively waiting for every straggler is how p99 burns.

