There's a decision I've gotten wrong more than once: adding retry as if it were a free improvement. Configure three attempts with exponential backoff, the system looks more stable on the dashboard, done. What I wasn't watching was how many extra calls I was sending to the downstream on every failure.
This post comes from an experiment I built to measure exactly that: when retry buys real availability, when it multiplies pressure, and when it simply changes nothing because the problem isn't transient. The repo is retry-resilience-experiment, commit bdfc350, with Spring Boot 3.3.5, Java 21, Resilience4j 2.2.0, and k6 as the load generator.
My thesis is simple: retry is budget. Each extra attempt consumes user wait time, hits the real downstream, and can accelerate a degradation that was already in progress. It's not a feature you flip on and call it done.
The problem with only looking at success rate
When the downstream has simulated random failures at 35%, the difference between policies is visible. With no-retry-standard-timeout, the success rate in that run was 0.6529. With immediate-retry, it climbed to 0.955. That looks like a clear win.
But the number that matters is right next to it: retry_amplification_factor. With immediate-retry on random-failures it reached 1.465. That means for every user request, the system made 1.465 real calls to the downstream. In jitter-random-failures it was 1.471. The downstream received almost 47% more traffic than k6 generated.
For transient failures that might be acceptable. The downstream is failing for external reasons, retries land at different moments, and the outcome improves. But that 47% extra isn't abstract: downstream capacity has to exist to absorb it. If the service is already at its limit, that overhead is the nudge that tips it over.
The metric the repo defines as a contract for not fooling yourself is exactly that:
// MetricSnapshot.java — this line exists to prevent self-deception
double retryAmplificationFactor, // downstream_calls / total_requestsIf you only look at successRate and errorRate, you can believe you won when you actually pushed 47% more load onto a system that was already struggling.
progressive-degradation: where retry can accelerate the collapse
This scenario is the most interesting one methodologically, and also the one with the most important warning.
The PROGRESSIVE_DEGRADATION downstream implements this:
// DownstreamScenario.java — delay grows with each real call received
case PROGRESSIVE_DEGRADATION ->
Duration.ofMillis(Math.min(900, 80 + callNumber * 3));The delay isn't external or fixed: it grows with callNumber, which is the counter of real calls to the downstream. That means a policy with more retries generates more calls, and those calls accelerate the degradation. It's not the same failure for everyone: policies with retry degrade faster because they push harder.
The numbers from the run show this clearly. With no-retry-standard-timeout, 7720 total requests were processed and 7720 downstream calls were initiated. With immediate-retry, total requests dropped to 2939 but downstream calls went up to 8699, with an amplification factor of 2.96. The retry policy processed fewer user requests but made more downstream calls.
To be clear: this isn't a design flaw, it's the point of the experiment. The lab documents it explicitly in docs/brief-post.md: progressive-degradation should be read as load-sensitive degradation, not as an identical external failure for all policies. If you treat it as a direct comparison between policies under the same conditions, the conclusion is framed wrong from the start.
What you can conclude: in scenarios where the degradation rate depends on the volume of calls received, retries can be an accelerant. That has a name in production: retry storm. And the lab reproduces it in a controlled way.
The percentiles that lie to you when there are timeouts
There's a technical detail that changed how I read the results, and the README documents it honestly.
The caller timeout is implemented with future.cancel(true) in the RetryExecutor:
// RetryExecutor.java — cancel(true) interrupts the attempt from the caller side
try {
future.get(policy.timeout().toMillis(), TimeUnit.MILLISECONDS);
return new AttemptResult(true, elapsedMs(started), "ok", true);
} catch (TimeoutException timeout) {
future.cancel(true);
return new AttemptResult(false, elapsedMs(started), "timeout", true);
}When an attempt exceeds the timeout, the latency recorded for that attempt is capped by the caller timeout: STANDARD_TIMEOUT = Duration.ofMillis(260). That's why in progressive-degradation almost all all_attempt_p95_ms and all_attempt_p99_ms values show exactly 260. It's not that the downstream responded in 260 ms: it's that the caller stopped waiting at 260 ms and recorded that as the attempt latency.
What happens after the cancel(true) in the simulated downstream isn't fully modeled. In a real system with HTTP, a database, or a queue, the downstream may keep executing work even after the client has given up. The lab counts initiated calls but can't guarantee there's no residual work post-cancellation.
This also matters for reading successful_requests_per_second. The value of 0.95 that appears across several progressive-degradation scenarios isn't the system's maximum capacity: it's the useful work observed under that closed k6 load. With a different VU configuration, a different duration, or a real network, the numbers would differ.
circuit-breaker and bulkhead: visible rejections as a protection signal
In progressive-degradation, the circuit breaker produces something that looks contradictory at first glance. The 13-circuit-breaker-progressive-degradation run has total_requests = 44777 and circuit_breaker_rejected = 44718. The error rate is 0.9987. That looks catastrophic.
But look at the downstream calls: 198. Amplification factor: 0.004. The circuit breaker almost completely stopped sending calls to the downstream. The rejections are visible to the client, but the downstream is protected.
Compare that with immediate-retry-progressive-degradation, which has downstream_calls = 8699 and keeps failing at the same rate, and the trade-off becomes obvious. The circuit breaker chooses to reject fast rather than multiply pressure on something that can no longer respond.
The bulkhead in the same run shows a different variant: bulkhead_rejected = 22122 with downstream_calls = 3668. It limits concurrency instead of opening the circuit, but the effect is similar: it reduces downstream pressure at the cost of visible rejections.
Those concurrency signals (max_inflight_downstream = 16 for bulkhead, 40 for most other runs) are observations, not proof of saturation. The lab renamed the metric from saturationObservation to concurrencyObservation for exactly that reason: high max_inflight doesn't prove CPU, network, or connection pool saturation. It's a signal that invites investigation, not a conclusion.
What I conclude and what I don't
This experiment is a local simulation, a single published run, against a simulated downstream with in-memory delays. The numbers don't represent production, don't represent any real provider, and don't support claiming "this policy scales to X RPS". If you want to publish exact values with strong claims, the README says it clearly: run at least three editorial runs and look for consistency, not a single pass.
What I think can be sustained:
- In transient failures, retry can improve success rate but always has an amplification factor greater than 1. That overhead exists and has to fit within the system.
- In load-sensitive degradation, more retries can accelerate the degradation because they generate more calls. This isn't universal, but the scenario is real and the experiment reproduces it.
- p95 and p99 of attempts don't tell you the real downstream latency when there are timeouts: they tell you how long the caller waited before giving up.
- Circuit breaker and bulkhead produce visible rejections that can be exactly the right decision to protect the system.
What I don't conclude: that one policy is better than another in the abstract, that these numbers apply to a different system, or that max_inflight_downstream proves saturation.
The question I'm leaving open for further exploration: how much real residual work actually remains in the downstream after a future.cancel(true) in a system with an HTTP connection pool? The lab notes it as a known limitation. In production that's exactly where the difference lies between a timeout that protects and one that only hides the problem.
The repo is at github.com/JuanTorchia/retry-resilience-experiment. If you run it and get different numbers, I want to know.
Related Articles
HikariCP: the p95 that lies to you and how to read the real pool signals
A low p95 with a 97% error rate isn't a fast pool — it's a pool that fails fast. I built a reproducible experiment with Spring Boot 3, PostgreSQL, and k6 to understand which signals actually matter — and which ones deceive you.
Spring Security with Spring Boot Actuator: the authorization model that survived the incident
Locking down Actuator endpoints isn't enough. After the incident, I rebuilt the authorization model from scratch: explicit SecurityFilterChain, separate health groups, roles for /metrics and /env, and real validation with curl. This is what's still standing.
Spring Boot Actuator in Production: The Endpoints I Left Open by Accident and How I Closed Them
After publishing my Jakarta EE vs Spring Boot analysis, I audited Actuator's defaults on a backend I own and found sensitive endpoints wide open — ones I never consciously configured. Here's the hardening checklist I built afterward.
Comments (0)
What do you think of this?
Drop your comment in 10 seconds.
We only use your login to show your name and avatar. No spam.