Retry with Backoff
Intent
Handle transient failures in distributed systems by automatically retrying failed operations with progressively increasing delays, improving reliability without overwhelming the failing service.
Problem
Network calls and remote service invocations can fail temporarily due to transient issues like network hiccups, brief service overloads, or momentary resource unavailability. Simply retrying immediately can worsen the problem by hammering an already struggling service, while giving up after the first failure means lost opportunities to complete valid operations once the transient issue resolves.
Real-World Analogy
Imagine trying to call a friend whose phone line is busy. If you immediately redial over and over, you’re just going to keep getting the busy signal and might even annoy them if they have call waiting. Instead, you wait a minute and try again. If it’s still busy, you wait a bit longer—maybe 5 minutes. Still busy? You wait 15 minutes. By spacing out your attempts with increasing delays, you give the line time to clear while still eventually getting through. Adding a bit of randomness (jitter) to your wait times is like varying when you call so everyone doesn’t try again at exactly the same moment.
When You Need It
- Your application depends on remote services that may experience temporary failures
- You want to improve reliability by automatically recovering from transient errors
- You need to avoid overwhelming a struggling service with rapid-fire retry attempts
UML Class Diagram
classDiagram
class RetryPolicy {
-maxAttempts: int
-backoffStrategy: BackoffStrategy
+execute(operation): Result
-shouldRetry(attempt, error): boolean
}
class BackoffStrategy {
<<interface>>
+calculateDelay(attempt): duration
}
class ExponentialBackoff {
-baseDelay: duration
-multiplier: double
-maxDelay: duration
-jitter: boolean
+calculateDelay(attempt): duration
-addJitter(delay): duration
}
class LinearBackoff {
-increment: duration
-maxDelay: duration
+calculateDelay(attempt): duration
}
class FixedBackoff {
-delay: duration
+calculateDelay(attempt): duration
}
class Operation {
<<interface>>
+execute(): Result
}
RetryPolicy --> BackoffStrategy
BackoffStrategy <|-- ExponentialBackoff
BackoffStrategy <|-- LinearBackoff
BackoffStrategy <|-- FixedBackoff
RetryPolicy --> Operation
Sequence Diagram
sequenceDiagram
participant C as Client
participant RP as RetryPolicy
participant S as Service
C->>RP: execute(operation)
RP->>S: attempt 1
S-->>RP: failure
Note over RP: wait (backoff delay)
RP->>S: attempt 2
S-->>RP: failure
Note over RP: wait (longer backoff)
RP->>S: attempt 3
S-->>RP: success
RP-->>C: result
Participants
- RetryPolicy — orchestrates retry attempts and applies backoff strategy
- BackoffStrategy — defines the interface for calculating retry delays
- ExponentialBackoff — implements exponential increase in delay with optional jitter
- LinearBackoff — implements fixed increment increases in delay
- FixedBackoff — implements constant delay between retries
- Operation — the operation being retried
How It Works
- The client invokes an operation through the retry policy wrapper
- If the operation fails with a retryable error, the policy checks if max attempts have been reached
- The backoff strategy calculates an appropriate delay based on the attempt number
- The system waits for the calculated delay (optionally with added jitter to prevent thundering herd)
- The operation is retried, repeating the process until success or max attempts exhausted
Applicability
Use when:
- You’re dealing with transient failures that are likely to resolve themselves quickly
- You need to improve reliability without requiring manual intervention
- You want to be a good citizen by not overwhelming struggling downstream services
Don’t use when:
- Failures are permanent (like authentication errors or invalid input) rather than transient
- Operations have strict latency requirements that can’t tolerate retry delays
- The operation has side effects that make it unsafe to retry without idempotency guarantees
Trade-offs
Pros:
- Significantly improves reliability by automatically recovering from transient failures
- Exponential backoff with jitter prevents overwhelming recovering services
- Provides tunable behavior through configurable retry counts and delay strategies
Cons:
- Increases overall operation latency when retries are needed
- Can hide underlying problems if failures are not properly logged and monitored
- Requires careful tuning to balance quick recovery with avoiding service overload
Related Patterns
- Circuit Breaker — works together; circuit breaker stops retries when service is known to be down
- Timeout — essential companion; prevents retries from waiting indefinitely
- Idempotency — ensures retries are safe by making operations repeatable without side effects
- Bulkhead — isolates retry attempts to prevent them from consuming all resources