Picture this: an endpoint goes down for maybe ninety seconds. The retry loop queues up a few hundred events while it’s out. Endpoint comes back up, the system fires all of them at once, and the endpoint falls right over again. This goes on until someone gets paged at 2am. The retries caused more damage than the original outage did.
Retries aren’t optional if you care about delivery. But a careless retry strategy can do more harm than skipping retries entirely.
The “Just Try Again” Approach
Most people’s first pass at retry logic looks roughly like this:
if delivery fails:
wait 5 seconds
try again
if still fails:
wait 5 seconds
try again
repeat up to 5 times
Works fine on your laptop with three test endpoints.
Production is different, though. A customer’s endpoint drops for two minutes. You’ve got 500 events piling up in the queue, all with the same retry schedule. The endpoint recovers and suddenly gets slammed with 500 requests at once. Down it goes again. Now those 500 events all need a second retry, and you’ve got fresh events stacking up behind them too.
This is what distributed systems folks call the thundering herd problem. Fixed-interval retries practically guarantee it’ll happen eventually.
Exponential Backoff
Better idea: space your retries further apart each time.
Attempt 1: immediate
Attempt 2: wait 30 seconds
Attempt 3: wait 1 minute
Attempt 4: wait 5 minutes
Attempt 5: wait 30 minutes
Attempt 6: wait 2 hours
Attempt 7: wait 8 hours
Attempt 8: wait 24 hours
Early retries pick up quick blips like a network hiccup or a brief restart. Later retries give a broken endpoint actual time to recover instead of you pounding on it every five seconds while it’s down.
The math is simple enough:
delay = base_delay * (multiplier ^ attempt_number)
So with a 30-second base and multiplier of 2:
Attempt 1: 30s
Attempt 2: 60s
Attempt 3: 120s
Attempt 4: 240s
...
You’ll want a maximum delay cap on this or the numbers get absurd. Most systems cap at 24 hours. After five to seven days of nothing but failures, the event gets shunted to a dead letter queue or marked as permanently failed. If an endpoint hasn’t come back after a week, the problem isn’t going to be solved by attempt number forty-seven.
Why You Need Jitter
Exponential backoff on its own doesn’t actually solve the thundering herd, though. If 500 events all fail at the same moment and they’re all running the exact same backoff formula, they’re all going to retry at the exact same moment too. You haven’t gotten rid of the stampede. You’ve just stretched the gaps between stampedes.
Jitter fixes this by introducing randomness into when each retry fires:
import random
def retry_delay(attempt, base_delay=30, max_delay=86400):
exponential = base_delay * (2 ** attempt)
capped = min(exponential, max_delay)
# Full jitter: random value between 0 and the calculated delay
return random.uniform(0, capped)
A few flavors worth knowing about:
- Full jitter:
random(0, calculated_delay). Scatters your retries across the entire delay window. Best at breaking up herds. - Equal jitter:
calculated_delay/2 + random(0, calculated_delay/2). Guarantees a minimum wait but still has randomness on top. Good middle ground. - Decorrelated jitter: The delay for each retry is derived from the previous delay instead of the attempt number. Produces wilder spacing. Some teams prefer this approach.
Full jitter is the best default unless there’s a specific reason to pick something else. Lowest collision rate, most even distribution of retries across time.
Not Every Failure Deserves a Retry
A 503 Service Unavailable and a 404 Not Found are both HTTP errors, sure. But retrying a 404 is pointless. That URL isn’t going to start working on the fourth try.
Worth retrying
- 5xx errors (500, 502, 503, 504). Server-side issues. The whole point of retries.
- Connection timeouts. Could be temporary load or a blip in the network.
- DNS resolution failures. Clear up on their own more often than you’d expect, especially during propagation.
- Connection refused. They might be mid-deploy.
Don’t bother retrying
- 4xx errors (400, 401, 403, 404, 422). The problem is with your request, not their server. Sending it again won’t change anything.
- Invalid URL. Somebody misconfigured their endpoint. That’s a human problem now.
- TLS certificate failures. No retry schedule in the world is going to renew an expired cert.
Somewhere in between
- 429 Too Many Requests. Retry this, but check for a
Retry-Afterheader first. If they’re telling you to slow down, slow down. - 301/302 Redirects. Follow them (up to a sane limit), but it’s worth wondering whether they meant to redirect webhook traffic or whether something got misconfigured somewhere.
def should_retry(status_code, error_type):
if error_type in ("timeout", "connection_refused", "dns_error"):
return True
if status_code is None:
return True # Network-level failure
if status_code == 429:
return True # Rate limited, retry with backoff
if 500 <= status_code < 600:
return True # Server error
return False # Client errors, redirects, etc.
Rate Limiting on Top of Everything Else
Backoff and jitter handle when retries fire, but there’s also the question of raw throughput. If a customer’s endpoint handles 10 requests per second on a good day, blasting 50 retries per second at it because your backoff schedule happened to align is still going to cause problems.
Put a per-endpoint rate limit on your delivery pipeline:
For each consumer endpoint:
max_concurrent_deliveries = 5
max_requests_per_second = 10
And if you’re getting 429 responses with Retry-After headers, pay attention to them. That’s the consumer telling you exactly how much they can take.
Dead Letter Queues
At some point you have to stop retrying. The endpoint has been down for a week, or it straight up doesn’t exist anymore.
But stopping retries doesn’t mean you throw the event away. Failed events should land in a dead letter queue where someone can look at them, figure out what happened, and push them through manually once things are fixed on the other end.
A useful DLQ stores:
- The original event payload
- Every delivery attempt with timestamps, status codes, and error messages
- The endpoint URL and its configuration at the time
- Why the event ended up here in the first place
That last one is important. Six weeks from now a customer will open a ticket about missing events. “Event exhausted 8 retry attempts over 3 days against endpoint X, final error was connection refused” gives you something to work with. A blank entry does not.
What the Big Players Do
Here’s how some of the bigger webhook senders handle retries:
| Service | Max Attempts | Retry Window | Backoff Strategy |
|---|---|---|---|
| Stripe | 16 | ~3 days | Exponential |
| GitHub | 3 | ~4 hours | Fixed intervals |
| Twilio | 2 | Immediate | None |
| Shopify | 19 | ~48 hours | Exponential |
Wide range. Twilio barely retries at all while Shopify will keep going for two days. Most of the industry seems to be converging on exponential backoff over a multi-day window. More attempts over a longer window gets you better delivery rates, but if the outage drags on, some of those events are going to be pretty stale by the time they arrive.
A Note for the Receiving Side
If you’re on the receiving end of webhooks, how your endpoint behaves has a big impact on how well retries work for you.
Respond fast. Acknowledge the event with a 200, then process it in a background job. If your handler takes 25 seconds to chew through some business logic before responding, the sender might time out and retry, and now you’re processing the same event twice.
Make your processing idempotent. You will get duplicates. It’s not a question of if. Use the event ID to deduplicate.
Return honest status codes. If your endpoint returns 200 when it actually choked on the payload, the sender thinks everything went fine. If it returns 500 because your validation logic rejected a field, the sender will keep retrying something that’s never going to work. Getting status codes right prevents a ton of pointless back-and-forth.
The Full Picture
All together, a solid retry strategy has:
- Exponential backoff with a reasonable base delay and a max cap
- Full jitter to scatter retries and prevent thundering herds
- Failure classification so you’re not wasting cycles retrying things that’ll never work
- Per-endpoint rate limits to avoid overwhelming recovering consumers
- Dead letter queues as a safety net for events that exhaust every retry
- Observability into what’s happening across all of it (retry counts, queue depths, delivery latency)
That’s a lot to build, and more importantly it’s a lot to maintain as your traffic scales up. Hookbridge has all of this built in. Configurable retry policies per endpoint, tunable backoff schedules, failure classification, rate limiting. You set your delivery requirements and we run the infrastructure.
Next up: Webhook Observability: Knowing What Happened and Why. Because the only thing worse than a failed delivery is a failed delivery nobody noticed.