Product
Send Receive
Back to blog

Designing a Retry Strategy That Won't Take Down Your Consumers

webhooksengineeringreliability

Picture this: an endpoint goes down for maybe ninety seconds. The retry loop queues up a few hundred events while it’s out. Endpoint comes back up, the system fires all of them at once, and the endpoint falls right over again. This goes on until someone gets paged at 2am. The retries caused more damage than the original outage did.

Retries aren’t optional if you care about delivery. But a careless retry strategy can do more harm than skipping retries entirely.

The “Just Try Again” Approach

Most people’s first pass at retry logic looks roughly like this:

if delivery fails:
    wait 5 seconds
    try again
    if still fails:
        wait 5 seconds
        try again
    repeat up to 5 times

Works fine on your laptop with three test endpoints.

Production is different, though. A customer’s endpoint drops for two minutes. You’ve got 500 events piling up in the queue, all with the same retry schedule. The endpoint recovers and suddenly gets slammed with 500 requests at once. Down it goes again. Now those 500 events all need a second retry, and you’ve got fresh events stacking up behind them too.

This is what distributed systems folks call the thundering herd problem. Fixed-interval retries practically guarantee it’ll happen eventually.

Exponential Backoff

Better idea: space your retries further apart each time.

Attempt 1: immediate
Attempt 2: wait 30 seconds
Attempt 3: wait 1 minute
Attempt 4: wait 5 minutes
Attempt 5: wait 30 minutes
Attempt 6: wait 2 hours
Attempt 7: wait 8 hours
Attempt 8: wait 24 hours

Early retries pick up quick blips like a network hiccup or a brief restart. Later retries give a broken endpoint actual time to recover instead of you pounding on it every five seconds while it’s down.

The math is simple enough:

delay = base_delay * (multiplier ^ attempt_number)

So with a 30-second base and multiplier of 2:

Attempt 1: 30s
Attempt 2: 60s
Attempt 3: 120s
Attempt 4: 240s
...

You’ll want a maximum delay cap on this or the numbers get absurd. Most systems cap at 24 hours. After five to seven days of nothing but failures, the event gets shunted to a dead letter queue or marked as permanently failed. If an endpoint hasn’t come back after a week, the problem isn’t going to be solved by attempt number forty-seven.

Why You Need Jitter

Exponential backoff on its own doesn’t actually solve the thundering herd, though. If 500 events all fail at the same moment and they’re all running the exact same backoff formula, they’re all going to retry at the exact same moment too. You haven’t gotten rid of the stampede. You’ve just stretched the gaps between stampedes.

Jitter fixes this by introducing randomness into when each retry fires:

import random

def retry_delay(attempt, base_delay=30, max_delay=86400):
    exponential = base_delay * (2 ** attempt)
    capped = min(exponential, max_delay)
    # Full jitter: random value between 0 and the calculated delay
    return random.uniform(0, capped)

A few flavors worth knowing about:

  • Full jitter: random(0, calculated_delay). Scatters your retries across the entire delay window. Best at breaking up herds.
  • Equal jitter: calculated_delay/2 + random(0, calculated_delay/2). Guarantees a minimum wait but still has randomness on top. Good middle ground.
  • Decorrelated jitter: The delay for each retry is derived from the previous delay instead of the attempt number. Produces wilder spacing. Some teams prefer this approach.

Full jitter is the best default unless there’s a specific reason to pick something else. Lowest collision rate, most even distribution of retries across time.

Not Every Failure Deserves a Retry

A 503 Service Unavailable and a 404 Not Found are both HTTP errors, sure. But retrying a 404 is pointless. That URL isn’t going to start working on the fourth try.

Worth retrying

  • 5xx errors (500, 502, 503, 504). Server-side issues. The whole point of retries.
  • Connection timeouts. Could be temporary load or a blip in the network.
  • DNS resolution failures. Clear up on their own more often than you’d expect, especially during propagation.
  • Connection refused. They might be mid-deploy.

Don’t bother retrying

  • 4xx errors (400, 401, 403, 404, 422). The problem is with your request, not their server. Sending it again won’t change anything.
  • Invalid URL. Somebody misconfigured their endpoint. That’s a human problem now.
  • TLS certificate failures. No retry schedule in the world is going to renew an expired cert.

Somewhere in between

  • 429 Too Many Requests. Retry this, but check for a Retry-After header first. If they’re telling you to slow down, slow down.
  • 301/302 Redirects. Follow them (up to a sane limit), but it’s worth wondering whether they meant to redirect webhook traffic or whether something got misconfigured somewhere.
def should_retry(status_code, error_type):
    if error_type in ("timeout", "connection_refused", "dns_error"):
        return True
    if status_code is None:
        return True  # Network-level failure
    if status_code == 429:
        return True  # Rate limited, retry with backoff
    if 500 <= status_code < 600:
        return True  # Server error
    return False  # Client errors, redirects, etc.

Rate Limiting on Top of Everything Else

Backoff and jitter handle when retries fire, but there’s also the question of raw throughput. If a customer’s endpoint handles 10 requests per second on a good day, blasting 50 retries per second at it because your backoff schedule happened to align is still going to cause problems.

Put a per-endpoint rate limit on your delivery pipeline:

For each consumer endpoint:
    max_concurrent_deliveries = 5
    max_requests_per_second = 10

And if you’re getting 429 responses with Retry-After headers, pay attention to them. That’s the consumer telling you exactly how much they can take.

Dead Letter Queues

At some point you have to stop retrying. The endpoint has been down for a week, or it straight up doesn’t exist anymore.

But stopping retries doesn’t mean you throw the event away. Failed events should land in a dead letter queue where someone can look at them, figure out what happened, and push them through manually once things are fixed on the other end.

A useful DLQ stores:

  • The original event payload
  • Every delivery attempt with timestamps, status codes, and error messages
  • The endpoint URL and its configuration at the time
  • Why the event ended up here in the first place

That last one is important. Six weeks from now a customer will open a ticket about missing events. “Event exhausted 8 retry attempts over 3 days against endpoint X, final error was connection refused” gives you something to work with. A blank entry does not.

What the Big Players Do

Here’s how some of the bigger webhook senders handle retries:

ServiceMax AttemptsRetry WindowBackoff Strategy
Stripe16~3 daysExponential
GitHub3~4 hoursFixed intervals
Twilio2ImmediateNone
Shopify19~48 hoursExponential

Wide range. Twilio barely retries at all while Shopify will keep going for two days. Most of the industry seems to be converging on exponential backoff over a multi-day window. More attempts over a longer window gets you better delivery rates, but if the outage drags on, some of those events are going to be pretty stale by the time they arrive.

A Note for the Receiving Side

If you’re on the receiving end of webhooks, how your endpoint behaves has a big impact on how well retries work for you.

Respond fast. Acknowledge the event with a 200, then process it in a background job. If your handler takes 25 seconds to chew through some business logic before responding, the sender might time out and retry, and now you’re processing the same event twice.

Make your processing idempotent. You will get duplicates. It’s not a question of if. Use the event ID to deduplicate.

Return honest status codes. If your endpoint returns 200 when it actually choked on the payload, the sender thinks everything went fine. If it returns 500 because your validation logic rejected a field, the sender will keep retrying something that’s never going to work. Getting status codes right prevents a ton of pointless back-and-forth.

The Full Picture

All together, a solid retry strategy has:

  1. Exponential backoff with a reasonable base delay and a max cap
  2. Full jitter to scatter retries and prevent thundering herds
  3. Failure classification so you’re not wasting cycles retrying things that’ll never work
  4. Per-endpoint rate limits to avoid overwhelming recovering consumers
  5. Dead letter queues as a safety net for events that exhaust every retry
  6. Observability into what’s happening across all of it (retry counts, queue depths, delivery latency)

That’s a lot to build, and more importantly it’s a lot to maintain as your traffic scales up. Hookbridge has all of this built in. Configurable retry policies per endpoint, tunable backoff schedules, failure classification, rate limiting. You set your delivery requirements and we run the infrastructure.

Next up: Webhook Observability: Knowing What Happened and Why. Because the only thing worse than a failed delivery is a failed delivery nobody noticed.