Back to blog

Why Webhook Delivery Is Harder Than You Think

webhooksengineeringreliability

Before I started Hookbridge, I built webhook delivery into a couple of different projects by hand. The first pass was always the same — serialize some JSON, POST it to a URL, get a 200, move on. It worked fine for the first few integrations. Then we’d hit a hundred endpoints and suddenly we were dealing with DNS failures, zombie connections, retry storms, and one customer whose endpoint took 45 seconds to respond while everything else backed up behind it.

Webhook delivery is a distributed systems problem wearing a simple interface as a disguise.

The Network Is Not Reliable

Every webhook you send crosses routers, load balancers, DNS resolvers, TLS handshakes, firewalls, and CDNs before it reaches your customer’s endpoint. Any one of those can fail, slow down, or just do something weird.

You’ll see DNS resolution failures when a customer’s domain can’t be resolved. You’ll see connection timeouts when the server is up yet refuses connections because it’s overloaded or a firewall rule changed. TLS errors from expired certs or misconfigured chains will kill the connection before it even starts. And sometimes you’ll get a clean TCP connection that just… drops mid-response.

Each failure mode needs a different response strategy. DNS issues might clear up in minutes. An expired certificate could take days to fix. Your delivery system has to handle all of them without losing events.

Their Server Is Not Your Server

When you send webhooks, you’re hitting servers you don’t own, can’t monitor, and have zero control over.

Some endpoints are painfully slow — 30 seconds or more to respond. Do you wait? At what point do you give up? Other endpoints return wrong status codes: a 200 when processing actually failed, or a 500 when the event was handled fine. Plenty of customers have rate limiters in front of their webhook endpoints, so you’ll get 429 Too Many Requests if you send too fast. And endpoints go offline all the time — deployments, outages, maintenance windows. Your system needs to keep retrying without dropping anything.

You’re not building an HTTP client. You’re building an HTTP client that has to tolerate every possible misbehavior on the receiving end.

Retries Are Necessary — And Dangerous

Failures are inevitable, so you need retries. The problem is that naive retry logic makes things worse.

The Thundering Herd

Say you have 10,000 webhooks queued and the consumer’s server goes down for 5 minutes. When it comes back up, do you slam it with all 10,000 at once? Because that’s a great way to knock it right back down.

Retry Amplification

If your first attempt fails and you retry 3 times with no backoff, one request becomes four. Multiply that across thousands of events and you’ve 4x’d your outbound traffic — and theirs.

The Retry Storm

When multiple webhook senders all retry against the same consumer endpoint at the same time, you get a cascading failure that can take down the consumer’s entire infrastructure, not just the webhook route.

The fix is exponential backoff, jitter, and rate limiting. Getting the balance right between “retry soon enough to matter” and “don’t crush the consumer” is harder than it sounds.

Ordering Is an Illusion

Events happen in order. Deliveries don’t.

Say these three events fire in sequence:

  1. order.created at 10:00:00
  2. order.updated at 10:00:01
  3. order.completed at 10:00:02

If order.created fails and gets retried, the consumer might actually receive them as:

  1. order.updated
  2. order.completed
  3. order.created (retry)

Now the consumer sees an order being created after it was completed. If their logic depends on event ordering, you’ve got a data integrity problem.

You can enforce strict ordering by serializing delivery per entity, though that kills throughput. Most systems go with at-least-once delivery and push the burden onto consumers to handle out-of-order events with timestamps or sequence numbers.

Exactly-Once Delivery Is a Myth

Distributed systems give you two choices: at-most-once or at-least-once delivery. Exactly-once — every event delivered precisely one time — isn’t achievable without tight coordination between sender and receiver.

Think about it: you send a webhook, the consumer processes it, and then their server crashes before it can send back a 200. From your side, that delivery failed. So you retry. The consumer has now processed the same event twice.

The practical answer is at-least-once delivery paired with consumer-side idempotency. That means every event needs a unique ID, consumers need to track what they’ve already processed, and your docs need to spell out this contract clearly.

It takes work on both sides to get right.

Observability Is Critical

When deliveries fail, someone needs to know. And “it failed” doesn’t cut it — you need to know why, when, how many retries have happened, and what comes next.

These are the kinds of questions you’ll be answering:

  • Is this transient or permanent?
  • How many events are sitting in the retry queue right now?
  • Is one slow consumer dragging down delivery for everyone else?
  • What’s our p99 delivery latency?
  • When exactly did this endpoint start failing?

You need logging, metrics, and alerting that can actually answer the question when a customer opens a ticket: “did you send us that webhook or not?”

Security Adds Complexity

Webhook endpoints are publicly accessible URLs. If someone discovers the URL, they can send fake events to it. So you need signing — HMAC signatures that let the consumer verify a request actually came from you and hasn’t been tampered with.

Then you need key management: rotation policies, per-customer signing keys, secure storage. You also need replay protection, because a signed webhook that gets intercepted and re-sent still carries a valid signature. Timestamps and expiration windows help, though they add yet another validation layer.

None of this is optional if you’re serious about webhook delivery.

The Operational Burden

Running a webhook delivery system means running infrastructure that:

  • Maintains a queue of pending deliveries
  • Retries on a schedule with backoff
  • Tracks delivery status per event, per consumer
  • Respects consumer-specific rate limits
  • Gives you delivery logs and debugging tools
  • Monitors health and alerts on failures
  • Handles traffic spikes without dropping events

That’s a lot of infrastructure to build and keep running. And it’s all adjacent to your core product — critical enough that it can’t break, yet not the thing that actually makes your product valuable.

Why This Matters

Delivery reliability directly affects how much your customers trust your platform. A dropped webhook can mean a missed order, a broken integration, or hours of debugging on their end. When they can’t figure out what went wrong, they open a support ticket with you.

Reliable webhook delivery isn’t a nice-to-have. If your API has webhooks, it’s a requirement.

You can absolutely build all of this yourself. We did, and that’s how Hookbridge came to be — we got tired of rebuilding the same delivery infrastructure across projects and turned it into a service. If you’d rather not do that dance yourself, it might be worth a look.

Next up: Securing Webhooks with HMAC Signatures — a practical guide to signing and verifying webhooks so your consumers can trust every event they receive.