Webhook Observability: Knowing What Happened and Why

April 3, 2026 5 min read

webhooksengineeringobservability

“Did you send us that webhook?”

If you’ve run a webhook delivery system for any length of time, you’ve fielded this question, and the answer needs to be more specific than “probably.” You need to know what was sent, when it went out, whether it was received, and if it wasn’t, what went wrong.

Without that information, every delivery failure turns into a support ticket. Every integration issue turns into a guessing game where you’re checking logs on three different systems while the customer waits. Every outage turns into a mystery that nobody has the data to solve quickly.

The Questions Your System Should Answer

At a minimum, you want real-time answers to these:

Was this event actually delivered? Not just sent, but received and acknowledged by the consumer.
If it failed, what was the error? Network issue, timeout, a 500 from their server, expired certificate?
How many retry attempts happened, and what was the result of each one?
How long did the whole process take from event creation to successful acknowledgment?
What does this endpoint’s health look like over the past hour or day? What’s the success rate?

If any of these take more than a few seconds to look up, you’ve got gaps.

Delivery Logs

Every delivery attempt should produce a log entry. You want the event ID, the endpoint URL, which attempt number it was, when it happened, the HTTP status code that came back, how long the request took, the response body (this matters more than people think when debugging), and for network-level failures, the specific error like a timeout, DNS resolution failure, or TLS issue.

These logs need to be searchable. When a customer says “we didn’t get the webhook for order 12345,” you should be able to pull up that specific event and see every delivery attempt within seconds.

To give a concrete example, here’s what it looks like when you query delivery attempts through the Hookbridge API. This is a real response shape from GET /v1/messages/{id}/attempts:

{
  "data": [
    {
      "id": "01935abc-def0-7123-4567-890abcdef001",
      "attempt_no": 1,
      "response_status": 503,
      "response_latency_ms": 30012,
      "error_text": "HTTP 503 Service Unavailable",
      "retry_type": "slow",
      "retry_after_seconds": 60,
      "dns_ms": 14,
      "tcp_connect_ms": 32,
      "tls_handshake_ms": 48,
      "ttfb_ms": 29900,
      "transfer_ms": 4,
      "conn_reused": false,
      "resolved_ip": "203.0.113.42",
      "created_at": "2026-03-19T14:20:00Z"
    },
    {
      "id": "01935abc-def0-7123-4567-890abcdef002",
      "attempt_no": 2,
      "response_status": null,
      "response_latency_ms": null,
      "error_text": "connection timeout",
      "retry_type": "slow",
      "retry_after_seconds": 120,
      "dns_ms": 11,
      "tcp_connect_ms": null,
      "tls_handshake_ms": null,
      "ttfb_ms": null,
      "transfer_ms": null,
      "conn_reused": false,
      "resolved_ip": "203.0.113.42",
      "created_at": "2026-03-19T14:21:00Z"
    },
    {
      "id": "01935abc-def0-7123-4567-890abcdef003",
      "attempt_no": 3,
      "response_status": 200,
      "response_latency_ms": 142,
      "dns_ms": 0,
      "tcp_connect_ms": 0,
      "tls_handshake_ms": 0,
      "ttfb_ms": 137,
      "transfer_ms": 5,
      "conn_reused": true,
      "resolved_ip": "203.0.113.42",
      "created_at": "2026-03-19T14:22:03Z"
    }
  ],
  "meta": {
    "request_id": "req_xyz123",
    "has_more": false
  }
}

You can read the whole story from this response. Attempt 1 got a 503 after waiting 30 seconds for the server to respond (look at that ttfb_ms). DNS, TCP, and TLS all connected fine, so the endpoint was reachable but the server was struggling. Attempt 2 couldn’t even establish a TCP connection, which suggests the server was in worse shape or had been taken offline. Attempt 3 went through in 142ms with a reused connection (conn_reused: true), meaning the endpoint had recovered and was healthy again.

That level of detail turns a “did our webhook get delivered?” conversation into a 30-second lookup instead of a 30-minute investigation.

Metrics

Logs tell you what happened with a specific event. Metrics tell you what’s happening across your system.

The ones that matter: delivery success rate (both first-attempt and overall), delivery latency at p50/p95/p99, how often events need retries, current queue depth, and error breakdowns by type and by endpoint.

Time series data is important here. You want to see trends, not just snapshots. A gradual uptick in retry rate for one particular endpoint might mean that consumer is starting to have capacity trouble. Catching that early lets you reach out before it turns into a full outage where they’re losing events.

Alerting

Metrics sitting in a dashboard that nobody looks at aren’t doing much for you. You need automated alerts for the situations that require attention:

Overall delivery success rate drops below your threshold (99% is a reasonable starting point)
Queue depth is growing, meaning events are being produced faster than delivered
A specific endpoint has been failing for an extended stretch
More events than usual are exhausting their retries and landing in the dead letter queue

Give Your Customers a Dashboard

This is the part that pays for itself surprisingly fast. Your customers need visibility into their own webhook deliveries.

A good customer-facing view shows recent deliveries with status and response codes, failed deliveries with error details and the retry schedule, endpoint health over time, the ability to search events by ID or type or time range, and payload inspection so they can see exactly what was sent and what came back.

Once customers can look things up themselves, the volume of “did you send that webhook?” tickets drops off a cliff. Instead of a back-and-forth thread where you’re both trying to figure out what the payload contained, they can just pull it up directly.

Capturing Requests and Responses

For debugging, you want the full picture of each delivery attempt. On the request side: the HTTP method and URL, the headers (including signature headers), and the body. On the response side: the status code, headers, body (truncated to something reasonable), and response time.

There are some privacy and security considerations. You’ll probably want to redact sensitive headers like authorization tokens. You might need a retention limit on how long full payloads are kept around. Some customers may want to opt out of payload logging entirely. And stored payloads should be encrypted at rest.

Even with a limited retention window, having this data available cuts debugging time dramatically. The difference between “I think the webhook was sent correctly” and “here’s the exact request and response from that delivery attempt” is the difference between a 45-minute investigation and a 2-minute lookup.

Common Debugging Scenarios

Most webhook support questions fall into a handful of patterns, and good observability makes each one straightforward to resolve.

“We’re not receiving any webhooks.” Check the delivery logs. If attempts are being made and failing, the error will tell you what’s going on. Usually it’s the endpoint returning 4xx or 5xx, a DNS resolution failure from a misconfigured domain, an expired TLS certificate, or a firewall blocking your requests.

“We’re receiving duplicate events.” Look at the delivery logs for multiple successful deliveries of the same event. This usually means the consumer’s endpoint is slow to respond. The delivery system times out and fires a retry, but the original request actually went through. Both get processed.

“Events are arriving out of order.” Check the delivery timestamps. If some events needed retries, later events may have been delivered successfully while earlier ones were still working through their retry cycle. The logs will show exactly when each event actually arrived.

“We received a webhook but the data seems wrong.” Pull up the exact payload that was delivered and compare it against the source event. More often than not, the issue is in the consumer’s parsing logic rather than the webhook content itself.

Build or Buy

You can build all of this yourself, but it’s a meaningful investment: a database or log store for delivery records, a query interface for searching and filtering, metrics collection and dashboarding, alerting rules and notification routing, a customer-facing UI for self-service debugging, and data retention policies with cleanup jobs. Each piece is its own project, and they all need to be maintained.

Hookbridge ships with all of this. The delivery attempts API shown earlier in this post is part of a larger observability surface that includes delivery logs, metrics, request and response capture, and a consumer-facing dashboard where your customers can debug their own integration issues without filing support tickets.

Next up: Idempotency Keys: Ensuring Webhooks Are Safe to Replay. When your delivery system guarantees at-least-once delivery, consumers need a reliable way to handle the duplicates that come with it.