Webhooks at scale: delivering 100M events per day reliably

Webhooks are deceptively simple in concept: when something happens, send an HTTP POST to a URL. In practice, building a webhook system that reliably delivers 100M+ events per day, handles endpoint failures gracefully, and maintains ordering guarantees is one of the hardest distributed systems problems in payments.

Why webhooks are hard

The fundamental challenge is that you're making an outbound HTTP request to infrastructure you don't control. Your customer's server might be down, slow, returning 500s, or behind a rate limiter. You need to handle all of these gracefully without dropping events or overwhelming recovering systems.

100M+Events per day

99.98%Delivery success rate

<500msMedian delivery time

Our architecture: a multi-tier queue

We use a three-tier queue architecture. Events are first published to a high-throughput Kafka topic. A fleet of delivery workers consumes from Kafka and attempts delivery. Failed deliveries are placed into a Redis-backed retry queue with exponential backoff. Events that fail repeatedly over 72 hours are moved to a dead-letter queue (DLQ) that persists in S3, with alerting and a manual replay UI for merchants.

Ordering guarantees without global locks

Strict event ordering per resource (e.g., all events for payment_id X must arrive in order) is important for merchants that maintain state machines on their end. We achieve this by partitioning the Kafka topic by resource_id. All events for a given resource land in the same partition and are consumed in order by the same worker. This gives us ordering per resource without any global coordination overhead.

The gotcha: Ordering guarantees break down during retry. If event #3 fails and is retried after event #4 delivers, the merchant receives them out of order. We solve this by including a sequence_number on every event so merchants can detect and handle out-of-order delivery on their end.

Backpressure and rate limiting

We respect the HTTP 429 (Too Many Requests) and Retry-After headers from merchant endpoints. If an endpoint returns a 429, we immediately pause delivery to that endpoint and schedule a retry at the time specified by the Retry-After header. This prevents us from hammering recovering systems and maintains merchant trust.

For endpoints that are consistently slow or failing, we implement circuit breakers. After 10 consecutive failures, an endpoint enters an open circuit state and receives reduced-frequency delivery attempts (once per minute) until it starts succeeding again.

Operational tooling

We've invested heavily in the merchant-facing DLQ management UI. Merchants can see all undelivered events, filter by type, inspect the full request and response for each failed attempt, and manually trigger replays for individual events or entire time ranges. This has dramatically reduced support tickets from merchants who had silent failures in their webhook infrastructure.

What's next

We're working on a streaming webhook mode using Server-Sent Events (SSE) as an alternative to polling for high-frequency merchants. Early benchmarks show 3× lower latency for event-dense workflows like marketplace payment splits.

Ready to optimise your payment flow?

Join thousands of businesses using Zupay to process payments faster, smarter, and at lower cost.

Start for free → Talk to sales

Webhooks at scale: delivering 100M events per day reliably

Why webhooks are hard

Our architecture: a multi-tier queue

Ordering guarantees without global locks

Backpressure and rate limiting

Operational tooling

What's next

Ready to optimise your payment flow?

Keep reading

How we cut payment latency by 60% with edge routing

Building a fraud detection engine with sub-10ms latency

Using LLMs to write better decline reason messages for users