TL;DR
Our webhook based lead intake toppled during a partner outage that sent duplicate posts and slow responses. We rebuilt routing on OpenClaw with a queue first design, idempotent writes, and a small agent that performed validation and enrichment away from the edge. The new pipeline cut duplicate records to essentially zero, brought p95 from seven seconds to two, and turned replays into a one click operation. If your growth stack still pushes directly from webhooks into a CRM, move to a buffered pattern before the next traffic spike.
The outage that forced a rebuild
It started on a Tuesday afternoon when a form vendor rolled out a change that turned temporary 500s into client retries. Our intake endpoint was configured to wait for downstream confirmations before returning 200, which meant any slow CRM write caused the vendor to retry, which caused us to hold the connection longer, which led to more retries. Within minutes we saw queueing at the edge, a backlog of in flight requests, and a rising duplicate rate in the CRM.
We paused campaigns, took a snapshot of the current backlog, and switched the form to a static thank you page while we stabilized. The fix that night was tactical: return a 202 as soon as we validated the signature and stored the payload, then push the event to a lightweight queue. The next morning we made the call to rebuild routing properly instead of patching a fragile chain.
If you have never seen this failure mode, picture fifty tabs of a form submitter smashing the submit button while a partner flips between 200 and 500 responses. A direct integration turns into a feedback loop. A buffered pipeline turns that into a brief spike that drains safely.
What we started with
Our original flow looked fine on a diagram but failed under real conditions. It was a straight line: webhook intake, validation, transform, CRM write, and a synchronous response. We also did naive de dupe in the database by checking for an email in the last hour, which suppressed some duplicates but created false positives whenever users had shared inboxes.
We tracked three metrics that mattered most:
- p95 latency from intake to CRM write hovered between 4 and 6 seconds on normal days
- Duplicate creation landed around 3 percent, higher during paid bursts
- Recovery from a partner spike required manual intervention and often a batch delete
None of those met the reliability bar we wanted for paid acquisition. We needed a design that was boring, observable, and resilient to predictable chaos.
The redesign: a resilient routing service
We moved to a pattern that is common in event driven systems and works especially well for how to build resilient lead routing in a marketing stack. The new design introduced a managed FIFO queue, a dead letter queue, and a small worker that handled validation, enrichment, and fan out to the CRM and analytics sinks. We also adopted idempotent writes across every outbound call and simplified the shape of our events.
Four architectural choices drove most of the improvement:
- Immediate acknowledge. The intake edge verifies the HMAC signature, enforces rate limits per sender, drops any PII that we do not need, and then pushes the event into the queue. It returns 202 within 80 to 120 milliseconds.
- At least once delivery with idempotent writes. We assume retries will happen and we design handlers so they can safely run twice.
- Poison pill isolation. Bad events go to the DLQ after bounded retries with backoff and jitter so they do not block healthy traffic.
- Replay as a first class feature. A one button replay reads from the DLQ after a fix and publishes back into the main queue.
We kept the agents small and focused. The worker validated fields, performed light enrichment, and called the CRM and analytics endpoints. It also emitted structured logs and metrics so SREs could see each stage.
Here is a simplified view of the handler contract we used for idempotent webhook processing in marketing systems:
// TypeScript pseudo code
type LeadEvent = {
source: string;
sourceEventId: string;
receivedAt: string; // RFC3339
payload: Record<string, unknown>;
};
function idempotencyKey(e: LeadEvent): string {
// Stable tuple: vendor id, event id, normalized email hash
const email = String(e.payload.email || "").trim().toLowerCase();
const emailHash = sha256(email);
return `${e.source}:${e.sourceEventId}:${emailHash}`;
}
async function handle(event: LeadEvent) {
const key = idempotencyKey(event);
if (await cache.exists(key)) return { status: "duplicate" };
// validate
validateSchema(event.payload);
// enrich
const city = await geoIp(event.payload.ip);
// write to CRM with idempotency key
const crmRes = await crm.createOrUpdateLead({
email: event.payload.email,
firstName: event.payload.firstName,
city,
}, { idempotencyKey: key });
await cache.set(key, { crmId: crmRes.id }, { ttl: 24 * 3600 });
return { status: "ok", crmId: crmRes.id };
}
Implementation notes
Step 1Stabilize intake at the edge
We introduced a small intake service that validated signatures, normalized headers, and wrote to the queue. The critical change was returning a 202 as soon as the write succeeded rather than waiting for the CRM. That single change removed synchronous coupling and prevented partners from retrying during slowdowns.
We also used structured logging from day one and masked sensitive fields. That made debugging easier during replays and kept us aligned with our privacy posture.
Step 2Make handlers safe to run twice
We treated every downstream call as at least once. The fix was to generate a deterministic key and use it as a write guard in our storage and in any third party that supported it. For systems without native idempotency, we wrote a thin wrapper that upserted based on the key and emitted a no op if the record already existed.
The idea is simple but worth repeating. You cannot control when a partner retries or when the network flakes. You can control whether your system produces the same outcome if a message arrives twice.
Step 3Add a dead letter queue and bounded retries
We configured the queue with a retry policy that used exponential backoff with jitter. Retrying hot without jitter creates thundering herds after brief outages. Jitter spreads retries over a window so downstreams breathe. After three attempts we moved the message to a dead letter queue. Operators got a clear view into what was failing without harming the p95 for healthy traffic.
Step 4Build a replay that operators trust
We made replay a first class button in our console. It shows the reason the message landed in the DLQ, the last error, and a masked preview of the payload. Once the root cause is fixed, an operator can replay a single message or a batch. Replays are tagged so we can identify them in metrics during post incident reviews.
Step 5Keep payloads small and schemas explicit
We trimmed payloads to the fields the CRM actually needed. Smaller messages mean faster processing and lower storage. We versioned the schema and rejected messages that did not match. Clear contracts beat soft contracts when teams grow and sources multiply.
Step 6Observe everything
We shipped counters and histograms for every stage. You cannot improve what you do not see. The dashboards tracked intake rate, queue depth, handler duration, success rate, duplicate short circuit rate, and DLQ placements per reason. We also added an on call runbook link next to every alert in our console so the responder could jump straight to the right play.
If you want to see how this fits into the larger product, AI marketing automation features page explains what ButterGrow does at a high level. We used the same building blocks that power customers in production. You can skim an overview at the feature set and then map the pieces to an agent or a worker in your own stack. We also documented similar reliability patterns in a late night incident postmortem about how our growth agent failed in production which provides additional context.
Results after the cutover
Here is how the numbers looked after forty eight hours in production and again after thirty days.
| Metric | Before | After 48 hours | After 30 days |
|---|---|---|---|
| p95 intake to CRM | 7.2 s | 2.1 s | 1.9 s |
| Duplicate leads per 10k | 83 | 1 | 0 to 2 |
| DLQ placement rate | n.a. | 0.7 percent | 0.3 percent |
| Operator interventions per week | 3 | 0 | 0 to 1 |
The most surprising improvement was psychological. Once the queue drained predictably and the replay button worked, people slept better. Sales stopped worrying about missing leads during campaigns, and marketing got clearer attribution because duplicates stopped skewing reports.
What went wrong and how we fixed it
No rebuild is clean. Three problems took longer than expected.
- Bad phone number validation blocked too many messages. We used an overly strict library that failed for valid international numbers. We patched the library and moved the strictness to a later enrichment step.
- A legacy integration did not support idempotent keys. We wrote an upsert endpoint in front of it and cached outcomes locally. It added two hundred milliseconds but removed duplicates.
- A noisy partner hammered the intake URL during their own tests. We switched them to a signed sender token and set per sender rate limits. That removed the noise and gave us clear capacity planning.
Runbook snippets we actually use
We included small scripts in our on call runbook so responders can verify or repair without hunting.
# Show top DLQ reasons in the last hour
dlq_tail | jq -r '.reason' | sort | uniq -c | sort -nr | head -n 5
# Replay a single message by id
replay_msg --id "$1" --reason "fixed schema for phone number validation"
# Estimate p95 from recent events
latency_histogram --window 15m --percentile 95
These are thin wrappers around our console and metrics APIs. The point is not to be clever but to make the right action the easy one at three in the morning.
Checklist to repeat this build
If you are starting from webhooks that write directly to a sink, here is a short plan that maps to the steps above.
- Move intake to an authenticated endpoint that validates signatures.
- Push to a managed FIFO queue and immediately return 202.
- Implement deterministic idempotency keys and short circuit duplicates.
- Configure exponential backoff with jitter and a dead letter queue.
- Build a replay tool that operators can run without code changes.
- Emit structured logs, counters, and histograms from each stage.
- Keep payloads small and schemas versioned.
If you want product context, you can scan what ButterGrow does and the AI marketing automation features that map to this pattern. For more story time and war rooms, you can browse more from the ButterGrow blog where we store our incident notes and developer write ups.
Finally, if this approach feels close to what you need for queue based marketing automation architecture, bookmark two specialty topics for later deep dives: agent workflow design and replay safety. They both pay dividends when teams grow.
ButterGrow is a hosted assistant built on an agent framework that takes these patterns from the whiteboard into production. If you want to try this with your own forms, you can get started in minutes with the onboarding flow](/#getting-started) and wire up your first intake route in a single afternoon.
References
- Stripe docs on idempotency keys . background on designing idempotent writes for external APIs.
- Amazon SQS dead letter queues . reference for isolating poison messages during retries.
- Exponential backoff and jitter guidance . why jitter prevents thundering herds during recovery.
Frequently Asked Questions
How did you enforce idempotency for duplicate lead events in the router?+
We generated a deterministic idempotency key from a stable tuple of fields like source event id, normalized email hash, and RFC3339 timestamp truncated to seconds. We wrote every successful mutation behind a unique key in a fast store and short-circuited repeats. This cut duplicate lead creation to effectively zero without adding noticeable latency.
What service level objective did you set for lead routing and how was it measured?+
We targeted a 2 second p95 from intake to CRM write. We measured end to end by stamping a server side received_at and a persisted_at on the final sink and emitting histograms to our metrics stack. This kept us honest during incident reviews and weekly regressions.
Which queue technology and patterns backed the new pipeline?+
We used a managed FIFO queue with a dedicated dead letter queue and a retry policy that applied exponential backoff with jitter. We treated every handler as at least once and relied on idempotent writes to keep state consistent even under bursts.
How did you handle webhook flapping from third party forms and ad platforms?+
We moved intake to a signed ingestion endpoint and immediately acknowledged with a 202 status after pushing the payload into the queue. We validated signatures, rate limited per sender, and decoupled processing so retries from partners did not amplify load on our CRM.
How did you test failure scenarios before going live?+
We wrote chaos scripts that injected timeouts, 429s, and malformed payloads, then verified DLQ placement and alerting. We also replayed a one week export of historical events through a non production environment to confirm idempotency and ordering.
How do you prevent PII from leaking into logs during debugging and replays?+
We scrubbed or hashed emails and phone numbers at the logging layer and enforced structured logging fields. DLQ payloads were encrypted at rest, and we only surfaced 5 to 10 line previews with sensitive fields masked in the console.
Ready to try ButterGrow?
See how ButterGrow can supercharge your growth with a quick demo.
Book a Demo