Developer Stories12 min read

The Night Our Growth Agent Broke: A Developer Story with OpenClaw

By Maya Chen

TL;DR

Our cart recovery agent melted down one Friday when a partner API throttled us and our retries amplified the failure. We rebuilt the workflow on OpenClaw with strict idempotency at every side effect, bounded backoff, a dead letter queue, and distributed tracing. The new design cut duplicate sends to near zero, turned replays into a routine task, and gave marketing the confidence to scale volume without fear. The biggest lesson was to treat every outbound action as a ledgered event with guardrails instead of hopeful best effort.

What broke and when

At 10:42 p.m., a Slack channel that rarely moves went red. Our cart recovery sequence, which usually lifts revenue on autopilot, had started firing duplicate discount codes and repeating emails to a subset of shoppers. The incident looked small at first. By 10:47 p.m., on-call realized we had a positive feedback loop: a retry path meant to be helpful was amplifying a throttling response from a partner API.

We had recently migrated messaging to a queue backed by a lightweight worker pool. The pool had a polite three try policy, but two gaps aligned. First, webhook deliveries from a commerce platform sometimes arrived twice with the same payload during network hiccups. Second, our handler performed side effects before writing a completion mark. When latency moved from tens of milliseconds to seconds, our protection window vanished. The result was a pileup of duplicate side effects.

We paused the campaign and started containment. We added a temporary check on the discount creation endpoint to block repeats for the same customer and cart window. That bandage stopped the bleeding, but the bigger question remained. Why did our safeguards degrade under stress, and how could we rebuild them so a timing shift could not trigger a cascade again?

Framing the problem like an engineer, not a firefighter

The quickest fix would have been a bigger delay or an extra if statement. We chose a different path. We wrote down the failure in plain terms so that any engineer, marketer, or SRE could reason about it.

  • Duplicate webhook deliveries are normal during network issues.
  • External APIs throttle or time out in unpredictable bursts.
  • Side effects, once executed, must never be repeated for the same logical event.
  • Retries need to be bounded so they do not become a second failure mode.
  • Operators need fast visibility to pause, replay, or shift load.

Before we touched code, we aligned with the growth team on what mattered most: customer trust and consistent spend. That clarity gave us permission to invest in engineering changes that would not show up as features in a menu, but would pay off every late night afterward.

The rebuild plan

We took the playbook approach and wrote small, composable steps that we could ship over a week, not a quarter. Our goals were simple and measurable: no duplicate side effects, deterministic retries, replay without fear, and full observability.

Step 1Define an event contract and dedup keys

We started by formalizing an event contract. Every incoming signal carried enough context to compute a deterministic deduplication key. For cart recovery, that key was a hash of customer ID, store ID, and a five minute time bucket. We wrote the key to a fast store with a TTL slightly longer than the retry horizon. Handlers checked and set this key before sending any email or creating any discount code. This turned best effort into a ledger and enabled safe replays.

# playbook.yaml
name: cart-recovery
on: shopify.cart.abandoned
steps:
  - id: compute-dedup
    run: utils.compute_dedup_key
    with:
      fields: [customer_id, store_id, timestamp]
      bucket_minutes: 5
  - id: check-and-set
    run: store.setnx
    with:
      key: {{ steps.compute-dedup.output.key }}
      ttl_seconds: 2700
  - id: maybe-send-email
    if: {{ steps.check-and-set.output.set == true }}
    run: messaging.send_template
    with:
      template: cart_recovery_v3
      to: {{ event.email }}

Step 2Add retries with backoff and a per-entity budget

We replaced naive three tries with exponential backoff and jitter. The handler received a per-customer retry budget so the queue could not be monopolized by one noisy entity. We wrote the policy as data to avoid magical constants hiding in code, and we surfaced budget exhaustion as a metric. For anyone searching later, this is a good reference for how to design an AI agent retry strategy that protects throughput without starving valid work.

{
  "retry": {
    "strategy": "exponential",
    "base_ms": 500,
    "max_attempts": 3,
    "jitter": "full",
    "budget_per_contact": 5
  }
}

Step 3Introduce a dead letter queue and a safe replay tool

Any job that exceeded attempts or failed with a non-retryable code moved to a dead letter queue. We built a small replay UI that pulled one job at a time, showed its dedup key and last error, and executed the handler with the same trace context. Because every side effect checked dedup first, replays were boring, which is exactly how operational tools should feel.

We later wrote up the mechanics in more detail in our guide to idempotency, retries, and DLQs. That article goes deeper into patterns for retryable versus terminal failures and how to pick TTLs that match your business windows.

Step 4Trace across boundaries with OpenTelemetry

We propagated a traceparent through every hop: webhook ingestion, enqueue, handler, and external API calls. A single trace answered three questions quickly. Did we receive the event more than once. Did the handler run more than once. Did the external partner respond or time out. We also added span attributes for customer ID and campaign ID to make filtering easy in the UI.

This paid off the same night. During validation, a developer clicked from an alert into a trace and noticed a second webhook delivery with the same payload arriving 80 seconds later. In the old world, that would have been shrugged off as a logging blip. With trace correlation, it was a clear signal and a new test case.

Step 5Add guardrails and policy checks before action

We added a policy step before any irreversible side effect. The policy engine evaluated consent, contact method limits, and daily frequency caps. If any condition failed, the agent wrote a structured no-op result and exited cleanly. This change aligned the system with how marketers think about frequency and preference centers instead of leaving those concerns implicit in code.

Step 6Keep a human loop for hard edges

Automation does not mean removing judgment. For promos that were especially sensitive, we added a Slack approval step that queued a message for a human to approve in bulk. We had previously built a similar flow and documented it internally. Borrowing that pattern gave us a fast, familiar safety valve for marketing during unusual spikes or holidays.

Architecture we shipped

We resisted the temptation to introduce exotic components. The system that replaced the broken agent was intentionally boring.

  • An HTTPS ingestion endpoint validated signatures and normalized events.
  • A queue accepted jobs with the dedup key as metadata.
  • A worker executed a handler that first set a dedup key, then performed side effects.
  • A retry module made decisions from policy data, not hardcoded waits.
  • A dead letter queue stored jobs with structured error context for replay.
  • OpenTelemetry tracing connected every hop into one timeline.

The details that look fussy in code reviews were the ones that removed 2 a.m. surprises. The handler never performs a side effect before the dedup key is set. Retries are bounded, not infinite. Results are structured, not print statements. Each decision is visible in metrics and traces.

Operational playbook after the fix

Shipping code was only half the work. The other half was writing down how to run it while the coffee is still brewing.

  • Alerts watch error rates, dedup misses, and retry budgets per entity.
  • Dashboards show a map of jobs by state: queued, running, success, noop, retrying, dead letter.
  • The runbook lists common failure signatures and the first checks, like signature mismatch or partner timeouts.
  • Replay is a controlled button with guardrails. Operators can replay one job at a time or schedule a bounded batch.
  • The growth team has a simple toggle for safe mode that switches templates and caps frequency temporarily.

If you need more depth on instrumentation, the agent analytics walkthrough covers trace propagation, useful span attributes, and alert thresholds that avoid pager fatigue.

Results after 30 days

We promised the team two outcomes and measured both. First, we wanted duplicate side effects to round down to zero. Second, we wanted replays to be routine, not heroic.

  • Duplicate discount issuance fell from a peak of 1.8 percent during the incident to 0.03 percent in the first full week, then settled at 0.01 percent by day 30.
  • Mean time to first signal on partner API issues dropped from 14 minutes to 90 seconds because traces and metrics were aligned.
  • Successful replay rate within 24 hours moved from 62 percent to 98 percent because handlers became idempotent and DLQ context was rich.
  • The growth team increased send volume by 27 percent without new complaints or unsubscribes related to frequency limits.

These are not vanity numbers. They describe a system that fails predictably and recovers without guessing. That is what marketers need when budgets are real and moments are perishable.

Lessons I would not skip again

You can build a marketing agent that looks fine during demos but falls apart when the internet reminds you it is the internet. Here are the lessons I would carry into any new project.

  • Make idempotency an interface, not a comment. Handlers should implement a check and set pattern before side effects.
  • Write retry policy as data and expose it to operators. If someone has to recompile to change a wait time, you did not finish the job.
  • Route terminal failures to a durable queue with enough context for a human to decide next steps. A dead letter without clues is a dead end.
  • Trace everything that crosses a network boundary and attach business identifiers to spans. Engineers debug systems, marketers debug journeys.
  • Keep a human loop for high risk actions so your automation can pause without feeling like defeat.

Backpressure at integration boundaries

We focused on our code, but the bigger world matters too. When partner APIs slow down, the right response is graceful backpressure, not louder knocking. We set hard concurrency limits per integration and tested with injected 429 responses. This avoided the trap where a retry storm looks like traffic growth and leads to stricter throttles.

Two practices helped here. First, we used a small leaky bucket per integration to smooth bursts without hiding real load. Second, we logged per partner latency percentiles so we could see the difference between our own slowness and theirs. That simple chart saved us from a misleading optimization when a partner's P99 temporarily doubled.

What this means for the team

Engineering now speaks the same language as marketing when incidents happen. When the growth lead asks why a sequence paused, we can point to a policy decision, a dedup hit, or a retry budget. That clarity shortens conversations and builds trust. It also frees up time to build new journeys instead of cleaning up old ones.

We also felt the cultural shift. The on-call rotation became calmer. The incident timeline looked like a short story instead of a novel. Most importantly, we reclaimed late nights for sleep and weekends for living.

Here is the quiet win. By solving for reliability first, we also made it easier to add new channels. SMS and push notifications slotted into the same handler pattern with different side effects. That made the next quarter's roadmap feel possible, not aspirational.

The rebuilt agent now fits the boring and proud category. It does not surprise us, and it does not chase trends. It puts customer experience and operator sanity first. That is a good trade for any team that cares about durable growth.

ButterGrow already powered the team's campaigns before this incident, and leaning into its AI marketing automation features made the rebuild faster. If you are starting from scratch or modernizing something brittle, you do not have to reinvent these safety rails.

Our brand and our customers benefited from choices that seem unglamorous. Idempotent marketing webhooks at scale, stable retry policies, and a DLQ that encourages safe replays are the backbone of a dependable growth program.

Adopting these ideas helped us answer a long tail query we hear from new teammates all the time: how to make idempotent marketing webhooks without turning the codebase into a maze.

If you manage similar systems, I hope this story gives you a blueprint you can adapt to your own stack and team habits.

Building and operating these capabilities is why we use ButterGrow for production campaigns. It is the hosted OpenClaw platform our marketers know, and it integrates cleanly with the engineering patterns described here.

As you sketch your own roadmap, keep three questions near your design doc. What is the dedup key for each side effect. When do you stop retrying. How will someone on call replay safely while sipping coffee, not chugging it.

Taking time to answer those questions up front will pay off every time traffic spikes or a partner hiccups. It is also the difference between a demo and a dependable system.

You can also find more depth on related building blocks in our write ups. The agent analytics walkthrough covers tracing and metrics. The playbook on idempotency, retries, and DLQs dives into handler patterns and safe replays.

Finally, if your team is ready to modernize, you can evaluate features and patterns side by side before writing code. The overview of features is a good place to see the building blocks in one place. For adjacent topics and case studies, explore more from the ButterGrow blog.

To wrap the story, we started with a scary chart and ended with a small, sturdy system. That is progress you can feel on a Friday night.

The next time someone asks how to design an AI agent retry strategy that does not wake up the neighborhood, you will have an answer and a runbook.

ButterGrow has a straightforward onboarding path, so if you want to take the patterns from this post and put them to work, you can get started in minutes with a trial workspace and a sample playbook.

References

Frequently Asked Questions

How did idempotency keys stop duplicate discounts in the cart recovery workflow?+

We generated a deterministic idempotency key from the customer ID, event type, and a time bucket, then wrote it to a fast key store with a TTL. Handlers first checked the key before performing side effects like coupon issuance or email sends. If the key existed, the handler returned immediately, which eliminated duplicate discounts during retries or webhook replays.

What retry policy worked best for flaky partner APIs without spamming customers?+

We used exponential backoff with jitter and a retry budget per contact. The policy retried 3 times at growing intervals, then routed the message to a dead letter queue for review. We also added a circuit breaker that switched the agent to a safe mode when error rates crossed a threshold, which prevented blast radius during partner outages.

How did you trace a customer journey across multiple services for debugging?+

We propagated a traceparent through every hop using OpenTelemetry and stored the trace ID alongside our event IDs. This made it possible to click from an alert to a single distributed trace that showed webhook receive, queue enqueue, handler execution, and external API calls. The same trace ID was used in agent analytics dashboards for one-click correlation.

How do you safely replay failed jobs without re-sending messages or coupons?+

We stored a minimal event ledger with deduplication keys for each side effect and enforced idempotency in handlers. Replays read from the dead letter queue, then executed handlers that checked the dedup store before acting. If a side effect had already occurred, the handler wrote a no-op result and advanced the state, which made replays safe.

What metrics became your north stars after the incident?+

We committed to three metrics: time to first signal on failure, percentage of idempotency-protected side effects, and successful replay rate within 24 hours. These aligned engineering work with marketing outcomes, since quicker detection, better dedup coverage, and reliable replays directly reduced customer impact.

How would you design an AI agent retry strategy for high-volume ecommerce events?+

Start with bounded retries using exponential backoff and jitter, add a per-entity budget so one customer cannot starve others, and implement a dead letter queue with a replay tool. Combine this with idempotent handlers and clear alerting on thresholds. This design limits noise while preserving delivery for legitimate spikes.

Ready to try ButterGrow?

See how ButterGrow can supercharge your growth with a quick demo.

Book a Demo