Platform Updates11 min read

Observability 360: Tracing, Live Replay, and SLAs for workflow automation

By ButterGrow Team

TL;DR

Observability 360 is a major update to our reliability stack that adds OpenTelemetry based tracing, structured JSON logs, live session replay for UI driven agents, and native SLI and SLO monitors. The single biggest takeaway is that you can reproduce any failed customer journey end to end, then fix it in minutes instead of hours. It works across email, CRM syncs, storefront events, and social actions without custom glue. If you already run large scale workflow automation, this release turns debugging and capacity planning into a first class, repeatable practice.

What shipped in Observability 360

  • End to end traces that follow a single customer event through triggers, agent steps, external APIs, and data pipelines.
  • A built in trace viewer with span filters, flame graph mode, and percentiles for latency and cost.
  • Structured logs with context helpers and export to OTLP or S3, plus field level redaction.
  • Live session replay for browser based agents, including network panel, console, and viewport timeline.
  • Native SLI and SLO monitors with error budget burn charts and alert routing.
  • Span links between asynchronous jobs so you can see fan out and retries in one place.

If you are new to the product, the best overview of capabilities is on the page that lists our AI marketing automation features. If you prefer a product tour, start from ButterGrow, which hosts the managed experience built on top of our orchestration engine.

Why observability matters for agents that touch revenue

Marketing stacks now route signups, cart events, attribution, content scheduling, and billing webhooks through the same automation backbone. That means a failed span is not just a log line. It can be a missed welcome email, a broken UTM map, or a stale product feed that throttles spend. Strong observability closes the loop between the customer experience and the workflow engine that powers it.

Three operational realities shaped this release:

  1. Agent runs are non deterministic. Retries, backoffs, and model sampling complicate root cause analysis. You need context preserved at every hop.

  2. Most growth teams depend on at least five external services for identity, messaging, commerce, and ads. Timeouts and rate limits must be visible and attributable.

  3. Reliability is now a product surface. When you can show a percentile for campaign latency or a burn down of the month’s error budget, stakeholders trust automation at higher volumes.

Tracing that tells the whole story

Traces are the backbone of Observability 360. Every run gets a root span with attributes like customer_id, segment, campaign_id, and source_event. Child spans capture operations such as fetch list members, enrich lead, call LLM, render email, and update CRM. Cross service calls propagate context using W3C trace context so the journey remains intact across gateways and plugins.

Step 1Turn on tracing in your workspace

Flip the Observability 360 toggle in settings. Under the hood, the SDK starts an OpenTelemetry tracer provider and auto instruments HTTP clients, queues, and database calls. You can also set an environment variable to export traces to your own backend if you want to compare visualizations side by side.

# Optional: ship traces to your own collector
export OTEL_EXPORTER_OTLP_ENDPOINT="https://otel.your-collector.example"
export OTEL_TRACES_EXPORTER="otlp"
export OTEL_RESOURCE_ATTRIBUTES="service.name=buttergrow-workspace,service.version=2026.6,env=prod"

Step 2Add context early and once

Use context helpers at the start of a run. That ensures every span you or the SDK create inherits the same identifiers and labels.

// TypeScript example
import { runContext, trace } from "@buttergrow/agents";

runContext.set({
  customer_id: "c_91f3",
  segment: "buyers_30d",
  campaign_id: "cmp_back_to_school",
});

const span = trace.startSpan("render_email");
span.setAttribute("template", "sale_announcement_v2");
span.end();

When jobs fan out to queues, link spans so the trace remains navigable. The viewer shows these links, which makes retries and DLQs easy to reason about.

const parent = trace.getCurrentSpan();
enqueueJob({
  name: "sync_crm",
  spanLink: parent?.spanContext(),
});

Structured logs you can actually query

Console output is not enough once you are joining runs by customer, campaign, and destination system. Observability 360 adds JSON logs with a stable schema, automatic correlation to traces, and export options. You can tail them live, build filters like campaign equals back_to_school and response_code equals 429, and save shared views for your team.

Here is a minimal example you can adopt in any handler:

{
  "timestamp": "2026-06-05T12:30:11Z",
  "level": "warn",
  "message": "destination rate limited",
  "trace_id": "44f6b099e7c64b6f",
  "span_id": "c002c8a4d8f3a6a0",
  "customer_id": "c_91f3",
  "campaign_id": "cmp_back_to_school",
  "destination": "crm",
  "http.status_code": 429,
  "retry_in_ms": 5000
}

For error payloads, the platform recognizes and promotes the standard problem details fields from RFC 7807. That means type, title, and instance are searchable and link back to the failing span automatically.

Export options

You can export logs to your own data lake on a schedule or stream them through OTLP to a third party. Retention is workspace scoped and can be set per stream. PII redaction rules apply before export so your downstream tools only receive masked values for fields you choose to protect.

Live session replay for browser based agents

Some of the highest impact failures live in UI automations. A page changes a selector, a modal hides a button, or an auth flow adds a step. Replay lets you watch the exact sequence of DOM mutations, network calls, and viewport changes a colleague or autonomous agent experienced. Because the timeline is synchronized with your traces, you can jump from a failing span directly to the moment the UI drifted.

Privacy is the default. Input fields such as passwords, credit cards, and emails are masked. You can add custom CSS selectors to exclude entire components, and you can scope retention by project so sensitive journeys expire on a tighter schedule.

SLIs, SLOs, and error budgets that drive action

Observability 360 includes opinionated defaults for success rate, latency, and unit cost. You can define objectives like p95 under 2 seconds for create_lead, success rate above 99.5 percent for send_email, or unit cost under 3 cents for enrich_contact. The monitor page shows current status, budget remaining, and burn rate, plus a run list of the top budget consumers.

Alerts route to Slack, email, or PagerDuty. You can set separate policies for day and night schedules, and you can annotate an incident with the runbook link that explains how to mitigate a particular failure mode. This is especially useful when on call rotates to someone outside the original project team.

If your use case is analytics heavy, the tutorial on how to instrument, monitor, and debug agents goes deeper on dashboards, sampling, and metrics you can share with stakeholders.

Compare the old toolbox with Observability 360

The table below summarizes what changes for teams migrating from ad hoc prints and vendor specific consoles to a unified view.

Capability Before With Observability 360
Root cause analysis Manual grepping across logs One click from an error to the exact failing span and replay
Cross service context Lost at queue or webhook boundaries W3C trace context with span links across jobs
Error payloads Free form strings RFC 7807 problem details with searchable fields
UI drift debugging Screenshots from users Time aligned session replay with network and console
Reliability targets Best effort alerts SLOs with error budgets and burn charts
Data export Vendor specific formats OTLP for traces and logs, S3 for archives

Real outcomes from early access customers

Teams in the preview cut mean time to detect failures by 38 percent and mean time to repair by 41 percent after turning on traces and replay together. One ecommerce customer identified that 7 percent of failed cart recovery attempts were caused by a specific DOM selector that changed in a minor theme update. Another B2B team discovered a silent retry storm that added 19 percent to their messaging costs during a partner outage and fixed it by adding a single backoff rule.

Two patterns stand out:

  • Context wins. The biggest gains came from standardizing on a few context fields like customer_id, campaign_id, and segment, then wiring those into traces and logs at the start of each run.
  • Replay closes the empathy gap. When product managers could watch the exact UI drift that blocked an agent, prioritization decisions were faster and less contentious.

How to roll this out in your org

You can phase the rollout in less than a day. The steps below assume you have owner access to a workspace.

Step 1Upgrade the SDK and agent images

Use the version shipped on June 5, 2026 or newer. That build includes auto instrumentation for HTTP clients, the queue, and SQL. It also exposes helpers for setting context and creating span links.

Step 2Enable tracing and structured logs

Open settings and toggle Observability 360. Decide if traces and logs should remain in the managed viewer or if you also want to export to your own backend. Leaving export off is fine for teams that prefer the built in experience.

Step 3Define SLIs and SLOs

Start with two or three objectives per critical flow. Common picks are success rate above 99.5 percent for lead creation, p95 under 2 seconds for the list API, and under 3 cents for enrich_contact. Tie alerts to Slack for day shifts and to PagerDuty for nights and weekends.

Step 4Add redaction rules

Audit which fields are sensitive for your use case. Mask them in both logs and replay. The default set covers common inputs such as email, password, card, and address, but you may need to add custom selectors.

Step 5Train the team

Show two example traces and a replay in a short brown bag. Ask each squad to add context helpers at the top of their most used runs and to save one or two filters that match their weekly dashboards. Point new teammates to the AI marketing automation features for a complete view of what the product can do.

Integrations and compatibility

Traces and logs use the OpenTelemetry data model. You can forward them to any backend that supports OTLP over HTTP or gRPC. The viewer is optimized for the most common analysis tasks, but if you already use a third party for long term storage, you can keep that in place and send a copy.

For teams just getting started, you can get started in minutes with sample projects and seed data. If you have procurement or privacy questions, the page with answers to common questions covers data handling, roles, and retention.

For more product stories and deep dives, browse more from the ButterGrow blog and keep an eye on upcoming update notes.

Roadmap highlights

We are working on percent level sampling and an adaptive mode that scales sampling up automatically during incident conditions. We are also adding cross workspace trace joins for agencies that run many tenants, plus first party panels for queue depth and rate limit heatmaps.

If you are ready to try these capabilities in a real project, you can start a workspace, connect a data source, and instrument your first flow in under fifteen minutes. The onboarding flow lets you get started in minutes, and the page listing AI marketing automation features shows where tracing, logs, replay, and SLOs fit inside ButterGrow.

References

Frequently Asked Questions

How do I enable OpenTelemetry tracing for agents and workflows in ButterGrow?+

Update to the latest SDK, then toggle the Observability 360 switch in workspace settings. Set OTEL_EXPORTER_OTLP_ENDPOINT in your environment if you want to ship traces to your own backend. All spans also appear in the built in trace viewer, so you can start without any external vendor.

What data is captured by live session replay and how is sensitive content protected?+

Replay records DOM mutations, network calls, console output, and agent viewport events. Field level redaction rules mask PII by default, and you can add custom selectors for inputs or components. Raw frames never leave your region and are retained according to the workspace policy you set.

Can I define SLIs and SLOs without switching to a separate monitoring tool?+

Yes. Observability 360 provides built in latency, success rate, and cost SLIs. You can define objectives such as p95 under 2 seconds or success rate above 99.5 percent, then tie alerts to PagerDuty, Slack, or email. Error budgets are calculated automatically and shown in the runbook view.

How are structured logs different from standard console output?+

Structured logs are JSON with typed fields like trace_id, customer_id, segment, and cost. They can be filtered and joined with traces for a single timeline. You can export them to your own data lake or SIEM over OTLP or S3, and you can create retention policies per stream.

Does Observability 360 require changes to existing automations built on OpenClaw?+

Most projects work out of the box because the SDK auto instruments key operations. You will get the most value by adding context helpers at the start of a run, such as setting customer_id and campaign_id, and by upgrading to the latest workflow primitives to capture span links.

What is the recommended way to surface actionable errors to developers and analysts?+

Emit problem details using the RFC 7807 format in your error paths and let the platform attach those to the parent span. That gives you consistent type, title, and instance fields that are searchable and easy to route. You can also annotate spans with runbook URLs for faster mitigation.

Ready to try ButterGrow?

See how ButterGrow can supercharge your growth with a quick demo.

Book a Demo