Agent Reliability Drilldown: Instrument, Replay, and Fix Faster

📚 Agent Reliability Playbook
Ad Space
Agent Reliability Drilldown: Instrument, Replay, and Fix Faster
Every time the support agent at Lumenly misfiled a ticket, engineers scrambled to find the culprit. Logs were sparse, cost reports lagged, and no one could replay the conversation that caused the incident. This tutorial builds a reliability harness from scratch so your agents never remain mysterious again.
We will build a cross-language observability stack that:
- Captures OpenTelemetry traces from both Node and Python runtimes.
- Stores NDJSON transcripts you can replay locally.
- Tracks latency, tool counts, and dollar costs.
- Adds a CLI viewer plus pytest integration to stop regressions before they ship.
Scenario: Mission Control for a Support Agent
Our Escalation Concierge is now live with real customers. The ops team needs:
- A single trace ID that connects user prompts, LLM calls, and downstream tools.
- Replay files to step through every mission turn without digging through mixed logs.
- Metrics that expose token spend, latency, and tool success rates.
- Automated tests that fail if observability drifts.
We will build this reliability lab in five layers.
Step 1: Reliability Objectives
Before touching code, define the signal you need.
| Objective | Signal | Implementation |
|---|---|---|
| Trace every turn | Trace/span ids, attributes | OpenTelemetry tracer in both runtimes |
| Repro steps fast | NDJSON transcript per mission | Recorder utility + CLI viewer |
| Control costs | Token counts, dollar estimate | Metrics emitted per completion |
| Catch regressions | Tests that assert logs exist | Pytest + Node tap tests |
| Share context | Dashboard and replay docs | Jaeger/Grafana plus runbook |
Document this table in your repo (docs/reliability-plan.md) so every agent feature has to keep these signals intact.
Step 2: Instrument Turns with OpenTelemetry
Python FastAPI agent
Install dependencies:
pip install "fastapi[standard]" opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-instrumentation-fastapi
Create telemetry/tracing.py:
# telemetry/tracing.py
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, OTLPSpanExporter
provider = TracerProvider(resource=Resource(attributes={"service.name": "support-agent"}))
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
Wrap each agent turn:
# app/agent.py
from telemetry.tracing import tracer
async def run_turn(payload):
with tracer.start_as_current_span("agent.turn", attributes={"user_id": payload.user_id}) as span:
completion = await llm.complete(payload.history)
span.set_attribute("tokens.prompt", completion.prompt_tokens)
span.set_attribute("tokens.completion", completion.completion_tokens)
span.set_attribute("tool.calls", len(completion.tools or []))
return completion
Node.js Express agent
npm install @opentelemetry/api @opentelemetry/sdk-trace-node @opentelemetry/exporter-trace-otlp-http
// telemetry/tracing.js
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { trace } from "@opentelemetry/api";
const provider = new NodeTracerProvider({
resource: { attributes: { "service.name": "support-agent-node" } }
});
provider.addSpanProcessor(new BatchSpanProcessor(new OTLPTraceExporter({ url: "http://localhost:4318/v1/traces" })));
provider.register();
export const tracer = trace.getTracer("support-agent-node");
// app/agent.js
import { tracer } from "../telemetry/tracing.js";
export async function runTurn(payload) {
return await tracer.startActiveSpan("agent.turn", async (span) => {
span.setAttribute("user_id", payload.userId);
const completion = await llm.complete(payload.history);
span.setAttribute("tokens.prompt", completion.promptTokens);
span.setAttribute("tokens.completion", completion.completionTokens);
span.end();
return completion;
});
}
Tracer checklist:
- Emit the same
correlation_idacross Node and Python services. - Send spans to a local collector (Jaeger, Honeycomb, etc.) via OTLP.
- Store exporters in config so prod can point to managed backends.
Step 3: Capture Mission Replays
Store every message, tool call, and model response in NDJSON so you can replay the mission step by step.
Python recorder
# telemetry/recorder.py
import json, time, uuid
from pathlib import Path
class MissionRecorder:
def __init__(self, directory: Path):
directory.mkdir(parents=True, exist_ok=True)
self.file = directory / f"mission-{uuid.uuid4()}.ndjson"
def log(self, event: dict):
event["ts"] = time.time()
with self.file.open("a", encoding="utf-8") as fh:
fh.write(json.dumps(event) + "\n")
Node recorder
// telemetry/recorder.js
import { createWriteStream } from "node:fs";
import { randomUUID } from "node:crypto";
import { join } from "node:path";
export class MissionRecorder {
constructor(directory) {
this.stream = createWriteStream(join(directory, `mission-${randomUUID()}.ndjson`), { flags: "a" });
}
log(event) {
const enriched = { ...event, ts: Date.now() / 1000 };
this.stream.write(`${JSON.stringify(enriched)}\n`);
}
}
MissionScope CLI (shared replay viewer)
# tools/missionscope.py
import json
import typer
from rich.table import Table
from rich.console import Console
app = typer.Typer()
console = Console()
@app.command()
def replay(path: str):
table = Table(show_header=True, header_style="bold green")
table.add_column("ts")
table.add_column("type")
table.add_column("details")
for line in open(path, encoding="utf-8"):
event = json.loads(line)
table.add_row(str(event["ts"]), event.get("type", "-"), event.get("summary") or event.get("name", ""))
console.print(table)
if __name__ == "__main__":
app()
Usage:
python tools/missionscope.py replay logs/mission-123.ndjson
Guidelines:
- Emit the same
correlation_idas the trace span so you can jump between tools. - Hide PII by redacting fields before writing to disk.
- Store replay files in S3 or another bucket with lifecycle policies.
Step 4: Metrics, Costs, and Alerts
Instrument both runtimes with StatsD or Prometheus metrics.
Python StatsD emitter
# telemetry/metrics.py
from statsd import StatsClient
stats = StatsClient(host="localhost", port=8125, prefix="agent")
def record_turn(latency_ms, tool_count, cost_usd):
stats.timing("turn.latency_ms", latency_ms)
stats.gauge("turn.tool_count", tool_count)
stats.incr("cost.usd", cost_usd)
Cost helper:
def estimate_cost(tokens_in, tokens_out, model="gpt-4o-mini"):
pricing = {"gpt-4o-mini": {"input": 0.00015 / 1000, "output": 0.0006 / 1000}}
rate = pricing[model]
return tokens_in * rate["input"] + tokens_out * rate["output"]
Node Prometheus metrics
npm install prom-client
// telemetry/metrics.js
import client from "prom-client";
client.collectDefaultMetrics();
export const latencyHistogram = new client.Histogram({
name: "agent_turn_latency_ms",
help: "Latency per turn",
buckets: [100, 250, 500, 1000, 2000]
});
export const toolCounter = new client.Counter({
name: "agent_tool_calls_total",
help: "Tool invocations",
labelNames: ["tool", "status"]
});
export const costCounter = new client.Counter({
name: "agent_cost_usd_total",
help: "Estimated dollar cost"
});
Expose /metrics in Express so Prometheus or Grafana Agent can scrape it. Alert when cost per mission spikes or when tool failures exceed a threshold.
Step 5: Automated Tests and Simulations
Pytest replay test
# tests/test_replay_exists.py
import json
from pathlib import Path
def test_turn_emits_replay(tmp_path, agent):
agent.run_turn({"user_id": "demo", "text": "check tracing"})
replay_file = next(Path("logs").glob("mission-*.ndjson"))
events = [json.loads(line) for line in replay_file.read_text().splitlines()]
assert any(evt.get("type") == "tool" for evt in events)
Node Tap test for metrics
npm install tap
// test/metrics.test.js
import t from "tap";
import { latencyHistogram } from "../telemetry/metrics.js";
t.test("records latency", (t) => {
latencyHistogram.observe(0.45);
const metrics = latencyHistogram.get().values[0];
t.ok(metrics.value > 0, "observed latency recorded");
t.end();
});
CI tips:
- Run simulated missions in CI to ensure replays and spans exist.
- Fail the build if cost or latency metrics breach thresholds.
- Store sample NDJSON files as fixtures to validate CLI replay output.
Step 6: Deploy the Observability Stack
Local lab:
docker run -d --name otel-collector -p 4317:4317 -p 4318:4318 otel/opentelemetry-collector:latest
docker run -d --name jaeger -p 16686:16686 jaegertracing/all-in-one
docker run -d --name statsd-exporter -p 9102:9102 prom/statsd-exporter
Wire OTLP exporters to the collector, send StatsD metrics to the exporter, and build Grafana dashboards on top of Prometheus. Capture the topology in docs/observability.md with steps to replay missions during incidents.
Step 7: Implementation Checklist
- Configure OpenTelemetry tracers (Node and Python) with shared correlation IDs.
- Persist mission transcripts as NDJSON and add the MissionScope CLI to the repo.
- Emit latency, tool, and cost metrics (StatsD or Prometheus) with alert thresholds.
- Add automated tests that verify traces, transcripts, and metrics exist.
- Deploy Jaeger + Grafana (or your chosen stack) and document replay runbooks.
You now have a full reliability harness: every agent turn is traceable, replayable, and measurable. Pair this tutorial with the API integration series so every new connector ships with the safeguards ops teams expect.
Continue Building
Ad Space
Recommended Tools & Resources
* This section contains affiliate links. We may earn a commission when you purchase through these links at no additional cost to you.
📚 Featured AI Books
OpenAI API
AI PlatformAccess GPT-4 and other powerful AI models for your agent development.
LangChain Plus
FrameworkAdvanced framework for building applications with large language models.
Pinecone Vector Database
DatabaseHigh-performance vector database for AI applications and semantic search.
AI Agent Development Course
EducationComplete course on building production-ready AI agents from scratch.
💡 Pro Tip
Start with the free tiers of these tools to experiment, then upgrade as your AI agent projects grow. Most successful developers use a combination of 2-3 core tools rather than trying everything at once.
📚 Agent Reliability Playbook
🚀 Join the AgentForge Community
Get weekly insights, tutorials, and the latest AI agent developments delivered to your inbox.
No spam, ever. Unsubscribe at any time.



