ai-agentstutorialobservabilitytestingdevops

Agent Reliability Drilldown: Instrument, Replay, and Fix Faster

By AgentForge Hub2/10/20257 min read

Intermediate

📚 Agent Reliability Playbook

Part 1 of 4

First part in series

All Tutorials

Part 2: Detect, Triage, and Page Faster

Series Progress25% Complete

View All Parts in This Series

Agent Reliability Drilldown: Instrument, Replay, and Fix FasterCurrent

Detect, Triage, and Page Faster

Run Reliability Drills Before Production

Close the Loop with Postmortems and QA Backlogs

Ad Space

Agent Reliability Drilldown: Instrument, Replay, and Fix Faster

Every time the support agent at Lumenly misfiled a ticket, engineers scrambled to find the culprit. Logs were sparse, cost reports lagged, and no one could replay the conversation that caused the incident. This tutorial builds a reliability harness from scratch so your agents never remain mysterious again.

We will build a cross-language observability stack that:

Captures OpenTelemetry traces from both Node and Python runtimes.
Stores NDJSON transcripts you can replay locally.
Tracks latency, tool counts, and dollar costs.
Adds a CLI viewer plus pytest integration to stop regressions before they ship.

Scenario: Mission Control for a Support Agent

Our Escalation Concierge is now live with real customers. The ops team needs:

A single trace ID that connects user prompts, LLM calls, and downstream tools.
Replay files to step through every mission turn without digging through mixed logs.
Metrics that expose token spend, latency, and tool success rates.
Automated tests that fail if observability drifts.

We will build this reliability lab in five layers.

Step 1: Reliability Objectives

Before touching code, define the signal you need.

Objective	Signal	Implementation
Trace every turn	Trace/span ids, attributes	OpenTelemetry tracer in both runtimes
Repro steps fast	NDJSON transcript per mission	Recorder utility + CLI viewer
Control costs	Token counts, dollar estimate	Metrics emitted per completion
Catch regressions	Tests that assert logs exist	Pytest + Node tap tests
Share context	Dashboard and replay docs	Jaeger/Grafana plus runbook

Document this table in your repo (docs/reliability-plan.md) so every agent feature has to keep these signals intact.

Step 2: Instrument Turns with OpenTelemetry

Python FastAPI agent

Install dependencies:

pip install "fastapi[standard]" opentelemetry-sdk opentelemetry-exporter-otlp opentelemetry-instrumentation-fastapi

Create telemetry/tracing.py:

# telemetry/tracing.py
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, OTLPSpanExporter

provider = TracerProvider(resource=Resource(attributes={"service.name": "support-agent"}))
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

Wrap each agent turn:

# app/agent.py
from telemetry.tracing import tracer

async def run_turn(payload):
    with tracer.start_as_current_span("agent.turn", attributes={"user_id": payload.user_id}) as span:
        completion = await llm.complete(payload.history)
        span.set_attribute("tokens.prompt", completion.prompt_tokens)
        span.set_attribute("tokens.completion", completion.completion_tokens)
        span.set_attribute("tool.calls", len(completion.tools or []))
        return completion

Node.js Express agent

npm install @opentelemetry/api @opentelemetry/sdk-trace-node @opentelemetry/exporter-trace-otlp-http

// telemetry/tracing.js
import { NodeTracerProvider } from "@opentelemetry/sdk-trace-node";
import { BatchSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { trace } from "@opentelemetry/api";

const provider = new NodeTracerProvider({
  resource: { attributes: { "service.name": "support-agent-node" } }
});
provider.addSpanProcessor(new BatchSpanProcessor(new OTLPTraceExporter({ url: "http://localhost:4318/v1/traces" })));
provider.register();
export const tracer = trace.getTracer("support-agent-node");

// app/agent.js
import { tracer } from "../telemetry/tracing.js";

export async function runTurn(payload) {
  return await tracer.startActiveSpan("agent.turn", async (span) => {
    span.setAttribute("user_id", payload.userId);
    const completion = await llm.complete(payload.history);
    span.setAttribute("tokens.prompt", completion.promptTokens);
    span.setAttribute("tokens.completion", completion.completionTokens);
    span.end();
    return completion;
  });
}

Tracer checklist:

Emit the same correlation_id across Node and Python services.
Send spans to a local collector (Jaeger, Honeycomb, etc.) via OTLP.
Store exporters in config so prod can point to managed backends.

Step 3: Capture Mission Replays

Store every message, tool call, and model response in NDJSON so you can replay the mission step by step.

Python recorder

# telemetry/recorder.py
import json, time, uuid
from pathlib import Path

class MissionRecorder:
    def __init__(self, directory: Path):
        directory.mkdir(parents=True, exist_ok=True)
        self.file = directory / f"mission-{uuid.uuid4()}.ndjson"

    def log(self, event: dict):
        event["ts"] = time.time()
        with self.file.open("a", encoding="utf-8") as fh:
            fh.write(json.dumps(event) + "\n")

Node recorder

// telemetry/recorder.js
import { createWriteStream } from "node:fs";
import { randomUUID } from "node:crypto";
import { join } from "node:path";

export class MissionRecorder {
  constructor(directory) {
    this.stream = createWriteStream(join(directory, `mission-${randomUUID()}.ndjson`), { flags: "a" });
  }
  log(event) {
    const enriched = { ...event, ts: Date.now() / 1000 };
    this.stream.write(`${JSON.stringify(enriched)}\n`);
  }
}

MissionScope CLI (shared replay viewer)

# tools/missionscope.py
import json
import typer
from rich.table import Table
from rich.console import Console

app = typer.Typer()
console = Console()

@app.command()
def replay(path: str):
    table = Table(show_header=True, header_style="bold green")
    table.add_column("ts")
    table.add_column("type")
    table.add_column("details")
    for line in open(path, encoding="utf-8"):
        event = json.loads(line)
        table.add_row(str(event["ts"]), event.get("type", "-"), event.get("summary") or event.get("name", ""))
    console.print(table)

if __name__ == "__main__":
    app()

Usage:

python tools/missionscope.py replay logs/mission-123.ndjson

Guidelines:

Emit the same correlation_id as the trace span so you can jump between tools.
Hide PII by redacting fields before writing to disk.
Store replay files in S3 or another bucket with lifecycle policies.

Step 4: Metrics, Costs, and Alerts

Instrument both runtimes with StatsD or Prometheus metrics.

Python StatsD emitter

# telemetry/metrics.py
from statsd import StatsClient
stats = StatsClient(host="localhost", port=8125, prefix="agent")

def record_turn(latency_ms, tool_count, cost_usd):
    stats.timing("turn.latency_ms", latency_ms)
    stats.gauge("turn.tool_count", tool_count)
    stats.incr("cost.usd", cost_usd)

Cost helper:

def estimate_cost(tokens_in, tokens_out, model="gpt-4o-mini"):
    pricing = {"gpt-4o-mini": {"input": 0.00015 / 1000, "output": 0.0006 / 1000}}
    rate = pricing[model]
    return tokens_in * rate["input"] + tokens_out * rate["output"]

Node Prometheus metrics

npm install prom-client

// telemetry/metrics.js
import client from "prom-client";
client.collectDefaultMetrics();

export const latencyHistogram = new client.Histogram({
  name: "agent_turn_latency_ms",
  help: "Latency per turn",
  buckets: [100, 250, 500, 1000, 2000]
});

export const toolCounter = new client.Counter({
  name: "agent_tool_calls_total",
  help: "Tool invocations",
  labelNames: ["tool", "status"]
});

export const costCounter = new client.Counter({
  name: "agent_cost_usd_total",
  help: "Estimated dollar cost"
});

Expose /metrics in Express so Prometheus or Grafana Agent can scrape it. Alert when cost per mission spikes or when tool failures exceed a threshold.

Step 5: Automated Tests and Simulations

Pytest replay test

# tests/test_replay_exists.py
import json
from pathlib import Path

def test_turn_emits_replay(tmp_path, agent):
    agent.run_turn({"user_id": "demo", "text": "check tracing"})
    replay_file = next(Path("logs").glob("mission-*.ndjson"))
    events = [json.loads(line) for line in replay_file.read_text().splitlines()]
    assert any(evt.get("type") == "tool" for evt in events)

Node Tap test for metrics

npm install tap

// test/metrics.test.js
import t from "tap";
import { latencyHistogram } from "../telemetry/metrics.js";

t.test("records latency", (t) => {
  latencyHistogram.observe(0.45);
  const metrics = latencyHistogram.get().values[0];
  t.ok(metrics.value > 0, "observed latency recorded");
  t.end();
});

CI tips:

Run simulated missions in CI to ensure replays and spans exist.
Fail the build if cost or latency metrics breach thresholds.
Store sample NDJSON files as fixtures to validate CLI replay output.

Step 6: Deploy the Observability Stack

Local lab:

docker run -d --name otel-collector -p 4317:4317 -p 4318:4318 otel/opentelemetry-collector:latest

docker run -d --name jaeger -p 16686:16686 jaegertracing/all-in-one

docker run -d --name statsd-exporter -p 9102:9102 prom/statsd-exporter

Wire OTLP exporters to the collector, send StatsD metrics to the exporter, and build Grafana dashboards on top of Prometheus. Capture the topology in docs/observability.md with steps to replay missions during incidents.

Step 7: Implementation Checklist

Configure OpenTelemetry tracers (Node and Python) with shared correlation IDs.
Persist mission transcripts as NDJSON and add the MissionScope CLI to the repo.
Emit latency, tool, and cost metrics (StatsD or Prometheus) with alert thresholds.
Add automated tests that verify traces, transcripts, and metrics exist.
Deploy Jaeger + Grafana (or your chosen stack) and document replay runbooks.

You now have a full reliability harness: every agent turn is traceable, replayable, and measurable. Pair this tutorial with the API integration series so every new connector ships with the safeguards ops teams expect.

Continue Building

Ad Space

Recommended Tools & Resources

* This section contains affiliate links. We may earn a commission when you purchase through these links at no additional cost to you.

📚 Featured AI Books

The Agentic AI Bible

The AI Revolution in Project Management

The AI Engineering Bible

OpenAI API

AI Platform

Access GPT-4 and other powerful AI models for your agent development.

Pay-per-use

LangChain Plus

Framework

Advanced framework for building applications with large language models.

Free + Paid

Pinecone Vector Database

Database

High-performance vector database for AI applications and semantic search.

Free tier available

AI Agent Development Course

Education

Complete course on building production-ready AI agents from scratch.

$199

💡 Pro Tip

Start with the free tiers of these tools to experiment, then upgrade as your AI agent projects grow. Most successful developers use a combination of 2-3 core tools rather than trying everything at once.