ai-agentsreliabilityload-testingchaostutorial

Agent Reliability Playbook - Part 3: Run Reliability Drills Before Production

By John Babich2/24/20254 min read
Advanced
Agent Reliability Playbook - Part 3: Run Reliability Drills Before Production

Agent Reliability Playbook - Part 3: Run Reliability Drills Before Production

Once alerts exist, the best time to trigger them is before customers do. Part 3 focuses on stress testing and chaos drills so you can prove your agent meets SLOs and that your detectors from Part 2 actually fire.

Scenario: Friday Drill Before a Marketing Launch

Marketing wants to feature the Escalation Concierge in the newsletter tomorrow. Ops requires proof that:

  • The agent survives 200 concurrent incidents.
  • Tool failures trigger Sev2 alerts within 60 seconds.
  • Humans can replay any synthetic mission for audits.
  • Feature flags let you kill risky capabilities instantly.

We will create a reliability drill kit that runs every Friday.


Step 1: Define SLOs and Drill Matrix

List the behaviors you must test each week.

SLO Target Probe
Mission latency p95 < 2.5s Load test 200 concurrent missions
Detector time Critical alert < 60s Chaos script that forces refund errors
Tool success 99% success Inject timeouts, ensure retries recover
Replay availability 100% Sample 20 transcripts & verify NDJSON exists

Store this matrix in docs/drill-matrix.md so executives and engineers agree on the gate criteria.


Step 2: Generate Synthetic Missions

Python Locust scripts

# drills/locustfile.py
from locust import HttpUser, task, between
import uuid, random

class MissionUser(HttpUser):
    wait_time = between(0.1, 1)

    @task(3)
    def submit_incident(self):
        payload = {
            "user_id": str(uuid.uuid4()),
            "history": [
                {"role": "user", "content": "Refund order #{}".format(random.randint(1000, 9999))}
            ]
        }
        self.client.post("/agent/mission", json=payload)

    @task(1)
    def trigger_refund_edge(self):
        payload = {
            "user_id": str(uuid.uuid4()),
            "history": [{"role": "user", "content": "Refund $5000 for #{}".format(random.randint(1000, 9999))}],
            "metadata": {"force_refund_limit": True}
        }
        self.client.post("/agent/mission", json=payload)

Node Artillery profile

# drills/artillery.yml
config:
  target: "https://staging.lumenly.ai"
  phases:
    - duration: 900
      arrivalRate: 50
      rampTo: 200
      name: "Launch rehearsal"
scenarios:
  - flow:
      - post:
          url: "/agent/mission"
          json:
            user_id: "{{ uuid }}"
            history:
              - role: "user"
                content: "Customer wants refund {{ $randomNumber(1000, 9999) }}"

Run both to cover HTTP and websocket entry points.


Step 3: Inject Failures with Feature Flags

Use a feature-flag service or environment toggles to force common issues.

Node chaos toggles

// chaos/toggles.js
let flags = {
  toolTimeout: false,
  hallucinationMode: false
};

export function enable(flag) {
  flags[flag] = true;
}

export function disable(flag) {
  flags[flag] = false;
}

export async function runTool(payload) {
  if (flags.toolTimeout) {
    await new Promise((resolve, reject) => setTimeout(() => reject(new Error("forced timeout")), 4000));
  }
  return realTool(payload);
}

Python chaos decorator

# chaos/decorators.py
from functools import wraps

CHAOS_FLAGS = {"force_hallucination": False}

def chaos(flag_name):
    def wrapper(func):
        @wraps(func)
        async def inner(*args, **kwargs):
            if CHAOS_FLAGS.get(flag_name):
                raise RuntimeError(f"forced chaos {flag_name}")
            return await func(*args, **kwargs)
        return inner
    return wrapper

Flip these flags during drills to validate detectors and recovery steps.


Step 4: Capture and Grade Results

After each drill, store metrics plus pass/fail status.

# drills/results.py
import json
from datetime import datetime

class DrillRecorder:
    def __init__(self, path="logs/drills.ndjson"):
        self.path = path

    def record(self, name, status, metrics):
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "name": name,
            "status": status,
            "metrics": metrics
        }
        with open(self.path, "a", encoding="utf-8") as fh:
            fh.write(json.dumps(entry) + "\n")

Log outputs such as p95_latency, detector_delay_ms, and pager_ack_seconds. Plot them in Grafana to spot regressions.


Step 5: Automate the Drill Pipeline

  1. Create a GitHub Actions workflow that:
    • Deploys staging stack.
    • Runs Locust + Artillery.
    • Toggles chaos flags for 5 minutes.
    • Uploads drill results artifact.
  2. Fail the pipeline if any SLO breach occurs.
  3. Notify Slack with the drill report and link to the NDJSON logs.

Example Actions snippet:

- name: Run Locust
  run: |
    locust -f drills/locustfile.py --headless -u 200 -r 20 -t 15m
- name: Toggle chaos
  run: |
    curl -X POST "$CHAOS_URL/enable/toolTimeout"
    sleep 300
    curl -X POST "$CHAOS_URL/disable/toolTimeout"

Step 6: Implementation Checklist

  • Publish SLO + drill matrix to the repo.
  • Add Locust/Artillery generators referencing staging endpoints.
  • Wire chaos toggles or feature flags into risky tools.
  • Record drill metrics (latency, detector delay, pager ack).
  • Automate the drill pipeline and alert on failures.

With drills in place, you know your agent and alerting stack behave before shipping campaigns. Part 4 will close the loop with postmortems, experiment tracking, and a reliability backlog.


Continue Building

Related Tools

Useful tools for this topic

If you want to turn this article into a concrete next step, start with one of these.

Subscribe to AgentForge Hub

Get weekly insights, tutorials, and the latest AI agent developments delivered to your inbox.

No spam, ever. Unsubscribe at any time.

Loading conversations...