ai-agentsreliabilityload-testingchaostutorial

Agent Reliability Playbook - Part 3: Run Reliability Drills Before Production

By AgentForge Hub2/24/20254 min read

Advanced

📚 Agent Reliability Playbook

Part 3 of 4

Part 2: Detect, Triage, and Page Faster

All Tutorials

Part 4: Close the Loop with Postmortems and QA Backlogs

Series Progress75% Complete

View All Parts in This Series

Agent Reliability Drilldown: Instrument, Replay, and Fix Faster

Detect, Triage, and Page Faster

Run Reliability Drills Before ProductionCurrent

Close the Loop with Postmortems and QA Backlogs

Ad Space

Agent Reliability Playbook - Part 3: Run Reliability Drills Before Production

Once alerts exist, the best time to trigger them is before customers do. Part 3 focuses on stress testing and chaos drills so you can prove your agent meets SLOs and that your detectors from Part 2 actually fire.

Scenario: Friday Drill Before a Marketing Launch

Marketing wants to feature the Escalation Concierge in the newsletter tomorrow. Ops requires proof that:

The agent survives 200 concurrent incidents.
Tool failures trigger Sev2 alerts within 60 seconds.
Humans can replay any synthetic mission for audits.
Feature flags let you kill risky capabilities instantly.

We will create a reliability drill kit that runs every Friday.

Step 1: Define SLOs and Drill Matrix

List the behaviors you must test each week.

SLO	Target	Probe
Mission latency	p95 < 2.5s	Load test 200 concurrent missions
Detector time	Critical alert < 60s	Chaos script that forces refund errors
Tool success	99% success	Inject timeouts, ensure retries recover
Replay availability	100%	Sample 20 transcripts & verify NDJSON exists

Store this matrix in docs/drill-matrix.md so executives and engineers agree on the gate criteria.

Step 2: Generate Synthetic Missions

Python Locust scripts

# drills/locustfile.py
from locust import HttpUser, task, between
import uuid, random

class MissionUser(HttpUser):
    wait_time = between(0.1, 1)

    @task(3)
    def submit_incident(self):
        payload = {
            "user_id": str(uuid.uuid4()),
            "history": [
                {"role": "user", "content": "Refund order #{}".format(random.randint(1000, 9999))}
            ]
        }
        self.client.post("/agent/mission", json=payload)

    @task(1)
    def trigger_refund_edge(self):
        payload = {
            "user_id": str(uuid.uuid4()),
            "history": [{"role": "user", "content": "Refund $5000 for #{}".format(random.randint(1000, 9999))}],
            "metadata": {"force_refund_limit": True}
        }
        self.client.post("/agent/mission", json=payload)

Node Artillery profile

# drills/artillery.yml
config:
  target: "https://staging.lumenly.ai"
  phases:
    - duration: 900
      arrivalRate: 50
      rampTo: 200
      name: "Launch rehearsal"
scenarios:
  - flow:
      - post:
          url: "/agent/mission"
          json:
            user_id: "{{ uuid }}"
            history:
              - role: "user"
                content: "Customer wants refund {{ $randomNumber(1000, 9999) }}"

Run both to cover HTTP and websocket entry points.

Step 3: Inject Failures with Feature Flags

Use a feature-flag service or environment toggles to force common issues.

Node chaos toggles

// chaos/toggles.js
let flags = {
  toolTimeout: false,
  hallucinationMode: false
};

export function enable(flag) {
  flags[flag] = true;
}

export function disable(flag) {
  flags[flag] = false;
}

export async function runTool(payload) {
  if (flags.toolTimeout) {
    await new Promise((resolve, reject) => setTimeout(() => reject(new Error("forced timeout")), 4000));
  }
  return realTool(payload);
}

Python chaos decorator

# chaos/decorators.py
from functools import wraps

CHAOS_FLAGS = {"force_hallucination": False}

def chaos(flag_name):
    def wrapper(func):
        @wraps(func)
        async def inner(*args, **kwargs):
            if CHAOS_FLAGS.get(flag_name):
                raise RuntimeError(f"forced chaos {flag_name}")
            return await func(*args, **kwargs)
        return inner
    return wrapper

Flip these flags during drills to validate detectors and recovery steps.

Step 4: Capture and Grade Results

After each drill, store metrics plus pass/fail status.

# drills/results.py
import json
from datetime import datetime

class DrillRecorder:
    def __init__(self, path="logs/drills.ndjson"):
        self.path = path

    def record(self, name, status, metrics):
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "name": name,
            "status": status,
            "metrics": metrics
        }
        with open(self.path, "a", encoding="utf-8") as fh:
            fh.write(json.dumps(entry) + "\n")

Log outputs such as p95_latency, detector_delay_ms, and pager_ack_seconds. Plot them in Grafana to spot regressions.

Step 5: Automate the Drill Pipeline

Create a GitHub Actions workflow that:
- Deploys staging stack.
- Runs Locust + Artillery.
- Toggles chaos flags for 5 minutes.
- Uploads drill results artifact.
Fail the pipeline if any SLO breach occurs.
Notify Slack with the drill report and link to the NDJSON logs.

Example Actions snippet:

- name: Run Locust
  run: |
    locust -f drills/locustfile.py --headless -u 200 -r 20 -t 15m
- name: Toggle chaos
  run: |
    curl -X POST "$CHAOS_URL/enable/toolTimeout"
    sleep 300
    curl -X POST "$CHAOS_URL/disable/toolTimeout"

Step 6: Implementation Checklist

Publish SLO + drill matrix to the repo.
Add Locust/Artillery generators referencing staging endpoints.
Wire chaos toggles or feature flags into risky tools.
Record drill metrics (latency, detector delay, pager ack).
Automate the drill pipeline and alert on failures.

With drills in place, you know your agent and alerting stack behave before shipping campaigns. Part 4 will close the loop with postmortems, experiment tracking, and a reliability backlog.

Continue Building

Ad Space

Recommended Tools & Resources

* This section contains affiliate links. We may earn a commission when you purchase through these links at no additional cost to you.

📚 Featured AI Books

The Agentic AI Bible

The AI Revolution in Project Management

The AI Engineering Bible

OpenAI API

AI Platform

Access GPT-4 and other powerful AI models for your agent development.

Pay-per-use

LangChain Plus

Framework

Advanced framework for building applications with large language models.

Free + Paid

Pinecone Vector Database

Database

High-performance vector database for AI applications and semantic search.

Free tier available

AI Agent Development Course

Education

Complete course on building production-ready AI agents from scratch.

$199

💡 Pro Tip

Start with the free tiers of these tools to experiment, then upgrade as your AI agent projects grow. Most successful developers use a combination of 2-3 core tools rather than trying everything at once.