ai-agentsreliabilityload-testingchaostutorial

Agent Reliability Playbook - Part 3: Run Reliability Drills Before Production

By AgentForge Hub2/24/20254 min read
Advanced
Agent Reliability Playbook - Part 3: Run Reliability Drills Before Production

Ad Space

Agent Reliability Playbook - Part 3: Run Reliability Drills Before Production

Once alerts exist, the best time to trigger them is before customers do. Part 3 focuses on stress testing and chaos drills so you can prove your agent meets SLOs and that your detectors from Part 2 actually fire.

Scenario: Friday Drill Before a Marketing Launch

Marketing wants to feature the Escalation Concierge in the newsletter tomorrow. Ops requires proof that:

  • The agent survives 200 concurrent incidents.
  • Tool failures trigger Sev2 alerts within 60 seconds.
  • Humans can replay any synthetic mission for audits.
  • Feature flags let you kill risky capabilities instantly.

We will create a reliability drill kit that runs every Friday.


Step 1: Define SLOs and Drill Matrix

List the behaviors you must test each week.

SLO Target Probe
Mission latency p95 < 2.5s Load test 200 concurrent missions
Detector time Critical alert < 60s Chaos script that forces refund errors
Tool success 99% success Inject timeouts, ensure retries recover
Replay availability 100% Sample 20 transcripts & verify NDJSON exists

Store this matrix in docs/drill-matrix.md so executives and engineers agree on the gate criteria.


Step 2: Generate Synthetic Missions

Python Locust scripts

# drills/locustfile.py
from locust import HttpUser, task, between
import uuid, random

class MissionUser(HttpUser):
    wait_time = between(0.1, 1)

    @task(3)
    def submit_incident(self):
        payload = {
            "user_id": str(uuid.uuid4()),
            "history": [
                {"role": "user", "content": "Refund order #{}".format(random.randint(1000, 9999))}
            ]
        }
        self.client.post("/agent/mission", json=payload)

    @task(1)
    def trigger_refund_edge(self):
        payload = {
            "user_id": str(uuid.uuid4()),
            "history": [{"role": "user", "content": "Refund $5000 for #{}".format(random.randint(1000, 9999))}],
            "metadata": {"force_refund_limit": True}
        }
        self.client.post("/agent/mission", json=payload)

Node Artillery profile

# drills/artillery.yml
config:
  target: "https://staging.lumenly.ai"
  phases:
    - duration: 900
      arrivalRate: 50
      rampTo: 200
      name: "Launch rehearsal"
scenarios:
  - flow:
      - post:
          url: "/agent/mission"
          json:
            user_id: "{{ uuid }}"
            history:
              - role: "user"
                content: "Customer wants refund {{ $randomNumber(1000, 9999) }}"

Run both to cover HTTP and websocket entry points.


Step 3: Inject Failures with Feature Flags

Use a feature-flag service or environment toggles to force common issues.

Node chaos toggles

// chaos/toggles.js
let flags = {
  toolTimeout: false,
  hallucinationMode: false
};

export function enable(flag) {
  flags[flag] = true;
}

export function disable(flag) {
  flags[flag] = false;
}

export async function runTool(payload) {
  if (flags.toolTimeout) {
    await new Promise((resolve, reject) => setTimeout(() => reject(new Error("forced timeout")), 4000));
  }
  return realTool(payload);
}

Python chaos decorator

# chaos/decorators.py
from functools import wraps

CHAOS_FLAGS = {"force_hallucination": False}

def chaos(flag_name):
    def wrapper(func):
        @wraps(func)
        async def inner(*args, **kwargs):
            if CHAOS_FLAGS.get(flag_name):
                raise RuntimeError(f"forced chaos {flag_name}")
            return await func(*args, **kwargs)
        return inner
    return wrapper

Flip these flags during drills to validate detectors and recovery steps.


Step 4: Capture and Grade Results

After each drill, store metrics plus pass/fail status.

# drills/results.py
import json
from datetime import datetime

class DrillRecorder:
    def __init__(self, path="logs/drills.ndjson"):
        self.path = path

    def record(self, name, status, metrics):
        entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "name": name,
            "status": status,
            "metrics": metrics
        }
        with open(self.path, "a", encoding="utf-8") as fh:
            fh.write(json.dumps(entry) + "\n")

Log outputs such as p95_latency, detector_delay_ms, and pager_ack_seconds. Plot them in Grafana to spot regressions.


Step 5: Automate the Drill Pipeline

  1. Create a GitHub Actions workflow that:
    • Deploys staging stack.
    • Runs Locust + Artillery.
    • Toggles chaos flags for 5 minutes.
    • Uploads drill results artifact.
  2. Fail the pipeline if any SLO breach occurs.
  3. Notify Slack with the drill report and link to the NDJSON logs.

Example Actions snippet:

- name: Run Locust
  run: |
    locust -f drills/locustfile.py --headless -u 200 -r 20 -t 15m
- name: Toggle chaos
  run: |
    curl -X POST "$CHAOS_URL/enable/toolTimeout"
    sleep 300
    curl -X POST "$CHAOS_URL/disable/toolTimeout"

Step 6: Implementation Checklist

  • Publish SLO + drill matrix to the repo.
  • Add Locust/Artillery generators referencing staging endpoints.
  • Wire chaos toggles or feature flags into risky tools.
  • Record drill metrics (latency, detector delay, pager ack).
  • Automate the drill pipeline and alert on failures.

With drills in place, you know your agent and alerting stack behave before shipping campaigns. Part 4 will close the loop with postmortems, experiment tracking, and a reliability backlog.


Continue Building

Ad Space

Recommended Tools & Resources

* This section contains affiliate links. We may earn a commission when you purchase through these links at no additional cost to you.

OpenAI API

AI Platform

Access GPT-4 and other powerful AI models for your agent development.

Pay-per-use

LangChain Plus

Framework

Advanced framework for building applications with large language models.

Free + Paid

Pinecone Vector Database

Database

High-performance vector database for AI applications and semantic search.

Free tier available

AI Agent Development Course

Education

Complete course on building production-ready AI agents from scratch.

$199

💡 Pro Tip

Start with the free tiers of these tools to experiment, then upgrade as your AI agent projects grow. Most successful developers use a combination of 2-3 core tools rather than trying everything at once.

🚀 Join the AgentForge Community

Get weekly insights, tutorials, and the latest AI agent developments delivered to your inbox.

No spam, ever. Unsubscribe at any time.

Loading conversations...