Agent Reliability Playbook - Part 3: Run Reliability Drills Before Production

📚 Agent Reliability Playbook
Previous
Part 2: Detect, Triage, and Page Faster
Next
Part 4: Close the Loop with Postmortems and QA Backlogs
Ad Space
Agent Reliability Playbook - Part 3: Run Reliability Drills Before Production
Once alerts exist, the best time to trigger them is before customers do. Part 3 focuses on stress testing and chaos drills so you can prove your agent meets SLOs and that your detectors from Part 2 actually fire.
Scenario: Friday Drill Before a Marketing Launch
Marketing wants to feature the Escalation Concierge in the newsletter tomorrow. Ops requires proof that:
- The agent survives 200 concurrent incidents.
- Tool failures trigger Sev2 alerts within 60 seconds.
- Humans can replay any synthetic mission for audits.
- Feature flags let you kill risky capabilities instantly.
We will create a reliability drill kit that runs every Friday.
Step 1: Define SLOs and Drill Matrix
List the behaviors you must test each week.
| SLO | Target | Probe |
|---|---|---|
| Mission latency | p95 < 2.5s | Load test 200 concurrent missions |
| Detector time | Critical alert < 60s | Chaos script that forces refund errors |
| Tool success | 99% success | Inject timeouts, ensure retries recover |
| Replay availability | 100% | Sample 20 transcripts & verify NDJSON exists |
Store this matrix in docs/drill-matrix.md so executives and engineers agree on the gate criteria.
Step 2: Generate Synthetic Missions
Python Locust scripts
# drills/locustfile.py
from locust import HttpUser, task, between
import uuid, random
class MissionUser(HttpUser):
wait_time = between(0.1, 1)
@task(3)
def submit_incident(self):
payload = {
"user_id": str(uuid.uuid4()),
"history": [
{"role": "user", "content": "Refund order #{}".format(random.randint(1000, 9999))}
]
}
self.client.post("/agent/mission", json=payload)
@task(1)
def trigger_refund_edge(self):
payload = {
"user_id": str(uuid.uuid4()),
"history": [{"role": "user", "content": "Refund $5000 for #{}".format(random.randint(1000, 9999))}],
"metadata": {"force_refund_limit": True}
}
self.client.post("/agent/mission", json=payload)
Node Artillery profile
# drills/artillery.yml
config:
target: "https://staging.lumenly.ai"
phases:
- duration: 900
arrivalRate: 50
rampTo: 200
name: "Launch rehearsal"
scenarios:
- flow:
- post:
url: "/agent/mission"
json:
user_id: "{{ uuid }}"
history:
- role: "user"
content: "Customer wants refund {{ $randomNumber(1000, 9999) }}"
Run both to cover HTTP and websocket entry points.
Step 3: Inject Failures with Feature Flags
Use a feature-flag service or environment toggles to force common issues.
Node chaos toggles
// chaos/toggles.js
let flags = {
toolTimeout: false,
hallucinationMode: false
};
export function enable(flag) {
flags[flag] = true;
}
export function disable(flag) {
flags[flag] = false;
}
export async function runTool(payload) {
if (flags.toolTimeout) {
await new Promise((resolve, reject) => setTimeout(() => reject(new Error("forced timeout")), 4000));
}
return realTool(payload);
}
Python chaos decorator
# chaos/decorators.py
from functools import wraps
CHAOS_FLAGS = {"force_hallucination": False}
def chaos(flag_name):
def wrapper(func):
@wraps(func)
async def inner(*args, **kwargs):
if CHAOS_FLAGS.get(flag_name):
raise RuntimeError(f"forced chaos {flag_name}")
return await func(*args, **kwargs)
return inner
return wrapper
Flip these flags during drills to validate detectors and recovery steps.
Step 4: Capture and Grade Results
After each drill, store metrics plus pass/fail status.
# drills/results.py
import json
from datetime import datetime
class DrillRecorder:
def __init__(self, path="logs/drills.ndjson"):
self.path = path
def record(self, name, status, metrics):
entry = {
"timestamp": datetime.utcnow().isoformat(),
"name": name,
"status": status,
"metrics": metrics
}
with open(self.path, "a", encoding="utf-8") as fh:
fh.write(json.dumps(entry) + "\n")
Log outputs such as p95_latency, detector_delay_ms, and pager_ack_seconds. Plot them in Grafana to spot regressions.
Step 5: Automate the Drill Pipeline
- Create a GitHub Actions workflow that:
- Deploys staging stack.
- Runs Locust + Artillery.
- Toggles chaos flags for 5 minutes.
- Uploads drill results artifact.
- Fail the pipeline if any SLO breach occurs.
- Notify Slack with the drill report and link to the NDJSON logs.
Example Actions snippet:
- name: Run Locust
run: |
locust -f drills/locustfile.py --headless -u 200 -r 20 -t 15m
- name: Toggle chaos
run: |
curl -X POST "$CHAOS_URL/enable/toolTimeout"
sleep 300
curl -X POST "$CHAOS_URL/disable/toolTimeout"
Step 6: Implementation Checklist
- Publish SLO + drill matrix to the repo.
- Add Locust/Artillery generators referencing staging endpoints.
- Wire chaos toggles or feature flags into risky tools.
- Record drill metrics (latency, detector delay, pager ack).
- Automate the drill pipeline and alert on failures.
With drills in place, you know your agent and alerting stack behave before shipping campaigns. Part 4 will close the loop with postmortems, experiment tracking, and a reliability backlog.
Continue Building
Ad Space
Recommended Tools & Resources
* This section contains affiliate links. We may earn a commission when you purchase through these links at no additional cost to you.
📚 Featured AI Books
OpenAI API
AI PlatformAccess GPT-4 and other powerful AI models for your agent development.
LangChain Plus
FrameworkAdvanced framework for building applications with large language models.
Pinecone Vector Database
DatabaseHigh-performance vector database for AI applications and semantic search.
AI Agent Development Course
EducationComplete course on building production-ready AI agents from scratch.
💡 Pro Tip
Start with the free tiers of these tools to experiment, then upgrade as your AI agent projects grow. Most successful developers use a combination of 2-3 core tools rather than trying everything at once.
📚 Agent Reliability Playbook
Previous
Part 2: Detect, Triage, and Page Faster
Next
Part 4: Close the Loop with Postmortems and QA Backlogs
🚀 Join the AgentForge Community
Get weekly insights, tutorials, and the latest AI agent developments delivered to your inbox.
No spam, ever. Unsubscribe at any time.



