Agent Reliability Playbook - Part 3: Run Reliability Drills Before Production

📚 Agent Reliability Playbook
Previous
Part 2: Detect, Triage, and Page Faster
Next
Part 4: Close the Loop with Postmortems and QA Backlogs
Agent Reliability Playbook - Part 3: Run Reliability Drills Before Production
Once alerts exist, the best time to trigger them is before customers do. Part 3 focuses on stress testing and chaos drills so you can prove your agent meets SLOs and that your detectors from Part 2 actually fire.
Scenario: Friday Drill Before a Marketing Launch
Marketing wants to feature the Escalation Concierge in the newsletter tomorrow. Ops requires proof that:
- The agent survives 200 concurrent incidents.
- Tool failures trigger Sev2 alerts within 60 seconds.
- Humans can replay any synthetic mission for audits.
- Feature flags let you kill risky capabilities instantly.
We will create a reliability drill kit that runs every Friday.
Step 1: Define SLOs and Drill Matrix
List the behaviors you must test each week.
| SLO | Target | Probe |
|---|---|---|
| Mission latency | p95 < 2.5s | Load test 200 concurrent missions |
| Detector time | Critical alert < 60s | Chaos script that forces refund errors |
| Tool success | 99% success | Inject timeouts, ensure retries recover |
| Replay availability | 100% | Sample 20 transcripts & verify NDJSON exists |
Store this matrix in docs/drill-matrix.md so executives and engineers agree on the gate criteria.
Step 2: Generate Synthetic Missions
Python Locust scripts
# drills/locustfile.py
from locust import HttpUser, task, between
import uuid, random
class MissionUser(HttpUser):
wait_time = between(0.1, 1)
@task(3)
def submit_incident(self):
payload = {
"user_id": str(uuid.uuid4()),
"history": [
{"role": "user", "content": "Refund order #{}".format(random.randint(1000, 9999))}
]
}
self.client.post("/agent/mission", json=payload)
@task(1)
def trigger_refund_edge(self):
payload = {
"user_id": str(uuid.uuid4()),
"history": [{"role": "user", "content": "Refund $5000 for #{}".format(random.randint(1000, 9999))}],
"metadata": {"force_refund_limit": True}
}
self.client.post("/agent/mission", json=payload)
Node Artillery profile
# drills/artillery.yml
config:
target: "https://staging.lumenly.ai"
phases:
- duration: 900
arrivalRate: 50
rampTo: 200
name: "Launch rehearsal"
scenarios:
- flow:
- post:
url: "/agent/mission"
json:
user_id: "{{ uuid }}"
history:
- role: "user"
content: "Customer wants refund {{ $randomNumber(1000, 9999) }}"
Run both to cover HTTP and websocket entry points.
Step 3: Inject Failures with Feature Flags
Use a feature-flag service or environment toggles to force common issues.
Node chaos toggles
// chaos/toggles.js
let flags = {
toolTimeout: false,
hallucinationMode: false
};
export function enable(flag) {
flags[flag] = true;
}
export function disable(flag) {
flags[flag] = false;
}
export async function runTool(payload) {
if (flags.toolTimeout) {
await new Promise((resolve, reject) => setTimeout(() => reject(new Error("forced timeout")), 4000));
}
return realTool(payload);
}
Python chaos decorator
# chaos/decorators.py
from functools import wraps
CHAOS_FLAGS = {"force_hallucination": False}
def chaos(flag_name):
def wrapper(func):
@wraps(func)
async def inner(*args, **kwargs):
if CHAOS_FLAGS.get(flag_name):
raise RuntimeError(f"forced chaos {flag_name}")
return await func(*args, **kwargs)
return inner
return wrapper
Flip these flags during drills to validate detectors and recovery steps.
Step 4: Capture and Grade Results
After each drill, store metrics plus pass/fail status.
# drills/results.py
import json
from datetime import datetime
class DrillRecorder:
def __init__(self, path="logs/drills.ndjson"):
self.path = path
def record(self, name, status, metrics):
entry = {
"timestamp": datetime.utcnow().isoformat(),
"name": name,
"status": status,
"metrics": metrics
}
with open(self.path, "a", encoding="utf-8") as fh:
fh.write(json.dumps(entry) + "\n")
Log outputs such as p95_latency, detector_delay_ms, and pager_ack_seconds. Plot them in Grafana to spot regressions.
Step 5: Automate the Drill Pipeline
- Create a GitHub Actions workflow that:
- Deploys staging stack.
- Runs Locust + Artillery.
- Toggles chaos flags for 5 minutes.
- Uploads drill results artifact.
- Fail the pipeline if any SLO breach occurs.
- Notify Slack with the drill report and link to the NDJSON logs.
Example Actions snippet:
- name: Run Locust
run: |
locust -f drills/locustfile.py --headless -u 200 -r 20 -t 15m
- name: Toggle chaos
run: |
curl -X POST "$CHAOS_URL/enable/toolTimeout"
sleep 300
curl -X POST "$CHAOS_URL/disable/toolTimeout"
Step 6: Implementation Checklist
- Publish SLO + drill matrix to the repo.
- Add Locust/Artillery generators referencing staging endpoints.
- Wire chaos toggles or feature flags into risky tools.
- Record drill metrics (latency, detector delay, pager ack).
- Automate the drill pipeline and alert on failures.
With drills in place, you know your agent and alerting stack behave before shipping campaigns. Part 4 will close the loop with postmortems, experiment tracking, and a reliability backlog.
Continue Building
Related Tools
Useful tools for this topic
If you want to turn this article into a concrete next step, start with one of these.
Solution Type Quiz
PlanningDecide whether your use case is better served by automation, a chatbot, RAG, a copilot, or a more capable agent.
Open toolEvaluation Plan Builder
OperationsBuild a first evaluation plan for answer quality, action safety, human review, monitoring, and rollback.
Open toolArchitecture Recommender
ArchitectureGet a recommended starting architecture based on autonomy, data shape, action model, and team profile.
Open tool📚 Agent Reliability Playbook
Previous
Part 2: Detect, Triage, and Page Faster
Next
Part 4: Close the Loop with Postmortems and QA Backlogs
Subscribe to AgentForge Hub
Get weekly insights, tutorials, and the latest AI agent developments delivered to your inbox.
No spam, ever. Unsubscribe at any time.
