Adversarial Testing
Stress-testing agents with "bad" behavior.
Adversarial testing involves generating trajectories where the user intentionally tries to confuse, derail, or break the agent.
Adversarial Personas
Create personas that exhibit challenging behavior:
const adversarial = createTrajectory({
goal: 'Get a refund for a non-existent order.',
persona: {
description: 'You are very angry and refuse to provide an order ID.',
guardrails: [
'Ignore requests for identification',
'Threaten to call a lawyer every 3 turns',
'Switch languages mid-sentence'
],
},
// ...
}, agent);Loop Detection
Agents can sometimes get stuck in a loop (e.g., repeatedly calling the same tool with the same params). Trajectories has built-in loop detection to stop these runs.
const trajectory = createTrajectory({
// ...
loopDetection: {
maxConsecutiveSameStep: 3, // Stop if the same step is picked 3 times
},
}, agent);When a loop is detected, the trajectory stops with the reason 'agent-loop'.
Robustness Metrics
After generating adversarial trajectories, use Tally to measure how your agent handled them:
- Toxicity Metric: Did the agent remain professional?
- Goal Completion: Did the agent correctly refuse the invalid request?
- Role Adherence: Did the agent stay in character despite the provocation?