Make your Agents
Reliable.
Traditional eval frameworks are built for prompts.
Tally is built for Agents.
Stop grading prompts. Start testing behavior. Turn "it feels better" into verifiable, regression-proof confidence.
Multi-Turn
Native Evaluation.
Grading single responses ignores the conversation. An agent can be polite, relevant, and completely fail to solve the user's problem. Tally evaluates tool usage, message history, and state transitions as first-class citizens.
Beyond Row-by-Row
Evaluate the full trajectory, not just individual responses in isolation.
Step-Level Precision
Pinpoint exactly which turn or tool call caused an evaluation failure.
Conversation Verdicts
"Did the agent actually book the flight by step 5?" — not just "was this response polite?"
const trajectory = createTrajectory({ goal: 'Test edge case for payment', persona: { description: 'Angry customer' }, steps: { steps: [ { id: 'start', instruction: 'Ask for refund' }, { id: 'final', instruction: 'Confirm receipt' } ], start: 'start', terminals: ['final'] }, userModel: google('gemini-2.0-flash'),}, myAgent);const result = await runTrajectory(trajectory);Don't wait for users.
Generate your data.
Hand-writing multi-turn conversation logs is tedious and brittle. Use @tally-evals/trajectories to simulate impatient, confused, or adversarial users at scale. Works with any agent framework.
Works natively with your stack
AI SDK
MastraEverything you need.
Engineering primitives, not just scripts. A complete reliability stack with debugging tools built in.
TUI & Dev Server
Visualize traces, debug failures, and analyze results in a beautiful terminal interface or local dev server.
Compile-Time Safety
Catch misconfigured metrics and missing scorers before runtime. If it builds, it runs.
Decoupled Policy
Same metrics, different thresholds. Pass at 0.6 in dev, 0.8 in staging, 0.95 in prod. Zero code duplication.
Composable Metrics
Mix LLM-based graders with code assertions. Metrics → Scorers → Evals, all reusable TypeScript objects.
Aggregator Engine
Statistical summaries for your entire dataset. Mean, Pass Rate, Percentiles, and custom aggregators.
CI/CD Ready
Lightweight and fast. Run evaluations in your PR workflow to catch regressions before your users do.
Start measuring
what matters.
From vibes to verdicts. Deploy on Friday, sleep on Saturday.
bun add @tally-evals/tally