Tally v0.1 Public Beta

Make your Agents
Reliable.

Traditional eval frameworks are built for prompts.
Tally is built for Agents.

Stop grading prompts. Start testing behavior. Turn "it feels better" into verifiable, regression-proof confidence.

Get Started
The Tally Advantage

Multi-Turn
Native Evaluation.

Grading single responses ignores the conversation. An agent can be polite, relevant, and completely fail to solve the user's problem. Tally evaluates tool usage, message history, and state transitions as first-class citizens.

Beyond Row-by-Row

Evaluate the full trajectory, not just individual responses in isolation.

Step-Level Precision

Pinpoint exactly which turn or tool call caused an evaluation failure.

Conversation Verdicts

"Did the agent actually book the flight by step 5?" — not just "was this response polite?"

U
A
RELEVANCE: 0.94
T
CONVERSATION VERDICT: PASS
trajectory.ts
const trajectory = createTrajectory({  goal: 'Test edge case for payment',  persona: { description: 'Angry customer' },  steps: {    steps: [      { id: 'start', instruction: 'Ask for refund' },      { id: 'final', instruction: 'Confirm receipt' }    ],    start: 'start',    terminals: ['final']  },  userModel: google('gemini-2.0-flash'),}, myAgent);const result = await runTrajectory(trajectory);
Generation

Don't wait for users.
Generate your data.

Hand-writing multi-turn conversation logs is tedious and brittle. Use @tally-evals/trajectories to simulate impatient, confused, or adversarial users at scale. Works with any agent framework.

User Personas
Step-Graph Paths
Stress Testing
Framework Agnostic
Learn about Trajectories

Works natively with your stack

AI SDKAI SDK
MastraMastra

Everything you need.

Engineering primitives, not just scripts. A complete reliability stack with debugging tools built in.

Developer Experience

TUI & Dev Server

Visualize traces, debug failures, and analyze results in a beautiful terminal interface or local dev server.

Type-Safe

Compile-Time Safety

Catch misconfigured metrics and missing scorers before runtime. If it builds, it runs.

Flexible

Decoupled Policy

Same metrics, different thresholds. Pass at 0.6 in dev, 0.8 in staging, 0.95 in prod. Zero code duplication.

Modular

Composable Metrics

Mix LLM-based graders with code assertions. Metrics → Scorers → Evals, all reusable TypeScript objects.

Analytics

Aggregator Engine

Statistical summaries for your entire dataset. Mean, Pass Rate, Percentiles, and custom aggregators.

Production

CI/CD Ready

Lightweight and fast. Run evaluations in your PR workflow to catch regressions before your users do.

Start measuring
what matters.

From vibes to verdicts. Deploy on Friday, sleep on Saturday.

bun add @tally-evals/tally