Why Tally?
Understanding the Agent Reliability Crisis and how Tally solves it.
Traditional evaluation frameworks are built for prompts. Tally is built for Agents.
Tally is framework-agnostic: it evaluates the behavior (conversations, tool calls, trajectories), regardless of which agent framework produced it. The only requirement is that you record runs into Tally’s standard Conversation shape (based on AI SDK-style model messages). That means you can evaluate agents written in any language as long as you can export the interaction history in that format.
The Agent Reliability Crisis
Building a chatbot that works 80% of the time is easy. Closing the remaining 20% gap is where most projects fail. Agents introduce a level of non-determinism and stateful complexity that traditional unit testing cannot handle.
The Problem: Fragmented Feedback
- The Row-by-Row Fallacy: Testing individual turns doesn't tell you if the agent actually solved the user's problem over 5 steps.
- String-Typed Fragility: Most frameworks rely on string identifiers for metrics. When you rename a metric in your code, your evaluation config breaks silently.
- The "Black Box" CI: When a test fails in CI, you get a number (e.g.,
0.6). You don't see the tool calls, the thought process, or the conversation that led to that failure.
The Tally Philosophy
1. A sensible, extensible, typesafe SDK for defining evals
Tally is designed to feel like a clean TypeScript SDK, not a config language. You define evals, metrics, scorers, and verdicts in code with strong types and editor support, so your evaluation suite stays maintainable as it grows.
2. Multi-Turn as a First-Class Citizen
Tally doesn't just evaluate responses; it evaluates Trajectories. By integrating with @tally-evals/trajectories, you can simulate agent-user interactions and evaluate the entire conversation history.
3. Decoupling Measure from Policy
A Metric measures a signal (e.g., Relevance). A Verdict Policy defines what "good" looks like for a specific scenario. By separating these, you can run the same metrics across different evaluation contexts—release gates, regression tests, edge-case suites, dev loops—and simply swap the verdict policy to match the required rigor. The measure stays constant; the policy adapts to the scenario.
4. Closing the Loop
Evaluation is only useful if it leads to a fix.
- TUI: Browse failed runs directly in your terminal.
- Dev Server: A visual showcase of every tool call and LLM response.
- CI Integration: Structured reports designed to gate your deployment pipeline.
The Ecosystem
Tally is part of a broader ecosystem designed for the full agent development lifecycle:
- Producer:
@tally-evals/trajectoriesgenerates synthetic multi-turn data. - Processor:
@tally-evals/tallyruns the evaluations and calculates the signal. - Interface:
tally-cliprovides the TUI and Dev Server to visualize the results.
Ready to build reliable agents? Get Started.