Tally

Why Tally?

Understanding the Agent Reliability Crisis and how Tally solves it.

Traditional evaluation frameworks are built for prompts. Tally is built for Agents.

Tally is framework-agnostic: it evaluates the behavior (conversations, tool calls, trajectories), regardless of which agent framework produced it. The only requirement is that you record runs into Tally’s standard Conversation shape (based on AI SDK-style model messages). That means you can evaluate agents written in any language as long as you can export the interaction history in that format.

The Agent Reliability Crisis

Building a chatbot that works 80% of the time is easy. Closing the remaining 20% gap is where most projects fail. Agents introduce a level of non-determinism and stateful complexity that traditional unit testing cannot handle.

The Problem: Fragmented Feedback

  1. The Row-by-Row Fallacy: Testing individual turns doesn't tell you if the agent actually solved the user's problem over 5 steps.
  2. String-Typed Fragility: Most frameworks rely on string identifiers for metrics. When you rename a metric in your code, your evaluation config breaks silently.
  3. The "Black Box" CI: When a test fails in CI, you get a number (e.g., 0.6). You don't see the tool calls, the thought process, or the conversation that led to that failure.

The Tally Philosophy

1. A sensible, extensible, typesafe SDK for defining evals

Tally is designed to feel like a clean TypeScript SDK, not a config language. You define evals, metrics, scorers, and verdicts in code with strong types and editor support, so your evaluation suite stays maintainable as it grows.

2. Multi-Turn as a First-Class Citizen

Tally doesn't just evaluate responses; it evaluates Trajectories. By integrating with @tally-evals/trajectories, you can simulate agent-user interactions and evaluate the entire conversation history.

3. Decoupling Measure from Policy

A Metric measures a signal (e.g., Relevance). A Verdict Policy defines what "good" looks like for a specific scenario. By separating these, you can run the same metrics across different evaluation contexts—release gates, regression tests, edge-case suites, dev loops—and simply swap the verdict policy to match the required rigor. The measure stays constant; the policy adapts to the scenario.

4. Closing the Loop

Evaluation is only useful if it leads to a fix.

  • TUI: Browse failed runs directly in your terminal.
  • Dev Server: A visual showcase of every tool call and LLM response.
  • CI Integration: Structured reports designed to gate your deployment pipeline.

The Ecosystem

Tally is part of a broader ecosystem designed for the full agent development lifecycle:

  • Producer: @tally-evals/trajectories generates synthetic multi-turn data.
  • Processor: @tally-evals/tally runs the evaluations and calculates the signal.
  • Interface: tally-cli provides the TUI and Dev Server to visualize the results.

Ready to build reliable agents? Get Started.

On this page