Concepts

Mental Model

Tally is primarily an Evaluation Engine. Instead of providing a rigid set of fixed tests, it provides the infrastructure to define and run your own domain-based evaluation measures.

To promote maximum reusability and composability, Tally enforces a strict separation between measurement, combination, and decision-making.

Building Blocks

Tally is composed of fundamental building blocks that you combine together to form an evaluation pipeline.

Block	Analogy	Purpose
MetricDef	Ruler & Blueprint	Defines what to measure (e.g., an LLM prompt or a regex check). It is the definition/blueprint for a metric.
Metric	Measurement	The actual result produced by a MetricDef (can be numeric, boolean, ordinal, etc.).
Scorer	Normalizer/Combiner	Combines multiple metric results into a unified score (usually 0–1) based on weights.
Eval	Decision Rule	Combines a Metric or Scorer with a Verdict Policy to determine if the result "passes" or "fails."
Tally	Orchestrator	The main entry point that takes data and evals, then runs the evaluation pipeline.
Report	Scorecard	The type-safe output with per-target results, summaries, and the View API for assertions.

The Composability Trio

The power of Tally comes from how Metrics, Scorers, and Evals separate their responsibilities.

1. Metrics (Measurement)

Focuses exclusively on capturing raw domain data. A metric only cares about measuring one specific quality (e.g., response time, keyword presence, or LLM-graded relevance).

2. Scorers (Combination)

Focuses exclusively on how to weigh and normalize multiple measurements. Because scorers are distinct from metrics, you can reuse the same metric in different scoring contexts.

Focuses exclusively on the business rules. An eval takes the output of your measures and applies a Verdict Policy. You can use the same scoring measures but apply different verdict thresholds depending on the evaluation scenario.

Core Architecture

Tally acts as the orchestrator that combines these building blocks together against your data.

graph TD
    Data[Conversation Data] --> Tally[createTally]
    Evals[Evals Array] --> Tally
    
    subgraph "Composition Layer"
        Tally --> Pipeline[Evaluation Pipeline]
        Pipeline --> EvalExec[Eval Execution]
        EvalExec --> Verdicts[Verdict Policies]
        EvalExec --> Logic[Metric or Scorer]
        Logic --> RawMetrics[MetricDefs]
    end

    Verdicts --> Results[Per-Target Results]
    Results --> Report[TallyRunReport]
    Report --> View[View API]

Type-Safe by Design

Tally is built with TypeScript-first type safety:

Eval names are literal types: When you pass evals to createTally, the report knows exactly which evals exist.
View API provides autocomplete: Access results with view.step(0)['Answer Relevance'] and get type errors for typos.
No string-based lookups: Everything is typed from definition to report access.

const tally = createTally({
  data: [conversation],
  evals: [relevanceEval, completenessEval], // Names are inferred
});

const report = await tally.run();
const view = report.view();

view.step(0)['Answer Relevance'];  // ✅ Autocomplete works
view.step(0)['Typo'];              // ❌ Type error

API Reference

For detailed type definitions and factory functions, see:

Metrics API — defineBaseMetric, defineSingleTurnCode, defineSingleTurnLLM, etc.
Scorers API — defineScorer, createWeightedAverageScorer, defineInput
Evals API — defineSingleTurnEval, defineMultiTurnEval, defineScorerEval
Reports API — TallyRunReport, TallyRunArtifact, View API