Concepts
The mental model and core architecture of Tally.
Mental Model
Tally is primarily an Evaluation Engine. Instead of providing a rigid set of fixed tests, it provides the infrastructure to define and run your own domain-based evaluation measures.
To promote maximum reusability and composability, Tally enforces a strict separation between measurement, combination, and decision-making.
Building Blocks
Tally is composed of fundamental building blocks that you combine together to form an evaluation pipeline.
| Block | Analogy | Purpose | Code |
|---|---|---|---|
| MetricDef | Ruler & Blueprint | Defines what to measure (e.g., an LLM prompt or a regex check). It is the definition/blueprint for a metric. | |
| Metric | Measurement | The actual result produced by a MetricDef (can be numeric, boolean, ordinal, etc.). | |
| Scorer | Normalizer/Combiner | Combines multiple metric results into a unified score (usually 0–1) based on weights. | |
| Eval | Decision Rule | Combines a Metric or Scorer with a Verdict Policy to determine if the result "passes" or "fails." | |
| Tally | Orchestrator | The main entry point that takes data and evals, then runs the evaluation pipeline. | |
| Report | Scorecard | The type-safe output with per-target results, summaries, and the View API for assertions. |
The Composability Trio
The power of Tally comes from how Metrics, Scorers, and Evals separate their responsibilities.
1. Metrics (Measurement)
Focuses exclusively on capturing raw domain data. A metric only cares about measuring one specific quality (e.g., response time, keyword presence, or LLM-graded relevance).
2. Scorers (Combination)
Focuses exclusively on how to weigh and normalize multiple measurements. Because scorers are distinct from metrics, you can reuse the same metric in different scoring contexts.
3. Evals (Decision)
Focuses exclusively on the business rules. An eval takes the output of your measures and applies a Verdict Policy. You can use the same scoring measures but apply different verdict thresholds depending on the evaluation scenario.
Core Architecture
Tally acts as the orchestrator that combines these building blocks together against your data.
graph TD
Data[Conversation Data] --> Tally[createTally]
Evals[Evals Array] --> Tally
subgraph "Composition Layer"
Tally --> Pipeline[Evaluation Pipeline]
Pipeline --> EvalExec[Eval Execution]
EvalExec --> Verdicts[Verdict Policies]
EvalExec --> Logic[Metric or Scorer]
Logic --> RawMetrics[MetricDefs]
end
Verdicts --> Results[Per-Target Results]
Results --> Report[TallyRunReport]
Report --> View[View API]Type-Safe by Design
Tally is built with TypeScript-first type safety:
- Eval names are literal types: When you pass evals to
createTally, the report knows exactly which evals exist. - View API provides autocomplete: Access results with
view.step(0)['Answer Relevance']and get type errors for typos. - No string-based lookups: Everything is typed from definition to report access.
const tally = createTally({
data: [conversation],
evals: [relevanceEval, completenessEval], // Names are inferred
});
const report = await tally.run();
const view = report.view();
view.step(0)['Answer Relevance']; // ✅ Autocomplete works
view.step(0)['Typo']; // ❌ Type errorAPI Reference
For detailed type definitions and factory functions, see:
- Metrics API —
defineBaseMetric,defineSingleTurnCode,defineSingleTurnLLM, etc. - Scorers API —
defineScorer,createWeightedAverageScorer,defineInput - Evals API —
defineSingleTurnEval,defineMultiTurnEval,defineScorerEval - Reports API —
TallyRunReport,TallyRunArtifact, View API