Overview

Metrics are the signals you measure about your agent. They can be fully deterministic (code checks) or model-judged (LLM-based), and they're designed to be reused across different evaluation scenarios—release gates, regression tests, edge-case suites, dev loops, and more.

Blueprint vs. Result

It is important to understand the distinction between a definition and its output:

Concept	Description
`MetricDef`	The blueprint. It defines the "how" (code or prompt), the metadata, and the expected value type.
`Metric`	The result. This is the actual value (e.g., `4.5` or `true`) produced when a `MetricDef` is run against data.

Anatomy of a Metric

Every metric in Tally fits into a few practical categories. These categories aren't just taxonomy — they determine how you run, how you score, and how you debug evaluations.

1. Value Type

Metrics can produce any domain-relevant value:

Numeric: continuous values like token usage, latency, similarity scores, or judge scores.
Boolean: binary checks like "is the output valid JSON?"
Ordinal: categorical results mapped from a rubric (e.g., { Poor, Fair, Good, Excellent }).

In Tally, metric values are represented as a scalar type:

type MetricScalar = number | boolean | string;

Example:

import { defineBaseMetric } from '@tally-evals/tally';

// Likert-style ordinal rating produced by an LLM judge or rubric
// (store as an ordinal category; map to scores via normalization when needed)
const concisenessLikert = defineBaseMetric({
  name: 'concisenessLikert',
  valueType: 'string',
  description: 'Conciseness rating as an ordinal category (e.g., VeryVerbose → VeryConcise)',
});

2. Scope

Determines what the metric is measuring:

Single-turn: evaluates a single interaction (input + output).
Multi-turn: analyzes the entire conversation history (drift, retention, goal completion).

Example (single-turn vs multi-turn):

import { defineSingleTurnCode, defineMultiTurnLLM, defineBaseMetric } from '@tally-evals/tally';

// Single-turn: "Does this step include a tool call?"
const hasToolCall = defineSingleTurnCode({
  base: defineBaseMetric({
    name: 'hasToolCall',
    valueType: 'boolean',
  }),
  compute: ({ selected }) =>
    selected.output.some(
      (m) => m.role === 'assistant' && m.content.length > 0,
    ),
});

// Multi-turn: "Did the agent stay on topic across the whole conversation?"
const stayedOnTopic = defineMultiTurnLLM({
  base: defineBaseMetric({
    name: 'stayedOnTopic',
    valueType: 'number',
  }),
  runOnContainer: async (conversation) => ({
    transcript: conversation.steps
      .map(
        (s) =>
          `User: ${s.input.content}\nAssistant: ${s.output
            .map((m) => m.content)
            .join(' ')}`,
      )
      .join('\n\n'),
  }),
  provider: myModel,
  prompt: {
    instruction: 'Score topic adherence 0-5.\n\n{{transcript}}',
    variables: [] as const,
  },
});

3. Implementation

The engine that powers the measurement:

Code-based: deterministic logic written in TypeScript. Best for technical validation.
LLM-based: evaluated by a language model using a prompt/rubric. Best for semantic judgment.

Example (code-based vs LLM-based options):

import { defineSingleTurnCode, defineSingleTurnLLM, defineBaseMetric } from '@tally-evals/tally';
import { createMinMaxNormalizer } from '@tally-evals/tally/normalization';

// Code-based: optional preProcessor, cacheable, metadata
const keywordPresent = defineSingleTurnCode({
  base: defineBaseMetric({
    name: 'keywordPresent',
    valueType: 'boolean',
  }),
  preProcessor: async (selected) => ({
    text: selected.output.map((m) => m.content).join(' '),
  }),
  compute: ({ data }) => (data as { text: string }).text.includes('refund'),
  cacheable: true,
  metadata: {
    keyword: 'refund',
  },
});

// LLM-based: provider, prompt, rubric, normalization
const helpfulness = defineSingleTurnLLM({
  base: defineBaseMetric({
    name: 'helpfulness',
    valueType: 'number',
  }),
  provider: myModel,
  prompt: {
    instruction:
      'Rate helpfulness 1-5.\n\nQuery: {{input}}\nResponse: {{output}}',
    variables: [] as const,
  },
  rubric: {
    criteria: '1=unhelpful, 5=very helpful',
    scale: '1-5',
  },
  normalization: {
    normalizer: createMinMaxNormalizer({
      min: 1,
      max: 5,
      clip: true,
    }),
  },
});

The Metric Lifecycle

Define: Create a MetricDef using the factory APIs.
Execute: Tally runs the definition against your conversation data.
Normalize (Optional): If you plan to combine a metric inside a Scorer, you'll typically attach a Normalizer so the metric maps to a comparable 0–1 score. It's optional because some metrics already produce 0–1, and some workflows don't use scorers.
Aggregate (Optional): For single-turn metrics, attach Aggregators to compute summary statistics (mean, percentiles, etc.) across all targets. Tally provides sensible defaults based on metric value type.
Compose: Use Scorers to combine multiple signals, then use Evals to apply a verdict policy.

Normalization (recommended)

Normalization converts a raw metric value (number/boolean/ordinal) into a comparable 0–1 score. In the current API, normalization lives on the metric definition.

If you're using scorers, you typically want every input metric to have a sensible normalization strategy (or provide normalizerOverride per scorer input).

See the full reference: Normalizers.

Aggregations (optional)

Aggregations compute summary statistics across all evaluation targets. Single-turn metrics can carry aggregator definitions that the pipeline uses to summarize results.

Where it applies: SingleTurnMetricDef (step/item-level metrics)
Where to read the output: report.evalSummaries.get(evalName)?.aggregations

Default Aggregators by Value Type

Tally provides sensible default aggregators based on the metric's valueType:

Value Type	Default Aggregators
`number`	`mean`, `p50`, `p75`, `p90`
`boolean`	`mean`, `p50`, `p75`, `p90`, `trueRate`
`ordinal`	`mean`, `p50`, `p75`, `p90`, `distribution`

Use getDefaultAggregators(valueType) to retrieve the defaults, or define custom aggregators for bespoke summaries (e.g., per-customer-tier rollups).

Same Measures, Different Policies

In practice, you want to reuse the same metrics across different evaluation scenarios:

Release gates: strict verdict policies for production readiness.
Regression tests: moderate policies to catch quality regressions.
Edge-case suites: scenario-specific thresholds for known failure modes.
Dev loops: looser policies for fast iteration and feedback.

The measure stays constant; the verdict policy adapts to the context. This is how you avoid rewriting evaluation logic for every test suite.

API Reference

Metric Factories — defineBaseMetric, defineSingleTurnCode, defineSingleTurnLLM, withNormalization
Built-in Metrics — Pre-built metrics for common use cases
Normalizers — createMinMaxNormalizer, createBooleanNormalizer, createOrdinalNormalizer
Aggregators — createMeanAggregator, createTrueRateAggregator, etc.

Overview

On this page