Tally

Evals

Define evals, select targets, and compute verdicts.

Evals

The Eval API is the primary user-facing API for building evaluations.

Import

import {
  defineSingleTurnEval,
  defineMultiTurnEval,
  defineScorerEval,
  runAllTargets,
  runSelectedSteps,
  runSelectedItems,
  booleanVerdict,
  thresholdVerdict,
  rangeVerdict,
  ordinalVerdict,
  customVerdict,
} from 'tally';

defineSingleTurnEval()

Creates an eval that applies a verdict policy to a single-turn metric. Use this to evaluate individual conversation steps or dataset items with pass/fail criteria.

name:string

Eval name.

description?:string

Optional description for this eval.

metric:SingleTurnMetricDef

Single-turn metric definition to evaluate.

verdict?:VerdictPolicyFor<TMetricValue>

Optional verdict policy inferred from the metric value type.

autoNormalize?:AutoNormalizer

Optional auto-normalization for boolean/ordinal metrics.

AutoNormalizer
kind:'boolean' | 'ordinal' | 'number'

Auto-normalization mode.

trueScore?:number

Score for true (boolean mode).

falseScore?:number

Score for false (boolean mode).

weights?:Record<string | number, number>

Ordinal weight map (ordinal mode).

context?:EvaluationContext

Optional evaluation context (run policy for single-turn).

EvaluationContext
singleTurn?:SingleTurnRunPolicy

Single-turn run policy.

metadata?:Record<string, unknown>

Optional metadata for this run.

metadata?:Record<string, unknown>

Optional metadata for this eval.

defineMultiTurnEval()

Creates an eval that applies a verdict policy to a multi-turn metric. Use this to evaluate entire conversations with pass/fail criteria for goals like completion or role adherence.

name:string

Eval name.

description?:string

Optional description for this eval.

metric:MultiTurnMetricDef

Multi-turn metric definition to evaluate.

verdict?:VerdictPolicyFor<TMetricValue>

Optional verdict policy inferred from the metric value type.

autoNormalize?:AutoNormalizer

Optional auto-normalization for boolean/ordinal metrics.

AutoNormalizer
kind:'boolean' | 'ordinal' | 'number'

Auto-normalization mode.

trueScore?:number

Score for true (boolean mode).

falseScore?:number

Score for false (boolean mode).

weights?:Record<string | number, number>

Ordinal weight map (ordinal mode).

context?:EvaluationContext

Optional evaluation context (run policy for single-turn).

EvaluationContext
singleTurn?:SingleTurnRunPolicy

Single-turn run policy.

metadata?:Record<string, unknown>

Optional metadata for this run.

metadata?:Record<string, unknown>

Optional metadata for this eval.

defineScorerEval()

Creates an eval that applies a verdict policy to a scorer's combined output. Use this when quality depends on multiple weighted factors combined into a single score.

name:string

Eval name.

description?:string

Optional description for this eval.

scorer:Scorer

Scorer definition to combine input metrics.

verdict?:VerdictPolicyFor<number>

Optional verdict policy for the derived score.

context?:EvaluationContext

Optional evaluation context (run policy for single-turn).

EvaluationContext
singleTurn?:SingleTurnRunPolicy

Single-turn run policy.

metadata?:Record<string, unknown>

Optional metadata for this run.

metadata?:Record<string, unknown>

Optional metadata for this eval.

Target selection policies

runAllTargets()

Runs the eval on all available targets (all steps in a conversation, or all items in a dataset). This is the default behavior.

No options.

runSelectedSteps()

Runs the eval only on specific conversation steps by index. Use this to focus evaluation on particular turns (e.g., only the first and last step).

steps:number[]

Step indices to evaluate.

runSelectedItems()

Runs the eval only on specific dataset items by index. Use this to focus evaluation on particular items in a dataset.

indices:number[]

Dataset item indices to evaluate.

Verdict helpers

booleanVerdict()

Creates a verdict policy for boolean metrics. Passes when the raw value matches the specified boolean.

passWhen:boolean

Pass when raw value matches this boolean.

thresholdVerdict()

Creates a verdict policy that passes when the normalized score meets or exceeds a threshold. The most common way to set a quality bar.

passAt:number

Pass when score >= passAt.

rangeVerdict()

Creates a verdict policy that passes when the score falls within a specified range. Useful for balanced checks (e.g., "neither too brief nor too verbose").

min?:number

Optional minimum pass threshold.

max?:number

Optional maximum pass threshold.

ordinalVerdict()

Creates a verdict policy for ordinal (string) metrics. Passes when the raw value is one of the allowed categories.

passWhenIn:readonly string[]

Allowed ordinal values for pass.

customVerdict()

Creates a verdict policy with custom logic. Use this when your pass/fail rules are complex and can't be expressed with built-in policies.

verdict:(score: Score, rawValue: T) => 'pass' | 'fail' | 'unknown'

Custom verdict function.

On this page