Evals

The Eval API is the primary user-facing API for building evaluations.

Import

import {
  defineSingleTurnEval,
  defineMultiTurnEval,
  defineScorerEval,
  runAllTargets,
  runSelectedSteps,
  runSelectedItems,
  booleanVerdict,
  thresholdVerdict,
  rangeVerdict,
  ordinalVerdict,
  customVerdict,
} from 'tally';

`defineSingleTurnEval()`

Creates an eval that applies a verdict policy to a single-turn metric. Use this to evaluate individual conversation steps or dataset items with pass/fail criteria.

name:string

Eval name.

description?:string

Optional description for this eval.

metric:SingleTurnMetricDef

Single-turn metric definition to evaluate.

verdict?:VerdictPolicyFor<TMetricValue>

Optional verdict policy inferred from the metric value type.

autoNormalize?:AutoNormalizer

Optional auto-normalization for boolean/ordinal metrics.

AutoNormalizer

kind:'boolean' | 'ordinal' | 'number'

Auto-normalization mode.

trueScore?:number

Score for true (boolean mode).

falseScore?:number

Score for false (boolean mode).

weights?:Record<string | number, number>

Ordinal weight map (ordinal mode).

context?:EvaluationContext

Optional evaluation context (run policy for single-turn).

EvaluationContext

singleTurn?:SingleTurnRunPolicy

Single-turn run policy.

metadata?:Record<string, unknown>

Optional metadata for this run.

metadata?:Record<string, unknown>

Optional metadata for this eval.

`defineMultiTurnEval()`

Creates an eval that applies a verdict policy to a multi-turn metric. Use this to evaluate entire conversations with pass/fail criteria for goals like completion or role adherence.

name:string

Eval name.

description?:string

Optional description for this eval.

metric:MultiTurnMetricDef

Multi-turn metric definition to evaluate.

verdict?:VerdictPolicyFor<TMetricValue>

Optional verdict policy inferred from the metric value type.

autoNormalize?:AutoNormalizer

Optional auto-normalization for boolean/ordinal metrics.

AutoNormalizer

kind:'boolean' | 'ordinal' | 'number'

Auto-normalization mode.

trueScore?:number

Score for true (boolean mode).

falseScore?:number

Score for false (boolean mode).

weights?:Record<string | number, number>

Ordinal weight map (ordinal mode).

context?:EvaluationContext

Optional evaluation context (run policy for single-turn).

EvaluationContext

singleTurn?:SingleTurnRunPolicy

Single-turn run policy.

metadata?:Record<string, unknown>

Optional metadata for this run.

metadata?:Record<string, unknown>

Optional metadata for this eval.

`defineScorerEval()`

Creates an eval that applies a verdict policy to a scorer's combined output. Use this when quality depends on multiple weighted factors combined into a single score.

name:string

Eval name.

description?:string

Optional description for this eval.

scorer:Scorer

Scorer definition to combine input metrics.

verdict?:VerdictPolicyFor<number>

Optional verdict policy for the derived score.

context?:EvaluationContext

Optional evaluation context (run policy for single-turn).

EvaluationContext

singleTurn?:SingleTurnRunPolicy

Single-turn run policy.

metadata?:Record<string, unknown>

Optional metadata for this run.

metadata?:Record<string, unknown>

Optional metadata for this eval.

Target selection policies

`runAllTargets()`

Runs the eval on all available targets (all steps in a conversation, or all items in a dataset). This is the default behavior.

No options.

`runSelectedSteps()`

Runs the eval only on specific conversation steps by index. Use this to focus evaluation on particular turns (e.g., only the first and last step).

steps:number[]

Step indices to evaluate.

`runSelectedItems()`

Runs the eval only on specific dataset items by index. Use this to focus evaluation on particular items in a dataset.

indices:number[]

Dataset item indices to evaluate.

Verdict helpers

`booleanVerdict()`

Creates a verdict policy for boolean metrics. Passes when the raw value matches the specified boolean.

passWhen:boolean

Pass when raw value matches this boolean.

`thresholdVerdict()`

Creates a verdict policy that passes when the normalized score meets or exceeds a threshold. The most common way to set a quality bar.

passAt:number

Pass when score >= passAt.

`rangeVerdict()`

Creates a verdict policy that passes when the score falls within a specified range. Useful for balanced checks (e.g., "neither too brief nor too verbose").

min?:number

Optional minimum pass threshold.

max?:number

Optional maximum pass threshold.

`ordinalVerdict()`

Creates a verdict policy for ordinal (string) metrics. Passes when the raw value is one of the allowed categories.

passWhenIn:readonly string[]

Allowed ordinal values for pass.

`customVerdict()`

Creates a verdict policy with custom logic. Use this when your pass/fail rules are complex and can't be expressed with built-in policies.

verdict:(score: Score, rawValue: T) => 'pass' | 'fail' | 'unknown'

Custom verdict function.

Evals

On this page