Evals
Define evals, select targets, and compute verdicts.
Evals
The Eval API is the primary user-facing API for building evaluations.
Import
import {
defineSingleTurnEval,
defineMultiTurnEval,
defineScorerEval,
runAllTargets,
runSelectedSteps,
runSelectedItems,
booleanVerdict,
thresholdVerdict,
rangeVerdict,
ordinalVerdict,
customVerdict,
} from 'tally';defineSingleTurnEval()
Creates an eval that applies a verdict policy to a single-turn metric. Use this to evaluate individual conversation steps or dataset items with pass/fail criteria.
Eval name.
Optional description for this eval.
Single-turn metric definition to evaluate.
Optional verdict policy inferred from the metric value type.
Optional auto-normalization for boolean/ordinal metrics.
Auto-normalization mode.
Score for true (boolean mode).
Score for false (boolean mode).
Ordinal weight map (ordinal mode).
Optional evaluation context (run policy for single-turn).
Single-turn run policy.
Optional metadata for this run.
Optional metadata for this eval.
defineMultiTurnEval()
Creates an eval that applies a verdict policy to a multi-turn metric. Use this to evaluate entire conversations with pass/fail criteria for goals like completion or role adherence.
Eval name.
Optional description for this eval.
Multi-turn metric definition to evaluate.
Optional verdict policy inferred from the metric value type.
Optional auto-normalization for boolean/ordinal metrics.
Auto-normalization mode.
Score for true (boolean mode).
Score for false (boolean mode).
Ordinal weight map (ordinal mode).
Optional evaluation context (run policy for single-turn).
Single-turn run policy.
Optional metadata for this run.
Optional metadata for this eval.
defineScorerEval()
Creates an eval that applies a verdict policy to a scorer's combined output. Use this when quality depends on multiple weighted factors combined into a single score.
Eval name.
Optional description for this eval.
Scorer definition to combine input metrics.
Optional verdict policy for the derived score.
Optional evaluation context (run policy for single-turn).
Single-turn run policy.
Optional metadata for this run.
Optional metadata for this eval.
Target selection policies
runAllTargets()
Runs the eval on all available targets (all steps in a conversation, or all items in a dataset). This is the default behavior.
No options.
runSelectedSteps()
Runs the eval only on specific conversation steps by index. Use this to focus evaluation on particular turns (e.g., only the first and last step).
Step indices to evaluate.
runSelectedItems()
Runs the eval only on specific dataset items by index. Use this to focus evaluation on particular items in a dataset.
Dataset item indices to evaluate.
Verdict helpers
booleanVerdict()
Creates a verdict policy for boolean metrics. Passes when the raw value matches the specified boolean.
Pass when raw value matches this boolean.
thresholdVerdict()
Creates a verdict policy that passes when the normalized score meets or exceeds a threshold. The most common way to set a quality bar.
Pass when score >= passAt.
rangeVerdict()
Creates a verdict policy that passes when the score falls within a specified range. Useful for balanced checks (e.g., "neither too brief nor too verbose").
Optional minimum pass threshold.
Optional maximum pass threshold.
ordinalVerdict()
Creates a verdict policy for ordinal (string) metrics. Passes when the raw value is one of the allowed categories.
Allowed ordinal values for pass.
customVerdict()
Creates a verdict policy with custom logic. Use this when your pass/fail rules are complex and can't be expressed with built-in policies.
Custom verdict function.