Tally

Metric

Create metrics bottom-up with plain-object factory APIs.

Metric

Factory APIs are the preferred way to create metrics. These functions provide type-safe construction of metric definitions.

Import

import {
  defineBaseMetric,
  withNormalization,
  withMetadata,
  defineSingleTurnCode,
  defineSingleTurnLLM,
  defineMultiTurnCode,
  defineMultiTurnLLM,
} from '@tally-evals/tally';

defineBaseMetric()

Creates the foundational metric definition with name, value type, and optional metadata. This is the starting point for all metrics.

name:string

Unique metric name (used as identifier in reports).

valueType:'number' | 'boolean' | 'string'

The type of value this metric produces.

description?:string

Human-readable description for UI/reporting.

metadata?:Record<string, unknown>

Custom metadata attached to the metric.

normalization?:MetricNormalization<T>

Normalization configuration for converting raw values to 0–1 scores.

MetricNormalization<T>
normalizer:NormalizerSpec<T> | NormalizeToScore<T>

Normalizer specification object or custom function.

calibrate?:TNormContext | ((args: { dataset: readonly unknown[]; rawValues: readonly T[] }) => TNormContext | Promise<TNormContext>)

Static context or async function to derive calibration data from the dataset.


withNormalization()

Attaches a normalization strategy to an existing metric definition. Use this to convert raw values to a 0–1 score for consistent comparison and scoring.

metric:BaseMetricDef<T> | MetricDef<T, MetricContainer>

The metric or base metric to attach normalization to.

normalizer:NormalizerSpec<T> | NormalizeToScore<T>

Normalizer specification or custom function. See Normalizers reference for available specs.

calibrate?:TNormContext | ((args: { dataset: readonly unknown[]; rawValues: readonly T[] }) => TNormContext | Promise<TNormContext>)

Static calibration context or async function to derive context from raw values.


withMetadata()

Adds or merges metadata into an existing metric definition. Useful for tagging metrics with custom properties for reporting or debugging.

metric:BaseMetricDef<T> | MetricDef<T, MetricContainer>

The metric or base metric to annotate.

metadata:Record<string, unknown>

Metadata to merge into the metric definition.


defineSingleTurnCode()

Creates a single-turn metric that uses TypeScript code to compute values. Best for deterministic checks like JSON validation, keyword presence, or response length.

base:BaseMetricDef<T>

Base metric definition (from defineBaseMetric).

compute:(args: { data: unknown; metadata?: Record<string, unknown> }) => T

Function that computes the metric value from preprocessed data.

preProcessor?:(selected: ConversationStep | DatasetItem) => Promise<unknown> | unknown

Optional function to prepare the target before compute. Default exposes { input, output }.

dependencies?:BaseMetricDef[]

Other metrics this metric depends on (for execution ordering).

cacheable?:boolean

Whether results can be cached for identical inputs.

normalization?:MetricNormalization<T>

Normalization configuration (overrides base if provided).

metadata?:Record<string, unknown>

Additional metadata (merged with base metadata).

aggregators?:CompatibleAggregator<T>[]

Custom aggregators for summarizing results. Default aggregators are added based on valueType.


defineSingleTurnLLM()

Creates a single-turn metric that uses an LLM to evaluate quality. Best for semantic judgments like relevance, tone, helpfulness, or toxicity detection.

base:BaseMetricDef<T>

Base metric definition (from defineBaseMetric).

provider:LanguageModel | (() => LanguageModel)

AI SDK language model instance or factory function.

prompt:PromptTemplate<TVars>

Prompt template for LLM evaluation.

PromptTemplate<TVars>
instruction:string

Template string with {{variable}} placeholders.

variables?:readonly string[]

List of variable names used in the template.

examples?:Array<{ input: Record<string, unknown>; expectedOutput: string }>

Few-shot examples for the LLM.

preProcessor?:(selected: ConversationStep | DatasetItem) => Promise<unknown> | unknown

Optional function to prepare the target before LLM evaluation.

rubric?:object

Optional rubric to guide LLM scoring consistency.

Rubric
criteria:string

Scoring criteria description.

scale?:string

Scale description (e.g., "1-5").

examples?:Array<{ score: number; reasoning: string }>

Example scores with reasoning.

postProcessing?:object

Optional post-processing of the LLM output.

PostProcessing
normalize?:boolean

Whether to normalize the output.

transform?:(rawOutput: string) => T

Transform raw LLM string to metric value.

normalization?:MetricNormalization<T>

Normalization configuration (overrides base if provided).

metadata?:Record<string, unknown>

Additional metadata (merged with base metadata).

aggregators?:CompatibleAggregator<T>[]

Custom aggregators for summarizing results. Default aggregators are added based on valueType.


defineMultiTurnCode()

Creates a multi-turn metric that uses TypeScript code to analyze entire conversations. Best for deterministic checks across conversation history.

base:BaseMetricDef<T>

Base metric definition (from defineBaseMetric).

runOnContainer:(container: Conversation) => Promise<unknown> | unknown

Prepares the conversation for downstream compute. Returns a serializable payload.

compute:(args: { data: unknown; metadata?: Record<string, unknown> }) => T

Function that computes the metric value from the prepared data.

dependencies?:BaseMetricDef[]

Other metrics this metric depends on.

cacheable?:boolean

Whether results can be cached.

normalization?:MetricNormalization<T>

Normalization configuration.

metadata?:Record<string, unknown>

Additional metadata.


defineMultiTurnLLM()

Creates a multi-turn metric that uses an LLM to evaluate entire conversations. Best for assessing goal completion, role adherence, topic drift, and other multi-step behaviors.

base:BaseMetricDef<T>

Base metric definition (from defineBaseMetric).

runOnContainer:(container: Conversation) => Promise<unknown> | unknown

Prepares the conversation for LLM evaluation. Returns data to include in the prompt.

provider:LanguageModel | (() => LanguageModel)

AI SDK language model instance or factory function.

prompt:PromptTemplate<TVars>

Prompt template for LLM evaluation.

PromptTemplate<TVars>
instruction:string

Template string with {{variable}} placeholders.

variables?:readonly string[]

List of variable names used in the template.

examples?:Array<{ input: Record<string, unknown>; expectedOutput: string }>

Few-shot examples for the LLM.

rubric?:object

Optional rubric to guide LLM scoring.

Rubric
criteria:string

Scoring criteria description.

scale?:string

Scale description.

examples?:Array<{ score: number; reasoning: string }>

Example scores with reasoning.

postProcessing?:object

Optional post-processing of the LLM output.

PostProcessing
normalize?:boolean

Whether to normalize the output.

transform?:(rawOutput: string) => T

Transform raw LLM string to metric value.

normalization?:MetricNormalization<T>

Normalization configuration.

metadata?:Record<string, unknown>

Additional metadata.


Example

import { defineBaseMetric, defineSingleTurnLLM, withNormalization } from '@tally-evals/tally';
import { createMinMaxNormalizer } from '@tally-evals/tally/normalization';
import { google } from '@ai-sdk/google';

// Create base metric
const base = defineBaseMetric({
  name: 'answerRelevance',
  valueType: 'number',
  description: 'Measures how relevant the answer is to the query',
});

// Create LLM-based single-turn metric with normalization
const answerRelevance = defineSingleTurnLLM({
  base: withNormalization({
    metric: base,
    normalizer: createMinMaxNormalizer({ min: 0, max: 5, clip: true }),
  }),
  provider: google('models/gemini-2.5-flash-lite'),
  prompt: {
    instruction: `Score the relevance of the response to the query on a scale of 0-5.
    
Query: {{input}}
Response: {{output}}`,
    variables: ['input', 'output'] as const,
  },
  rubric: {
    criteria: '0 = completely irrelevant, 5 = perfectly relevant',
    scale: '0-5',
    examples: [
      { score: 5, reasoning: 'Directly answers the question with accurate information.' },
      { score: 0, reasoning: 'Response has nothing to do with the query.' },
    ],
  },
});

On this page