Metric

Factory APIs are the preferred way to create metrics. These functions provide type-safe construction of metric definitions.

Import

import {
  defineBaseMetric,
  withNormalization,
  withMetadata,
  defineSingleTurnCode,
  defineSingleTurnLLM,
  defineMultiTurnCode,
  defineMultiTurnLLM,
} from '@tally-evals/tally';

`defineBaseMetric()`

Creates the foundational metric definition with name, value type, and optional metadata. This is the starting point for all metrics.

name:string

Unique metric name (used as identifier in reports).

valueType:'number' | 'boolean' | 'string'

The type of value this metric produces.

description?:string

Human-readable description for UI/reporting.

metadata?:Record<string, unknown>

Custom metadata attached to the metric.

normalization?:MetricNormalization<T>

Normalization configuration for converting raw values to 0–1 scores.

MetricNormalization<T>

normalizer:NormalizerSpec<T> | NormalizeToScore<T>

Normalizer specification object or custom function.

calibrate?:TNormContext | ((args: { dataset: readonly unknown[]; rawValues: readonly T[] }) => TNormContext | Promise<TNormContext>)

Static context or async function to derive calibration data from the dataset.

`withNormalization()`

Attaches a normalization strategy to an existing metric definition. Use this to convert raw values to a 0–1 score for consistent comparison and scoring.

metric:BaseMetricDef<T> | MetricDef<T, MetricContainer>

The metric or base metric to attach normalization to.

normalizer:NormalizerSpec<T> | NormalizeToScore<T>

Normalizer specification or custom function. See Normalizers reference for available specs.

calibrate?:TNormContext | ((args: { dataset: readonly unknown[]; rawValues: readonly T[] }) => TNormContext | Promise<TNormContext>)

Static calibration context or async function to derive context from raw values.

`withMetadata()`

Adds or merges metadata into an existing metric definition. Useful for tagging metrics with custom properties for reporting or debugging.

metric:BaseMetricDef<T> | MetricDef<T, MetricContainer>

The metric or base metric to annotate.

metadata:Record<string, unknown>

Metadata to merge into the metric definition.

`defineSingleTurnCode()`

Creates a single-turn metric that uses TypeScript code to compute values. Best for deterministic checks like JSON validation, keyword presence, or response length.

base:BaseMetricDef<T>

Base metric definition (from defineBaseMetric).

compute:(args: { data: unknown; metadata?: Record<string, unknown> }) => T

Function that computes the metric value from preprocessed data.

preProcessor?:(selected: ConversationStep | DatasetItem) => Promise<unknown> | unknown

Optional function to prepare the target before compute. Default exposes { input, output }.

dependencies?:BaseMetricDef[]

Other metrics this metric depends on (for execution ordering).

cacheable?:boolean

Whether results can be cached for identical inputs.

normalization?:MetricNormalization<T>

Normalization configuration (overrides base if provided).

metadata?:Record<string, unknown>

Additional metadata (merged with base metadata).

aggregators?:CompatibleAggregator<T>[]

Custom aggregators for summarizing results. Default aggregators are added based on valueType.

`defineSingleTurnLLM()`

Creates a single-turn metric that uses an LLM to evaluate quality. Best for semantic judgments like relevance, tone, helpfulness, or toxicity detection.

base:BaseMetricDef<T>

Base metric definition (from defineBaseMetric).

provider:LanguageModel | (() => LanguageModel)

AI SDK language model instance or factory function.

prompt:PromptTemplate<TVars>

Prompt template for LLM evaluation.

PromptTemplate<TVars>

instruction:string

Template string with {{variable}} placeholders.

variables?:readonly string[]

List of variable names used in the template.

examples?:Array<{ input: Record<string, unknown>; expectedOutput: string }>

Few-shot examples for the LLM.

preProcessor?:(selected: ConversationStep | DatasetItem) => Promise<unknown> | unknown

Optional function to prepare the target before LLM evaluation.

rubric?:object

Optional rubric to guide LLM scoring consistency.

Rubric

criteria:string

Scoring criteria description.

scale?:string

Scale description (e.g., "1-5").

examples?:Array<{ score: number; reasoning: string }>

Example scores with reasoning.

postProcessing?:object

Optional post-processing of the LLM output.

PostProcessing

normalize?:boolean

Whether to normalize the output.

transform?:(rawOutput: string) => T

Transform raw LLM string to metric value.

normalization?:MetricNormalization<T>

Normalization configuration (overrides base if provided).

metadata?:Record<string, unknown>

Additional metadata (merged with base metadata).

aggregators?:CompatibleAggregator<T>[]

Custom aggregators for summarizing results. Default aggregators are added based on valueType.

`defineMultiTurnCode()`

Creates a multi-turn metric that uses TypeScript code to analyze entire conversations. Best for deterministic checks across conversation history.

base:BaseMetricDef<T>

Base metric definition (from defineBaseMetric).

runOnContainer:(container: Conversation) => Promise<unknown> | unknown

Prepares the conversation for downstream compute. Returns a serializable payload.

compute:(args: { data: unknown; metadata?: Record<string, unknown> }) => T

Function that computes the metric value from the prepared data.

dependencies?:BaseMetricDef[]

Other metrics this metric depends on.

cacheable?:boolean

Whether results can be cached.

normalization?:MetricNormalization<T>

Normalization configuration.

metadata?:Record<string, unknown>

Additional metadata.

`defineMultiTurnLLM()`

Creates a multi-turn metric that uses an LLM to evaluate entire conversations. Best for assessing goal completion, role adherence, topic drift, and other multi-step behaviors.

base:BaseMetricDef<T>

Base metric definition (from defineBaseMetric).

runOnContainer:(container: Conversation) => Promise<unknown> | unknown

Prepares the conversation for LLM evaluation. Returns data to include in the prompt.

provider:LanguageModel | (() => LanguageModel)

AI SDK language model instance or factory function.

prompt:PromptTemplate<TVars>

Prompt template for LLM evaluation.

PromptTemplate<TVars>

instruction:string

Template string with {{variable}} placeholders.

variables?:readonly string[]

List of variable names used in the template.

examples?:Array<{ input: Record<string, unknown>; expectedOutput: string }>

Few-shot examples for the LLM.

rubric?:object

Optional rubric to guide LLM scoring.

Rubric

criteria:string

Scoring criteria description.

scale?:string

Scale description.

examples?:Array<{ score: number; reasoning: string }>

Example scores with reasoning.

postProcessing?:object

Optional post-processing of the LLM output.

PostProcessing

normalize?:boolean

Whether to normalize the output.

transform?:(rawOutput: string) => T

Transform raw LLM string to metric value.

normalization?:MetricNormalization<T>

Normalization configuration.

metadata?:Record<string, unknown>

Additional metadata.

Example

import { defineBaseMetric, defineSingleTurnLLM, withNormalization } from '@tally-evals/tally';
import { createMinMaxNormalizer } from '@tally-evals/tally/normalization';
import { google } from '@ai-sdk/google';

// Create base metric
const base = defineBaseMetric({
  name: 'answerRelevance',
  valueType: 'number',
  description: 'Measures how relevant the answer is to the query',
});

// Create LLM-based single-turn metric with normalization
const answerRelevance = defineSingleTurnLLM({
  base: withNormalization({
    metric: base,
    normalizer: createMinMaxNormalizer({ min: 0, max: 5, clip: true }),
  }),
  provider: google('models/gemini-2.5-flash-lite'),
  prompt: {
    instruction: `Score the relevance of the response to the query on a scale of 0-5.
    
Query: {{input}}
Response: {{output}}`,
    variables: ['input', 'output'] as const,
  },
  rubric: {
    criteria: '0 = completely irrelevant, 5 = perfectly relevant',
    scale: '0-5',
    examples: [
      { score: 5, reasoning: 'Directly answers the question with accurate information.' },
      { score: 0, reasoning: 'Response has nothing to do with the query.' },
    ],
  },
});

Metric

On this page