Tally

Builtin Metrics

Prebuilt metric definitions for common evaluation needs.

Builtin Metrics

Prebuilt metrics are exported from the @tally-evals/tally/metrics entrypoint. They provide ready-to-use evaluation capabilities for common use cases.

Import

import {
  // Single-turn metrics
  createAnswerRelevanceMetric,
  createAnswerSimilarityMetric,
  createCompletenessMetric,
  createToxicityMetric,
  createToolCallAccuracyMetric,
  // Multi-turn metrics
  createRoleAdherenceMetric,
  createGoalCompletionMetric,
  createTopicAdherenceMetric,
} from '@tally-evals/tally/metrics';

Single-Turn Metrics

createAnswerRelevanceMetric()

Measures how relevant the assistant's response is to the user's query. Uses LLM-as-judge to score relevance on a 0–5 scale, normalized to 0–1.

provider:LanguageModel

AI SDK language model for relevance analysis.

partialWeight?:numberDefault: 0.3

Weight for "unsure" statements in the rubric.

aggregators?:NumericAggregatorDef[]

Custom numeric aggregators. Defaults include percentiles.


createAnswerSimilarityMetric()

Measures semantic similarity between the response and a target/expected answer. Can use embeddings or keyword matching.

embeddingModel?:EmbeddingModel

Optional embedding model for semantic similarity.

targetResponse?:string

Target response to compare against. Can also be provided via metadata.targetResponse.

minKeywords?:numberDefault: 1

Minimum matching keywords required (keyword mode).


createCompletenessMetric()

Measures whether the response fully addresses all aspects of the query. Uses LLM-as-judge to score on a 0–5 scale, normalized to 0–1.

provider:LanguageModel

AI SDK language model for completeness analysis.

expectedPoints?:string[]

Expected key points/topics to check for coverage.

aggregators?:NumericAggregatorDef[]

Custom numeric aggregators.


createToxicityMetric()

Detects harmful, offensive, or inappropriate content in responses. Returns a toxicity score where lower is better.

provider:LanguageModel

AI SDK language model for toxicity detection.

categories?:Array<'hate' | 'harassment' | 'violence' | 'self-harm' | 'sexual' | 'profanity'>

Toxicity categories to emphasize during evaluation.


createToolCallAccuracyMetric()

Validates that the agent made correct tool calls with proper arguments. Checks tool names, argument schemas, and optionally call order.

expectedToolCalls:Array<{ toolName: string; argsSchema?: z.ZodSchema }>

Expected tool calls to validate against.

ExpectedToolCall
toolName:string

Name of the expected tool.

argsSchema?:z.ZodSchema

Zod schema for validating tool arguments.

toolCallOrder?:string[]

Expected order of tool calls (array of tool names).

strictMode?:booleanDefault: false

If true, sequence must be exact with no extra calls.


Multi-Turn Metrics

createRoleAdherenceMetric()

Measures how well the assistant maintains its assigned role/persona across the entire conversation. Multi-turn metric.

expectedRole:string

Description of the role the assistant should adhere to.

provider:LanguageModel

AI SDK language model for role adherence analysis.

checkConsistency?:booleanDefault: true

Whether to evaluate consistency across all turns.


createGoalCompletionMetric()

Measures whether the conversation achieved its intended goal. Evaluates progress and final outcome across all turns. Multi-turn metric.

goal:string

Description of the goal to evaluate against.

provider:LanguageModel

AI SDK language model for goal completion analysis.

checkPartialCompletion?:booleanDefault: true

Whether to reward partial progress toward the goal.

considerEfficiency?:booleanDefault: false

Whether to consider efficiency (fewer turns) in scoring.


createTopicAdherenceMetric()

Measures whether the assistant stays on topic throughout the conversation. Detects off-topic tangents and topic drift. Multi-turn metric.

topics:string[]

Topics the assistant should adhere to.

provider:LanguageModel

AI SDK language model for topic adherence analysis.

allowTopicTransitions?:booleanDefault: true

If true, allows natural transitions between related topics.

strictMode?:booleanDefault: false

If true, penalizes deviations more strictly.


Example

import { google } from '@ai-sdk/google';
import {
  createAnswerRelevanceMetric,
  createRoleAdherenceMetric,
} from '@tally-evals/tally/metrics';
import { defineSingleTurnEval, defineMultiTurnEval, thresholdVerdict } from '@tally-evals/tally';

const model = google('models/gemini-2.5-flash-lite');

// Single-turn metric
const relevance = createAnswerRelevanceMetric({ provider: model });
const relevanceEval = defineSingleTurnEval({
  name: 'Answer Relevance',
  metric: relevance,
  verdict: thresholdVerdict(0.7),
});

// Multi-turn metric
const roleAdherence = createRoleAdherenceMetric({
  expectedRole: 'helpful customer support agent',
  provider: model,
});
const roleEval = defineMultiTurnEval({
  name: 'Role Adherence',
  metric: roleAdherence,
  verdict: thresholdVerdict(0.8),
});

On this page