Builtin Metrics
Prebuilt metric definitions for common evaluation needs.
Builtin Metrics
Prebuilt metrics are exported from the @tally-evals/tally/metrics entrypoint. They provide ready-to-use evaluation capabilities for common use cases.
Import
import {
// Single-turn metrics
createAnswerRelevanceMetric,
createAnswerSimilarityMetric,
createCompletenessMetric,
createToxicityMetric,
createToolCallAccuracyMetric,
// Multi-turn metrics
createRoleAdherenceMetric,
createGoalCompletionMetric,
createTopicAdherenceMetric,
} from '@tally-evals/tally/metrics';Single-Turn Metrics
createAnswerRelevanceMetric()
Measures how relevant the assistant's response is to the user's query. Uses LLM-as-judge to score relevance on a 0–5 scale, normalized to 0–1.
AI SDK language model for relevance analysis.
Weight for "unsure" statements in the rubric.
Custom numeric aggregators. Defaults include percentiles.
createAnswerSimilarityMetric()
Measures semantic similarity between the response and a target/expected answer. Can use embeddings or keyword matching.
Optional embedding model for semantic similarity.
Target response to compare against. Can also be provided via metadata.targetResponse.
Minimum matching keywords required (keyword mode).
createCompletenessMetric()
Measures whether the response fully addresses all aspects of the query. Uses LLM-as-judge to score on a 0–5 scale, normalized to 0–1.
AI SDK language model for completeness analysis.
Expected key points/topics to check for coverage.
Custom numeric aggregators.
createToxicityMetric()
Detects harmful, offensive, or inappropriate content in responses. Returns a toxicity score where lower is better.
AI SDK language model for toxicity detection.
Toxicity categories to emphasize during evaluation.
createToolCallAccuracyMetric()
Validates that the agent made correct tool calls with proper arguments. Checks tool names, argument schemas, and optionally call order.
Expected tool calls to validate against.
Name of the expected tool.
Zod schema for validating tool arguments.
Expected order of tool calls (array of tool names).
If true, sequence must be exact with no extra calls.
Multi-Turn Metrics
createRoleAdherenceMetric()
Measures how well the assistant maintains its assigned role/persona across the entire conversation. Multi-turn metric.
Description of the role the assistant should adhere to.
AI SDK language model for role adherence analysis.
Whether to evaluate consistency across all turns.
createGoalCompletionMetric()
Measures whether the conversation achieved its intended goal. Evaluates progress and final outcome across all turns. Multi-turn metric.
Description of the goal to evaluate against.
AI SDK language model for goal completion analysis.
Whether to reward partial progress toward the goal.
Whether to consider efficiency (fewer turns) in scoring.
createTopicAdherenceMetric()
Measures whether the assistant stays on topic throughout the conversation. Detects off-topic tangents and topic drift. Multi-turn metric.
Topics the assistant should adhere to.
AI SDK language model for topic adherence analysis.
If true, allows natural transitions between related topics.
If true, penalizes deviations more strictly.
Example
import { google } from '@ai-sdk/google';
import {
createAnswerRelevanceMetric,
createRoleAdherenceMetric,
} from '@tally-evals/tally/metrics';
import { defineSingleTurnEval, defineMultiTurnEval, thresholdVerdict } from '@tally-evals/tally';
const model = google('models/gemini-2.5-flash-lite');
// Single-turn metric
const relevance = createAnswerRelevanceMetric({ provider: model });
const relevanceEval = defineSingleTurnEval({
name: 'Answer Relevance',
metric: relevance,
verdict: thresholdVerdict(0.7),
});
// Multi-turn metric
const roleAdherence = createRoleAdherenceMetric({
expectedRole: 'helpful customer support agent',
provider: model,
});
const roleEval = defineMultiTurnEval({
name: 'Role Adherence',
metric: roleAdherence,
verdict: thresholdVerdict(0.8),
});