Metrics
Built-in Metrics
Pre-built metrics provided by Tally.
Tally ships with a small set of pre-built metrics for common agent evaluation needs. Some are LLM-based (judgment) and some are code-based (deterministic).
All built-in metrics are MetricDefs you can reuse across scorers and evals.
Single-Turn Metrics
These metrics evaluate individual steps or turns in a conversation.
| Metric factory | Type | Output | Description | Code |
|---|---|---|---|---|
createAnswerRelevanceMetric | LLM | number (0–5, normalized to 0–1) | How relevant is the response to the user's input? | |
createCompletenessMetric | LLM | number | Does the response address all parts of the user's query? | |
createToxicityMetric | LLM | number | Checks for harmful or offensive content in the response. | |
createAnswerSimilarityMetric | Code | number (0–1) | Measures similarity to a target answer (keyword-based; optional embeddings). | |
createToolCallAccuracyMetric | Code | number (0–1) | Measures tool call correctness (presence/args/order) against expectations. |
Multi-Turn Metrics
These metrics evaluate the entire conversation as a whole.
| Metric factory | Type | Output | Description | Code |
|---|---|---|---|---|
createRoleAdherenceMetric | LLM | number | Does the agent maintain its assigned persona throughout? | |
createGoalCompletionMetric | LLM | number | Did the agent successfully achieve the user's goal? | |
createTopicAdherenceMetric | LLM | number | Did the agent stay on topic across multiple turns? |
Usage Example
import { z } from 'zod';
import { google } from '@ai-sdk/google';
import {
createAnswerRelevanceMetric,
createCompletenessMetric,
createToolCallAccuracyMetric,
createRoleAdherenceMetric,
createTopicAdherenceMetric,
} from '@tally-evals/tally/metrics';
// Imagine a "travel planner" agent that should call `searchFlights({ origin, destination, departDate })`
const provider = google('models/gemini-2.0-flash');
// LLM-judged single-turn signals
const relevance = createAnswerRelevanceMetric({
provider,
partialWeight: 0.3,
});
const completeness = createCompletenessMetric({
provider,
expectedPoints: [
'Ask for missing constraints (budget, dates, cabin class) if needed',
'Recommend a concrete itinerary option',
'Explain tradeoffs (price vs duration vs layovers)',
'Confirm before booking / purchasing',
],
});
// Deterministic single-turn signal: tool calling correctness
const toolCallAccuracy = createToolCallAccuracyMetric({
expectedToolCalls: [
{
toolName: 'searchFlights',
argsSchema: z.object({
origin: z.string(),
destination: z.string(),
departDate: z.string(),
}),
},
],
toolCallOrder: ['searchFlights'],
strictMode: true,
});
// Multi-turn signals (conversation-level)
const roleAdherence = createRoleAdherenceMetric({
expectedRole: 'a helpful travel planner who asks clarifying questions and avoids hallucinating prices',
provider,
checkConsistency: true,
});
const topicAdherence = createTopicAdherenceMetric({
topics: ['travel planning', 'flights', 'itinerary', 'budget', 'dates', 'layovers', 'airlines'],
provider,
allowTopicTransitions: true,
strictMode: false,
});