Built-in Metrics

Tally ships with a small set of pre-built metrics for common agent evaluation needs. Some are LLM-based (judgment) and some are code-based (deterministic).

All built-in metrics are MetricDefs you can reuse across scorers and evals.

Single-Turn Metrics

These metrics evaluate individual steps or turns in a conversation.

Metric factory	Type	Output	Description
`createAnswerRelevanceMetric`	LLM	number (0–5, normalized to 0–1)	How relevant is the response to the user's input?
`createCompletenessMetric`	LLM	number	Does the response address all parts of the user's query?
`createToxicityMetric`	LLM	number	Checks for harmful or offensive content in the response.
`createAnswerSimilarityMetric`	Code	number (0–1)	Measures similarity to a target answer (keyword-based; optional embeddings).
`createToolCallAccuracyMetric`	Code	number (0–1)	Measures tool call correctness (presence/args/order) against expectations.

Multi-Turn Metrics

These metrics evaluate the entire conversation as a whole.

Metric factory	Type	Output	Description
`createRoleAdherenceMetric`	LLM	number	Does the agent maintain its assigned persona throughout?
`createGoalCompletionMetric`	LLM	number	Did the agent successfully achieve the user's goal?
`createTopicAdherenceMetric`	LLM	number	Did the agent stay on topic across multiple turns?

Usage Example

import { z } from 'zod';
import { google } from '@ai-sdk/google';
import {
  createAnswerRelevanceMetric,
  createCompletenessMetric,
  createToolCallAccuracyMetric,
  createRoleAdherenceMetric,
  createTopicAdherenceMetric,
} from '@tally-evals/tally/metrics';

// Imagine a "travel planner" agent that should call `searchFlights({ origin, destination, departDate })`
const provider = google('models/gemini-2.0-flash');

// LLM-judged single-turn signals
const relevance = createAnswerRelevanceMetric({
  provider,
  partialWeight: 0.3,
});

const completeness = createCompletenessMetric({
  provider,
  expectedPoints: [
    'Ask for missing constraints (budget, dates, cabin class) if needed',
    'Recommend a concrete itinerary option',
    'Explain tradeoffs (price vs duration vs layovers)',
    'Confirm before booking / purchasing',
  ],
});

// Deterministic single-turn signal: tool calling correctness
const toolCallAccuracy = createToolCallAccuracyMetric({
  expectedToolCalls: [
    {
      toolName: 'searchFlights',
      argsSchema: z.object({
        origin: z.string(),
        destination: z.string(),
        departDate: z.string(),
      }),
    },
  ],
  toolCallOrder: ['searchFlights'],
  strictMode: true,
});

// Multi-turn signals (conversation-level)
const roleAdherence = createRoleAdherenceMetric({
  expectedRole: 'a helpful travel planner who asks clarifying questions and avoids hallucinating prices',
  provider,
  checkConsistency: true,
});

const topicAdherence = createTopicAdherenceMetric({
  topics: ['travel planning', 'flights', 'itinerary', 'budget', 'dates', 'layovers', 'airlines'],
  provider,
  allowTopicTransitions: true,
  strictMode: false,
});

Built-in Metrics

Single-Turn Metrics

Multi-Turn Metrics

Usage Example

On this page