Tally
Metrics

Built-in Metrics

Pre-built metrics provided by Tally.

Tally ships with a small set of pre-built metrics for common agent evaluation needs. Some are LLM-based (judgment) and some are code-based (deterministic).

All built-in metrics are MetricDefs you can reuse across scorers and evals.

Single-Turn Metrics

These metrics evaluate individual steps or turns in a conversation.

Metric factoryTypeOutputDescriptionCode
createAnswerRelevanceMetricLLMnumber (0–5, normalized to 0–1)How relevant is the response to the user's input?
createCompletenessMetricLLMnumberDoes the response address all parts of the user's query?
createToxicityMetricLLMnumberChecks for harmful or offensive content in the response.
createAnswerSimilarityMetricCodenumber (0–1)Measures similarity to a target answer (keyword-based; optional embeddings).
createToolCallAccuracyMetricCodenumber (0–1)Measures tool call correctness (presence/args/order) against expectations.

Multi-Turn Metrics

These metrics evaluate the entire conversation as a whole.

Metric factoryTypeOutputDescriptionCode
createRoleAdherenceMetricLLMnumberDoes the agent maintain its assigned persona throughout?
createGoalCompletionMetricLLMnumberDid the agent successfully achieve the user's goal?
createTopicAdherenceMetricLLMnumberDid the agent stay on topic across multiple turns?

Usage Example

import { z } from 'zod';
import { google } from '@ai-sdk/google';
import {
  createAnswerRelevanceMetric,
  createCompletenessMetric,
  createToolCallAccuracyMetric,
  createRoleAdherenceMetric,
  createTopicAdherenceMetric,
} from '@tally-evals/tally/metrics';

// Imagine a "travel planner" agent that should call `searchFlights({ origin, destination, departDate })`
const provider = google('models/gemini-2.0-flash');

// LLM-judged single-turn signals
const relevance = createAnswerRelevanceMetric({
  provider,
  partialWeight: 0.3,
});

const completeness = createCompletenessMetric({
  provider,
  expectedPoints: [
    'Ask for missing constraints (budget, dates, cabin class) if needed',
    'Recommend a concrete itinerary option',
    'Explain tradeoffs (price vs duration vs layovers)',
    'Confirm before booking / purchasing',
  ],
});

// Deterministic single-turn signal: tool calling correctness
const toolCallAccuracy = createToolCallAccuracyMetric({
  expectedToolCalls: [
    {
      toolName: 'searchFlights',
      argsSchema: z.object({
        origin: z.string(),
        destination: z.string(),
        departDate: z.string(),
      }),
    },
  ],
  toolCallOrder: ['searchFlights'],
  strictMode: true,
});

// Multi-turn signals (conversation-level)
const roleAdherence = createRoleAdherenceMetric({
  expectedRole: 'a helpful travel planner who asks clarifying questions and avoids hallucinating prices',
  provider,
  checkConsistency: true,
});

const topicAdherence = createTopicAdherenceMetric({
  topics: ['travel planning', 'flights', 'itinerary', 'budget', 'dates', 'layovers', 'airlines'],
  provider,
  allowTopicTransitions: true,
  strictMode: false,
});

On this page