Overview
Measuring agent behavior with LLM and code-based metrics.
Metrics are the signals you measure about your agent. They can be fully deterministic (code checks) or model-judged (LLM-based), and they're designed to be reused across different evaluation scenarios—release gates, regression tests, edge-case suites, dev loops, and more.
Blueprint vs. Result
It is important to understand the distinction between a definition and its output:
| Concept | Description |
|---|---|
MetricDef | The blueprint. It defines the "how" (code or prompt), the metadata, and the expected value type. |
Metric | The result. This is the actual value (e.g., 4.5 or true) produced when a MetricDef is run against data. |
Anatomy of a Metric
Every metric in Tally fits into a few practical categories. These categories aren't just taxonomy — they determine how you run, how you score, and how you debug evaluations.
1. Value Type
Metrics can produce any domain-relevant value:
- Numeric: continuous values like token usage, latency, similarity scores, or judge scores.
- Boolean: binary checks like "is the output valid JSON?"
- Ordinal: categorical results mapped from a rubric (e.g.,
{ Poor, Fair, Good, Excellent }).
In Tally, metric values are represented as a scalar type:
type MetricScalar = number | boolean | string;Example:
import { defineBaseMetric } from '@tally-evals/tally';
// Likert-style ordinal rating produced by an LLM judge or rubric
// (store as an ordinal category; map to scores via normalization when needed)
const concisenessLikert = defineBaseMetric({
name: 'concisenessLikert',
valueType: 'string',
description: 'Conciseness rating as an ordinal category (e.g., VeryVerbose → VeryConcise)',
});2. Scope
Determines what the metric is measuring:
- Single-turn: evaluates a single interaction (input + output).
- Multi-turn: analyzes the entire conversation history (drift, retention, goal completion).
Example (single-turn vs multi-turn):
import { defineSingleTurnCode, defineMultiTurnLLM, defineBaseMetric } from '@tally-evals/tally';
// Single-turn: "Does this step include a tool call?"
const hasToolCall = defineSingleTurnCode({
base: defineBaseMetric({
name: 'hasToolCall',
valueType: 'boolean',
}),
compute: ({ selected }) =>
selected.output.some(
(m) => m.role === 'assistant' && m.content.length > 0,
),
});
// Multi-turn: "Did the agent stay on topic across the whole conversation?"
const stayedOnTopic = defineMultiTurnLLM({
base: defineBaseMetric({
name: 'stayedOnTopic',
valueType: 'number',
}),
runOnContainer: async (conversation) => ({
transcript: conversation.steps
.map(
(s) =>
`User: ${s.input.content}\nAssistant: ${s.output
.map((m) => m.content)
.join(' ')}`,
)
.join('\n\n'),
}),
provider: myModel,
prompt: {
instruction: 'Score topic adherence 0-5.\n\n{{transcript}}',
variables: [] as const,
},
});3. Implementation
The engine that powers the measurement:
- Code-based: deterministic logic written in TypeScript. Best for technical validation.
- LLM-based: evaluated by a language model using a prompt/rubric. Best for semantic judgment.
Example (code-based vs LLM-based options):
import { defineSingleTurnCode, defineSingleTurnLLM, defineBaseMetric } from '@tally-evals/tally';
import { createMinMaxNormalizer } from '@tally-evals/tally/normalization';
// Code-based: optional preProcessor, cacheable, metadata
const keywordPresent = defineSingleTurnCode({
base: defineBaseMetric({
name: 'keywordPresent',
valueType: 'boolean',
}),
preProcessor: async (selected) => ({
text: selected.output.map((m) => m.content).join(' '),
}),
compute: ({ data }) => (data as { text: string }).text.includes('refund'),
cacheable: true,
metadata: {
keyword: 'refund',
},
});
// LLM-based: provider, prompt, rubric, normalization
const helpfulness = defineSingleTurnLLM({
base: defineBaseMetric({
name: 'helpfulness',
valueType: 'number',
}),
provider: myModel,
prompt: {
instruction:
'Rate helpfulness 1-5.\n\nQuery: {{input}}\nResponse: {{output}}',
variables: [] as const,
},
rubric: {
criteria: '1=unhelpful, 5=very helpful',
scale: '1-5',
},
normalization: {
normalizer: createMinMaxNormalizer({
min: 1,
max: 5,
clip: true,
}),
},
});The Metric Lifecycle
- Define: Create a
MetricDefusing the factory APIs. - Execute: Tally runs the definition against your conversation data.
- Normalize (Optional): If you plan to combine a metric inside a Scorer, you'll typically attach a Normalizer so the metric maps to a comparable 0–1 score. It's optional because some metrics already produce 0–1, and some workflows don't use scorers.
- Aggregate (Optional): For single-turn metrics, attach Aggregators to compute summary statistics (mean, percentiles, etc.) across all targets. Tally provides sensible defaults based on metric value type.
- Compose: Use Scorers to combine multiple signals, then use Evals to apply a verdict policy.
Normalization (recommended)
Normalization converts a raw metric value (number/boolean/ordinal) into a comparable 0–1 score. In the current API, normalization lives on the metric definition.
If you're using scorers, you typically want every input metric to have a sensible normalization strategy (or provide normalizerOverride per scorer input).
See the full reference: Normalizers.
Aggregations (optional)
Aggregations compute summary statistics across all evaluation targets. Single-turn metrics can carry aggregator definitions that the pipeline uses to summarize results.
- Where it applies:
SingleTurnMetricDef(step/item-level metrics) - Where to read the output:
report.evalSummaries.get(evalName)?.aggregations
Default Aggregators by Value Type
Tally provides sensible default aggregators based on the metric's valueType:
| Value Type | Default Aggregators |
|---|---|
number | mean, p50, p75, p90 |
boolean | mean, p50, p75, p90, trueRate |
ordinal | mean, p50, p75, p90, distribution |
Use getDefaultAggregators(valueType) to retrieve the defaults, or define custom aggregators for bespoke summaries (e.g., per-customer-tier rollups).
Same Measures, Different Policies
In practice, you want to reuse the same metrics across different evaluation scenarios:
- Release gates: strict verdict policies for production readiness.
- Regression tests: moderate policies to catch quality regressions.
- Edge-case suites: scenario-specific thresholds for known failure modes.
- Dev loops: looser policies for fast iteration and feedback.
The measure stays constant; the verdict policy adapts to the context. This is how you avoid rewriting evaluation logic for every test suite.
API Reference
- Metric Factories —
defineBaseMetric,defineSingleTurnCode,defineSingleTurnLLM,withNormalization - Built-in Metrics — Pre-built metrics for common use cases
- Normalizers —
createMinMaxNormalizer,createBooleanNormalizer,createOrdinalNormalizer - Aggregators —
createMeanAggregator,createTrueRateAggregator, etc.