Tally

Run Report

Understanding the output of a Tally evaluation run.

After calling await tally.run(), you receive a TallyRunReport—a type-safe record of everything that happened during evaluation. The report captures measurements, normalized scores, verdicts, and aggregated summaries, all accessible through a strongly-typed API.

Type-Safe Access with the View API

The primary way to access report data is through the View API. Call report.view() to get a TargetRunView that provides:

  • Eval name autocomplete — Keys are literal types derived from your evals
  • Typed measurementsrawValue is typed to your metric's value type
  • Typed policies — Verdict policies match the metric's value type
const report = await tally.run();
const view = report.view();

The view provides the following methods:

MethodReturnsDescription
step(index)StepResultsAll single-turn eval results for a specific step
steps()Generator<StepResultsWithIndex>Iterate over all steps with their results
conversation()ConversationResultsMulti-turn evals and scalar scorer results
summary()SummaryResultsAggregated summaries by eval name
stepCountnumberTotal number of steps in the conversation

Accessing Step Results

Use step(index) to get a StepResults object containing all single-turn eval results for that step:

const view = report.view();

// Get results for step 0
const step0 = view.step(0);

// Access specific eval results with type-safe keys
const relevance = step0['Answer Relevance'];
if (relevance) {
  // StepEvalResult contains measurement and optional outcome
  console.log(`Score: ${relevance.measurement.score}`);      // number (0-1)
  console.log(`Raw: ${relevance.measurement.rawValue}`);     // typed to metric valueType
  console.log(`Verdict: ${relevance.outcome?.verdict}`);     // 'pass' | 'fail' | 'unknown'
}

Iterating Over All Steps

Use the steps() generator to iterate over all steps with their results:

for (const step of view.steps()) {
  console.log(`Step ${step.index}:`);
  
  for (const [evalName, result] of Object.entries(step)) {
    if (evalName === 'index') continue;
    console.log(`  ${evalName}: ${result.measurement.score}`);
  }
}

Accessing Conversation-Level Results

Use conversation() to get ConversationResults—multi-turn evals and scalar scorers:

const conversationResults = view.conversation();

// Multi-turn evals (ConversationEvalResult)
const roleAdherence = conversationResults['Role Adherence'];
if (roleAdherence) {
  console.log(`Role Adherence: ${roleAdherence.measurement.score}`);
  console.log(`Reasoning: ${roleAdherence.measurement.reasoning}`);
}

// Scalar scorer results (also ConversationEvalResult)
const overallQuality = conversationResults['Overall Quality'];
if (overallQuality) {
  console.log(`Overall Quality: ${overallQuality.measurement.score}`);
  console.log(`Verdict: ${overallQuality.outcome?.verdict}`);
}

Accessing Summaries

Use summary() to get SummaryResults—aggregated statistics across all targets:

const summaries = view.summary();

if (summaries) {
  for (const [evalName, summary] of Object.entries(summaries)) {
    // EvalSummary contains count, aggregations, and verdictSummary
    console.log(`${evalName}:`);
    console.log(`  Kind: ${summary.kind}`);       // 'singleTurn' | 'multiTurn' | 'scorer'
    console.log(`  Count: ${summary.count}`);     // number of data points
    
    // Score aggregations (always numeric: mean, percentiles)
    if (summary.aggregations?.score) {
      const agg = summary.aggregations.score;
      console.log(`  Mean: ${agg.Mean}`);
      console.log(`  P90: ${agg.P90}`);
    }
    
    // VerdictSummary with pass/fail rates and counts
    if (summary.verdictSummary) {
      const vs = summary.verdictSummary;
      console.log(`  Pass rate: ${(vs.passRate * 100).toFixed(1)}%`);
      console.log(`  Pass/Fail/Unknown: ${vs.passCount}/${vs.failCount}/${vs.unknownCount}`);
    }
  }
}

Test Assertions

The view API is designed for test assertions. Use it to verify evaluation results in your test suite:

import { expect, test } from 'vitest';

test('agent meets quality thresholds', async () => {
  const report = await tally.run();
  const view = report.view();
  
  // Assert step-level results
  const step0 = view.step(0);
  expect(step0['Answer Relevance']?.outcome?.verdict).toBe('pass');
  
  // Assert conversation-level results
  const conv = view.conversation();
  expect(conv['Role Adherence']?.measurement.score).toBeGreaterThan(0.8);
  
  // Assert summary pass rates
  const summaries = view.summary();
  expect(summaries?.['Answer Relevance']?.verdictSummary?.passRate).toBeGreaterThan(0.95);
});

CI Integration

Block deployments when quality drops below a threshold:

const report = await tally.run();
const summaries = report.view().summary();

const relevanceSummary = summaries?.['Answer Relevance'];
if (relevanceSummary?.verdictSummary) {
  const passRate = relevanceSummary.verdictSummary.passRate;
  if (passRate < 0.95) {
    process.exitCode = 1;
    console.error(`Pass rate ${(passRate * 100).toFixed(1)}% is below 95% threshold`);
  }
}

Persisting Reports

For storage, CI pipelines, or the Tally viewer, convert to an artifact:

const artifact = report.toArtifact();

// Save to disk
import { writeFileSync } from 'fs';
writeFileSync('run.json', JSON.stringify(artifact, null, 2));

The artifact is a schema-stable JSON structure with ISO timestamps and a schemaVersion field for forward compatibility. You can later create a view from a loaded artifact:

import { createTargetRunView } from '@tally-evals/tally';

const artifact = JSON.parse(readFileSync('run.json', 'utf-8'));
const view = createTargetRunView(artifact);

Definition Lookups

The view also provides methods to look up definitions used in the run:

const view = report.view();

// Look up definitions by name
const metricDef = view.metric('answerRelevance');
const evalDef = view.eval('Answer Relevance');
const scorerDef = view.scorer('Overall Quality');

// Get the metric definition for an eval
const metricForEval = view.metricForEval('Answer Relevance');

API Reference

For the complete type definitions, see the Reports API Reference.

On this page