Tally

Reports

SDK report objects, View API, and the schema-stable run artifact.

Reports

Tally exposes two report representations and a type-safe View API:

  • TallyRunReport: SDK-facing in-memory object returned from await tally.run().
  • TargetRunView: Type-safe accessor for report data, created via report.view().
  • TallyRunArtifact: Schema-stable, serializable artifact produced by report.toArtifact().

Import

import type {
  TallyRunReport,
  TallyRunArtifact,
  TargetRunView,
  StepResults,
  StepResultsWithIndex,
  ConversationResults,
  SummaryResults,
  StepEvalResult,
  ConversationEvalResult,
  Measurement,
  EvalOutcome,
  EvalSummary,
  VerdictSummary,
} from '@tally-evals/tally';

View API

The View API provides ergonomic, type-safe access to report data. Call report.view() to get a TargetRunView.

TargetRunView

Type-safe view over run results with eval name autocomplete.

stepCount:number

Total number of steps in the conversation.

defs:RunDefs

All definitions (metrics, evals, scorers) used in the run.

step(index):(index: number) => StepResults

Get all single-turn eval results for a specific step.

steps():() => Generator<StepResultsWithIndex>

Iterate over all steps with their results.

conversation():() => ConversationResults

Get multi-turn evals and scalar scorer results.

summary():() => SummaryResults | undefined

Get aggregated summaries by eval name.

metric(name):(name: string) => MetricDefSnap | undefined

Look up a metric definition by name.

eval(name):(name: string) => EvalDefSnap | undefined

Look up an eval definition by name.

scorer(name):(name: string) => ScorerDefSnap | undefined

Look up a scorer definition by name.

metricForEval(name):(name: string) => MetricDefSnap | undefined

Get the metric definition for an eval.


StepResults

Object returned by view.step(index). Keys are eval names with type-safe autocomplete.

[evalName]:StepEvalResult

Result for a single-turn eval at this step. Keys are literal eval names.


StepResultsWithIndex

Yielded by view.steps() generator. Same as StepResults plus an index property.

index:number

Step index (0-based).

[evalName]:StepEvalResult

Result for a single-turn eval at this step.


ConversationResults

Object returned by view.conversation(). Contains multi-turn evals and scalar scorers.

[evalName]:ConversationEvalResult

Result for a multi-turn eval or scalar scorer. Keys are literal eval names.


SummaryResults

Object returned by view.summary(). Contains aggregated statistics by eval name.

[evalName]:EvalSummary

Summary for an eval including aggregations and verdict rates.


StepEvalResult

Result for a single step evaluation.

evalRef:string

Name of the eval.

measurement:Measurement

The measurement (score, rawValue, metadata).

outcome?:EvalOutcome

Verdict outcome (if verdict policy was defined).


ConversationEvalResult

Result for a conversation-level evaluation.

evalRef:string

Name of the eval.

measurement:Measurement

The measurement (score, rawValue, metadata).

outcome?:EvalOutcome

Verdict outcome (if verdict policy was defined).


Measurement

The measured value from a metric execution.

metricRef:string

Reference to the metric in defs.metrics.

score?:number

Normalized score (0-1) after normalization.

rawValue?:number | boolean | string | null

Original metric value before normalization.

confidence?:number

LLM confidence (0-1) if applicable.

reasoning?:string

LLM reasoning if applicable.

executionTimeMs?:number

Execution time in milliseconds.

timestamp?:string

ISO timestamp of measurement.

metadata?:Record<string, unknown>

Additional metadata.


EvalOutcome

Verdict outcome from applying a verdict policy.

verdict:'pass' | 'fail' | 'unknown'

The final verdict.

policy:VerdictPolicyInfo

The policy used to compute the verdict.

observed?:{ rawValue?: MetricScalar; score?: number }

The observed values used for the verdict.


EvalSummary

Aggregated summary for an eval across all targets.

eval:string

Eval name.

kind:'singleTurn' | 'multiTurn' | 'scorer'

Eval kind.

count:number

Number of data points.

aggregations?:{ score: Record<string, number>; raw?: Record<string, number | Record<string, number>> }

Aggregated statistics (mean, percentiles, etc.).

verdictSummary?:VerdictSummary

Pass/fail/unknown rates and counts.


VerdictSummary

Pass/fail statistics across all targets.

passRate:number

Proportion of passing verdicts (0-1).

failRate:number

Proportion of failing verdicts (0-1).

unknownRate:number

Proportion of unknown verdicts (0-1).

passCount:number

Number of passing verdicts.

failCount:number

Number of failing verdicts.

unknownCount:number

Number of unknown verdicts.

totalCount:number

Total number of verdicts.


TallyRunReport

Returned from await tally.run().

runId:string

Run identifier.

createdAt:Date

Creation time (in-memory Date).

defs:RunDefs

Deduped definitions referenced by the run.

RunDefs
metrics:Record<MetricName, MetricDefSnap>

Metric definition snapshots.

MetricDefSnap
name:MetricName

Metric name.

scope:'single' | 'multi'

Metric scope.

valueType:'number' | 'boolean' | 'string' | 'ordinal'

Metric value type.

description?:string

Optional description.

metadata?:Record<string, unknown>

Optional metadata.

llm?:{ provider?: Record<string, unknown>; prompt?: { instruction: string; variables?: readonly string[] }; rubric?: Record<string, unknown> }

Optional LLM snapshot (when metric is llm-based).

llm
provider?:Record<string, unknown>

Provider info snapshot.

prompt?:{ instruction: string; variables?: readonly string[] }

Prompt snapshot.

rubric?:Record<string, unknown>

Rubric snapshot.

aggregators?:Array<{ kind: string; name: string; description?: string; config?: unknown }>

Optional aggregator snapshots (attached on single-turn metrics).

normalization?:MetricNormalizationSnap

Optional normalization snapshot.

MetricNormalizationSnap
normalizer:NormalizerSpecSnap

Serializable normalizer snapshot.

calibrate?:unknown | { note: "not-serializable" }

Calibration snapshot (function becomes not-serializable note).

evals:Record<EvalName, EvalDefSnap>

Eval definition snapshots.

EvalDefSnap
name:EvalName

Eval name.

kind:'singleTurn' | 'multiTurn' | 'scorer'

Eval kind.

outputShape:'seriesByStepIndex' | 'scalar'

Stored output shape.

metric:MetricName

Metric ref for this eval.

scorerRef?:ScorerName

Optional scorer ref (scorer evals).

verdict?:VerdictPolicyInfo

Optional verdict policy info.

description?:string

Optional description.

metadata?:Record<string, unknown>

Optional metadata.

scorers:Record<ScorerName, ScorerDefSnap>

Scorer definition snapshots.

ScorerDefSnap
name:ScorerName

Scorer name.

description?:string

Optional description.

metadata?:Record<string, unknown>

Optional metadata.

inputs:readonly ScorerInputSnap[]

Serializable scorer inputs.

fallbackScore?:Score

Optional fallback score.

combine?:{ kind: ScorerCombineKind; note?: string }

Optional combine strategy snapshot.

result:ConversationResult

Per-eval results + optional summaries.

ConversationResult
stepCount:number

Number of steps in the conversation.

singleTurn:Record<EvalName, SingleTurnEvalSeries>

Single-turn eval results indexed by step (null = not evaluated).

SingleTurnEvalSeries
byStepIndex:Array<StepEvalResult | null>

Array index == step index; null means not evaluated / not selected.

StepEvalResult
evalRef:EvalName

Eval name reference.

measurement:Measurement

Observed measurement (raw + score + debug info).

Measurement
metricRef:MetricName

Reference into `defs.metrics`.

score?:Score

Normalized score (0..1) when applicable.

rawValue?:MetricScalar | null

Raw value when available.

confidence?:number

Optional LLM confidence.

reasoning?:string

Optional LLM reasoning.

executionTimeMs?:number

Optional execution timing.

timestamp?:string

Optional ISO timestamp.

metadata?:Record<string, unknown>

Optional metadata.

outcome?:EvalOutcome

Optional verdict outcome (policy + verdict).

EvalOutcome
verdict:'pass' | 'fail' | 'unknown'

Final verdict.

policy:VerdictPolicyInfo

Serializable policy information.

observed?:{ rawValue?: MetricScalar | null; score?: Score }

Optional copy of observed values used for decision.

multiTurn:Record<EvalName, ConversationEvalResult>

Multi-turn eval results (one per conversation).

ConversationEvalResult
evalRef:EvalName

Eval name reference.

measurement:Measurement

Observed measurement (raw + score + debug info).

outcome?:EvalOutcome

Optional verdict outcome.

scorers:Record<EvalName, { shape: 'seriesByStepIndex'; series: SingleTurnEvalSeries } | { shape: 'scalar'; result: ConversationEvalResult }>

Scorer eval results (explicitly declared as series vs scalar).

summaries?:Summaries

Optional summary rollups (aggregations + verdict summary) keyed by eval.

Summaries
byEval:Record<EvalName, EvalSummarySnap>

Summary per eval.

EvalSummarySnap
eval:EvalName

Eval name reference.

kind:'singleTurn' | 'multiTurn' | 'scorer'

Eval kind.

count:number

Number of targets contributing to summary.

aggregations?:{ score: Aggregations; raw?: Aggregations }

Optional aggregations (score + optional raw).

verdictSummary?:VerdictSummary

Optional verdict rollup (counts + rates).

VerdictSummary
passRate:Score

Pass rate.

failRate:Score

Fail rate.

unknownRate:Score

Unknown rate.

passCount:number

Pass count.

failCount:number

Fail count.

unknownCount:number

Unknown count.

totalCount:number

Total evaluated count.

metadata?:Record<string, unknown>

Optional metadata.

view:() => TargetRunView

Create an ergonomic view for assertions and inspection.

toArtifact:() => TallyRunArtifact

Convert to a schema-stable artifact for persistence/UI tooling.

TallyRunArtifact

Serializable shape for storage and viewer/TUI tooling. Avoids Map in persisted fields.

schemaVersion:1

Artifact schema version.

runId:string

Run identifier.

createdAt:string

ISO timestamp.

defs:RunDefs

Deduped metric/eval/scorer definitions.

RunDefs
metrics:Record<MetricName, MetricDefSnap>

Metric definition snapshots.

MetricDefSnap
name:MetricName

Metric name.

scope:'single' | 'multi'

Metric scope.

valueType:'number' | 'boolean' | 'string' | 'ordinal'

Metric value type.

description?:string

Optional description.

metadata?:Record<string, unknown>

Optional metadata.

llm?:{ provider?: Record<string, unknown>; prompt?: { instruction: string; variables?: readonly string[] }; rubric?: Record<string, unknown> }

Optional LLM snapshot (when metric is llm-based).

aggregators?:Array<{ kind: string; name: string; description?: string; config?: unknown }>

Optional aggregator snapshots (attached on single-turn metrics).

normalization?:MetricNormalizationSnap

Optional normalization snapshot.

evals:Record<EvalName, EvalDefSnap>

Eval definition snapshots.

EvalDefSnap
name:EvalName

Eval name.

kind:'singleTurn' | 'multiTurn' | 'scorer'

Eval kind.

outputShape:'seriesByStepIndex' | 'scalar'

Stored output shape.

metric:MetricName

Metric ref for this eval.

scorerRef?:ScorerName

Optional scorer ref (scorer evals).

verdict?:VerdictPolicyInfo

Optional verdict policy info.

description?:string

Optional description.

metadata?:Record<string, unknown>

Optional metadata.

scorers:Record<ScorerName, ScorerDefSnap>

Scorer definition snapshots.

ScorerDefSnap
name:ScorerName

Scorer name.

description?:string

Optional description.

metadata?:Record<string, unknown>

Optional metadata.

inputs:readonly ScorerInputSnap[]

Serializable scorer inputs.

fallbackScore?:Score

Optional fallback score.

combine?:{ kind: ScorerCombineKind; note?: string }

Optional combine strategy snapshot.

result:ConversationResult

Per-eval results + optional summaries.

Same nested shape as `TallyRunReport.result` (see above).

metadata?:Record<string, unknown>

Optional metadata.

On this page