Reports
SDK report objects, View API, and the schema-stable run artifact.
Reports
Tally exposes two report representations and a type-safe View API:
TallyRunReport: SDK-facing in-memory object returned fromawait tally.run().TargetRunView: Type-safe accessor for report data, created viareport.view().TallyRunArtifact: Schema-stable, serializable artifact produced byreport.toArtifact().
Import
import type {
TallyRunReport,
TallyRunArtifact,
TargetRunView,
StepResults,
StepResultsWithIndex,
ConversationResults,
SummaryResults,
StepEvalResult,
ConversationEvalResult,
Measurement,
EvalOutcome,
EvalSummary,
VerdictSummary,
} from '@tally-evals/tally';View API
The View API provides ergonomic, type-safe access to report data. Call report.view() to get a TargetRunView.
TargetRunView
Type-safe view over run results with eval name autocomplete.
Total number of steps in the conversation.
All definitions (metrics, evals, scorers) used in the run.
Get all single-turn eval results for a specific step.
Iterate over all steps with their results.
Get multi-turn evals and scalar scorer results.
Get aggregated summaries by eval name.
Look up a metric definition by name.
Look up an eval definition by name.
Look up a scorer definition by name.
Get the metric definition for an eval.
StepResults
Object returned by view.step(index). Keys are eval names with type-safe autocomplete.
Result for a single-turn eval at this step. Keys are literal eval names.
StepResultsWithIndex
Yielded by view.steps() generator. Same as StepResults plus an index property.
Step index (0-based).
Result for a single-turn eval at this step.
ConversationResults
Object returned by view.conversation(). Contains multi-turn evals and scalar scorers.
Result for a multi-turn eval or scalar scorer. Keys are literal eval names.
SummaryResults
Object returned by view.summary(). Contains aggregated statistics by eval name.
Summary for an eval including aggregations and verdict rates.
StepEvalResult
Result for a single step evaluation.
Name of the eval.
The measurement (score, rawValue, metadata).
Verdict outcome (if verdict policy was defined).
ConversationEvalResult
Result for a conversation-level evaluation.
Name of the eval.
The measurement (score, rawValue, metadata).
Verdict outcome (if verdict policy was defined).
Measurement
The measured value from a metric execution.
Reference to the metric in defs.metrics.
Normalized score (0-1) after normalization.
Original metric value before normalization.
LLM confidence (0-1) if applicable.
LLM reasoning if applicable.
Execution time in milliseconds.
ISO timestamp of measurement.
Additional metadata.
EvalOutcome
Verdict outcome from applying a verdict policy.
The final verdict.
The policy used to compute the verdict.
The observed values used for the verdict.
EvalSummary
Aggregated summary for an eval across all targets.
Eval name.
Eval kind.
Number of data points.
Aggregated statistics (mean, percentiles, etc.).
Pass/fail/unknown rates and counts.
VerdictSummary
Pass/fail statistics across all targets.
Proportion of passing verdicts (0-1).
Proportion of failing verdicts (0-1).
Proportion of unknown verdicts (0-1).
Number of passing verdicts.
Number of failing verdicts.
Number of unknown verdicts.
Total number of verdicts.
TallyRunReport
Returned from await tally.run().
Run identifier.
Creation time (in-memory Date).
Deduped definitions referenced by the run.
Metric definition snapshots.
Metric name.
Metric scope.
Metric value type.
Optional description.
Optional metadata.
Optional LLM snapshot (when metric is llm-based).
Provider info snapshot.
Prompt snapshot.
Rubric snapshot.
Optional aggregator snapshots (attached on single-turn metrics).
Optional normalization snapshot.
Serializable normalizer snapshot.
Calibration snapshot (function becomes not-serializable note).
Eval definition snapshots.
Eval name.
Eval kind.
Stored output shape.
Metric ref for this eval.
Optional scorer ref (scorer evals).
Optional verdict policy info.
Optional description.
Optional metadata.
Scorer definition snapshots.
Scorer name.
Optional description.
Optional metadata.
Serializable scorer inputs.
Optional fallback score.
Optional combine strategy snapshot.
Per-eval results + optional summaries.
Number of steps in the conversation.
Single-turn eval results indexed by step (null = not evaluated).
Array index == step index; null means not evaluated / not selected.
Eval name reference.
Observed measurement (raw + score + debug info).
Reference into `defs.metrics`.
Normalized score (0..1) when applicable.
Raw value when available.
Optional LLM confidence.
Optional LLM reasoning.
Optional execution timing.
Optional ISO timestamp.
Optional metadata.
Optional verdict outcome (policy + verdict).
Final verdict.
Serializable policy information.
Optional copy of observed values used for decision.
Multi-turn eval results (one per conversation).
Eval name reference.
Observed measurement (raw + score + debug info).
Optional verdict outcome.
Scorer eval results (explicitly declared as series vs scalar).
Optional summary rollups (aggregations + verdict summary) keyed by eval.
Summary per eval.
Eval name reference.
Eval kind.
Number of targets contributing to summary.
Optional aggregations (score + optional raw).
Optional verdict rollup (counts + rates).
Pass rate.
Fail rate.
Unknown rate.
Pass count.
Fail count.
Unknown count.
Total evaluated count.
Optional metadata.
Create an ergonomic view for assertions and inspection.
Convert to a schema-stable artifact for persistence/UI tooling.
TallyRunArtifact
Serializable shape for storage and viewer/TUI tooling. Avoids Map in persisted fields.
Artifact schema version.
Run identifier.
ISO timestamp.
Deduped metric/eval/scorer definitions.
Metric definition snapshots.
Metric name.
Metric scope.
Metric value type.
Optional description.
Optional metadata.
Optional LLM snapshot (when metric is llm-based).
Optional aggregator snapshots (attached on single-turn metrics).
Optional normalization snapshot.
Eval definition snapshots.
Eval name.
Eval kind.
Stored output shape.
Metric ref for this eval.
Optional scorer ref (scorer evals).
Optional verdict policy info.
Optional description.
Optional metadata.
Scorer definition snapshots.
Scorer name.
Optional description.
Optional metadata.
Serializable scorer inputs.
Optional fallback score.
Optional combine strategy snapshot.
Per-eval results + optional summaries.
Same nested shape as `TallyRunReport.result` (see above).
Optional metadata.