Eval Set Format

Eval sets provide a repeatable way to organize the inputs and metadata used during evaluation.

What an eval set is

An eval set typically describes:

the items or examples being evaluated
metadata associated with those items
expected outputs, labels, or references when available
which evaluators or metrics should be applied

This gives teams a stable structure for comparing results over time.

Why it matters

A clear eval set format helps you:

keep evaluation runs consistent
compare changes across model or agent versions
connect trace-derived behavior to dataset-level expectations
share evaluation definitions across local, CI, and Kubernetes environments

Practical guidance

When designing an eval set:

keep identifiers stable
store expected outputs or labels only when they are genuinely part of the task
attach metadata that is useful for slicing results later
avoid overloading one eval set with too many unrelated behaviors