Eval Set Format
Eval sets provide a repeatable way to organize the inputs and metadata used during evaluation.
What an eval set is
An eval set typically describes:
- the items or examples being evaluated
- metadata associated with those items
- expected outputs, labels, or references when available
- which evaluators or metrics should be applied
This gives teams a stable structure for comparing results over time.
Why it matters
A clear eval set format helps you:
- keep evaluation runs consistent
- compare changes across model or agent versions
- connect trace-derived behavior to dataset-level expectations
- share evaluation definitions across local, CI, and Kubernetes environments
Practical guidance
When designing an eval set:
- keep identifiers stable
- store expected outputs or labels only when they are genuinely part of the task
- attach metadata that is useful for slicing results later
- avoid overloading one eval set with too many unrelated behaviors