OpenAI Evals API Backend

agentevals includes an initial option to delegate evals to OpenAI’s Evals API.

What this means

Instead of keeping all judging logic inside agentevals, you can use agentevals to:

extract and normalize data from traces
organize evaluation inputs
send evaluation work to OpenAI’s Evals API backend
collect and review the resulting outputs in your agentevals workflow

When to use it

This backend is a good fit when:

you want external rubric-based or model-based judging
you already use OpenAI tooling in your evaluation stack
you want agentevals to stay focused on trace extraction and orchestration

When not to use it

Prefer Custom Evaluators when:

the logic should remain fully local and code-defined
you need deterministic checks with no external dependency
you want complete control over evaluator implementation details

Requirements

To use this backend, you should expect to provide:

an OpenAI API key
an eval definition compatible with your delegated workflow
trace data that contains enough context for judging

Conceptual workflow

collect traces from agent runs
map them into the eval structure agentevals expects
configure the OpenAI backend
delegate evaluation to OpenAI’s Evals API
review results through agentevals outputs and UI

Recommendation

Start with one narrow delegated eval first, confirm the judgment shape matches your expectations, then expand to broader use cases.