Ship Agents Reliably

Benchmark your agents before they hit production. AgentEvals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.

Why AgentEvals?

Evaluate agent behavior from real traces, not synthetic replays.

🔍

Trace-Based Evaluation

Parse OTLP streams and Jaeger JSON traces to evaluate agent behavior directly from production or test telemetry data.

No Re-Running Required

Score agent behavior from existing traces. No need to replay expensive LLM calls or wait for agent re-execution.

🎯

Golden Eval Sets

Define expected behaviors as golden eval sets and score traces against them using ADK's evaluation framework.

📊

Trajectory Matching

Compare agent trajectories with strict, unordered, subset, or superset matching modes for flexible evaluation.

🤖

LLM-as-Judge

Use LLM-powered evaluation for nuanced scoring of agent behavior without requiring reference trajectories.

🛠

CI/CD Integration

Run evaluations in your pipeline with the CLI. Gate deployments on agent behavior quality scores.

🧩

Custom Evaluators

Write your own scoring logic in Python, JavaScript, or any language. Share evaluators through the community registry.

How It Works

Three steps from traces to scores.

1

Collect Traces

Instrument your agent with OpenTelemetry or export Jaeger JSON traces from your observability platform.

2

Define Eval Sets

Create golden evaluation sets that describe expected agent behaviors, tool calls, and trajectories.

3

Score & Report

Run evaluations via CLI or Web UI. Get detailed scores and pass/fail results.

Two Ways to Evaluate

Choose the interface that fits your workflow.

CLI

Script evaluations and integrate into CI/CD pipelines. Pipe in traces, get scores out. Built for automation.

🖥

Web UI

Visually inspect traces and interactively evaluate agent behavior. Browse results, compare runs, and drill into details.

Build Your Own Evaluators

Write custom scoring logic in Python, JavaScript, or any language. Share it with the community through our evaluator registry.

Get Started

Up and running in seconds.

terminal
# Install from release wheel
pip install agentevals-<version>-py3-none-any.whl

# Run an evaluation against a trace
agentevals run samples/helm.json \
  --eval-set samples/eval_set_helm.json \
  -m tool_trajectory_avg_score

# Start the web UI
agentevals serve

Start Evaluating Your Agents

Open source. Trace-driven. No re-runs needed.