AgentEvals - Score Agent Behavior from OpenTelemetry Traces

🔍

Trace-Based Evaluation

Parse OTLP streams and Jaeger JSON traces to evaluate agent behavior directly from production or test telemetry data.

⚡

No Re-Running Required

Score agent behavior from existing traces. No need to replay expensive LLM calls or wait for agent re-execution.

🎯

Golden Eval Sets

Define expected behaviors as golden eval sets and score traces against them using ADK's evaluation framework.

📊

Trajectory Matching

Compare agent trajectories with strict, unordered, subset, or superset matching modes for flexible evaluation.

🤖

LLM-as-Judge

Use LLM-powered evaluation for nuanced scoring of agent behavior without requiring reference trajectories.

🛠

CI/CD Integration

Run evaluations in your pipeline with the CLI. Gate deployments on agent behavior quality scores.

🧩

Custom Evaluators

Write your own scoring logic in Python, JavaScript, or any language. Share evaluators through the community registry.

# Install from release wheel pip install agentevals-<version>-py3-none-any.whl # Run an evaluation against a trace agentevals run samples/helm.json \ --eval-set samples/eval_set_helm.json \ -m tool_trajectory_avg_score # Start the web UI agentevals serve

Ship Agents Reliably

Why AgentEvals?

Trace-Based Evaluation

No Re-Running Required

Golden Eval Sets

Trajectory Matching

LLM-as-Judge

CI/CD Integration

Custom Evaluators

How It Works

Collect Traces

Define Eval Sets

Score & Report

Two Ways to Evaluate

CLI

Web UI

Build Your Own Evaluators

Get Started

Start Evaluating Your Agents