
Benchmark your agents before they hit production. AgentEvals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.
Evaluate agent behavior from real traces, not synthetic replays.
Parse OTLP streams and Jaeger JSON traces to evaluate agent behavior directly from production or test telemetry data.
Score agent behavior from existing traces. No need to replay expensive LLM calls or wait for agent re-execution.
Define expected behaviors as golden eval sets and score traces against them using ADK's evaluation framework.
Compare agent trajectories with strict, unordered, subset, or superset matching modes for flexible evaluation.
Use LLM-powered evaluation for nuanced scoring of agent behavior without requiring reference trajectories.
Run evaluations in your pipeline with the CLI. Gate deployments on agent behavior quality scores.
Write your own scoring logic in Python, JavaScript, or any language. Share evaluators through the community registry.
Three steps from traces to scores.
Instrument your agent with OpenTelemetry or export Jaeger JSON traces from your observability platform.
Create golden evaluation sets that describe expected agent behaviors, tool calls, and trajectories.
Run evaluations via CLI or Web UI. Get detailed scores and pass/fail results.
Choose the interface that fits your workflow.
Script evaluations and integrate into CI/CD pipelines. Pipe in traces, get scores out. Built for automation.
Visually inspect traces and interactively evaluate agent behavior. Browse results, compare runs, and drill into details.
Write custom scoring logic in Python, JavaScript, or any language. Share it with the community through our evaluator registry.
Up and running in seconds.
# Install from release wheel pip install agentevals-<version>-py3-none-any.whl # Run an evaluation against a trace agentevals run samples/helm.json \ --eval-set samples/eval_set_helm.json \ -m tool_trajectory_avg_score # Start the web UI agentevals serve
Open source. Trace-driven. No re-runs needed.