Integrations & Use Cases
AgentEvals can be used in multiple ways depending on your workflow. Evaluate agents with zero code via OTel, programmatically via the SDK, or in CI pipelines with the CLI.
For detailed, working examples covering all integration patterns, see the examples directory in the repository.
Zero-Code (Recommended)
Point any OTel-instrumented agent at the receiver. No SDK, no code changes:
# Terminal 1 — start the agentevals server
uv run agentevals serve --dev
# Terminal 2 — run your agent with OTel pointing to agentevals
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
export OTEL_RESOURCE_ATTRIBUTES="agentevals.session_name=my-agent"
python your_agent.py
Traces stream to the UI in real-time. Works with LangChain, Strands, Google ADK, or any framework that emits OTel spans (http/protobuf and http/json supported).
Sessions are auto-created and grouped by agentevals.session_name. Set agentevals.eval_set_id to associate traces with an eval set.
See examples/zero-code-examples/ for working examples with different frameworks.
AgentEvals SDK
For programmatic session lifecycle and decorator API:
from agentevals import AgentEvals
app = AgentEvals()
with app.session(eval_set_id="my-eval"):
agent.invoke("Roll a 20-sided die for me")
Requires pip install "agentevals[streaming]". See examples/sdk_example/ for framework-specific patterns.
CLI & CI/CD
The CLI is built for scripting and CI pipelines.
Commands
# Single trace
uv run agentevals run samples/helm.json \
--eval-set samples/eval_set_helm.json \
-m tool_trajectory_avg_score
# Multiple traces
uv run agentevals run samples/helm.json samples/k8s.json \
--eval-set samples/eval_set_helm.json \
-m tool_trajectory_avg_score
# JSON output for programmatic processing
uv run agentevals run samples/helm.json \
--eval-set samples/eval_set_helm.json \
--output json
# List available evaluators (builtin + community)
uv run agentevals evaluator list
# List only builtin evaluators
uv run agentevals evaluator list --source builtin
GitHub Actions Example
# .github/workflows/agent-eval.yml
name: Agent Evaluation
on:
pull_request:
paths:
- 'agent/**'
- 'evals/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install agentevals
run: pip install agentevals-*.whl
- name: Run agent and capture trace
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: python run_agent.py --capture-trace ./traces/pr-run.json
- name: Evaluate agent behavior
run: |
agentevals run ./traces/pr-run.json \
--eval-set ./evals/golden.json \
-m tool_trajectory_avg_score \
--output json > results.json
- name: Check results
run: |
python -c "
import json, sys
results = json.load(open('results.json'))
if results['overall_score'] < 0.85:
print(f'Score {results[\"overall_score\"]} below threshold')
sys.exit(1)
"