Browse and use community-contributed evaluators, or submit your own.
Scores whether each final response contains a configured substring (case-sensitive or case-insensitive)
Scores whether each final response exactly matches a configured expected string
Scores whether each final response parses as JSON (optional markdown code fence extraction)
Scores similarity of each response to a reference string using normalized Levenshtein distance
Example evaluator that returns a randopm score between 0 and 1
Scores whether each final response matches a configured regular expression
Checks that responses are non-empty, meet a minimum length, and don't just echo back the user input
Verifies that each invocation made at least a minimum number of tool calls
Scores whether tool calls match an expected list of tool names (order-sensitive or multiset)
evaluators:
- name: response_quality
type: remote
source: github
ref: evaluators/response_quality/response_quality.py
threshold: 0.7
executor: local
config:
min_response_length: 20agentevals run traces/my_trace.json \ --config eval_config.yaml \ --eval-set eval_set.json
Evaluators are downloaded automatically and cached in ~/.cache/agentevals/evaluators/.
Evaluators are standalone scoring programs that read EvalInput JSON from stdin and write EvalResult JSON to stdout. Scaffold one in seconds:
pip install agentevals-cli agentevals evaluator init my_evaluator