Custom Evaluators

Beyond the built-in metrics, you can write your own evaluators in Python, JavaScript, or any language. An evaluator is any program that reads JSON from stdin and writes a score to stdout.

For the comprehensive guide, see custom-evaluators.md in the repository.

Scaffold an Evaluator

agentevals evaluator init my_evaluator

This creates a directory with boilerplate and a manifest:

my_evaluator/
├── my_evaluator.py     # your scoring logic
└── evaluator.yaml      # metadata manifest

You can also list supported runtimes and generate config snippets:

agentevals evaluator runtimes              # show supported languages
agentevals evaluator config my_evaluator \
  --path ./evaluators/my_evaluator.py      # generate config snippet

Implement Scoring Logic

Your function receives an EvalInput with the agent’s invocations and returns an EvalResult with a score between 0.0 and 1.0.

from agentevals_evaluator_sdk import EvalInput, EvalResult, evaluator

@evaluator
def my_evaluator(input: EvalInput) -> EvalResult:
    scores = []
    for inv in input.invocations:
        # Your scoring logic here
        score = 1.0
        scores.append(score)

    return EvalResult(
        score=sum(scores) / len(scores) if scores else 0.0,
        per_invocation_scores=scores,
    )

if __name__ == "__main__":
    my_evaluator.run()

Install the SDK standalone with pip install agentevals-evaluator-sdk (no heavy dependencies).

Reference in Eval Config

# eval_config.yaml
evaluators:
  - name: tool_trajectory_avg_score
    type: builtin

  - name: my_evaluator
    type: code
    path: ./evaluators/my_evaluator.py
    threshold: 0.7

agentevals run trace.json --config eval_config.yaml --eval-set eval_set.json

Community Evaluators

Community evaluators can be referenced directly from the shared evaluators repository using type: remote:

evaluators:
  - name: response_quality
    type: remote
    source: github
    ref: evaluators/response_quality/response_quality.py
    threshold: 0.7
    config:
      min_response_length: 20

Browse available community evaluators on the Evaluators page, or contribute your own.

Supported Languages

Evaluators can be written in any language that reads JSON from stdin and writes JSON to stdout.

Language	Extension	SDK available
Python	`.py`	`pip install agentevals-evaluator-sdk`
JavaScript	`.js`	No SDK yet — just read stdin, write stdout
TypeScript	`.ts`	No SDK yet — just read stdin, write stdout