Open-source agent testing runner

Catch the policy regressions your unit tests miss.

One prompt edit can make an agent create a $700 refund before manager approval, skip a fraud escalation, or call tools in the wrong order. Wendell Runner turns those failures into committed tests you can run locally and block in CI.

View source

$700 refund created before manager approval

Escalation skipped when fraud language appears

Tool call happens before required order lookup

CI passes code while agent behavior regresses

wendell test

Refund Agent Regression

1 failed, 4 passed

$ wendell test

FAIL refund_over_limit_requires_escalation

rule refund_limit_requires_approval

expected refunds.create not to be called

actual event[2] refunds.create amount=700

assertions

reporters

off

upload

PyPI package

Install with uv or pipx. No git checkout required.

MIT CLI

Public source, license, and releases on GitHub.

Local suites

`wendell test` needs no account or hosted dashboard.

CI evidence

Exit codes plus JSON, JUnit, and GitHub summaries.

What breaks

The failures that slip through normal app tests.

Your API tests can prove a tool works. Wendell tests whether the agent followed the business workflow before it touched the tool.

A prompt tweak changes the order of operations.

The refund tool fires before the agent gets manager approval.

Assert `refunds.create` is not called and `escalations.create` appears first.

A policy branch disappears during a refactor.

Fraud language no longer triggers Risk Operations escalation.

Commit the risky scenario once and run it on every pull request.

The app test suite passes, but the workflow is broken.

Unit tests prove the tool works; they do not prove the agent used it correctly.

Return a red/green CI result with the violated rule and exact tool trace.

How it works

Turn every escaped workflow bug into a regression test.

Keep playbooks, prompts, credentials, and your real agent runtime in your repo. Wendell reads committed scenarios and produces CI-native evidence without making a hosted dashboard the first dependency.

Commit a suite

Readable JSON scenarios live beside your agent code, prompts, policies, and fixtures.

Point at your agent

`agent_command` runs Python, TypeScript, Node, or any process that speaks stdin/stdout JSON.

Run in CI

The runner returns real exit codes and writes JSON, JUnit, or GitHub summaries.

Ship with evidence

Failures show the violated rule, tool calls, missing events, and the exact scenario.

Use it locally

The shareable loop is one config file and one command.

Start with deterministic assertions. Add generation later. The first win should be a failing refund, escalation, or policy scenario you can reproduce on a pull request.

Install

uv tool install --force wendell

Run

wendell test

Write your first assertion

wendell.toml

Local runner config

project = "refund-agent"
mode = "blocking"
suite = "tests/wendell/refunds.json"
agent_command = "node agents/refund-agent.mjs"
upload_traces = false
reporters = ["default", "json", "junit", "github"]

[output_file]
json = "wendell-results/results.json"
junit = "wendell-results/junit.xml"

[gates]
suite_min_score = 1.0
scenario_min_score = 1.0
critical_failures_allowed = 0

tool_called

tool_not_called

tool_called_before

message_contains

message_not_contains

json_path_equals

Python and TypeScript

Any agent stack can speak the adapter contract.

The runner invokes any process that reads JSON from stdin and writes a result to stdout. Python, TypeScript, Node, or another runtime can all run from the same `agent_command` field.

Python adapter

import json, sys

payload = json.loads(sys.stdin.read())
print(json.dumps({
  "message": "Escalated for manager approval.",
  "tool_calls": [
    {"name": "orders.lookup", "args": {}},
    {"name": "escalations.create", "args": {}}
  ]
}))

TypeScript adapter

import { readFileSync } from "node:fs";

type ToolCall = { name: string; args: Record<string, unknown> };
type AgentResult = { message: string; tool_calls: ToolCall[] };

JSON.parse(readFileSync(0, "utf8"));

const result: AgentResult = {
  message: "Escalated for manager approval.",
  tool_calls: [
    { name: "orders.lookup", args: {} },
    { name: "escalations.create", args: {} }
  ]
};

process.stdout.write(JSON.stringify(result));

CI-ready reports

Designed for pull requests.

Reporters follow the pattern developers already expect from test runners: terminal output for humans, structured files for automation, and summaries for GitHub Actions.

default

Readable red/green terminal output

json

Structured result payloads for custom automation

junit

CI-native test reports and annotations

github

Pull request summaries through GITHUB_STEP_SUMMARY

Share the runner.

The clean pitch: open-source regression testing for agents, local-first, CI-ready, and friendly to any adapter that can speak stdin/stdout JSON.

GitHub