GitHub - edonadei/caliper: Local-first eval harness for Claude Code, Codex and Pi skills

Source: https://github.com/edonadei/caliper

Caliper — Reliability testing for agent skills

[](https://github.com/edonadei/caliper#caliper--reliability-testing-for-agent-skills)

![Image 1: PyPI](https://pypi.org/project/caliper-eval/)![Image 2: Python](https://pypi.org/project/caliper-eval/)![Image 3: Skills](https://skills.sh/edonadei/caliper)

Run your skill _k_ times, get a pass@k score you can track and compare, and prove the skill beats the base agent. Works with the agent you already use — Claude Code, Codex, or Pi.

npx skills@latest add edonadei/caliper

Run each task with and without the skill, and Caliper shows you the difference:

``` ID Task k (3) pass@k task-1 Writes a conventional commit msg 3/3 100% PASS task-2 Generates a valid config file 2/3 96% PASS

With skill 98% ###################- No skill 55% ###########--------- Delta +43% up ```

- * *

Agent skills are hard to test. A skill that works on your machine, on this prompt, today, might fail tomorrow after a model update or a one-line prompt edit. Caliper makes reliability measurable: define what success looks like, run the skill repeatedly, and get a pass@k score you can track over time.

Use Caliper to answer questions like:

- Did my prompt edit actually improve the skill?

- Is the skill doing the work, or would the base agent pass without it?

- Does it still pass the workflows it passed last week?

- Which agent — Claude Code, Codex, or Pi — runs this skill more reliably?

- * *

Quick start

[](https://github.com/edonadei/caliper#quick-start)

Path A — Agentic (let your agent drive)

[](https://github.com/edonadei/caliper#path-a--agentic-let-your-agent-drive) **1. Install the skills**

npx skills@latest add edonadei/caliper

**2. Generate a spec interactively**

In your agent (Claude Code or Codex):

``` /grill-skill ./my-skill/SKILL.md ```

`grill-skill` reads your `SKILL.md`, interviews you, and writes a 3-task `.eval.yaml` (happy path, edge case, adversarial).

**3. Run and measure**

``` /evaluate-skill run my-skill.eval.yaml --k 3 --baseline ```

Browse past runs:

``` /evaluate-skill list /evaluate-skill report my-skill ```

Path B — CLI (run it yourself)

[](https://github.com/edonadei/caliper#path-b--cli-run-it-yourself) **1. Install the CLI**

pipx install caliper-eval # requires Python 3.10+

**2. Write a spec**

my-skill.eval.yaml

skill: path: ./SKILL.md backend: claude-code

judge: backend: claude-code

tasks:

Autorater — the LLM judge reads the transcript and decides

- name: Writes a conventional commit message

prompt: "Summarize the staged git diff as a commit message." expect: > The response is a conventional-commit message: a concise subject line under 72 characters, followed by a body explaining why the change was made, not just what changed.

Script execution — a deterministic Python assertion

- name: Generates a valid config file

cleanup: rm -f /tmp/app.config.json prompt: "Generate a config at /tmp/app.config.json with a 'port' of 8080." assert: | import json from pathlib import Path data = json.loads(Path("/tmp/app.config.json").read_text()) assert data["port"] == 8080

`expect:` is graded by the judge LLM; `assert:` runs locally as Python. Use either or both.

**3. Run it**

caliper run my-skill.eval.yaml --k 3 --baseline

**4. Read the output**

``` CALIPER - my-skill - k=3 - claude-code

ID Task k (3) pass@k task-1 Writes a conventional commit msg 3/3 100% PASS task-2 Generates a valid config file 2/3 96% PASS

With skill 98% ###################- No skill 55% ###########--------- Delta +43% up

Results saved to .caliper/results/my-skill/2026-06-19T14-23-01Z.json ```

Not sure what to put in a spec?

[](https://github.com/edonadei/caliper#not-sure-what-to-put-in-a-spec)

The **Eval Starter Pack** has four copy-paste templates, each catching a real agent failure (false success, tool misuse, runaway loops, prompt regressions). Every template runs green as-is against a bundled example, then points at your own skill by editing two or three commented lines.

- * *

How it works

[](https://github.com/edonadei/caliper#how-it-works)

``` .eval.yaml spec │ ▼ Harness ──── runs your skill against the agent (Claude Code / Codex / Pi) │ ▼ Judge ──── LLM autorater and/or deterministic Python assertions │ ▼ pass@k score + saved transcript ```

Each attempt runs in an isolated temporary home with no session history. Results are saved as JSON you can inspect and diff later.

- * *

Agent skills

[](https://github.com/edonadei/caliper#agent-skills) The repo ships two agent skills. Install both with:

npx skills@latest add edonadei/caliper

`evaluate-skill` — run and manage evals

[](https://github.com/edonadei/caliper#evaluate-skill--run-and-manage-evals) Create, validate, run, and summarize evals from inside your normal workflow — no separate terminal needed. The skill installs Caliper automatically if it's missing.

Or, if you already have Caliper installed and want to wire up the skill manually:

caliper install-skill claude-code caliper install-skill codex

Preview without writing files:

caliper install-skill claude-code --dry-run

Then use it in Claude Code:

``` /evaluate-skill run my-skill.eval.yaml --k 3 /evaluate-skill validate my-skill.eval.yaml ```

Or in Codex:

``` Use the evaluate-skill skill to run my-skill.eval.yaml with k=3 and summarize the result. ```

`grill-skill` — create evals interactively

[](https://github.com/edonadei/caliper#grill-skill--create-evals-interactively)

Don't have evals yet? `grill-skill` guides you through creating them. It reads your `SKILL.md`, interviews you about what good behavior looks like, and generates a 3-task spec (happy path, edge case, adversarial). Then it runs the eval and loops — k=1 to validate, k=3 to measure, baseline before you commit.

``` /grill-skill ./my-skill/SKILL.md ```

No path needed if you're already in the skill's directory:

``` /grill-skill ```

If an `.eval.yaml` already exists next to your skill, `grill-skill` reads the existing tasks and interviews you about gaps instead of starting from scratch.

- * *

Core concepts

[](https://github.com/edonadei/caliper#core-concepts) | Term | What it is | | --- | --- | | **Spec** | A `.eval.yaml` file that describes the skill, judge, and tasks to run | | **Backend** | The agent that executes the skill (`claude-code`, `codex`, `pi`, `claude-api`, `openai-api`) | | **Judge** | What decides pass/fail — an LLM reading the transcript (`expect:`), Python assertions (`assert:`), or both | | **pass@k** | Reliability score: run k times, measure how often the skill succeeds | | **Baseline** | Re-run the same tasks without the skill to prove the skill is doing the work | | **Attempt** | One isolated run of a single task — fresh temporary home, no session history |

- * *

Choosing a backend

[](https://github.com/edonadei/caliper#choosing-a-backend) | Backend | Requires | Best for | | --- | --- | --- | | `claude-code` | Claude Code CLI installed and authenticated | Testing Claude Code slash-command skills | | `codex` | Codex CLI installed (`npm install -g @openai/codex`) | Testing Codex skills | | `pi` | pi CLI installed (`npm install -g @earendil-works/pi-coding-agent`) and authenticated | Testing pi skills (agentskills.io) | | `claude-api` | `ANTHROPIC_API_KEY` env var | API-backed agents, no CLI needed | | `openai-api` | `OPENAI_API_KEY` env var | OpenAI API agents |

The agent backend and judge backend are independent — you can test a Codex skill with a Claude judge, or any other combination.

Claude Code setup

[](https://github.com/edonadei/caliper#claude-code-setup) Install and authenticate the `claude` CLI. `backend: claude-code` uses your existing Claude Code auth — no extra configuration needed.

For `backend: claude-api`:

export ANTHROPIC_API_KEY=...

Codex setup

[](https://github.com/edonadei/caliper#codex-setup)

npm install -g @openai/codex codex login

`backend: codex` calls `codex exec`. It does not fall back to the OpenAI API. If the Codex desktop app is installed, Caliper prefers the app-bundled binary over `codex` on `PATH`. Set `CODEX_CLI_PATH` to force a specific binary.

For `backend: openai-api`:

export OPENAI_API_KEY=...

pi setup

[](https://github.com/edonadei/caliper#pi-setup)

npm install -g @earendil-works/pi-coding-agent pi # then authenticate (e.g. /login for a subscription provider, or set the provider API key)

`backend: pi` runs `pi --print --mode json` and loads the skill natively via pi's `--skill` flag (the agentskills.io standard). It reuses your `~/.pi/agent` auth and settings — the spec's `model:` overrides pi's configured default when set. Set `PI_CLI_PATH` to force a specific binary. Note: pi's built-in default provider is `google`, so a spec with no `model:` relies on your pi config to resolve a provider you are authenticated for.

Check installed CLI versions:

caliper update-cli --check

- * *

Recommended workflow

[](https://github.com/edonadei/caliper#recommended-workflow) 1. Create a spec for one behavior you care about. 2. Run with `--k 1` while iterating on the spec. 3. Add `assert:` for facts an LLM judge might guess wrong (files, JSON, command output). 4. Move to `--k 3` or higher once the task is stable. 5. Add `--baseline` to prove the skill is making a difference. 6. Commit the spec alongside the skill so contributors can run the same eval.

``` /evaluate-skill run my-skill.eval.yaml --k 3 --baseline --verbose ```

- * *

Spec format

[](https://github.com/edonadei/caliper#spec-format)

skill: path: ./SKILL.md # path to the skill file (optional for baseline-only runs) backend: claude-code # claude-code | codex | pi | claude-api | openai-api model: <model-name> # optional model override

judge: backend: claude-code # claude-code | codex | pi | claude-api | openai-api model: <model-name> # optional model override

sandbox: extra_path:

- ./bin # prepended to PATH inside each attempt

forbidden_files:

- ".*\\.eval\\.yaml$" # prevents agent from reading the spec

- "./.caliper/.*" # prevents agent from reading saved results

tasks:

- name: Short task name

setup: <shell command> # optional, runs before each attempt cleanup: <shell command> # optional, always runs after each attempt prompt: <prompt sent to the agent> expect: <natural-language success condition> assert: |

optional inline Python assertion

assert True

- name: Task with external assertion script

prompt: "Generate a report" assert: ./assertions/check_report.py

Each task needs at least one of `expect` or `assert`. Task IDs are assigned automatically as `task-001`, `task-002`, and so on.

- * *

Judging

[](https://github.com/edonadei/caliper#judging)

LLM autorater (`expect:`)

[](https://github.com/edonadei/caliper#llm-autorater-expect)

The judge backend reads the full attempt transcript and decides whether the `expect` condition was met. When the backend captures tool-call traces (Claude Code, Codex), those traces are included — the judge can verify things like "the agent used tool X" without relying on the final text alone.

judge: backend: claude-code

Deterministic assertions (`assert:`)

[](https://github.com/edonadei/caliper#deterministic-assertions-assert) Python assertions run locally. Use these for facts the LLM judge might guess:

- file exists / exact file contents

- JSON / schema validity

- command output

- images or screenshots

- repository state

tasks:

- name: Writes an output file

cleanup: rm -f /tmp/out.txt prompt: "Write hello world to /tmp/out.txt" assert: | from pathlib import Path path = Path("/tmp/out.txt") assert path.exists(), "Output file was not created" assert path.read_text().strip() == "hello world"

When both `expect` and `assert` are present, both must pass.

- * *

CLI reference

[](https://github.com/edonadei/caliper#cli-reference) | Command | Description | | --- | --- | | `caliper run <spec>` | Run an evaluation spec | | `caliper new [name]` | Create a new spec with the interactive wizard | | `caliper validate <spec>` | Validate a spec file | | `caliper list [spec]` | List specs and saved runs | | `caliper report <spec-or-result>` | Re-render saved results | | `caliper install-skill <backend>` | Install the bundled evaluate-skill into Claude Code or Codex | | `caliper update-cli [backend]` | Check or update installed agent CLI versions |

`caliper run` flags

[](https://github.com/edonadei/caliper#caliper-run-flags) | Flag | Default | Description | | --- | --- | --- | | `--k INT` | `3` | Attempts per task | | `--baseline` | off | Also run each task without the skill | | `--workers INT` | `4` | Parallel task workers | | `--timeout INT` | `120` | Seconds per attempt | | `--model TARGET` | — | Override skill backend and/or model (see below) | | `--judge-model TARGET` | — | Override judge backend and/or model (see below) | | `--verbose` | off | Show per-attempt judge reasoning | | `--output PATH` | — | Also save results JSON to a specific path |

#### `--model` and `--judge-model` syntax

[](https://github.com/edonadei/caliper#--model-and---judge-model-syntax) Both flags accept a `backend:model` compound value, a bare backend name, or a bare model name:

Override backend and model together

caliper run my-skill.eval.yaml --model claude-api:claude-sonnet-4-6

Override backend only (model stays unset / from spec)

caliper run my-skill.eval.yaml --model codex

Override model only (backend stays from spec)

caliper run my-skill.eval.yaml --model claude-sonnet-4-6

Override judge independently

caliper run my-skill.eval.yaml --model claude-api:claude-sonnet-4-6 --judge-model claude-api:claude-haiku-4-5-20251001

Accepted backends: `claude-code`, `codex`, `pi`, `claude-api`, `openai-api` (aliases: `claude`, `anthropic`, `openai`).

The spec file is never modified — overrides apply only to the current run.

- * *

Scoring

[](https://github.com/edonadei/caliper#scoring) For each task:

``` pass@k = 1 - (1 - successes / k) ^ k ```

The aggregate score is the average task pass@k. With `--baseline`, Caliper runs the same tasks without the skill and reports the delta.

- * *

Project layout

[](https://github.com/edonadei/caliper#project-layout)

``` caliper/ commands/ CLI command implementations harness/ Agent execution backends (Claude Code, Codex, API) judge/ LLM and script judging implementations schema/ Eval spec and result models runner.py Evaluation orchestration skills/ evaluate-skill/ Agent skill for running Caliper from Claude Code or Codex grill-skill/ Agent skill for creating and iterating on evals interactively tests/ Pytest coverage for harnesses, judges, and runner behavior ```

- * *

Contributing

[](https://github.com/edonadei/caliper#contributing) Good first areas:

- add example evals for real skills

- improve backend error messages

- add deterministic assertion helpers

- expand tests for harness and judge behavior

- improve result reporting and summaries

- document common setup problems for Claude Code and Codex

Before opening a pull request:

pip install -e ".[dev,openai]" pytest ruff check . caliper validate skills/evaluate-skill/evaluate-skill.eval.yaml

When changing behavior, include a test or an eval fixture that demonstrates the expected outcome. Keep backend-specific logic isolated to the relevant module under `caliper/harness/` or `caliper/judge/`.

- * *

Troubleshooting

[](https://github.com/edonadei/caliper#troubleshooting) **`codex judge failed: model ... is not supported`** The model name is not available to your Codex account. Use a model that `codex exec --model <name>` accepts.

**`codex CLI not found`**

npm install -g @openai/codex

**`claude` command not found** Install and authenticate Claude Code, or switch the backend to `codex`, `pi`, `claude-api`, or `openai-api`.

**A task passes only because of `assert:`** When a task has only `assert:`, no LLM judge runs. Add `expect:` if you also want an LLM to evaluate the transcript.