GitHub - patrickiel/pantomime: Natural-language end-to-end GUI testing for Windows. Write test cases in plain English; a perception layer sees the screen and a

Source: https://github.com/patrickiel/pantomime

![Image 1: Pantomime](https://github.com/patrickiel/pantomime/blob/main/design/icon.svg)

_Natural-language end-to-end testing for anything on screen._

![Image 2: Python 3.12](https://camo.githubusercontent.com/b566d396d2e22b5a6a8b05f1168f96e80fb8e42d45d17cd178a039e81a2b047a/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f707974686f6e2d332e31322d626c75652e737667)![Image 3: Platform: Windows](https://camo.githubusercontent.com/e61e443ae29ec2ea8c44468f1a37136f30082accadf2537719f78fe632985795/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f706c6174666f726d2d57696e646f77732d3030373844362e737667)![Image 4: License: MIT](https://github.com/patrickiel/pantomime/blob/main/LICENSE)![Image 5: PRs welcome](https://github.com/patrickiel/pantomime/blob/main/CONTRIBUTING.md)

Pantomime is an end-to-end GUI testing framework where test cases are written in plain **natural language**. You describe what a user would do ("Type the username, click Sign in, confirm the welcome screen"), and Pantomime carries it out on the real screen.

screencast.mp4Video 3

Under the hood, local perception "sees" the screen (Windows UI Automation, plus a local OpenCV and OCR grounder for screens UI Automation cannot read), a reasoning model plans and judges each step, and one content-agnostic driver acts out the clicks and keystrokes on a screen region without knowing or caring what is inside it.

This README has two parts:

- **Writing and running tests** is for anyone who authors tests. Start here.

- **How it works** and the sections after it explain the internals and how to develop Pantomime itself.

- * *

Writing and running tests

[](https://github.com/patrickiel/pantomime#writing-and-running-tests) This section is the complete guide for an end user who writes tests. It takes you from an empty project to a passing run you can step through in a browser.

1. Requirements

[](https://github.com/patrickiel/pantomime#1-requirements)

- **Windows.** Pantomime drives the desktop through Windows UI Automation, win32 input, and Windows OCR. It is Windows-only.

- **Python 3.12**, managed by uv. Install uv with `powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"` (uv fetches Python 3.12 for you).

- **An API key for a reasoning model**, used to plan steps and judge natural-language checks. One exception: `--dry-run` plans without a key.

2. Install the `panto` command

[](https://github.com/patrickiel/pantomime#2-install-the-panto-command) Pantomime is a framework you install into your own project; your tests are YAML files that live in your repo next to your app. It is not published to PyPI yet, so first clone this repository:

git clone https://github.com/patrickiel/pantomime.git

Then install the `panto` command from that local checkout (point the path at wherever you cloned it):

uv tool install --editable "C:\path\to\pantomime"

If the shell cannot find `panto` afterward, run `uv tool update-shell` and open a new terminal. For a one-off run without installing, use `uvx --from "C:\path\to\pantomime" panto ...`.

> Working **inside this repository** instead? Run `uv sync` once, then prefix every command with `uv run` (for example `uv run panto run tests`). The rest of this guide writes `panto` for brevity.

3. Scaffold a project

[](https://github.com/patrickiel/pantomime#3-scaffold-a-project) From the root of your project:

panto init # scaffolds into ./tests (pass a different dir to override)

`init` is append-only and never overwrites existing files (pass `--force` to replace them). It creates these things:

Pantomime's own files are grouped under a `config/` folder so the project root stays clean:

| Path | Purpose | | --- | --- | | `tests/example.yaml` | A valid starter test that exercises every field. Edit it for your app. | | `config/pantomime.yaml` | Your project config: name, models, grounding, the `api_key`, and the `vars`/`secrets` tests use. **Gitignored** (holds the key). | | `config/pantomime.example.yaml` | Committed template with placeholders. A fresh clone copies it to `config/pantomime.yaml`. | | `.gitignore` | Updated (append-only) to ignore `runs/` and `config/pantomime.yaml`. |

The schemas are not scaffolded into your project. They are generated from Pantomime's models and hosted in this repo's `schemas/v1/` folder, so every project on the same version references the same files.

**Editor support.** The generated test and config start with a modeline pointing at the hosted schema. It is the same URL for every YAML file, regardless of folder depth:

yaml-language-server: $schema=https://raw.githubusercontent.com/patrickiel/pantomime/main/schemas/v1/testcase.schema.json

With that line and the VS Code YAML extension, you get live validation, autocomplete, and hover docs while editing. The same schemas are enforced when a test or the config is loaded, so a misspelled field fails fast. Regenerate the hosted copies any time with `panto schema -o schemas/v1/testcase.schema.json` and `panto schema --config -o schemas/v1/pantomime.schema.json`.

Everything a project needs lives in **`config/pantomime.yaml`**: a name, which models to use, grounding, pricing, the reasoning `api_key`, and the `vars`/`secrets` your tests reference. Because it holds the key and secrets it is **gitignored**; the committed `config/pantomime.example.yaml` is the shared template a fresh clone copies. `panto init` asks for a project name and seeds `config/pantomime.yaml` for you — open it and fill in your values:

config/pantomime.yaml (gitignored)

name: my-app # shown in the web debugger header api_key: sk-... provider: deepseek # required: reasoning backend (deepseek | anthropic | openai) planner_model: deepseek-v4-pro judge_model: deepseek-v4-flash effort: medium # required: low | medium | high | xhigh | max grounding_enabled: true # required: local OpenCV/OCR grounder (false = UIA-only) pricing: # required: price every model the planner/judge use deepseek-v4-pro: { input_per_mtok: 0.435, output_per_mtok: 0.87 } deepseek-v4-flash: { input_per_mtok: 0.07, output_per_mtok: 0.14 } vars: USER: demo_user # referenced in tests as ${VAR:USER} secrets: DEMO_PASSWORD: hunter2 # referenced in tests as ${SECRET:DEMO_PASSWORD}

Set your reasoning provider's `api_key` (the provider is swappable; see Configuration). Pantomime loads `config/pantomime.yaml` automatically on every command, walking up from the working directory; built-in defaults fill in anything you omit, except `provider`, `planner_model`, `judge_model`, `effort`, `grounding_enabled`, and `pricing`, which have no default: the YAML alone chooses the backend, which model runs, how hard it thinks, whether the local grounder is on, and what it costs, so a run refuses to start until they are set.

4. Write a test

[](https://github.com/patrickiel/pantomime#4-write-a-test) A test is a single YAML file. The smallest valid test needs only an `id`, a `title`, and either a list of `steps` or a free-text `flow`:

test: id: open-app steps:

- action: "Double-click the 'My App' icon on the desktop."

There is intentionally no field for the application or platform. Launching the app is just another step. Here is a fuller example showing every common field:

yaml-language-server: $schema=https://raw.githubusercontent.com/patrickiel/pantomime/main/schemas/v1/testcase.schema.json

test: id: login-happy-path description: > A registered user signs in and lands on the welcome screen. tags: [smoke, auth] priority: high

Scope the run to a window whose title contains this text.

The CLI --window flag overrides it. Omit to use the foreground window.

window: Login

Test data. Reference values as ${data.NAME}. Pull public values with

${VAR:NAME} (from the vars: map in pantomime.yaml) and secrets with

${SECRET:NAME} (from the secrets: map) so neither sits in the test file.

data: user: "${VAR:USER}" password: "${SECRET:DEMO_PASSWORD}"

preconditions:

- "The 'Login' window is open and showing the Username and Password fields."

steps:

- action: "Type the username '${data.user}' into the 'Username' field."

expect: "The 'Username' field shows 'demo_user'."

- action: "Type the password into the 'Password' field."

- action: "Click the 'Sign in' button."

expect: "The screen has navigated past the login form to a welcome message." timeout_s: 15 retries: 2

assertions:

- "absent:Invalid credentials"

- "The welcome message indicates the user is logged in."

teardown:

- "Click the 'Sign out' button to return to the login form."

**Fields**

Required:

| Field | Type | Notes | | --- | --- | --- | | `id` | string | Unique id; also joins to run history. | | `title` | string | Human-readable name. | | `steps`_or_`flow` | list _or_ string | Exactly one. `steps` is structured; `flow` is one free-text scenario. Defining neither, or both, is an error. |

Optional:

| Field | Type | Notes | | --- | --- | --- | | `description` | string | Free text. | | `tags` | list of strings | For grouping. | | `priority` | string | Free-form, for example `high`. | | `window` | string | Target a window by title **substring**. | | `region` | `[x, y, w, h]` | Absolute screen rectangle. Omit for the whole screen / foreground window. | | `data` | map of string to string | Values referenced as `${data.NAME}`. | | `preconditions` | list of strings | Checked before the steps run. | | `assertions` | list | Final hard checks (see below). | | `teardown` | list of strings | Always runs, even after a failure. |

Each item under `steps` defines **either** a natural-language `action`**or** one deterministic action field (`key` / `type` / `wait` / `scroll`) — exactly one:

| Field | Type | Notes | | --- | --- | --- | | `action` | string | The instruction, in natural language (driven by the model). | | `key` | string | Deterministic: press a key combo, e.g. `Tab`, `ctrl+a`. | | `type` | string | Deterministic: type literal text into the focused field. Supports `${data.x}` / `${SECRET:NAME}`. | | `wait` | number | Deterministic: pause this many seconds. | | `scroll` | string | Deterministic: scroll `up` or `down` (optional `amount:` companion). | | `expect` | string | Optional. A check run right after the action. | | `region` | `[x, y, w, h]` | Optional. Restrict this step to a rectangle. | | `timeout_s` | int | Optional. Per-step time budget. | | `retries` | int | Optional. Retries for the whole step attempt. |

The deterministic fields run **without any model call or screen sense** — the action is fixed up front and executed straight away, so a mechanical step like a keystroke is fast and free. Use them for unambiguous mechanical actions; use `action` when the agent needs to look at the screen to decide what to do (for example, clicking a button it has to locate). An `expect` check still runs after a deterministic action.

The schema is strict: unknown or misspelled fields are rejected.

5. Expectations and variables

[](https://github.com/patrickiel/pantomime#5-expectations-and-variables) This is the heart of natural-language authoring.

**Expectations** (used in a step's `expect` and in `assertions`). Three prefixes are checked deterministically, for free and instantly:

- `contains:<text>` passes if the text appears anywhere on screen (case-insensitive).

- `absent:<text>` passes if the text is **not** on screen.

- `prop:<Role>:<Name>` passes if an accessibility element with that role exists (name is an optional substring), for example `prop:Button:Sign in`.

Anything else is treated as a natural-language expectation and judged by the reasoning model against what is on screen, for example `"The welcome message indicates the user is logged in."` (Fuzzy expectations need a configured model; they fail if none is available.)

**Variables.** Three reference forms are supported, and only these three:

- `${data.NAME}` substitutes a value from the test's own `data` map.

- `${VAR:NAME}` substitutes a **public** value from the `vars:` map in `config/pantomime.yaml`. Not a secret: it is shown in logs and prompts like ordinary data. Use it for non-sensitive per-user values such as a username, so they stay out of the test file.

- `${SECRET:NAME}` substitutes a secret from the `secrets:` map in `config/pantomime.yaml`.

Secrets are redacted everywhere they could leak: logs, the text sent to the model, and saved screenshots. They are resolved to plaintext only at the instant they are typed. Put every password or token in a secret and reference it with `${SECRET:...}`.

6. Validate and preview before running

[](https://github.com/patrickiel/pantomime#6-validate-and-preview-before-running) Check a test (or a whole folder) without running it. The summary is printed with secrets redacted:

panto validate tests # a file or a directory of *.yaml / *.yml

See exactly what the planner will see. `sense` prints the current ScreenState, the list of elements with their `eN` ids, plus a footer with the element count and target region:

panto sense --window Login # target a window by title substring panto sense --vision # also capture a (redacted) screenshot

> With no `--window`, no `--region`, and no `window:` field, `sense` and `run` target the **foreground window**, which is usually your terminal. Always point them at your app.

7. Run a test

[](https://github.com/patrickiel/pantomime#7-run-a-test)

panto run examples/login/tests/login.yaml --window Login

- `panto run` accepts a single file or a directory (a suite of all `*.yaml` / `*.yml` files, searched recursively).

- `--window <text>` targets a window by title substring and **overrides** the test's `window:` field.

- `--dry-run` plans the first action without clicking or typing, and needs no API key. Good for sanity-checking a new test.

- Press `Ctrl-C` to cancel; the run stops cleanly at the next safe point.

**Try it with the bundled demo.** See `examples/login/README.md` to run the sample Login app and its test.

8. Read the results

[](https://github.com/patrickiel/pantomime#8-read-the-results) After each test, the CLI prints a plain-language `PASSED` / `FAILED` explanation: per step, what was expected, what was observed, and the last action taken, plus a token and estimated-cost line.

Exit codes (the CI contract):

| Code | Meaning | | --- | --- | | `0` | All tests passed. | | `1` | A test ran but failed. | | `2` | Parse error (`INVALID:`) or target/region error (`TARGET ERROR:`). | | `130` | Canceled. |

Every run also writes artifacts to `runs/<test-id>/<timestamp>/`:

- `result.json` is the full machine-readable result (steps, actions, verdicts, token usage, and a USD cost estimate).

- `junit.xml` is a standard JUnit report, the integration point for any CI (Jenkins, GitLab, GitHub Actions, and others).

- `step1.png`, `step2.png`, ... are screenshots captured at each check, with secret fields blacked out.

9. Step through a run in the web debugger

[](https://github.com/patrickiel/pantomime#9-step-through-a-run-in-the-web-debugger) A small web app lets you launch, stop, and replay tests from the browser instead of the terminal. From inside your project:

panto runner # serves http://localhost:5173 and opens it

`panto runner` resolves the project from the current directory (the one whose `config/pantomime.yaml` it finds by walking up) and starts the debugger scoped to just that project: one runner instance per project, no shared state. Add `--port` to change the port or `--no-open` to skip opening a browser tab. The first launch needs the runner's dependencies installed (`cd runner; pnpm install`).

It lists the project's tests, runs or dry-runs them (it spawns the same `panto run` under the hood), and lets you replay each run beat by beat: the Sense, Plan, Act, Verify timeline with the redacted screenshot the loop saw at each step. Each test keeps a history of past runs with status, duration, tokens, and cost. Navigate steps with the arrow keys or `j` / `k`. The project's name (from `config/pantomime.yaml`) shows in the header, and its runs live in `runs/pantomime.db`. Recording is on by default, and the timeline can follow a run live.

- * *

How it works

[](https://github.com/patrickiel/pantomime#how-it-works) Every goal (a precondition, a step, a prose flow, or a teardown line) runs a four-beat **Sense, Plan, Act, Verify** loop:

1. **Sense** captures the target region and lists the on-screen elements (id, role, name, text, box) as a `ScreenState`. Windows UI Automation provides the elements; for screens it cannot read (Tkinter fields, custom-drawn UIs), a local **OpenCV and OCR grounder** adds the input boxes and visible text. It is on by default. 2. **Plan** sends the goal and the ScreenState to the reasoning model, which calls a single `act` tool with one action, referencing an element **by id**. 3. **Act** resolves that id to a box and clicks its center, types, or presses keys. The model never emits pixel coordinates. This is the _grounding contract_: it points at elements, the harness does the geometry. 4. **Verify** runs a deterministic check (`contains:` / `absent:` / `prop:`) or, for a natural-language expectation, asks the model to judge the screen.

The loop re-senses after every action and repeats until the model reports done (or an action budget is reached), then verifies. Assertions are verify-only. Teardown always runs.

Perception and the driver

[](https://github.com/patrickiel/pantomime#perception-and-the-driver)

Pantomime perceives and acts on one rectangular screen region and is content-agnostic: it does not know whether a native app, a browser, or a remote desktop is inside the rectangle. Element boxes are region-local, so they map directly onto the screenshots. The driver captures pixels and sends mouse and keyboard input, doing the single region-local to absolute coordinate conversion in one place. Password fields are masked in the text and blacked out in any screenshot before it leaves the machine.

Reasoning

[](https://github.com/patrickiel/pantomime#reasoning)

Planning and judging both call a reasoning model through a configurable `provider`: `deepseek` or `anthropic` (or any Anthropic-compatible endpoint via the `anthropic` SDK), or `openai` (the OpenAI SDK behind a thin adapter; install with `pantomime[openai]`). The model is text-only by default: it reasons over the structured element list, not a screenshot. The provider and model ids are configuration, not hard-wired, and are meant to be swapped. Token usage from every call is accumulated and turned into an estimated USD cost.

- * *

Configuration

[](https://github.com/patrickiel/pantomime#configuration)

All configuration lives in **`config/pantomime.yaml`** (the `config/` folder is discovered by walking up from the working directory). Built-in defaults fill in anything you omit, so you only set what you want to change. It is gitignored because it holds `api_key` and `secrets`; commit `config/pantomime.example.yaml` (placeholders) as the shared template.

| Key | Purpose | | --- | --- | | `name` | Project name, shown in the web debugger header (defaults to the directory name). | | `api_key` | API key for the reasoning model. Required to run (not for `--dry-run`). | | `provider` | Reasoning backend: `deepseek`, `anthropic`, or `openai`. Required (no default). | | `base_url` | Override the endpoint URL (a self-hosted proxy or gateway). Defaults to the provider's. | | `planner_model` | Planner model id. Required (no default); must have a `pricing` entry. | | `judge_model` | Judge model id. Required (no default); must have a `pricing` entry. | | `effort` | Reasoning effort, from low to max. Required (no default). | | `grounding_enabled` | Toggle the OpenCV and OCR grounder (`true`/`false`). Required (no default). | | `pricing` | USD per 1M tokens, keyed by model id (`input_per_mtok` / `output_per_mtok`). Required: every model the planner/judge use must be priced, so each role is billed at its own rate. | | `runs_dir` | Where run artifacts are written. Relative paths are anchored to the project root (the parent of `config/`), so runs always land under the project regardless of the working directory. | | `db_path` | The debugger SQLite database. | | `debug_record` | Record the debugger database (on by default). | | `vars` | Map of public values for `${VAR:NAME}` references (shown in logs). | | `secrets` | Map of secret values for `${SECRET:NAME}` references (redacted everywhere). |

- * *

CLI reference

[](https://github.com/patrickiel/pantomime#cli-reference)

``` panto init [dir] [--force] Scaffold a starter project (default dir: tests). panto schema [-o path] Emit the test-format JSON Schema (stdout, or to a file). panto validate <path> Parse and validate a file or directory; print a redacted summary. panto sense [--region x,y,w,h] Print the current ScreenState. [--window text] [--vision] panto run <path> [--window text] Execute a file or directory of tests. [--dry-run] ```

- * *

Developing Pantomime

[](https://github.com/patrickiel/pantomime#developing-pantomime) Work inside the repository with uv:

uv sync # install dependencies into a managed virtualenv uv run pytest # run the unit and component suite (offline, no display or API key)

The web debugger has its own toolchain:

cd runner pnpm install pnpm dev # http://localhost:5173 pnpm check # type-check pnpm db:studio # browse the SQLite database in Drizzle Studio

The debugger is a thin reader and launcher around the Python CLI: the Python side writes every beat to the SQLite database, and the SvelteKit app reads it through Drizzle. Keep those two schemas in step.

- * *

Troubleshooting

[](https://github.com/patrickiel/pantomime#troubleshooting) | Symptom | Fix | | --- | --- | | `panto: command not found` | Run `uv tool update-shell` and open a new terminal, or use `uv run panto` inside this repo. | | It acts on your terminal | Pass `--window <title>` (or set `window:` in the test). With no target, it uses the foreground window. | | Fields are not detected | Leave grounding on (`grounding_enabled: true` in `config/pantomime.yaml`). Use `panto sense` to see what is detected. | | Clicks land slightly off | Set the display to 100% scaling for the run. | | A fuzzy expectation fails with "no judge available" | Set your reasoning API key, or rewrite the check as `contains:` / `absent:` / `prop:`. |