p.enthalabs

GitHub - experientiallabs/world-model-harness: World-model-as-a-harness for simulating AI agent environments

`wmh` makes it easy to go from agent traces to faithful replication of your production environment where your agents run.

Basically, an LLM pretends to be a virtual machine executing instructions — but it's 5x faster than a real sandbox.

Just:

git clone https://github.com/experientiallabs/world-model-harness cd world-model-harness uv sync uv run wmh build

The `build` command opens a wizard that walks you through creating your own world model from your traces.

Below is a comparison running 8 SWE-bench tasks: real sandboxes on the left, a world model acting as the sandbox on the right.

![Image 1: world-model-harness demo](https://github.com/experientiallabs/world-model-harness/blob/main/assets/demo.gif)

How it works

[](https://github.com/experientiallabs/world-model-harness#how-it-works)

A frontier LLM acts as the _environment_ your agent steps against, reconstructed from your own OpenTelemetry traces. Inspired by **Qwen-AgentWorld** (LLM-as-environment), **GEPA** (reflective prompt evolution), and **DreamGym** (retrieval over a trace replay buffer) — but with **zero training**: we get there with prompt optimization on a frontier model.

1. **Build** from your OTel traces: ingest → normalize → split train/held-out → index a replay buffer → evolve the env prompt with GEPA against the held-out split. 2. **Serve**: agents call `WorldModel.step(action)` (in-process or via the local HTTP backend). Each step retrieves the most similar past `(state, action) → observation` examples and predicts the next observation.

Try it

[](https://github.com/experientiallabs/world-model-harness#try-it)

uv run wmh examples list # swe-bench, tau-bench, terminal-tasks uv run wmh eval list # eval suites shipped with the examples uv run wmh eval run tau-bench # replay + score reconstruction fidelity uv run wmh play # step into the environment yourself uv run wmh serve # local HTTP backend on :8000

Example-local prebuilt models live under `examples/<task>/models/`; pass `--root examples/<task>` to `wmh list`, `wmh demo`, `wmh play`, or `wmh serve` to use one without rebuilding.

Use it as an API

[](https://github.com/experientiallabs/world-model-harness#use-it-as-an-api)

from wmh import Action, ActionKind from wmh.config.store import WorldModelStore from wmh.engine.loader import load_world_model

model_dir = WorldModelStore(".wmh").resolve("airline") wm, _provider = load_world_model(model_dir)

session = wm.new_session(task="check out the cart") obs = wm.step(session.id, Action(kind=ActionKind.TOOL_CALL, name="add_to_cart", arguments={"sku": "A1"})) print(obs.content)

Or over HTTP (same code path), namespaced by model name: `GET /world_models`, then `POST /world_models/{name}/sessions` and `POST /world_models/{name}/sessions/{id}/step`.

Providers

[](https://github.com/experientiallabs/world-model-harness#providers) One interface, four backends, verified on startup. Credentials are read from the environment:

| Provider | Model | Env vars | | --- | --- | --- | | Anthropic | Claude Opus | `ANTHROPIC_API_KEY` | | AWS Bedrock | Claude Opus | `AWS_REGION`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` | | Azure OpenAI | GPT | `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT` | | OpenAI | GPT | `OPENAI_API_KEY` |

Development

[](https://github.com/experientiallabs/world-model-harness#development)

Managed with uv; linting/formatting with ruff; type checking with ty. Conventions live in AGENTS.md.

uv sync --extra dev # env + dev tools uv run ruff check . # lint uv run ruff format . # format uv run ty check # type check uv run pytest -q # tests

Usage telemetry

[](https://github.com/experientiallabs/world-model-harness#usage-telemetry) `wmh` uses anonymous usage telemetry to track the volume of usage. Telemetry is strictly metadata. It never includes prompts, traces, actions, observations, file paths, model names, provider credentials, or raw user content.

Telemetry is enabled by default. To opt out for a project:

uv run wmh config telemetry disable

This writes `.wmh/settings.toml`. You can re-enable it with `uv run wmh config telemetry enable`, check the current setting with `uv run wmh config telemetry status`, or disable it for a process with `DO_NOT_TRACK=1` or `WMH_TELEMETRY=0`.