Opus vs GLM-5.2 in a coding-agent pipeline — paired-run findings

Source: https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c

GLM Tries, Opus Triages: Behavioral Differences in Research-to-Code Agents

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#glm-tries-opus-triages-behavioral-differences-in-research-to-code-agents)

A controlled comparison across **19 paired runs spanning 19 repository forks** — 38 individual workflow executions total — running an identical paper-implementation pipeline (`remyxai/outrider` — Claude Code under the hood, with `glm-5.2` routed at z.ai's Coding Plan endpoint vs default Opus). The pipeline ran in two modes that probe different parts of the workflow:

- **Selection-pass mode (n=9)**: no pin; each provider freely selects its own paper from the candidate pool. Exercises the full pipeline including selection + verification gates.

- **Pin-method mode (n=10)**: same paper pinned on each fork, both providers run their full chain on identical input. Isolates implementation-side behavior on a forced pick.

The aggregate verdict comes from the n=19 union; the two mode-specific breakdowns below show where the difference comes from.

Reproducing this

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#reproducing-this) The action under test is `remyxai/outrider`; the `remyxai-cli` installs the workflow on a target fork and dispatches runs:

Install Outrider on the target fork (one-time setup).

remyxai outrider init --repo your-fork/repo --interest-id <uuid>

Drop the alternate provider's API key into the repo's secrets.

remyxai outrider set-provider-secret \ --repo your-fork/repo --provider zai --key-from ~/zai-key

Compare the same paper across providers + models.

remyxai outrider trigger --repo your-fork/repo --pin-method 2606.27369v1 \ --provider anthropic --model claude-opus-4-7 remyxai outrider trigger --repo your-fork/repo --pin-method 2606.27369v1 \ --provider zai --model glm-5.2

Or omit --pin-method to let each provider select its own paper.

remyxai outrider trigger --repo your-fork/repo \ --provider anthropic --model claude-opus-4-7 remyxai outrider trigger --repo your-fork/repo \ --provider zai --model glm-5.2

`--provider` picks the company / API endpoint; `--model` picks the specific model from that provider's catalog.

Headline finding: triage vs attempt

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#headline-finding-triage-vs-attempt) Aggregate outcomes across all 19 paired runs:

| | PR shipped | Issue filed | Skipped (verification) | Failed | | --- | --- | --- | --- | --- | | **Opus** | 5 / 19 (26%) | 10 / 19 (53%) | 4 / 19 (21%) | 0 | | **GLM-5.2** | 1 / 19 (5%) | 15 / 19 (79%) | 2 / 19 (11%) | 1 / 19 (5%) |

**Opus triages.** When it can ship a PR cleanly, it does (5× more often than GLM). When it can't find a real call site, it exits at selection-pass verification rather than attempting an implementation. The full range of routing outcomes — PR / Issue / skip — gets used roughly in proportion to what the candidate actually warrants: ship, surface for discussion, or drop.

**GLM-5.2 tries.** GLM rarely exits early — it attempts implementation, and Outrider's downstream gates (preflight, self-review) decide where the work lands. `dspy` is a clean example: GLM drafted a PR; the self-review pass downgraded it back to an Issue. Opus on the same paper had already exited at selection verification, never starting implementation in the first place. GLM's high Issue-with-context rate (79%) is the pipeline's fallback shape when an attempted implementation doesn't clear validation — not a behavioral preference of GLM's.

The asymmetry isn't "Opus exits more, GLM ships more" — it's **Opus triages upstream, GLM lets the pipeline gates resolve it downstream**. Both end up on roughly the same outcomes for truly borderline cases; the path through the pipeline is what differs.

The divergence is sharpest at the upstream gates: selection-pass verification (drops all candidates if none are contract-anchored), preflight routing (PR vs Issue before implementation), self-review (downgrades a drafted PR to Issue). The two mode-specific breakdowns below show which gate does which work.

Selection-pass mode: where the divergence appears (n=9)

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#selection-pass-mode-where-the-divergence-appears-n9) Without a pinned paper, each provider runs the full pipeline — including the selection-pass verification gate Opus uses to drop candidates that don't have a real call site to wire into.

| Fork | Same paper? | Opus | GLM-5.2 | Divergence type | | --- | --- | --- | --- | --- | | diffusers | yes | Issue | Issue | none — both Issue ✓ | | **unsloth** | **yes** | **PR** | **Issue** | **ship vs scope** | | **espnet** | **yes** | **PR** | **skip-verified** | **ship vs exit** | | **dspy** | **yes** | **skip-verified** | **Issue** | **exit vs attempt** | | **mteb** | **yes** | **skip-verified** | **Issue** | **exit vs attempt** | | **supervision** | **yes** | **skip-verified** | **Issue** | **exit vs attempt** | | OpenHands | no | skip-verified | Issue | (different picks) | | lerobot | no | PR | skip-verified | (different picks) | | hermes-agent | no | PR | implementation timeout | (different picks) |

Two findings stack here:

- **Paper-picking converges**: both providers picked the same paper from the candidate pool in **6 of 9** runs (67%). Selection is largely model-insensitive — the same candidate pool yields the same picks across providers.

- **Same-paper outcomes diverge**: within the six same-paper pairs (bolded), **5 of 6 land on different artifact outcomes**. Three of those are the `skip-verified` vs `Issue` pattern that's specific to this mode — Opus's selection verification drops the candidate; GLM on the same input attempts an Issue.

Aggregate within the mode: **4 of 9** Opus runs exited at `skipped_by_selection_verification` (vs **2 of 9** GLM); **4 of 9** Opus runs shipped a PR (vs **0 of 9** GLM).

Pin-method mode: where the routing mostly agrees (n=10)

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#pin-method-mode-where-the-routing-mostly-agrees-n10)

When both providers receive the same paper as forced input, routing matched **8 of 10** paired completions, including the deeper chain gates (`high_risk`, `no_integration`, `preflight`). With selection + verification bypassed, the implementation-side behavior is largely model-insensitive at the routing-decision level.

The one striking implementation-side divergence was `atropos`: Opus drafted a PR but its self-review downgraded the result to Issue #7; GLM cleared self-review and shipped draft PR #8 — +462 / -1 across 4 files, paper-grounded tests verifying RiVER's two failure modes. The one time GLM shipped a PR in the full bench was when Opus's self-review pulled its own back.

Where their styles consistently differ

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#where-their-styles-consistently-differ) The PR-vs-Issue routing direction varies, but **how each model writes the artifact given a decision** is stable.

GLM-5.2: more verbose, more structured, more willing to propose specifics

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#glm-52-more-verbose-more-structured-more-willing-to-propose-specifics)

When GLM produces an Issue, it consistently uses more structure (`TL;DR / Suggested experiment / Engineering analysis / What blocks / How to unblock` is its house style) and proposes concrete next-step experiments more often than Opus does. In pin-method paired Issues, GLM's body named a specific experiment to run (a profile-then-port for a kernel paper, a `BaseMiddleware` subclass + conformance test for a control-plane paper, a thin video-eval harness for an active-perception paper). Opus's counterparts on the same papers usually listed scoping questions instead.

GLM also occasionally catches sharp scope-defining observations Opus misses. On LiveKit Agents (video-understanding paper): _"the repo's existing 'action' concept is LLM tool-calling within a voice turn, not perception actions over a video timeline."_ — a precise dissolution of the paper-to-repo mapping, reframing the entire premise. On AG2 (governance-for-coding-agents paper): _"the paper argues governance should NOT be delegated to LLM orchestration, whereas AG2 IS an LLM-orchestration framework — even the intent sits beside, not inside, AG2."_ — that single observation invalidates the recommendation's premise more cleanly than any "implementation blocker" enumeration would.

GLM's **failure mode is confabulation under uncertainty**. On one preflight against an observability platform (opik), GLM asserted _"the repo layout supplied for routing is empty"_ — factually wrong; the repo has hundreds of relevant Python files, and Opus on the same input correctly named specific classes. When context-grounding falters, GLM reaches for an unsupported assertion rather than slowing down.

Opus: tighter, more file-specific, more willing to walk away

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#opus-tighter-more-file-specific-more-willing-to-walk-away)

Opus's Issues consistently name specific files and line numbers, use a tighter prose register, and prefer open scoping questions over proposed experiments. And — the selection-pass signal — **Opus is much more willing to skip the run entirely** when no candidate clears its verification bar.

Its **failure mode is under-engagement with meta-framing** — when GLM catches a scope-defining observation (as in the LiveKit Agents and AG2 cases above), Opus on the same paper answered the question the candidate-pool put in front of it rather than re-questioning the premise.

Neither style is universally better. GLM's verbose-with-experiments shape gives a maintainer something concrete to react to; Opus's tighter-with-questions shape (or no artifact at all) gives a maintainer either the right scoping prompts to think through or a clean signal to look elsewhere this week.

Cost, tokens, speed (combined across both modes, n=17 clean pairs)

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#cost-tokens-speed-combined-across-both-modes-n17-clean-pairs) | | Opus | GLM-5.2 | Ratio | | --- | --- | --- | --- | | **Total spend** | $41.25 | $2.13 | Opus **~19× more expensive** | | **Total input tokens** | 199,864 | 744,371 | GLM uses **~3.7× more input** | | **Total output tokens** | 556,459 | 439,402 | Opus uses **~1.3× more output** | | **Total wall-clock** | 11,064s | 16,716s | GLM **~1.5× slower** overall |

Per-run wall-time distribution (n=19 each, all phases):

| | min | p25 | median | p75 | max | | --- | --- | --- | --- | --- | --- | | **Opus** | 1m34s | 3m00s | **8m11s** | 13m32s | 29m04s | | **GLM-5.2** | 4m13s | 4m49s | **14m11s** | 18m20s | 39m22s |

GLM is slower at every percentile, not just on outliers. Opus's quick low-p exits are the verification skips (sub-3m); its long-tail (~29m) is when the full chain runs on a PR-route. GLM's distribution sits roughly 1.7× longer through the body of the curve.

The 19× combined cost gap is wider than the pin-method-only gap (~11×) because **Opus spent more in selection-pass mode — its 4 PR-route runs there each drew on the full chain (~10-30K output tokens per PR), while GLM bailed at Issue and skipped the chain entirely**. The token-input gap narrows for the same reason: GLM ran less downstream work because Issue is its dominant exit.

The output-tokens flip — Opus uses _more_ output than GLM in aggregate (1.3×) — is the clearest single number tying back to the headline. Opus shipped 5 PRs total; GLM shipped 1. The pipeline's exit-gate behavior shows up in the output-token totals.

Practical takeaways

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#practical-takeaways)

1. **Selection convergence is the rule.** Both providers picked the same paper from the candidate pool 6 of 9 times in selection-pass mode (and routing converged 8 of 10 in pin-method once past selection). For "pick the paper, pick the route" workflows in pin-method mode, either provider is fine.

2. **Opus exits at multiple verification gates; GLM tends to continue to artifact production.** This is the consistent pattern across both modes. The exit gates are: selection-pass verification (the dominant signal in selection-pass mode), preflight routing (both modes), self-review downgrade on drafted PRs (atropos in pin-method, espnet in selection-pass).

3. **The consistent style difference is in shape, not direction.** GLM tends to write verbose, structured Issues with concrete proposed experiments; Opus tends to write tighter Issues with file-specific scoping questions — or no artifact at all.

4. **Pick the failure mode you'd rather absorb.** GLM-5.2 occasionally confabulates under uncertainty (the "empty repo layout" assertion on a 100K-file repo). Opus occasionally under-engages with meta-framing (missing the "this paper sits beside, not inside, your repo" reframings GLM catches). Both can mislead; neither is universally better.

Position in the ecosystem

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#position-in-the-ecosystem)

This comparison probes the **implementation / operationalization** leg of the research-to-code loop. FeatureBench covers a different leg (multi-agent performance on feature-development tasks); code-review-benchmark sits one stage downstream (review of resulting diffs). Together they sketch a more complete picture of where each model behaves how across the full pipeline.

All paired artifacts

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#all-paired-artifacts) For each fork, both the GLM-5.2 and Opus Outrider outputs are public:

Pin-method mode (n=10)

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#pin-method-mode-n10) | Fork | Paper | Opus | GLM-5.2 | | --- | --- | --- | --- | | atropos | RiVER (RL w/o ground-truth) | Issue #7 | **PR #8** | | mlx | SplitK quantized matmul | Issue #5 | Issue #6 | | opik | MAS-PromptBench | Issue #5 | Issue #6 | | LiveKit Agents | Active Perception | Issue #6 | Issue #7 | | AG2 | Deterministic Control Plane | Issue #6 | Issue #7 | | ultralytics | RT-DETRv3 | Issue #5 | Issue #6 | | lm-evaluation-harness | BINEVAL | Issue #3 | Issue #4 | | OLMo-core | Tied MoE experts | Issue #3 | Issue #4 | | open-instruct | DPO prefix sharing | **PR #3** | Issue #5 | | neural-steering | RAS | Issue #3 | Issue #4 |

Selection-pass mode (n=9)

[](https://gist.github.com/smellslikeml/36bf4939d76f0f84d113e2ddde5e6d3c#selection-pass-mode-n9) | Fork | Opus | GLM-5.2 | Same paper? | | --- | --- | --- | --- | | diffusers | Issue #8 (self-review downgrade) | Issue #9 (preflight) | yes | | **unsloth** | **PR #5 (draft)** | **Issue #7** | yes | | **espnet** | **PR #4 (draft)** | **skip-verified** | yes | | **dspy** | **skip-verified** | **Issue #6 (self-review)** | yes | | **mteb** | **skip-verified** | **Issue #7 (preflight)** | yes | | **supervision** | **skip-verified** | **Issue #3 (preflight)** | yes | | OpenHands | skip-verified | Issue #3 (preflight) | no | | lerobot | PR #6 (draft) | skip-verified | no | | hermes-agent | PR #2 (draft) | claude_failed (implementation 900s timeout) | no |