p.enthalabs

Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

github.com · Read Story HN original

Scored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%.

Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (https://debugml.github.io/cheating-agents/), I would like to also clarify a few things

1. Absolutely no {agents/skills}.md files were inserted at any point. No cheating mechanisms whatsoever

2. The cli agent was run in leaderboard compliant way (no modification of resources or timeouts)

3. The full terminal bench run was done using the fully open source version of the agent, no difference between what is on github and what was run.

I was originally going to wait for it to land on the leaderboard, but it has been 8 days and the maintainers do not respond unfortunately (there is a large backlog of the pull requests on their HF) so I decided to post anyways.

HF PR: https://huggingface.co/datasets/harborframework/terminal-ben...

It is astounding how much the harness matters, based on this and other experiments I have done.

Comments

Sorry I couldn’t really figure out if this was a harness, a fine tuned model, or both. Can we use Qwen with this for example? Is the performance expected to be better in that case?
The model was the default gemini-3-flash-preview.

Harness was https://www.npmjs.com/package/dirac-cli

Since Dirac is Cline's heavily modified fork, it supports all models Cline supported, including Qwen and all popular open/closed models

As a matter of fact, I am trying to run terminal bench 2.0 using some OSS models at the moment but the slow inference speeds are causing tasks to timeout

Interesting things Dirac does:

1. Uses an optimized version of Hash-Anchored edits for file editing (https://dirac.run/posts/hash-anchors-myers-diff-single-token)

2. Utilizes language's AST to decide what to fetch into context, entirely avoids large code file reads

3. Batches all operations. Does large number of reads/edits simultaneously (you can see a video demo for deepseek-v4-flash here https://www.reddit.com/r/LocalLLaMA/comments/1suhdki/tested_...)

4. Allows the model to execute code to analyze things on the fly, so the model can simply write bash/python/perl script to accomplish things where appropriate

5. A lot of context curation and opportunistic context updates, i.e. put into context anything that you are certain model would ask next

I always wondered why AST's were not more of a part in both editing and scoping of changes/parsing code. I thought I read an article where they said 'grep' was just as effective. It kinda made sense for the case they were talking about.
Grep is effective for the most part, except for situations like when you have huge codebases and the thing you're looking for is used in too many places both as symbol and non-symbol.

Another annoying thing about plain grep is, LLMs often end up pulling in bundled packages when using grep where 1 line is large enough to ruin the context window

> Grep is effective for the most part

It's very effective in well-written and well-designed code bases where concepts tend to be relatively well formed to not be named the same as everything else, so grepping for symbols give you good search results.

Projects where the god-object or core concepts are generic names like "Tree", "Node" or other things that are used everywhere, tends to be short of impossible to search with grep and friends.

It's not intuitive to humans, even after learning parsing theory. I can do basic name refactorings. I've even written neovim plugins to do 1 specific thing with the AST (dfs down and delete one subtree which I understand). Those are fine.

I would not be comfortable doing an on-the-fly "rewrite all subtrees that match this pattern" kind of edit.

It seems like a tool that's good for LLM's though.

"rewrite all subtrees that match this pattern" works really well in jetbrains, they call it structure search-and-replace.
Happened to have written both a tool and a blog post about the topic. It’s more about the different technical approaches you have in solving the problem but it might still interest you :)

https://www.context-master.dev/blog/deterministic-semantic-c...

Let me know, what you think

This is interesting - I have been working on the same thing, building contextual data, LSP-style.

I saw the tools page where if I understand right, `get-symbol-context` is actually the main useful tool for what you provide? The others seem more metadata it's easy to get already (?) but that tool provides the extra info.

I had been working on exposing mine as more high-level, ie multiple APIs to query different kinds of metadata about symbols, types, etc. But I am still not sure of the best approach, where my thinking was about not overloading the AI with too many different tools. They accumulate quickly.

I definitely share the same sentiment. I don’t want to overload the llm with many tools. Better to have a few opinionated and flexible ones, but yeah, keeping the balance is hard.

I would say the main two tools are get-symbol-context and get-repository-overview. The latter is actually the more complex and sophisticated one. I’m running some graph algorithms to rank the symbols in terms of relative importance based on centrality metrics, I.e. how well connected they are in the symbol graph.

The idea behind that is to allow the llm to infer the general structure and architecture of the project with just one tool call.

I guess you could reach a similar thing if you had some good Agents.md or docs detailing that for your project, but this was more meant to reach that on the fly.

The symbol-context tool is basically a graph query tool (without a dsl or cipher support yet), but yeah here the question is also whether it makes more sense to give the ai the possibility to run cipher queries itself or abstract it away in a thinner api.

The main underlying factor of my tool is however the graph that I’m building and the metadata which can be extracted from that (connections, type of connection, etc. ) :)

Whats the metadata you have in mind?

Metadata: I feel like LSP focuses on human-style things (like locating a symbol) which are useful, but not necessarily exactly what a LLM needs. Instead I want to do things like show the inheritance chain. Is a virtual method overriding something, being overridden later? What is the class / polymorphic situation? My feeling is that this will help understand the shape, plus, help some bugs.

So a query on a symbol would:

* Return its type declaration, not (just) location (and I'm considering some kind of summary version where it pulls in the ancestors too, so you directly see everything it has available not just the actual declaration, because leaf nodes in inheritance often don't add much and the key behaviour is elsewhere)

* Return info about inheritance, the shape of how this modifies other code and other code modifies it.

With variations when the symbol is a variable, a type, etc etc. I'm currently using treesitter for this, to bypass LSP and (for the language I'm working on) build a full symbol table and more, to get something closer to the LSP info you mention in your blog but not limited to what LSP makes available. I don't want to rely on a LSP server; I think first-class support per language is better. It's probably possible to generate this with a set of LSP calls, perhaps, but it might take some heuristics and guesswork and... :/

I do have a graph of file-level dependencies, but not yet a graph of what calls what at the symbol or type or method level. And while I build an index of all symbols I haven't yet sorted that by count.

I get the sense we're thinking along similar lines, with slightly different approaches?

Edit: if you would like to chat on this, I'm up for it! You can find me at my username at gmail (easy to lose emails there due to volume and spam!) or my profile has my website which has my LinkedIn (horribly, more reliable :D)

That sounds great, thanks for sharing your thoughts!

It sure sounds like we have similar things in mind. I basically try to build the proper graph representation of the code during runtime, so all caller/callee relationships plus type inheritance chains etc. This is basically what I call a semantic code graph in the blog post.

From the things I tried with tree-sitter I think I would have a hard time achieving the same because by nature tree-sitter can only make educated guesses on real connections and will run into problems, if things are named ambiguously.

But yeah, will definitely reach out and am looking forward to chatting :) Hope I find the time during this week!

Has anybody thought about encoding AST tokens as LLM tokens, similar to how different words can have different meanings and that's reflected in their embedding?
Language keywords are almost definitely individual tokens. But I think you mean more than that. Basically replacing identifiers with special tokens as well. It’s worth a shot but there’s some practical problems.

Immediate downside is that mapping variable name to token and back would probably require indexing the whole codebase. You’d need a 1:1 mapping for every name that was in scope, and probably need to be clever about disambiguating names that come in and out of scope.

I just realized that the fact that LLMs work so well for me in Clojure might be partly because of the clojure-mcp tools. They provide structural browsing and editing.
I think we should use ASTS more, not for performance, but for easier code review.

Changes that are primarily code refactorings, like breaking up a large module into a bunch of smaller ones, or renaming a commonly-used class, are extremely tedious to review, both in LLM generated diffs and human-written PRs. You still have to do it; LLMs have a habit of mangling comments when moving code across files, while for a human, an unassuming "rename FooAPIClient to LegacyFooAPIClient" PR is the best place to leave a backdoor when taking over a developer's account. Nevertheless, many developers just LGTM changes like this because of the tedium involved in reviewing them.

If one could express such changes as a simple AST-wrangling script in a domain-specific language, which would then be executed in a trusted environment after being reviewed, that would decrease the review burden considerably.

I believe that with agentic development, the most important constraint we have is human time. Making the LLM better and faster won't help us much if the human still needs to spend a majority of their time reading code. We should do what we can to give us less code to read, without losing confidence in the changes that the LLM makes.

...I've said this a few times, and sometimes I get downvoted for it sometimes I do not... This is what happens when you only hire CS people with no real world engineering experience. Sure they can build ML models, but I see how they improve upon them after years, and its always some really old "lesson learned" elsewhere in the industry. There's a thousand projects that make things like Claude Code use less tokens, and edit more efficiently, and nobody at Anthropic or Codex implements a single one of these approaches.

It screams inexperience building real software. If I were anthropic I'd hire devs for Claude Code who arent just AI builders, but tool builders, who care about UX and systems.

Building ML model training and serving infrastructure is real-world engineering. Nevermind the user-facing apps and supporting services.

> Sure they can build ML models, but I see how they improve upon them after years, and its always some really old "lesson learned" elsewhere in the industry. There's a thousand projects that make things like Claude Code use less tokens, and edit more efficiently, and nobody at Anthropic or Codex implements a single one of these approaches.

They have fully internalized the bitter lesson; the result is they get better returns improving the next model over squeezing out performance from the current one.

> Building ML model training and serving infrastructure is real-world engineering. Nevermind the user-facing apps and supporting services.

Looking at Anthropics status info for the last 90 days only serves to prove that they aren't hiring the right people for the right roles.

> They have fully internalized the bitter lesson; the result is they get better returns improving the next model over squeezing out performance from the current one.

Sure, but there's so many things they could be doing that don't require tweaking the model directly to improve it, the community builds all sorts of tools that improve Claude Code directly, and yet nobody at Anthropic takes any initiative in those directions, it feels like either they don't care about building user-facing software, or they don't have any UX experience.

> Utilizes language's AST to decide what to fetch into context,

Does that mean that it's only going to work with certain langauges for which it has parsers available?

Yes
Did you consider incorporating ast-grep or gritql?

Congratulations, great work.

Can't speak for OP but I tried providing ast-grep in the execution context of an execute_bash tool, but even with pretty aggressive steering most models just don't seem to use it a lot. More expensive/SOTA models or higher reasoning increases the chances but lowers speed and raises cost. Maybe due to training bias for exploration tasks?
Yes, I've tried this passive approach too and didn't dig much further after that. I thought maybe they'd figured out something more intentional in the prompting to enable these kinds of approaches.
I have a hunch model proficiency for a given CLI tool very much correlates with how many StackOverflow answers and blog entries providing examples for it there are...
My sense is that we're at a tipping point where instruction following is getting good enough to disrupt these old habits
Not really, but interested in trying them out for a future version, especially gritql.
Is there a complete list of the tools somewhere? I'm interested in how you chose to expose the AST specifically. In my own harness attempts I wanted to keep the number of tools absolutely minimal and briefly experimented with including an AST lib to use via an execute_python tool (plus some examples in the system prompt). Results were mixed though, with most models preferring ripgrep.
> Batches all operations. Does large number of reads/edits simultaneously...

I wasn't sure what this meant, so I looked at the source. It seems to be referring to tool APIs being designed around taking multiple targets as a list parameter, instead of hoping the model makes appropriately parallel tool calls. (This matches my experience btw, models are reluctant to make a large number of parallel calls simultaneously, and this seems more pronounced with weaker models.)

I think Anthropic may have mentioned this first, this pattern is also something my custom agent's tools are designed around, pretty sure I picked it up from them.
Anchor based editing requires injecting new anchors to the context, and dirac does so via a diff. So how is this more efficient (token-wise) than search and replace?? Even at a single token per hash. Also, code is read more than written so these just add up. I experimented once with stable anchors, albeit longer than a single token, and found it a downgrade.

My conclusion is that the efficiency dirac sees comes mainly from showing file skeleton by default

I'm not sure one way or another but I've been using a related tool called Tilth by another poster here. It doesn't do anchor-based editing, but it does do syntax-aware search and will e.g. report the line range for function definitions, provide file outlines with line numbers on a file name match, etc.

https://github.com/jahala/tilth

ohh this is really nice :) testing it
I have six patches that I will at some point upstream, the main bug/surprise is the .gitignore behavior is not what's documented, but even without it seems to work quite well.
This seems really good...going to test it :)
> My conclusion is that the efficiency dirac sees comes mainly from showing file skeleton by default

how hard do you think it would be to bring this optimization to oh-my-pi and opencode? I am testing dirac and it's very cool but the tooling isn't there yet comparing to oh-my-pi in terms of UX.

Would love some more feedback on this. Where do you think are major gaps?
Thinking back, I might have jumped the gun here. I can't objectively evaluate UX without spending more time with the tool. I'll try to daily drive it a bit before I can form an opinion.
It would be really cool to do a causality investigation to determine which one of these boosts it so much / quantify how much each matters. Who knows, they may all interact in a sum-is-greater-than-parts way that only improves the score when shipped altogether.
How are the two token anchors chosen when the initial 1700 single token anchors run out? I'm assuming just a 2 word combination from the 1700.
That's correct
Instead of burning tokens on SOTA models, why not use a dirt-cheap specialised model for file editing?

Where the SOTA model just makes a cheaper model to make edits, and it does so.

Yeah I also believe that there are plenty of efficiency gains available by using different models for different tasks. Reasoning models such as opus should only be used for the main planning and decision flows, but sub operations (exploring, applying edits etc etc) could be delegated to smaller and cheaper models. You also end up with a much smaller context for the main big model
No CLI? Only VSCode extension?
Cli too (you can't run tbench without cli as it runs in an isolated docker env) `npm install -g dirac-cli`
Can't OpenCode reach the same just developing this as a feature or plug-in? Like anchored edit?
Sure. Dirac is just a fork of the Cline harness and obviously OpenCode could take the same techniques and implement them. I don't know how difficult it would be to implement them in OpenCode, but given that Dirac and OpenCode are both open source, a future version of OpenCode could always be a re-branded Dirac (I'm sure there are ways to implement Dirac's techniques without having to completely replace OpenCode's underlying code base, but this illustrates that at the extreme, they could clearly just take Dirac in its entirety to get the same results).
Very interesting! I've often thought static analysis could really help agents (I wrote this last summer: https://martinalderson.com/posts/claude-code-static-analysis...), but despite being hyped for LSPs in Claude Code it turned out to be very underwhelming (for many of the reasons that they can be annoying in a "real" IDE, ie static analysis starts firing mid edit and complaining and cached analysis getting stuck).

Curious to know if this has been an issue with your AST approach on larger projects?

The hash line based numbering is very interesting too (though I see on Opus 4.5+ far far fewer editing errors).

I've often thought that even if model progress stopped today, we'd still have _years_ of improvements thru harness iteration.

Wrt LSP, it uses the default LSP mechanism of the ide provider.

For AST, it uses tree-sitter WASMs (ships them with the package), and maintains queries (https://github.com/dirac-run/dirac/tree/master/src/services/...)

To keep performance fast, it stores the symbols DB (using sqlite) in the workspace's directory and incrementally updates it based on timestamps. Then it uses this DB to resolve symbol queries

Yes I understand, but do you not have issues that it drifts out of date and confuses the agents (especially on longer running tasks)?

Like even "full" Visual Studio and Resharper have issues with this. Eg, you start editing file x, 'intellisense' runs, says there are loads of errors... because you haven't finished editing yet.

same issue from the other side. when a human is editing, the LSP fires mid-keystroke and shows bogus errors for a second, whatever. with an agent doing 5 edits in a row, the symbol DB is always behind by one edit, so the next lookup pulls stale references. you can re-index synchronously after each edit but that kills the batching speed.
It does a before/after comparison. Fetch the LSP error state, apply all edits, fetch it again, diff
Interesting. Would love a comparison to pi.dev (Not Ohmypi)

How does this perform in day to day coding tasks, outside of benchmarks?

https://github.com/dirac-run/dirac#-evals

README has eval of 8 tasks over 7 agents (including both pi and omp). Pi-mono costs second lowest across the 8 tasks (after Dirac) but occasionally misses produces incomplete changes.

Interestingly, 2 tasks where pi missed some changes both were the tasks that benefitted from AST symbol understanding (e.g. find all instances of things that refer to this symbol and change those things). Since pi relies on bash type tooling, it missed some occurrences

Going to assume you didnt capture the data but could you add time taken to completion for each if you have it?
re. bash type tooling-- it doesnt mean an agent cannot use ast: using treesitter cli this should be perfect possible
I assume that this benchmarks where done without any modifications to the default open-sourced harness. treesitter CLI would be an extra plugin for pi-mono, put I'd be equally curious about whether it would accomplish the task.
If I understand correctly, this is a heavily improved Cline fork? Does that mean features such as plan and act mode are also still there?
Yes, plan+act mode is one thing I loved about Cline!
Stared it. will try it later. one question though, to make it simpler for me, in what tasks does this model shine, how do you improve the score? I already use some skills to cut down CC costs, like caveman, rtk cli and a few others. just want to understand
I did limited testing using Sonnet on CC vs Sonnet on Dirac. I could not confirm the costs however
how well does it do on frontier models like Opus 4.6?
I have only done functionality testing, no benchmark testing on Opus (decided to pay my rent instead)
I keep trying to use dirac-cli with codex and it won't work: Error: Codex API error: Codex API request failed: 400.

Any ideas?

Assuming you logged in with OAuth, I am guessing you are trying to use gpt-5.5?

In my tests, it worked using gpt-5.4 for me and I assumed gpt-5.5 is not available to me because I am on the free plan

Do you have the subscription that allows 5.5? If so, I can look into what changed in API. Sorry I rarely use openAI so it is a bit of an untrodden path

Yes I'm on ChatGPT Pro (OAuth) and I'm trying to use gpt-5.5-xhigh.

That was the issue, 5.4 works just fine.

Support for service: priority (GPT /fast mode) would also be cool!

Will fix this soon. Please feel free to create a github issue in the meantime.
I haven't tried it, but I'm curious why you decided to implement a whole new harness over just writing extensions in pi. From whatever I've done with pi so far, the extension api is quite extensive. Hash anchored edits, for example, can definitely be implemented in pi. Anyhow, thank you for showing us your project and will be checking it out later. Cheers!
A few months ago one afternoon I was very frustrated with how slow Cline was being so decided to look under the hood. Decided to make a couple of changes. Got sucked in. About 70k lines of change, another 40k lines of deletions and two months later, here we are.
The best kind of project. I'm trying this today. I've been happily using OpenCode so far.
I've been looking into local LLMs and new harnesses recently, how good is Pi compared to OpenCode, I'm seeing that it's a lot better? What are the best models and customizations for it to fully utilize it?
I am a bit confused. What languages does it help with? You mention AST manipulation, so I am assuming it's not universally applicable, e.g. to Rust?
AST (Abstract Syntax Tree) is essentially a search algorithm to better help the agent do it's job.
It's really interesting how much the AI harness seems to matter. Going from 48% via Google's official results to 65% is a huge jump. I feel like I'm constantly seeing results that compare models and rarely seeing results that compare harnesses.

Is there a leaderboard out there comparing harness results using the same models?

I really wish there was! I thought of even creating one but it would be conflict of interest
We probably want to compare the cartesian product of model+harness.
Maybe the future isn't a human-like centralized intelligence but an octopus-like decentralized intelligence where more focus is placed on making the harness itself "smart"
That would be counter to AI company goals. They want harness to be dumb and models to be smart so they can sell models.
Not really. Anthropic for example sells both the harness and the models as a unified kit via Claude Code, it is in their best interest to make sure both parts work as well as possible, via reinforcement learning of previous usage as well for new model performance increases.
but harness are not a moat. They wouldnt have to subsidize their own harness massively if that was the case. Anyone can write a good harness .
That's not true that anyone can write a good harness because the LLM providers have information like prompts that they can RL train off of that someone writing their own harness would not have. Therefore a good and proprietary harness is a moat.
that doesnt answer why claude subsidizes their own harness and bans ppl from using subsidized inference on openclaw ect
Yes it does? They want people to be locked into the Claude Code product.
why do they have "lock" them if its clearly superior to alternatives that merely u se their api.
Because it's a way to make more money in the future. I feel like you're not really getting the difference between what a business does for profit and its technical decisions.