p.enthalabs

Ornith-1.0: Self-scaffolding LLMs for agentic coding

deep-reinforce.com · Read Story HN original

Comments

I added this to a benchmark I've been doing of how well agents find security bugs, specifically security bugs originally found by Mythos. It performs poorly with only read/grep/ls tools, but in a follow-up test with a full shell and Python, it doubled its findings (still a poor showing, but it does at least indicate it is doing what it says on the tin: making tools to help it solve problems). It also did worse than Qwen AgentWorld, another recent post-train of Qwen 3.6 MoE intended for agentic use.

https://swelljoe.com/post/will-it-mythos/

Good to know. Thanks for the research!
It would be really interesting to see how the Qwen 3.6 35B model compares to the 27B on your benchmark.
I'd have expected this to get more HN attention. Qwen 3.6 35B capability in a 9B model is a bonkers claim.
I thought so too when I read the headline but I expect it's basically Qwen3.5-9B
It looks like they're comparing Orinth 9B to Qwen 3.5 35B, not Qwen 3.6. I guess it kind of makes sense since it's a finetune of 3.5, but I totally missed until I looked closely.

In my brief tests, Ornith 35B performed quite well. It won't replace DeepSeek V4 Flash for me, but if it was fast and cheap enough it might.

I don't remember being super impressed with Ornith 9B, but I could see it being on par with Qwen 3.5 35B.

Instead of training the model to directly answer questions we trained the model to always write and execute the code that would solve the question ?

If that is the case, this isn't just a fancy way to perform prompt optimization?