Ornith-1.0: Self-scaffolding LLMs for agentic coding

79 points · 3 visible top comments · 2026-06-28 17:58:04 UTC

deep-reinforce.com · Read Story HN original

Comments

SwellJoe · 2026-06-28 20:01:10 UTC

I added this to a benchmark I've been doing of how well agents find security bugs, specifically security bugs originally found by Mythos. It performs poorly with only read/grep/ls tools, but in a follow-up test with a full shell and Python, it doubled its findings (still a poor showing, but it does at least indicate it is doing what it says on the tin: making tools to help it solve problems). It also did worse than Qwen AgentWorld, another recent post-train of Qwen 3.6 MoE intended for agentic use.

https://swelljoe.com/post/will-it-mythos/

kordlessagain · 2026-06-29 04:33:02 UTC

Good to know. Thanks for the research!

hedgehog · 2026-06-29 23:22:49 UTC

It would be really interesting to see how the Qwen 3.6 35B model compares to the 27B on your benchmark.

Balinares · 2026-06-29 12:11:50 UTC

I'd have expected this to get more HN attention. Qwen 3.6 35B capability in a 9B model is a bonkers claim.

chid · 2026-06-29 13:12:28 UTC

I thought so too when I read the headline but I expect it's basically Qwen3.5-9B

juliangoldsmith · 2026-06-29 17:38:32 UTC

It looks like they're comparing Orinth 9B to Qwen 3.5 35B, not Qwen 3.6. I guess it kind of makes sense since it's a finetune of 3.5, but I totally missed until I looked closely.

In my brief tests, Ornith 35B performed quite well. It won't replace DeepSeek V4 Flash for me, but if it was fast and cheap enough it might.

I don't remember being super impressed with Ornith 9B, but I could see it being on par with Qwen 3.5 35B.

nzach · 2026-06-29 12:18:07 UTC

Instead of training the model to directly answer questions we trained the model to always write and execute the code that would solve the question ?

If that is the case, this isn't just a fancy way to perform prompt optimization?