Claude Sonnet 5 – benchmark results

39 points · 7 visible top comments · 2026-06-30 20:09:25 UTC

artificialanalysis.ai · Read Story HN original

Comments

iLoveOncall · 2026-06-30 20:52:36 UTC

Half of the data is missing and the rest is inconsistent between different graphs and sections. Is the benchmark having Sonnet 5 generate the page and seeing how many hallucinations it has?

Tiberium · 2026-06-30 20:52:48 UTC

Seems like the model is incredibly inefficient at max reasoning, and even at high/xhigh it uses far more tokens than other models, including Gemini 3.5 Flash, GLM 5.2 and so on. GPT 5.5's efficiency in tokens is still unmatched.

trentor · 2026-06-30 21:01:27 UTC

Same with opus nothing above medium has a reasonable improvement for the tokens spent.

atemerev · 2026-06-30 21:01:49 UTC

Yet another mediocre model. Mostly irrelevant among open weights alternatives. Fable wen.

butterisgood · 2026-06-30 21:20:14 UTC

I used sonnet five today to evaluate work I’m doing on an experimental programming language with an interesting concurrency model.

I asked it to try to figure out why one of the examples wasn’t working.

It read the implementation of the compiler and the runtime, found the bug, fixed it, fixed the example and the only thing I had to do manually is suggest a less silly name for a particular function.

I would use sonnet 5 for coding … seems alright!

lucamark · 2026-06-30 21:23:40 UTC

Agree. It is a mediocre model, expensive while not being a frontier

datakan · 2026-06-30 21:03:39 UTC

I'm so sick of Anthropics usage caps and how their model devours tokens.

system2 · 2026-06-30 21:07:47 UTC

It starts with NVIDIA artificially and slowly releasing its tech. If the GPUs were cheaper, we would have better models by many other companies, and competition would take care of these greedy tactics.

lucamark · 2026-06-30 21:27:54 UTC

Remember that such models are available to us thanks to NVIDIA GPUs

system2 · 2026-06-30 23:15:59 UTC

Thank you NVIDIA for artificially making everything scarce so we pay 5x for everything today.

CSMastermind · 2026-06-30 21:06:13 UTC

Using Fable, pretty much every request hit some gate they had for no discernible reason. These provider-level rejections should be incorporated into benchmarks as 0s on the tasks since that's the experience you'll actually get using the model.

cjk · 2026-06-30 21:23:49 UTC

I have heard this from a bunch of folks, but that was not my experience. For the couple days I was able to use it, I didn't hit a single gate, and I was using it pretty extensively (but not for anything security-related).

lucamark · 2026-06-30 21:26:31 UTC

Never had rejections in the short time Fable was available

UltraSane · 2026-06-30 21:59:53 UTC

It used Opus for every biology related question I asked it.

olejorgenb · 2026-06-30 22:28:13 UTC

Even opus refuse to discuss micro biology for more than a around 15 turns in my experience.

nsingh2 · 2026-06-30 21:08:36 UTC

Cost per task is shockingly high. More expensive than Opus 4.8, second in place to Fable.

Cost per task data is only available for max effort though, might just be very inefficient at that effort level.

DrProtic · 2026-06-30 21:12:25 UTC

I feel like they repackaged Opus, slightly nerfed it, and reduced price per token.

A release just to have a headline while Fable situation is getting resolved.