Other than the worst naming I have ever seen (Sol / Terra / Luna), the pricing is still expensive:
> GPT‑5.6 is priced per 1M tokens across three model sizes:
> Sol is $5 input / $30 output;
> Terra is $2.50 input / $15 output
> Luna is $1 input / $6 output.
The OpenAI casino has never been more ready to take your money on gambling even more tokens.
minimaxir · 2026-06-26 17:15:21 UTC
Note that GPT 5.5 currently is $5 input / $30 output (short context) so Sol is in the same class, while Terra if the benchmarks are as claimed is indeed a half-price GPT 5.5 at comparable performance.
andrethegiant · 2026-06-26 17:15:56 UTC
What don't you like about the naming?
lwansbrough · 2026-06-26 17:20:51 UTC
I feel like going with Space + Latin is LLM-level creativity.
Can't buy cheaper as a selling point when Deepseek is basically free when hitting cache? Unsubsidized too, cloudflare and digital ocean can be the model provider for similar pricing.
Stitch4223 · 2026-06-26 17:18:41 UTC
With the $200/month plan I’ve never ran into any limits or issues. The product can be used every day for extensive sessions and development. What is everyone doing that makes them talk about tokens versus dollars?
minimaxir · 2026-06-26 17:20:14 UTC
If you've never hit the limits, why not do the $100/mo plan?
nsingh2 · 2026-06-26 17:33:01 UTC
From what my own experiences are, and what's on their checkout page, $100 is 5x base usage and $200 is 20x. If $100 was 10x, then I personally would drop down. They want people to go to the highest tier.
aeonik · 2026-06-26 21:08:35 UTC
You can hit limits with $100 if you use it all day.
You can do it easily if you use in fast mode.
I bet you could hit the limits of the $200/month using fast mode if you were using multiple sessions at the same time all day on fast mode.
The OpenAI tiers seem pretty well tuned.
I used to use the plus ($20/month), and that was good for a few sessions every once in a while.
But now that I'm using it to configure my network, monitoring, maintenance, I'm using it every day and I'm on the $100 plan. And I do pretty consistently hit the limits, but it's easy to pace myself.
I'mam thinking about upgrading to $200/month though. It would be nice not to have to ration it.
ai_slop_hater · 2026-06-26 17:26:22 UTC
I ran out of usage using GPT-5.5 and had to buy a second subscription. I now switched to GPT-5.4 which is basically 2x usage.
fph · 2026-06-26 18:51:22 UTC
But let's put it in perspective: what you're paying them is more than the average salary in many poorer countries.
Stitch4223 · 2026-06-26 20:53:43 UTC
Fair. From a business perspective said amount is very reasonable in Europe / USA. For personal use it’s already different. Sometimes the answer is simple, thanks.
kingstnap · 2026-06-26 21:25:25 UTC
Don't forget this.
> For GPT‑5.6 and later models, cache writes are billed at 1.25x the model’s uncached input rate
Charging for cache writes is cringe and literally only Anthropic did it. Anyway this does mean the "real" prices are +25% on top of what you wrote there.
loufe · 2026-06-26 17:13:07 UTC
"Next generation model"
If it was the next generation, why isn't it a major version change..?
ryangst_1 · 2026-06-26 17:17:14 UTC
LLM devs can't do version control
psychoslave · 2026-06-26 17:18:13 UTC
Semantic is passé, word models moved to the next generation.
dominotw · 2026-06-26 17:19:00 UTC
vibe versioning
cruffle_duffle · 2026-06-26 17:32:29 UTC
To be fair, versioning has always been vibes based.
appplication · 2026-06-26 17:19:59 UTC
Honestly LLMs are the ideal candidate for CalVer. It’s not like there’s any real API so there’s no backwards compatibility to maintain.
Even Apple adopted and standardized on it for their latest platform releases.
andy12_ · 2026-06-26 17:45:27 UTC
I think it makes more sense to make it so that major versions are different pretraining runs, and minor versions are simply the same pretraining run that was finetuned to different degrees. But it seems that that isn't cool anymore.
Kiro · 2026-06-26 20:13:03 UTC
LLM versioning is entirely feelings driven. The ideal versioning is probably just names.
kaizenite · 2026-06-26 17:23:51 UTC
Because if it sucks, they can just default to "It was a minor version change anyways"
goldenarm · 2026-06-26 17:51:30 UTC
They could hold the GPT-6 name for the IPO
GTP · 2026-06-26 17:53:59 UTC
Some assume it was to try to slip under the radar and avoid being limited by the government as they did with Fable.
therepanic · 2026-06-26 17:59:28 UTC
By all appearances, they did not succeed in doing so.
HarHarVeryFunny · 2026-06-26 18:07:45 UTC
AFAIK there is no difference between "generation" and "version". Version naming/numbering depends on how good it turns out to be, and competition. If the competition releases something then you need to push something out too.
Calling it 5.6 creates the least possible expectations, and therefore more potential for positive feedback.
The Sol/Terra/Luna naming is interesting. I wonder what Anthropic are considering for their next models? "Terminator", "Armageddon"?
wincy · 2026-06-26 18:26:42 UTC
You gotta check out the new ChatGPT 6.3 Betelgeuse bro
rolph · 2026-06-26 19:28:21 UTC
Heliopause
cyral · 2026-06-26 19:10:27 UTC
If they called it 6.0 and it wasn't AGI, you'd see a lot of complaining here too
tasuki · 2026-06-26 19:39:29 UTC
What is AGI? (I know what the shortcut expands to, I'm curious about your definition. Don't the current models fit?)
ChrisLTD · 2026-06-26 17:13:19 UTC
If it's a new generation why isn't it GPT-6?
win311fwg · 2026-06-26 17:20:29 UTC
It does not introduce incompatibilities with earlier 5.x models? Frontier models are at a point now that there will never be a need for another major version bump, aside from those chasing marketing gimmicks. They are smart enough to adapt.
ChrisLTD · 2026-06-26 17:26:05 UTC
What would it mean to be incompatible with the other 5.x models?
paxys · 2026-06-26 17:31:53 UTC
New request/response schema, new capabilities, or really anything that would break your existing workflows if you changed “5.5” to “5.6” in your application.
There have been many leaps forward in the past - tool calling, reasoning, agentic loops etc. 5.6 doesn’t have any of this. More intelligence doesn’t necessarily warrant a major version bump.
jurgenburgen · 2026-06-26 17:32:56 UTC
Only speaks Klingon
peab · 2026-06-26 17:27:36 UTC
not true. multimodality is still far from being solved
malnourish · 2026-06-26 17:27:46 UTC
A major bump will be warranted if/when we can truly separate prompt from data.
win311fwg · 2026-06-26 17:33:07 UTC
That is a different product line. It may be recorded as a version bump for marketing purposes, as already mentioned, but semantically begins at 0.
charcircuit · 2026-06-26 19:42:36 UTC
Why would incompatibilities have anything to do with a major version bump?
alcasa · 2026-06-26 17:24:27 UTC
They forgot how to do pretraining.
cleaning · 2026-06-26 17:47:11 UTC
5.5 was a new pretraining run.
paxys · 2026-06-26 19:17:01 UTC
Given the expectations everyone has created GPT-6 has to pretty much be AGI.
tasuki · 2026-06-26 19:36:32 UTC
What is your definition of AGI that the current LLMs don't fit?
paxys · 2026-06-26 19:47:43 UTC
As the old saying goes, I’ll know it when I see it. The current 5.x generation isn’t it.
gordonhart · 2026-06-26 19:53:54 UTC
Autonomously Generating Income (which is why it will never be released to the general public)
koolala · 2026-06-27 05:24:04 UTC
Hopefully it stands for AC Generation Improvements. If it prioritizes income it will bleed the planet dry. It needs to solve how expensive our cost is on the planet first or its entire existence was a mistake.
ThrowawayTestr · 2026-06-26 22:45:55 UTC
When it understands why 6 7 is funny
isomorphic_duck · 2026-06-26 23:01:09 UTC
Continual Learning? Why is this even a question? Isn’t it a well-known glaring issue with the current models? They cannot learn/adapt to new skills (in any permanent sense) once they are deployed.
FromTheFirstIn · 2026-06-26 23:19:51 UTC
You’d have to really stretch the definition of AGI to make the current models fit
LordDragonfang · 2026-06-27 02:53:32 UTC
The definition has already been stretched to not fit the previous models. There is no meaningful, static definition that significantly predates current capabilities.
There's a reason why ai xrisk doomers had to come up with the term ASI.
I would seriously suggest that everyone take a look at the wikipedia page for AGI from the month before ChatGPT was released, compare it to the current version, and not come to that conclusion.
The first sentence is “understand or learn any intellectual task that a human can.” Whatever you think of the benefits of LLMs, they don’t understand and they can only learn during the training period and with very minor adjustments in post training. So, no I don’t think any of these models are generally intelligent.
LordDragonfang · 2026-06-27 07:41:54 UTC
> they don’t understand
I have not seen any instance of this frequently-made assertion which is at all justified. It seems to rely on a definition of "understand" which is more about spirituality than actual observable evidence (they clearly can comprehend even complex tasks well enough to execute on them, and if you won't call that "understanding", you're playing word games rather than stating an objective fact).
Likewise, agents can literally come to a greater understanding of a problem through trial and error, and there are plenty of mechanisms to retain that knowledge. If you don't want to call that "learning", you're just making a choice to define it in a way more restrictive than how we use it for humans, and intentionally making communication more difficult.
mellosouls · 2026-06-27 10:11:44 UTC
It seems to rely on a definition of "understand" which is more about spirituality than actual observable evidence
"Understanding" has enough philosophical leeway in its use to allow at least the possibility of sentience as a prerequisite.
This is where the discussion about LLM capabilities becomes genuinely difficult, and dismissing that difficulty as "word games" or "spirituality vs evidence" is not helpful.
LordDragonfang · 2026-06-29 04:56:57 UTC
Considering that "sentience" has enough "philosophical leeway" that it's just as reasonable to assert that LLMs are sentient (and at extremes, that they have been sentient for years) -- especially if we are, as you suggest, supposed to include any philosophically possible definition -- I don't think that's a meaningful rebuttal. If no one can agree on whether it's sentient, it's bad faith to choose a fringe definition that hands off its definition to such a nebulous term.
In fact, I'd argue that statements about what "is" and "is not" sentient relies on even more spirituality and word games for anything that isn't a terran tetrapod.
For a meaningful -- "helpful" -- discussion on such things, one has to assume that everyone is choosing a definition which is closer to the median usage and relies on not being totally subjective. Furthermore, given the breadth of options, it should be assumed to be a definition which allows which permits the form of the question to be meaningful, rather than begging the question -- if your definition is tautological enough that non-biological entities can't have understanding, you're just expressing dogma rather than having a discussion.
Anything else is bad faith, or assuming bad faith on the part of the participants.
mellosouls · 2026-06-29 10:37:32 UTC
it's bad faith to choose a fringe definition that hands off its definition to such a nebulous term.
I do not think it is at all unreasonable or "fringe" to regard understanding as involving intentionality: ie a directedness of thought toward the object-relations being "grasped". That may not be the only possible conception of understanding but it is a mainstream philosophical idea.
In fact, I'd argue that statements about what "is" and "is not" sentient relies on even more spirituality and word games for anything that isn't a terran tetrapod.
Then you seem to be confusing "hard to understand" with "meaningless".
you're just expressing dogma rather than having a discussion.
Anything else is bad faith, or assuming bad faith on the part of the participants
Have a think about that (repeated) tone before responding.
Fwiw I am a long-time believer in consciousness being fully realisable in machines; I think the jury is still out on LLMs.
FromTheFirstIn · 2026-06-27 12:07:09 UTC
Agents are always combining the same underlying weights to their inputs, relying on the same maps of semi-semantic space and the relationships between those that it was leaning towards at training time. The fact that it’s successful in making lots of people have an Eliza effect doesn’t make it understand something. It’s simulating understanding based on an enormous corpus of text, much of which is people working through things or sharing an understanding of something. Unless you believe that all intellectual activity is about finding the space between words you shouldn’t believe LLMs have any chance at understanding anything.
knollimar · 2026-06-27 13:57:05 UTC
The "it's not X it's Y" where Y qnd X are the same indicates a lack of understanding.
LordDragonfang · 2026-06-29 04:32:43 UTC
Consider the number of humans I've seen make statements that fit that description (about AI, no less!), I don't think that's a strong argument against it.
FromTheFirstIn · 2026-06-30 00:51:54 UTC
Would you say those humans understand what they’re talking about?
LordDragonfang · 2026-06-30 15:42:24 UTC
Touché
mellosouls · 2026-06-27 10:04:06 UTC
From that same page:
Various criteria for intelligence have been proposed (most famously the Turing test) but to date, there is no definition that satisfies everyone
0x696C6961 · 2026-06-26 23:35:18 UTC
Always one goalpost away from what we have.
UltraSane · 2026-06-27 02:07:54 UTC
AGI should be able to do every job a human can do using a computer at least as well as the average human.
LordDragonfang · 2026-06-27 02:52:10 UTC
That's already been true for a while, you're overestimating the average human. They just have different failure modes.
UltraSane · 2026-06-29 17:14:38 UTC
It isn't even close to true. The biggest problem is that humans performance improves over time.
AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work.
Tasks with many messy input files, conflicting information, and complex deliverables remain difficult for all models. Under a strict all-or-nothing grading scheme per task, Claude Fable 5 leads overall, but achieves a perfect task score on only 3% of tasks. On 31 of 91 tasks, no model scores above 50%.
Davidzheng · 2026-06-27 02:53:40 UTC
And what is it worse at than an average human today that can be done on a computer?
UltraSane · 2026-06-27 04:14:24 UTC
almost everything? AGI has to be able to completely replace a human in any information worker role indefinitely.
virgildotcodes · 2026-06-27 07:58:31 UTC
I think you're speeding past the word "average" in the sentence. I'd argue that current frontier models already exceed the abilities of average humans across the majority of tasks you can do on a computer, although you might be able to argue that they tend to be a bit slower?
That latter part is debatable though - have you seen a non-technical person try to figure out something new on a computer?
UltraSane · 2026-06-27 08:23:33 UTC
" I'd argue that current frontier models already exceed the abilities of average humans " for things that fit in their context window sure but LLMs can't learn over time the way humans can. One example is LLMs are very good at writing a few thousands line of code but they absolutely cannot write coherent million line codebases. By average human I meant the average skill level for the job. AGI would need to be able to pass a interview and get hired and the perform well enough to not get fired.
Davidzheng · 2026-06-27 10:21:28 UTC
Yeah it's not true that for every job, it is better than median worker of that job. But it is conceivable that for almost all jobs it is already better than the median human (not just workers of that job).
isomorphic_duck · 2026-06-27 14:19:10 UTC
You have to understand that the median human is terrible at (almost) everything. Humans, the only examples of general intelligence we know, are economically valuable precisely because they can train themselves to specialise at a (relatively) narrow task over time. You don’t measure how good a coding model is by how well it programs relative to Doctors, or how well it can prove theorems relative to baristas, or how well it can write coherent novels relative to programmers. That would be a dumb metric.
tasuki · 2026-06-27 19:23:18 UTC
> Humans, the only examples of general intelligence we know
Our intelligence only seems "general" to us, because we're viewing it through our own eyes. Our "intelligence" is specialized to our survival, and we're terrible at most tasks outside that scope.
isomorphic_duck · 2026-06-28 11:48:09 UTC
We operate and think about subjects like Higher Topos Theory, Information Geometry and Algebraic Topology, which are several layers of abstractions removed from anything that can be termed as a skill “specialised to our survival”.
Davidzheng · 2026-06-27 10:22:53 UTC
But in any case, I think more than 10% of information workers today can be replaced by current-generation models indefinitely.
ChrisLTD · 2026-06-27 15:23:04 UTC
It's decent at rote coding tasks, but I haven't seen these things be reliable enough outside of that specific task to make the claim that it can do the work of any information worker.
AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work.
Tasks with many messy input files, conflicting information, and complex deliverables remain difficult for all models. Under a strict all-or-nothing grading scheme per task, Claude Fable 5 leads overall, but achieves a perfect task score on only 3% of tasks. On 31 of 91 tasks, no model scores above 50%.
AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work.
Tasks with many messy input files, conflicting information, and complex deliverables remain difficult for all models. Under a strict all-or-nothing grading scheme per task, Claude Fable 5 leads overall, but achieves a perfect task score on only 3% of tasks. On 31 of 91 tasks, no model scores above 50%.
leumon · 2026-06-26 17:13:43 UTC
> We plan to make them more broadly available to people using ChatGPT, Codex, and the API soon.
I hope this means then fable will also get released again.
lanthissa · 2026-06-26 17:27:05 UTC
why would it? if you're the us gov and sam&greg your good boy giving you 25m
and dario's you naughty boy who you dont agree with politically.
Let 5.6 free, keep fable chained and anthropic instantly sees rev loss and has to cave.
osti · 2026-06-26 17:14:02 UTC
Sol? Looks like openai is jealous of anthropics good model naming ability and wants to emulate it.
dominotw · 2026-06-26 17:21:42 UTC
sol has no soul
taytus · 2026-06-26 17:25:27 UTC
It's missing u
alcasa · 2026-06-26 17:25:43 UTC
They should have used Figher Jet codenames instead. The MiG-15 one has a nice ring to it.
arizen · 2026-06-26 17:59:39 UTC
Sol Goodman
MrCheeze · 2026-06-26 17:53:17 UTC
TBF, they did it first with ada/babbage/curie/davinci. "Sol" is a much weaker branding, though.
ddp26 · 2026-06-26 17:14:06 UTC
I'm going to pre-register my prediction that GPT-5.6 Sol is significantly behind Claude Fable 5, as evaluated by general consensus once time has passed for people to get familiar with both.
hmate9 · 2026-06-26 17:15:06 UTC
What is this prediction based on?
gpm · 2026-06-26 17:16:46 UTC
I suspect the same just based on their versioning scheme fwiw.
Comments
https://news.ycombinator.com/item?id=48678789
https://news.ycombinator.com/item?id=48683021
> GPT‑5.6 is priced per 1M tokens across three model sizes:
> Sol is $5 input / $30 output;
> Terra is $2.50 input / $15 output
> Luna is $1 input / $6 output.
The OpenAI casino has never been more ready to take your money on gambling even more tokens.
Edit: yeah. https://claude.ai/share/06fefe02-4299-44da-8c5a-42607f54ca77
You can do it easily if you use in fast mode.
I bet you could hit the limits of the $200/month using fast mode if you were using multiple sessions at the same time all day on fast mode.
The OpenAI tiers seem pretty well tuned.
I used to use the plus ($20/month), and that was good for a few sessions every once in a while.
But now that I'm using it to configure my network, monitoring, maintenance, I'm using it every day and I'm on the $100 plan. And I do pretty consistently hit the limits, but it's easy to pace myself.
I'mam thinking about upgrading to $200/month though. It would be nice not to have to ration it.
> For GPT‑5.6 and later models, cache writes are billed at 1.25x the model’s uncached input rate
Charging for cache writes is cringe and literally only Anthropic did it. Anyway this does mean the "real" prices are +25% on top of what you wrote there.
If it was the next generation, why isn't it a major version change..?
Even Apple adopted and standardized on it for their latest platform releases.
Calling it 5.6 creates the least possible expectations, and therefore more potential for positive feedback.
The Sol/Terra/Luna naming is interesting. I wonder what Anthropic are considering for their next models? "Terminator", "Armageddon"?
There have been many leaps forward in the past - tool calling, reasoning, agentic loops etc. 5.6 doesn’t have any of this. More intelligence doesn’t necessarily warrant a major version bump.
There's a reason why ai xrisk doomers had to come up with the term ASI.
I would seriously suggest that everyone take a look at the wikipedia page for AGI from the month before ChatGPT was released, compare it to the current version, and not come to that conclusion.
https://en.wikipedia.org/w/index.php?title=Artificial_genera...
I have not seen any instance of this frequently-made assertion which is at all justified. It seems to rely on a definition of "understand" which is more about spirituality than actual observable evidence (they clearly can comprehend even complex tasks well enough to execute on them, and if you won't call that "understanding", you're playing word games rather than stating an objective fact).
Likewise, agents can literally come to a greater understanding of a problem through trial and error, and there are plenty of mechanisms to retain that knowledge. If you don't want to call that "learning", you're just making a choice to define it in a way more restrictive than how we use it for humans, and intentionally making communication more difficult.
"Understanding" has enough philosophical leeway in its use to allow at least the possibility of sentience as a prerequisite.
This is where the discussion about LLM capabilities becomes genuinely difficult, and dismissing that difficulty as "word games" or "spirituality vs evidence" is not helpful.
In fact, I'd argue that statements about what "is" and "is not" sentient relies on even more spirituality and word games for anything that isn't a terran tetrapod.
For a meaningful -- "helpful" -- discussion on such things, one has to assume that everyone is choosing a definition which is closer to the median usage and relies on not being totally subjective. Furthermore, given the breadth of options, it should be assumed to be a definition which allows which permits the form of the question to be meaningful, rather than begging the question -- if your definition is tautological enough that non-biological entities can't have understanding, you're just expressing dogma rather than having a discussion.
Anything else is bad faith, or assuming bad faith on the part of the participants.
I do not think it is at all unreasonable or "fringe" to regard understanding as involving intentionality: ie a directedness of thought toward the object-relations being "grasped". That may not be the only possible conception of understanding but it is a mainstream philosophical idea.
In fact, I'd argue that statements about what "is" and "is not" sentient relies on even more spirituality and word games for anything that isn't a terran tetrapod.
Then you seem to be confusing "hard to understand" with "meaningless".
you're just expressing dogma rather than having a discussion.
Anything else is bad faith, or assuming bad faith on the part of the participants
Have a think about that (repeated) tone before responding.
Fwiw I am a long-time believer in consciousness being fully realisable in machines; I think the jury is still out on LLMs.
Various criteria for intelligence have been proposed (most famously the Turing test) but to date, there is no definition that satisfies everyone
https://www.linkedin.com/pulse/announcing-aa-briefcase-bench...
AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work.
Tasks with many messy input files, conflicting information, and complex deliverables remain difficult for all models. Under a strict all-or-nothing grading scheme per task, Claude Fable 5 leads overall, but achieves a perfect task score on only 3% of tasks. On 31 of 91 tasks, no model scores above 50%.
That latter part is debatable though - have you seen a non-technical person try to figure out something new on a computer?
Our intelligence only seems "general" to us, because we're viewing it through our own eyes. Our "intelligence" is specialized to our survival, and we're terrible at most tasks outside that scope.
AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work.
Tasks with many messy input files, conflicting information, and complex deliverables remain difficult for all models. Under a strict all-or-nothing grading scheme per task, Claude Fable 5 leads overall, but achieves a perfect task score on only 3% of tasks. On 31 of 91 tasks, no model scores above 50%.
AA-Briefcase is a new benchmark for testing models on realistic knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week knowledge work projects, each with many linked tasks and thousands of input source files. AA-Briefcase combines rubric and pairwise grading to evaluate verifiable task success, analytical quality, and presentation quality, giving a holistic view of overall agentic capability in knowledge work.
Tasks with many messy input files, conflicting information, and complex deliverables remain difficult for all models. Under a strict all-or-nothing grading scheme per task, Claude Fable 5 leads overall, but achieves a perfect task score on only 3% of tasks. On 31 of 91 tasks, no model scores above 50%.
I hope this means then fable will also get released again.
and dario's you naughty boy who you dont agree with politically.
Let 5.6 free, keep fable chained and anthropic instantly sees rev loss and has to cave.