GPT-5.5 Codex reasoning-token clustering at 516/1034/1552 may be leading to degraded performance on complex tasks
Navigation Menu
Toggle navigation
[](https://github.com/)
Appearance settings
* Platform
* AI CODE CREATION
- GitHub Copilot Write better code with AI
- GitHub Copilot app Direct agents from issue to merge
- MCP Registry New Integrate external tools
* DEVELOPER WORKFLOWS
- Actions Automate any workflow
- Codespaces Instant dev environments
- Code Review Manage code changes
* APPLICATION SECURITY
- GitHub Advanced Security Find and fix vulnerabilities
- Code security Secure your code as you build
- Secret protection Stop leaks before they start
* EXPLORE
- Blog
* Solutions
* BY COMPANY SIZE
- Startups
* BY USE CASE
- DevOps
- CI/CD
* BY INDUSTRY
* Resources
* EXPLORE BY TOPIC
- AI
- DevOps
- Security
* EXPLORE BY TYPE
* SUPPORT & SERVICES
- Partners
* Open Source
* COMMUNITY
- GitHub Sponsors Fund open source developers
* PROGRAMS
* REPOSITORIES
- Topics
- Trending
* Enterprise
* ENTERPRISE SOLUTIONS
- Enterprise platform AI-powered developer platform
* AVAILABLE ADD-ONS
- GitHub Advanced Security Enterprise-grade security features
- Copilot for Business Enterprise-grade AI features
- Premium Support Enterprise-grade 24/7 support
- Pricing
Search or jump to...
Search code, repositories, users, issues, pull requests...
Search
Clear
Provide feedback
We read every piece of feedback, and take your input very seriously.
- [x] Include my email address so I can be contacted
Cancel Submit feedback
Saved searches
Use saved searches to filter your results more quickly
Name
Query
To see all available qualifiers, see our documentation.
Cancel Create saved search
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
Uh oh!
There was an error while loading. Please reload this page.
- NotificationsYou must be signed in to change notification settings
- Code
- Actions
- Insights
Additional navigation options
- Code
- Issues
- Actions
- Insights
GPT-5.5 Codex reasoning-token clustering at 516/1034/1552 may be leading to degraded performance on complex tasks#30364
Copy link
Copy link
Open
Open
Copy link
Labels
bug Something isn't workingSomething isn't workingmodel-behavior Issues related to behaviors exhibited by the modelIssues related to behaviors exhibited by the modelrate-limits Issues related to rate limits, quotas, and token usage reportingIssues related to rate limits, quotas, and token usage reporting
Description

opened on Jun 27, 2026
Issue body actions
Summary
I found an aggregate pattern in Codex `token_count` metadata: `gpt-5.5` responses disproportionately land at exactly `reasoning_output_tokens = 516`, with additional fixed-boundary spikes around `1034` and `1552`.
This appears model-specific and coincides with lower overall reasoning-token intensity, which may help explain degraded performance on complex/high-stakes Codex tasks.
This is related to #29353, which reported a task-level reproduction where `gpt-5.5` runs ending at exactly 516 reasoning tokens returned the wrong answer. This issue adds aggregate evidence across a larger Feb-Jun window.
I am not claiming this proves hidden chain-of-thought truncation. The narrower claim is that Codex telemetry shows a GPT-5.5-specific fixed-token clustering anomaly that looks consistent with thresholded reasoning-budget behavior.
Environment
- Product: Codex
- Model most implicated: `gpt-5.5`
- Data source: Codex `token_count` metadata
- Time window analyzed: Feb 1-Jun 27, 2026 UTC
- Related issue: gpt-5.5 xhigh sometimes short-circuits with reasoning_output_tokens=516 and wrong final_answer in Codex Desktop#29353
Evidence
| Metric | Value | | --- | ---: | | Response-level token records analyzed | 390,195 | | Sessions represented | 865 | | Exact `reasoning_output_tokens = 516` events | 3,363 | | GPT-5.5 share of all responses | 19.3% | | GPT-5.5 share of exact-516 events | 82.0% | | GPT-5.5 exact-516 / >=516 ratio | 44.0% | | Non-GPT-5.5 exact-516 / >=516 ratio | 1.3% |
Model-level result:
| Model | Response records | Exact 516 / >=516 | | --- | ---: | ---: | | `gpt-5.5` | 75,401 | 44.0% | | `gpt-5.4` | 25,214 | 19.8% | | `gpt-5.2` | 247,575 | 0.34% | | `gpt-5.3-codex` | 13,333 | 0.0% | | `gpt-5.3-codex-spark` | 26,179 | 0.0% |
Monthly exact-516 clustering increased sharply:
| Month | Exact 516 / >=516 | | --- | ---: | | Feb 2026 | 0.11% | | Mar 2026 | 2.45% | | Apr 2026 | 4.25% | | May 2026 | 53.30% | | Jun 2026 | 35.84% |
At the same time, overall reasoning-token intensity decreased:
| Month | Mean reasoning tokens | P90 reasoning tokens | | --- | ---: | ---: | | Feb 2026 | 268.1 | 772 | | Mar 2026 | 256.8 | 723 | | Apr 2026 | 228.7 | 669 | | May 2026 | 106.9 | 344 | | Jun 2026 | 168.5 | 515 |
Why this looks suspicious
The anomaly is not simply higher reasoning-token usage overall. Mean and P90 reasoning-token intensity fell from February-April to May-June, while exact-516 clustering rose sharply.
The clustering is also not evenly distributed across models. `gpt-5.5` accounts for only 19.3% of responses but 82.0% of exact-516 events. Its exact-516 / >=516 ratio is about 33.6x higher than the non-GPT-5.5 baseline.
The fixed values are also notable: `516`, `1034`, and `1552` look like repeated threshold boundaries rather than a naturally varying reasoning-token distribution.
Expected behavior
Reasoning-token counts for complex Codex tasks should vary naturally with task complexity and should not disproportionately cluster at exact fixed values for one model family.
Actual behavior
`gpt-5.5` responses cluster heavily at exactly 516 reasoning tokens, with related spikes around 1034 and 1552. This pattern is much weaker or absent in several other models.
Ask
Could the Codex team investigate whether `gpt-5.5` has a reasoning-budget, routing, truncation, fallback, or scheduler behavior that causes responses to terminate around 516/1034/1552 reasoning tokens?
If this is expected behavior, it would be useful to know whether exact 516 indicates a normal stopping point, a budget cap, a degraded tier, or another internal threshold.
Useful internal validation checks:
1. Query `token_count` events with `reasoning_output_tokens` by model. 2. Compare exact-value counts for `0`, `516`, `1034`, and `1552`. 3. Compute `count(reasoning_output_tokens = 516) / count(reasoning_output_tokens >= 516)` by model and day. 4. Compare `gpt-5.5` against `gpt-5.2`, `gpt-5.4`, and Codex-specific variants. 5. Replay matched complex tasks across GPT-5.2 and GPT-5.5 with quality evals, especially separating exact-516 responses from longer-reasoning responses.
👍React with 👍48 revantmalani, juhaase, loner2403, partment, jianhongyu136 and 43 more😕React with 😕4 92645417d9e5c763259dbebc306e3e, YMingF, H-Sofie and Sing303👀React with 👀8 gydx6, Lionel233, lujunjiehhh, Sing303, guidedways and 3 more
Activity

added
bug Something isn't workingSomething isn't working
model-behavior Issues related to behaviors exhibited by the modelIssues related to behaviors exhibited by the model
rate-limits Issues related to rate limits, quotas, and token usage reportingIssues related to rate limits, quotas, and token usage reporting
github-actions commented on Jun 27, 2026

on Jun 27, 2026 – with GitHub Actions
Contributor
More actions
Potential duplicates detected. Please review them and close your issue if it is a duplicate.
_Powered by Codex Action_
👎React with 👎8 sebas1111111, cr-zhichen, gziqt, vguptaa45, jb2519 and 3 more
revantmalani commented on Jun 28, 2026

More actions
I've been facing the same issue and am very frustrated as well
bluecat1997 commented on Jun 28, 2026

More actions
meet same problem, desire openAI to feedback!
🚀React with 🚀4 YMingF, pingzhihe, 016 and cocofoxfox
bluecat1997 commented on Jun 28, 2026

More actions
> Potential duplicates detected. Please review them and close your issue if it is a duplicate. > > > * gpt-5.5 xhigh sometimes short-circuits with reasoning_output_tokens=516 and wrong final_answer in Codex Desktop#29353 > > > _Powered by Codex Action_
This is a much more data driven report than the previous one
👍React with 👍1 tanseydavid
vguptaa45 commented on Jun 28, 2026

Author
More actions
> > Potential duplicates detected. Please review them and close your issue if it is a duplicate. > > > > > > * gpt-5.5 xhigh sometimes short-circuits with reasoning_output_tokens=516 and wrong final_answer in Codex Desktop#29353 > > > > > > _Powered by Codex Action_ > > > This is a much more data driven report than the previous one
I agree, the previous one was closed for no reason. I hope this takes their attention
👍React with 👍1 tanseydavid
Lionel233 commented on Jun 28, 2026

Last edited by Lionel233
More actions
Exactly — this matches what I found, and it's clearly not an isolated case. I shared my initial finding on Reddit earlier (Half of Your High-Stakes Codex Requests May Be Silently Downgraded by Truncated Reasoning), and it's great that you've now dug deeper with model-specific and monthly data.
I've added a link to this GitHub issue in that Reddit post, so readers can cross-reference and upvote here.
Thanks for the thorough testing!
❤️React with ❤️4 vguptaa45, juhaase, Barometer-2002 and tanseydavid
loner2403 commented on Jun 28, 2026

More actions
Same issue
partment commented on Jun 28, 2026

More actions
Same issue
Suvmaker commented on Jun 28, 2026

More actions
same problem
lujunjiehhh commented on Jun 28, 2026

More actions
Same issue
haowang02 commented on Jun 28, 2026

More actions
same issue
owiofwm2i commented on Jun 28, 2026

More actions
> > > Potential duplicates detected. Please review them and close your issue if it is a duplicate. > > > > > > > > > * gpt-5.5 xhigh sometimes short-circuits with reasoning_output_tokens=516 and wrong final_answer in Codex Desktop#29353 > > > > > > > > > _Powered by Codex Action_ > > > > > > This is a much more data driven report than the previous one > > > I agree, the previous one was closed for no reason. I hope this takes their attention
You should post this on the official OpenAI forums too:
Those are more likely than Github to be seen by people who actually have the authority to investigate or escalate model-quality issues internally.
14 remaining items
Load more
MioQuispe commented on Jun 30, 2026

More actions
Its completely unusable! It cannot be that I pay 200$ a month and select XHigh but it gets outperformed by a free-tier chinese model...
I dont want any resets. They are collecting dust because of this.
Stop handing them out if you cant handle the capacity.
Id rather have -50% less usage if it meant the model actually spent any time thinking at all.
👍React with 👍6 Alexius66, partment, momadacoding, pingzhihe, vguptaa45 and 1 more
DanielMulec commented on Jun 30, 2026

More actions
Same issue

mentioned this on Jun 30, 2026
white54503 commented on Jun 30, 2026

More actions
Same issue.
92645417d9e5c763259dbebc306e3e commented on Jun 30, 2026

92645417d9e5c763259dbebc306e3e
More actions
an effective mitigation measure is to modify the system prompt(Share intermediary updates in `commentary` channel.) to
- Do not send optional commentary.
- You do not need to use the commentary channel to report progress to me.
- Use tools normally.
- Put user-facing text in final only.
or directly use the gpt-5.2-codex prompt
insanowsky commented on Jun 30, 2026

More actions
> an effective mitigation measure is to modify the system prompt(Share intermediary updates in `commentary` channel.) to > > > * Do not send optional commentary. > * You do not need to use the commentary channel to report progress to me. > * Use tools normally. > * Put user-facing text in final only. > > or directly use the gpt-5.2-codex prompt
system prompt does not manage juice value

mentioned this on Jul 1, 2026
- 📊 AI CLI 工具社区动态日报 2026-07-01 zx0828/big_model_radar#204
016 commented on Jul 1, 2026

More actions
Same issue, it's for $200?
MaShouo commented on Jul 1, 2026

More actions
Same issue, edit system prompt can relieve a bit

mentioned this in 3 issues on Jul 2, 2026
- 📰 Hacker News AI Digest 2026-07-02 kakapez/agents-radar#589
- 📰 Hacker News AI 社区动态日报 2026-07-02 96loveslife/big_model_radar#94
- 📰 Hacker News AI 社区动态日报 2026-07-02 litang9/big_model_radar#153
haowang02 commented on Jul 2, 2026

Last edited by haowang02
More actions
Because of this issue, I haven't been able to use Codex for any real work for quite a while now.
If this doesn't get fixed, I don't see any reason to keep paying for the subscription.
It's pretty disappointing that GPT-5.5 xhigh is now delivering a worse experience than some budget open-source models.
**This is outright fraud!!!**
👍React with 👍5 vguptaa45, pingzhihe, yk-liang, cocofoxfox and MioQuispe
vguptaa45 commented on Jul 2, 2026

Author
More actions
> Because of this issue, I haven't been able to use Codex for any real work for quite a while now. > > If this doesn't get fixed, I don't see any reason to keep paying for the subscription. > > It's pretty disappointing that GPT-5.5 xhigh is now delivering a worse experience than some budget open-source models.
YES LITERALLY THIS. Gpt-5.5xhigh thinking for only 30 seconds regularly is abysmal. Im holding for gpt5.6 and see if this issue is resolved, otherwise ill shift my team too.
👍React with 👍1 haowang02
MaShouo commented on Jul 2, 2026

More actions
The same test question, when asked in Opencode, achieves a 100% accuracy rate, but in Codex, the accuracy drops to nearly 0%. I don't know what the OpenAI team is doing. This is clearly not an issue of model intelligence—a single system prompt can ruin an entire model.
Sign up for free**to join this conversation on GitHub.** Already have an account? Sign in to comment
Metadata
Metadata
Assignees
No one assigned
Labels
bug Something isn't workingSomething isn't workingmodel-behavior Issues related to behaviors exhibited by the modelIssues related to behaviors exhibited by the modelrate-limits Issues related to rate limits, quotas, and token usage reportingIssues related to rate limits, quotas, and token usage reporting
Type
No type
Fields
No fields configured for issues without a type.
Projects
No projects
Milestone
No milestone
Relationships
None yet
Development
No branches or pull requests
Participants

+21
Issue actions
- !Image 33Open in GitHub Copilot app
Footer
[](https://github.com/) © 2026 GitHub,Inc.
Footer navigation
- Terms
- Privacy
- Security
- Status
- Docs
- Contact
- Manage cookies
- Do not share my personal information
You can’t perform that action at this time.