Claude Code Drama: 6,852 Sessions Prove Performance Collapse

❝

Welcome to Hello Engineer, your weekly guide to becoming a better software engineer! No fluff. Pure engineering insights.

The Smoking Gun: A Senior Engineer’s Data Autopsy

Stella Laurenzo knows how to file a bug report. The AMD Senior Director of AI dropped a GitHub issue on April 2, 2026 that read less like a complaint and more like a criminal investigation. She brought telemetry from 6,852 Claude Code sessions, tracked 17,871 thinking blocks, and analyzed 234,760 tool calls across three months of stable internal engineering work.

Her conclusion? Claude Code systematically degraded between January and March 2026.

Subscribe now

The numbers are brutal:

Median visible thinking length: 2,200 characters in January → 600 characters in March (73% collapse)
API calls per task: Up to 80x more retries from February to March
Files read before editing: 6.6 files → 2.0 files (not enough to understand dependencies)
Early stopping patterns: Near zero → 10 per day after March 8

❝

Key insight: 600 characters is barely enough to articulate a file reading strategy, let alone plan a multi file refactor across a 50,000 line codebase.

This is not “the model feels slower.” This is structural workflow breakdown.

The Viral Benchmark That Got Debunked

While Laurenzo’s data was mounting on GitHub, a different claim went viral on X. BridgeMind, the team behind BridgeBench, posted on April 12 that Claude Opus 4.6 fell from 83.3% accuracy (ranked #2) to 68.3% accuracy (ranked #10) in a hallucination benchmark retest.

The post exploded.

Then Paul Calcraft, an independent AI researcher, tore it apart. The original high score? Only 6 benchmark tasks. The new retest? 30 tasks. On the 6 overlapping tasks, performance barely moved: 87.6% → 85.4%. The entire “67% drop” narrative collapsed under scrutiny.

The debunk did not stop the controversy. It made it worse. If the most viral benchmark claim was bad science, what else was real?

What Anthropic Actually Changed (And Why Users Are Furious)

Boris Cherny, Claude Code lead, responded to Laurenzo’s GitHub issue with specifics. Anthropic made three deliberate product changes between February and March 2026:

Adaptive thinking by default (February 9): Claude decides reasoning depth per task instead of fixed budget
Effort level dropped to “medium” (March 3): Default changed from high to effort level 85
UI-only thinking redaction (February 12): Intermediate thinking hidden to reduce latency, but “does not impact thinking itself”

The official line: these changes were meant to balance intelligence, latency, and cost.

The developer experience: broken workflows, wasted tokens, and a model that felt “dumber than Sonnet 3.5.”

Here is the gap. Cherny says you can manually type /effort high in Claude Code terminal to restore extended reasoning. But Pro users on Cowork and Claude Desktop cannot change the default. They are locked into medium effort with no escape hatch.

❝

The trust issue: Anthropic knows when it changes serving parameters. Users do not. When the model suddenly starts hitting early stopping patterns and burning 80x more API calls, users have no official changelog to reference. They are left running their own benchmarks and filing GitHub issues.

The Prompt Caching Bug That Inflated Costs 10x to 20x

While the reasoning depth drama unfolded, a different group of developers discovered something worse. A community member reverse engineered the Claude Code binary using Ghidra and a MITM proxy and found two independent bugs that broke prompt caching.

Users reported costs inflating 10x to 20x without warning. Some confirmed that downgrading to Claude Code version 2.1.34 made the issue disappear.

Thariq Shihipar, an Anthropic engineer, responded on X: “Actively looking into this in particular.”

The timing was catastrophic. The prompt caching bug hit simultaneously with:

Session limit cuts during peak hours (announced March 27)
End of a promotional period that had doubled usage limits
Multiple Opus 4.6 service incidents logged on the status page

From a user perspective, it looked like Anthropic was throttling the model to save compute. From Anthropic’s perspective, they were tuning defaults and fixing infrastructure bugs.

Neither narrative is wrong. Both are happening at the same time.

The OpenAI Memo That Added Fuel to the Fire

On April 10, CNBC reported an internal OpenAI memo where their revenue chief claimed Anthropic made a “strategic misstep” by not securing enough compute capacity. The memo alleged Anthropic was “operating on a meaningfully smaller curve” than competitors.

Anthropic declined to comment on the claims.

The memo landed in the middle of the degradation controversy. Developers who were already suspicious about throttling now had an external data point suggesting Anthropic might be compute constrained.

Shihipar publicly denied on X that Anthropic degrades models to manage demand. But the company had already announced stricter session limits during peak hours, affecting around 7% of users. That established two facts:

Anthropic faces surging demand
Anthropic is actively rationing usage during busy periods

For many developers, that context made it easier to believe other, less visible tradeoffs might also be in play.

What the Independent Benchmarks Actually Show

Marginlab, an independent third party organization, has been running daily SWE-Bench-Pro evaluations of Claude Code with Opus 4.6 since the degradation complaints began.

Their findings as of April 10, 2026:

Historical baseline: 56% pass rate
Current daily evaluation: 50% pass rate (6 percentage point drop)
Sample size: 50 test cases per day
Statistical significance: Not yet reached, but the trend is concerning

Marginlab transparently notes their methodology: “We benchmark in Claude Code CLI with the SOTA model (currently Opus 4.6) directly, no custom harnesses.”

Artificial Analysis reported that Opus 4.6 used 30% to 60% more tokens than Opus 4.5 on their GDPval AA benchmark, making it the most costly model they had tested.

LiveBench slightly favors Opus 4.6 over Opus 4.5 overall, which argues against a simple “the model got worse everywhere” narrative.

The pattern: Opus 4.6 is not universally degraded, but it is burning more tokens and showing degraded performance on sustained, multi step coding workflows.

The Context Window Illusion

One of the quieter complaints involves Opus 4.6’s advertised 1 million token context window, which went generally available on March 13, 2026.

A detailed GitHub bug report found that during heavy Claude Code sessions, the model’s performance degraded well before hitting 50% of that window:

At 20% usage: Circular reasoning and forgotten decisions appeared
At 40% usage: Context compression kicked in, wiping scrollback history
At 48% usage: The model itself told the user it was not being effective and recommended starting a fresh session

If the effective high quality context is roughly 400,000 tokens, advertising 1 million feels misleading.

The user asked: “Should this be communicated rather than marketing the full figure?”

Anthropic has not publicly answered that question.

What This Means For Engineers

The Claude Code degradation saga is not just a bug report. It is a consumer rights crisis hiding in plain sight. When you pay a subscription fee for a specific AI model, do you have any right to actually receive that model’s advertised capabilities?

Here are the takeaways for engineers working with AI coding agents:

Trust the logs, not the marketing. Laurenzo’s forensic approach shows the right way to document degradation. Track session metrics, reasoning depth, API retry counts, and file read patterns over time. When something feels wrong, the data will prove it.
Demand transparency on serving parameters. Adaptive thinking, effort levels, prompt caching TTLs, and context compression thresholds directly impact your workflows. These should be documented in release notes, not discovered through reverse engineering.
Build escape hatches into your tooling. If your entire dev workflow depends on one AI provider, you are exposed to silent downgrades. Keep fallback models ready. Test them regularly. Know your migration path.

Found this breakdown valuable? Show some love with a like ❤️

For more no nonsense insights on coding, system design, and tech trends, Subscribe To My Newsletter! 🚀

Follow me on Youtube · LinkedIn · X · Instagram to stay updated.

See you next week with more exciting content!

Signing Off, Scortier