Claude’s C Compiler makes a mess of benchmarks

OraCore Editors

[TOOLS] April 4, 20269 min readOraCore Editors

Claude’s C Compiler makes a mess of benchmarks

Claude’s C Compiler can build the Linux kernel, but on SPEC CPU2017 it often cuts performance by 70%+ and even crashes some runs.

compiler benchmarks AI coding tools SPEC CPU2017 GCC Claude C Compiler

Share LinkedIn

Claude’s C Compiler makes a mess of benchmarks

Chips and Cheese published a sharp April 2026 test of Claude’s AI-written C compiler, and the numbers are hard to ignore. On the SPEC CPU2017 suite, the compiler often turned ordinary C programs into code that ran more than 70% slower than GCC, and one build even crashed on Arm’s Cortex X925.

That makes the article a useful reality check for anyone treating AI coding tools as if they are ready to replace mature systems software. Claude’s C Compiler, or CCC, can compile the Linux kernel, but that fact alone says very little about how well it handles real workloads, compiler diagnostics, or code generation quality.

What CCC actually did to a simple benchmark

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The first test in the article is a tiny dependent-load microbenchmark, the kind of thing hardware folks use to measure memory latency. The source is simple: each array access feeds the next one, so the compiler cannot hide the dependency chain. GCC and CCC both produced working code, but CCC generated a much longer instruction sequence.

That extra length mattered. On x86-64, CCC split operations into separate shift, add, and load instructions, then added extra register shuffling. On aarch64, it did the same sort of thing while also moving values through the stack more than necessary. The result was not a little noise around the edges. It changed the measured latency.

The article’s measurements show that Zen 5 and Intel’s Lion Cove each took about two extra cycles on this benchmark when compiled with CCC. Arm’s newer cores took a bigger hit, around 6 to 7 cycles, while smaller cores and in-order designs suffered even more because they have less execution width to hide the extra work.

CCC expands a short dependency chain into roughly nine instructions in some cases
Zen 5 and Lion Cove absorb much of the overhead through out-of-order execution
Arm cores without strong move elimination take a larger latency hit
In-order cores get hit the hardest because they cannot hide extra instruction count

That’s the interesting part: CCC is not just “AI wrote slower code.” It exposes how much modern CPUs depend on features such as move elimination, store forwarding, decode width, and deep reorder buffers. When a compiler emits clumsy code, those microarchitectural details stop being background trivia and become the whole story.

When the compiler ignores type errors

Chester Lam also used a deliberately broken C example to show how CCC handles type problems. GCC refused to build the code and emitted the usual type mismatch errors. CCC, by contrast, produced an executable.

That sounds funny until you remember why compilers have diagnostics in the first place. Type checking is one of the few things that keeps large C codebases from collapsing into undefined behavior and hard-to-debug runtime failures. A compiler that happily accepts bad code may feel convenient in the moment, but it pushes the cost downstream.

“Our society has gone all-in on AI.” — Chester Lam, Chips and Cheese

That line is written in the article’s satirical voice, but the technical point underneath it is serious. If an AI compiler is willing to paper over mistakes instead of flagging them, then its apparent productivity gain comes with a hidden bill: later failures, worse portability, and more time spent chasing bugs that should have been caught at compile time.

This is also where the article is more useful than a simple “AI bad” take. It shows that compiler quality is not one number. You need code generation, diagnostics, optimization behavior, and architecture-specific tuning. CCC only looks impressive if you stop the evaluation at “it compiled.”

SPEC CPU2017 gives the harsher verdict

The real stress test came from SPEC CPU2017, where the article ran eight C-only workloads on three cores: Arm Cortex X925 in Nvidia’s GB10, AMD Zen 5 in a Ryzen 7 9800X3D, and Intel Lion Cove in a Core Ultra 9 285K. Boost was disabled on the desktop chips to keep the comparison cleaner.

The results were ugly. CCC’s 502.gcc build crashed with a segmentation fault on Cortex X925. On x86-64, when the run completed, the CCC build of 502.gcc landed at 23.6% of GCC’s performance on Lion Cove and 27.1% on Zen 5. Across the eight workloads, the average regression was a bit worse than 70%.

That average hides some variation, but not enough to change the big picture. CCC hit its best relative showing on 505.mcf, where the slowdown was still below 35% compared with GCC. In several other workloads, the drop was much larger, especially on Arm.

Tested cores: Arm Cortex X925, AMD Zen 5, Intel Lion Cove
Workloads: 8 C-only SPEC CPU2017 benchmarks
Worst public result in the article: 502.gcc crashing on Cortex X925
Best relative CCC result: 505.mcf, still under 35% slower than GCC
Average CCC regression: a little over 70%

There’s also a useful comparison buried in the numbers. With GCC, Cortex X925 was competitive with Zen 5 and Lion Cove in several workloads. With CCC, that balance disappeared. Arm lost in nearly every test except one, and even there Lion Cove stayed ahead. In other words, the compiler changed the ranking of the chips, which is a reminder that benchmark results are never just about silicon.

One especially telling detail is the article’s IPC discussion. CCC sometimes produced very high instruction-per-cycle numbers, including a 6.09 IPC average on Arrow Lake for 525.x264. That sounds impressive until you remember that IPC is only useful if the instructions are doing useful work. CCC often got higher IPC by emitting far more instructions overall.

Why the hardware reacted so differently

The article’s top-down analysis explains why the same bad compiler output hurt some cores more than others. Zen 5 and Lion Cove have strong out-of-order machinery, wide execution engines, and better tolerance for ugly dependency chains. Arm’s Cortex X925 has less forgiving behavior when code leans on register moves and stack traffic.

Zen 5 also has a large op cache that can feed its front end efficiently, but the article shows one workload, 500.perlbench, where CCC’s bloated instruction footprint pushed it into a decoder bottleneck. That is a neat example of a rare case where front-end design details suddenly matter in a very visible way.

For readers who care about compiler quality, the practical lesson is simple: a compiler can be “correct” and still be a bad fit for real workloads. If it inflates instruction count, weakens locality, or mishandles stack traffic, the CPU has to pay for every one of those mistakes.

Zen 5’s op cache usually hides front-end pain, but not always
Lion Cove has a wide decoder and strong out-of-order resources
Cortex X925 appears more sensitive to move and stack-heavy code
CCC often increases instruction count far more than branch count

That last point matters because it explains why branch prediction was not the main story here. The code got bigger, but not always branchier. So the bottleneck shifted toward retirement, backend execution, and memory behavior instead of classic control-flow problems.

For an AI-generated compiler, this is the kind of failure mode that should worry people most. It is easy to imagine an AI tool that passes a smoke test, builds a toy program, and even compiles a kernel. It is much harder to get one that consistently respects architecture-specific tradeoffs across a benchmark suite built to stress the details.

What this says about AI coding tools

The article is satire, but it lands because the technical evidence is real enough to be uncomfortable. AI coding systems are getting better at producing plausible code, yet compilers are among the least forgiving places to fake competence. One bad optimization pass can turn into a measurable slowdown across an entire benchmark suite.

There’s also a broader product lesson for teams building AI tools. If your system writes code, the bar is not “it usually works.” The bar is “it preserves correctness, catches bad inputs, and does not destroy performance on real hardware.” That is a much harder target, especially in low-level software where every extra move instruction can matter.

So the practical takeaway is not to dismiss AI tools, but to test them in the ugliest places first. If a compiler cannot survive a benchmark like SPEC CPU2017 without massive regressions, then kernel builds and happy-path demos are not enough proof.

My prediction: AI-written compilers and code generators will keep improving, but the first teams to trust them in production will still need human review, benchmark gates, and architecture-specific validation. The real question is not whether AI can generate code. It is whether it can keep pace with the hardware details that still decide who wins a benchmark run.

// Related Articles

Claude’s C Compiler makes a mess of benchmarks

What CCC actually did to a simple benchmark

Get the latest AI news in your inbox

When the compiler ignores type errors

SPEC CPU2017 gives the harsher verdict

Why the hardware reacted so differently

What this says about AI coding tools

Why VidHub 会员互通不是“买一次全设备通用”

Why Bun’s Zig-to-Rust experiment is the right move

Why OpenAI API pricing is a product strategy, not a footnote

Why Claude Code’s prompt design beats IDE copilots

Why Databricks Model Serving is the right default for production infe…

Why IBM’s Bob is the right kind of AI coding assistant