Shannon Scaling Law explains LLM overtraining

OraCore Editors

Back to home

[RSCH] May 25, 20268 min readOraCore Editors

Shannon Scaling Law explains LLM overtraining

A Shannon-based scaling law explains why LLMs can get worse as compute rises under noise.

noise Shannon theory quantization overtraining LLM scaling laws

Share LinkedIn

Shannon Scaling Law explains LLM overtraining

A Shannon-based scaling law explains why LLMs can get worse as compute rises under noise.

Research org: Unspecified in arXiv abstract
Core data: pooled R² = 0.847
Breakthrough: Models training as information transmission over a noisy channel

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws tries to fix a real gap in how we think about model scaling: the old story says bigger models and more tokens should keep improving performance, but that breaks down when noise starts to dominate. For engineers, that matters because the failure mode is not just “diminishing returns.” The paper is arguing that once signal-to-noise ratio slips below a threshold, more scale can push a model into a worse regime instead of a better one.

The paper’s core claim is practical: LLM training can be viewed as information transmission, with model parameters acting like channel bandwidth and training tokens acting like signal power. That lets the authors borrow from the Shannon-Hartley theorem and define a Shannon capacity for LLMs. In plain English, they are saying there is a limit to how much useful learning signal the system can carry before intrinsic noise takes over.

What problem this paper is trying to fix

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Classic scaling laws are usually monotonic power laws. They assume that if you keep adding compute, parameters, or data, the loss keeps improving in a smooth direction. But the abstract points out that this picture does not explain non-monotonic behavior such as catastrophic overtraining and quantization-induced degradation, where performance gets worse even though compute goes up.

That is the kind of failure mode developers actually run into when they push training or deployment past a comfortable zone. A model can look fine during early scaling, then suddenly become unstable, overfit, or degrade after a perturbation like quantization. The paper is trying to give those cases a single theoretical frame instead of treating them as unrelated edge cases.

The important shift here is that the authors are not just saying “noise matters.” They are saying noise changes the shape of the scaling curve itself. Under their formulation, increasing model size or data without keeping enough SNR does not guarantee better performance; it can create a transition from monotonic improvement to U-shaped degradation.

How the method works in plain English

The Shannon Scaling Law maps familiar training quantities onto communication theory. Model parameters become bandwidth, and training tokens become signal power. Once that mapping is in place, the training process looks less like a simple optimization curve and more like a noisy transmission problem with a capacity limit.

That framing is useful because it makes the interaction between learning signal and intrinsic noise explicit. Instead of assuming scale always helps, the theory asks whether the model is still operating inside its usable capacity. If not, the extra scale can amplify noise rather than extract more signal.

In the abstract’s terms, this is what allows the law to capture both monotonic and non-monotonic behavior. It is designed to explain why some runs improve smoothly while others show a loss basin or a U-shaped curve. The paper claims the formulation can also account for perturbation-driven degradation, including Gaussian noise and quantization.

What the paper actually shows

The authors validate the theory on Pythia and OLMo2, using perturbations and task settings that include Gaussian noise, quantization, and supervised fine-tuning on math, QA, and code tasks. The abstract does not provide the full benchmark table, so there are no task-by-task accuracy numbers here to quote. What it does say is that the new law consistently outperforms classical scaling laws and recent perturbation-aware laws.

One concrete result stands out: fitted on Pythia models up to 6.9B parameters and up to 180B tokens, the model extrapolates to the unseen 12B Pythia model up to 307B tokens with pooled R² = 0.847. That is the kind of result that matters if you care about forecasting training curves before you spend the full budget.

The abstract also says the Shannon Scaling Law captures loss basins that prior approaches miss. That suggests it is not just fitting a trend line better; it is modeling the shape of degradation more faithfully when perturbations or overtraining kick in. Still, the source material gives only the headline metrics, so any deeper comparison would need the full paper.

Why developers should care

If you train or fine-tune models, this paper is basically a warning label for scale-at-all-costs thinking. It suggests that bigger checkpoints and longer runs are not inherently safer or better if the signal-to-noise ratio is not preserved. That has obvious implications for overtraining, quantization, and any workflow where you push a model through perturbations after training.

For teams doing capacity planning, the practical value is in prediction. A law that can extrapolate from smaller models and shorter token budgets to larger, unseen settings can help estimate where a run might stop paying off. Even if you do not adopt the exact theory, the paper reinforces a useful engineering habit: evaluate scaling under noise, not just in the cleanest possible setting.

For deployment, the quantization angle is especially relevant. Quantization is often treated as a mostly separate compression problem, but this paper places it in the same scaling conversation as training noise and overtraining. That makes it easier to reason about why a model that looked good in full precision can degrade after compression.

What is still unclear

The abstract is strong on theory and headline validation, but it leaves several practical questions open. It does not give the full benchmark breakdown, so we do not know exactly how much better the new law performs on each task or perturbation. It also does not say how the theory behaves across other model families beyond Pythia and OLMo2.

There is also a broader caution with any scaling law: a good fit does not automatically mean a universal operational rule. The paper claims a unified framework, but the source material only shows validation on a limited set of models, perturbations, and tasks. Developers should read it as a better lens for thinking about failure modes, not as a guarantee that every noisy training run will follow the same curve.

Still, the main takeaway is clear. This paper argues that LLM scaling is not just a matter of “more is better.” It is a capacity problem, and once noise overwhelms signal, the curve can bend the wrong way.

Bottom line

The useful contribution here is not another generic scaling law. It is a Shannon-style explanation for why model performance can collapse or turn U-shaped under noise, even as compute rises. If you build, fine-tune, quantize, or forecast LLMs, that is a framing worth keeping in your toolbox.

It reframes LLM training as noisy information transmission.
It explains non-monotonic failures like overtraining and quantization degradation.
It extrapolates well on Pythia in the reported setting, with pooled R² = 0.847.

// Related Articles

Shannon Scaling Law explains LLM overtraining

What problem this paper is trying to fix

Get the latest AI news in your inbox

How the method works in plain English

What the paper actually shows

Why developers should care

What is still unclear

Bottom line

CRDTs keep replicas in sync without locks

Post-Deterministic Systems for Autonomous Infra

Causal methods for measuring task learnability

RL Training That Hands Off Control Gradually

OmniGameArena benchmarks VLM game agents better

TurboQuant cuts KV cache memory 6x in Google tests