TurboQuant, EDEN, and the citation fight
TurboQuant’s KV-cache quantization claims are under fire: EDEN authors say the paper reuses older ideas, weaker scales, and shaky benchmarks.

TurboQuant entered the conversation with a bold claim: 6x compression for KV-cache quantization. But the debate around it quickly moved away from compression ratios and into a more uncomfortable question for ML research: who actually did the underlying work first?
On Hacker News, the authors behind a new note argued that TurboQuant is a restricted version of DRIVE and EDEN, with weaker scale choices and less accurate results. That is a strong accusation, and it matters because this is not a tiny corner of the field. KV-cache compression affects inference cost, latency, and memory pressure in large language models.
What TurboQuant actually claims
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
TurboQuant is about compressing the KV cache used during transformer inference. That cache stores past key and value vectors so the model can attend to earlier tokens without recomputing everything. The tradeoff is simple: the longer the context, the more memory the cache consumes, and the more room there is for quantization tricks to save money.

The controversy starts because TurboQuant’s core method does not appear to be a clean new quantizer. In the HN thread, the EDEN authors say the paper uses an older quantization recipe, then fixes the scale parameter in a way that is easier to describe but worse in practice. They also say the paper mixes a biased multi-bit step with an unbiased 1-bit residual step, which creates extra error compared with using EDEN directly.
Here are the main claims being debated:
- TurboQuant is framed as new, but critics say it is a restricted EDEN variant.
- The scale choice is fixed, while EDEN had derived better scale settings.
- The residual quantization step is weaker than the unbiased EDEN setup.
- The paper’s headline compression story may overstate how much is actually novel.
That last point is where the argument gets spicy. If a paper presents a familiar method with a new application, that can still be useful work. But if the paper reads like a fresh algorithm while borrowing most of its machinery from earlier papers, the citation trail matters a lot more.
And this is where the HN discussion became unusually specific. One commenter noted that the TurboQuant paper cites HIGGS and Cache Me If You Must, while another pointed out that the older EDEN papers already cover the same post-rotation quantization ideas more directly. The dispute is no longer about whether the method works at all. It is about how much of it was already known.
Why the prior work matters
To understand the criticism, you need the short version of the lineage. DRIVE introduced post-rotation distribution-aware quantization in 2021, and EDEN extended that idea to more bit widths and scale settings. The HN thread says those papers already gave the right derivations for choosing scales, while TurboQuant used a simpler but weaker version.
That matters because quantization papers can look similar on the surface while differing a lot in the details. A fixed scale can be easier to implement. A derived optimal scale can improve mean squared error. A biased quantizer can behave differently from an unbiased one. Once you start chaining these choices together, the error budget changes quickly.
“We were the first to introduce post-rotation distribution-aware quantization in 2021. This was later implemented in many fields, including federated learning, vector retrieval, databases, inference engines, and KV-cache.”
That quote from the Hacker News discussion captures the real issue: credit. In research, being first is not just a vanity metric. It affects who gets cited, who gets invited to speak, and which papers become the base layer for later systems work.
The same thread also points to a broader pattern that ML folks know too well. A method can be rediscovered in a product blog, a benchmark repo, or an implementation note, then relabeled as if it arrived from nowhere. When that happens, the original paper often gets less attention than the derivative version with better packaging.
There is also a technical reason to care. If TurboQuant is really a weaker version of EDEN, then anyone choosing it for production inference may be leaving performance on the table. In a memory-sensitive system, a small quantization penalty can turn into higher latency, lower throughput, or both.
Comparing the numbers and the benchmarks
The strongest criticism in the thread is not philosophical. It is numerical. One commenter said TurboQuant’s “6x compression” headline is hard to compare with earlier KV-cache baselines like KIVI, and that the paper’s RaBitQ comparison used a single-core CPU for the baseline but an A100 GPU for TurboQuant. That is not a fair benchmark setup.

Here is the comparison as described in the discussion and linked note:
- TurboQuant headline: 6x compression for KV-cache.
- EDEN comparison: the note says 2-bit EDEN beats 3-bit TurboQuant in some settings.
- Accuracy claim: the note says unbiased EDEN is often more than a bit better than TurboQuant.
- Benchmark setup concern: one baseline reportedly ran on a single CPU core while TurboQuant used an A100 GPU.
Those numbers are enough to change how you read the paper. If a method wins on a benchmark because it gets a better device, a better implementation, or a friendlier comparison target, the headline result stops being useful for engineers.
The OpenReview thread linked in the HN comments adds another layer: reproducibility concerns. If reported accuracy cannot be reproduced cleanly, then even a nice compression ratio is only half a result. Engineers need methods they can test, not just methods that look good in a PDF.
This is also where the vLLM implementation note becomes interesting. The docs for TurboQuant in vLLM describe the technique as a scalar case of HIGGS-style quantization applied to KV-cache compression. That framing suggests the idea is already part of a larger family of methods, which makes the “new invention” story even harder to defend.
What this says about AI research right now
This story is bigger than one paper. It shows how easily a method can be recast when it moves from theory papers into systems code, benchmark repos, or a blog post with a catchy name. The original math may be old, the implementation may be useful, and the packaging may be what gets attention.
That creates a weird incentive structure. If you are a researcher, you want your work to be cited correctly. If you are an engineer, you want the method that actually performs best under real constraints. If you are a startup or infrastructure team, you want the version that is easiest to ship. Those goals overlap, but they are not the same thing.
My read is simple: TurboQuant may still be useful as an implementation story, but it should be discussed as part of the EDEN/DRIVE lineage, not as a fresh break from prior work. If the HN critique holds up under review, the paper will be remembered less for its compression ratio and more as a case study in weak attribution and shaky benchmarking.
For teams working on KV-cache compression today, the practical move is to check the older papers first, then compare on the same hardware, same bit width, and same accuracy target. If 2-bit EDEN really beats 3-bit TurboQuant in your setup, that is the result that should drive the decision. The next question is whether the community starts treating citation hygiene as part of model quality, or keeps separating the math from the paper trail.
// Related Articles
- [RSCH]
TurboQuant and the SEO Shift for Small Sites
- [RSCH]
TurboQuant vs FP8: vLLM’s first broad test
- [RSCH]
LLMbda calculus gives agents safety rules
- [RSCH]
A simpler beamspace denoiser for mmWave MIMO
- [RSCH]
Why AI benchmark wins in cyber should scare defenders
- [RSCH]
Why Linux security needs a patch-wave mindset