5 KV cache takeaways for llama.cpp users
5 takeaways from TurboQuant: under-3-bit KV cache compression, memory savings, and the tradeoffs llama.cpp users should watch.

TurboQuant shows how KV cache compression could cut memory use with little quality loss.
Google Research’s TurboQuant claim is simple to state and hard to ignore: KV cache can drop below 3 bits with near-zero accuracy loss. In the llama.cpp discussion, one benchmark correction also showed q4_0 reduced KV memory by 72% on a DGX Spark test setup.
1. TurboQuant may shrink KV cache far beyond today’s common formats
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
The headline claim is that TurboQuant can compress the KV cache to under 3 bits while keeping accuracy losses close to zero. That matters because KV cache growth is one of the main reasons long-context inference gets expensive in memory.

In practical terms, the discussion frames TurboQuant as a possible next step beyond the familiar fp16, q8_0, and q4_0 cache options. If the paper’s results hold up in real deployments, model serving could keep more context in memory without the usual cost spike.
- Claimed target: under 3 bits per KV value
- Reported accuracy impact: near zero
- Primary benefit: lower memory pressure at long context
2. Memory savings are already visible in current cache quantization
Even before TurboQuant lands in mainstream tooling, the discussion includes corrected measurements that show why KV quantization matters. On a DGX Spark GB10 setup, q4_0 cut KV buffer use from 768 MiB to 216 MiB, while q8_0 landed at 408 MiB.
Those numbers are useful because they give a concrete baseline for what cache quantization can buy today. For teams tuning inference on limited GPU memory, the difference between fp16 and q4_0 can decide whether a long-context workload fits at all.
- f16 KV buffer: 768 MiB
- q8_0 KV buffer: 408 MiB, or 47% less KV memory
- q4_0 KV buffer: 216 MiB, or 72% less KV memory
3. Prompt throughput is not the whole story
One corrected benchmark in the thread shows prompt throughput stayed the same across cache types, even at 110K context. That is a useful reminder that prefill and decode behave differently, and that a cache change may not affect every stage equally.

The more important slowdown appeared during generation at long context, where q4_0 fell behind fp16 by 36.8% at 110K. The thread argues that per-token dequantization is the bottleneck, which is exactly the kind of overhead TurboQuant aims to remove.
110K context generation tok/s on the corrected test
f16 = 38.0
aq4_0 = 24.0
Delta = -36.8%4. The llama.cpp ecosystem is already testing variants
The discussion is not just about one paper. It also mentions NVIDIA’s KTVC work, MLX developer interest, and a forked implementation path in TheTom’s llama-cpp-turboquant repo. That tells you this is moving from theory toward implementation experiments.
Several comments also point to CUDA, HIP/ROCm, InnerQ, and prefill optimizations in different branches and forks. For readers tracking production readiness, the key signal is that the community is already comparing code paths, bug fixes, and block-size choices rather than only discussing the paper.
- Google Research blog and paper introduced the method
- llama.cpp discussion collected implementation interest
- Forks are testing CUDA, ROCm, and prefill changes
5. Benchmarks need careful methodology
The thread includes a useful correction from a benchmark author who first reported a dramatic prompt collapse, then later found the measurement was wrong. The corrected result: prompt throughput was unchanged, and the earlier memory paradox came from RSS-based measurement instead of GPU memory reporting.
That correction is a good warning for anyone evaluating KV cache work. If you are testing TurboQuant or any cache format, you need to measure the right memory source, separate prefill from decode, and check silent request failures before drawing conclusions.
- Use nvidia-smi plus internal KV buffer reporting for GPU memory
- Measure prefill and decode separately
- Verify that failed requests are excluded from throughput calculations
How to decide
If you care most about serving longer contexts on limited memory, TurboQuant is the item to watch. If you need something practical right now, the corrected q4_0 and q8_0 numbers in the thread show that existing cache quantization already delivers large memory savings.
If you are benchmarking or maintaining inference code, the safest takeaway is to treat KV cache as a separate performance axis. Memory, prompt speed, and decode speed can move in different directions, so the right choice depends on which bottleneck hurts your workload most.
// Related Articles
- [IND]
OpenAI is right to keep ads out of sensitive chats
- [IND]
AI bootlegs are already draining streaming royalties
- [IND]
AMD and Microsoft push Windows ML on GPU and NPU
- [IND]
OpenAI’s IPO filing turns hype into scrutiny
- [IND]
Skatteetaten proves public sector AI should be judged by outcomes
- [IND]
OpenAI’s IPO filing puts AI’s biggest test on Wall Street