TurboQuant vs FP8: vLLM’s first broad test
vLLM found FP8 KV-cache quantization beats TurboQuant on speed, while TurboQuant’s strongest variants hurt accuracy.

vLLM found FP8 KV-cache quantization beats TurboQuant on speed, while TurboQuant’s strongest variants hurt accuracy.
vLLM published a broad benchmark of TurboQuant on May 11, 2026, and the headline is simple: the memory savings look good, but FP8 KV-cache still wins for most production serving setups. The team tested four TurboQuant variants across four models, five benchmarks, and both dense and MoE architectures, then compared them with BF16 and FP8 baselines.
The most useful part of the post is that it moves the discussion away from small-model demos and into workloads that resemble actual serving traffic. That matters because KV-cache quantization only becomes interesting when context grows, requests pile up, and memory pressure starts to shape the whole inference stack.
| Method | KV-cache capacity | Latency impact | Throughput impact | Accuracy signal |
|---|---|---|---|---|
| FP8 | 2x | Negligible | Near BF16 | Matches baseline |
| TurboQuant k8v4 | 2.4x | 10% to 68% slower | 80% to 75% of BF16 | Near baseline |
| TurboQuant 4bit-nc | 2.3x to 3.7x | Measurable slowdown | About 75% of BF16 | Moderate drop |
| TurboQuant k3v4-nc / 3bit-nc | Higher than FP8 | Largest slowdown | 66% to 73% of BF16 | Clear drop |
Why TurboQuant got attention
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
TurboQuant compresses the KV-cache to 3 to 4 bits, then dequantizes back to BF16 before attention computation. That is very different from FP8, which stores the KV-cache in FP8 and also runs attention with FP8 Tensor Core operations. In plain English, TurboQuant saves memory aggressively, but it pays for that savings by doing extra work during inference.

That tradeoff is why the vLLM team framed this as a study, not a product pitch. A method can look excellent on a slide deck and still lose badly once you measure latency, throughput, and accuracy on long prompts or hard reasoning tasks.
- TurboQuant variants tested: k8v4, 4bit-nc, k3v4-nc, 3bit-nc
- Baselines: BF16 and FP8
- Models: MiniMax-M2.7, Llama-3.3-70B-Instruct, Qwen3-30B-A3B-Instruct-2507, Qwen3-30B-A3B-Thinking-2507
- Benchmarks: openai/mrcr, AIME25, GPQA:Diamond, MATH500, LiveCodeBench-v6
What the accuracy numbers actually show
The accuracy story is mixed, but the pattern is consistent. FP8 and TurboQuant k8v4 stay close to the unquantized baseline on the retrieval and reasoning workloads. TurboQuant 4bit-nc loses some accuracy, but it remains in the range where a deployment team might still consider it if memory pressure is severe.
The aggressive variants are where things fall apart. On long-context retrieval, the post says Llama-3.3-70B-Instruct at 128k context saw BF16 collapse to 98% average accuracy recovery, while 4bit-nc reached 96% recovery and k3v4-nc plus 3bit-nc dropped by about 20 points. On reasoning, the same pattern held, with AIME25 and LiveCodeBench-v6 taking the biggest hit.
"FP8 via --kv-cache-dtype fp8 remains the best default for KV-cache quantization." — vLLM blog, May 11, 2026
That quote is blunt, and it is also the most practical conclusion in the post. If your goal is to keep accuracy intact while reducing memory use, FP8 is the safe default. If your goal is to squeeze every last byte out of the cache, TurboQuant starts to look like a niche tool rather than a general recommendation.
- Long-context retrieval was tested up to each model’s maximum supported length
- Accuracy was reported as average pass@1 across 5 repetitions
- TurboQuant k3v4-nc and 3bit-nc showed about 20-point drops on the hardest long-context cases
- On MiniMax-M2.7, aggressive TurboQuant variants dropped accuracy by up to about 8 points on AIME25 and LiveCodeBench-v6
Speed is where TurboQuant loses the argument
The speed results are even less flattering. vLLM measured latency with 1,024 input tokens and 256 output tokens, then swept batch sizes of 1, 8, 32, and 64. FP8 had negligible overhead across the board. TurboQuant did not.

On Qwen3-30B-A3B-Instruct-2507, TurboQuant overhead ranged from roughly 10% to 60%. On Llama-3.3-70B-Instruct, it ranged from about 10% to 68%. The larger model also showed a worrying trend: overhead increased with batch size, which is the opposite of what serving teams want when traffic grows.
Throughput told the same story. FP8 matched BF16 throughput on both models, while TurboQuant variants came in below baseline. For Qwen3-30B, throughput ranged from 80% of BF16 with k8v4 to 73% with 3bit-nc. For Llama-3.3-70B, the range was 75% down to 66%.
That is the key takeaway for operators: lower KV-cache storage cost does not automatically translate into faster serving. Once you add dequantization cost back into the path, the math changes.
- Latency tests used 10 warmup iterations and 30 measured iterations
- Throughput tests used 200 prompts across 256/256, 1024/512, and 4096/256 token pairs
- vLLM version used: 0.20.2, commit 6ec9bbec3
- FP8 matched BF16 on latency and throughput, while TurboQuant variants consistently fell behind
What this means for real deployments
The most practical reading of the study is that TurboQuant is a memory tool first and a performance tool second. If you are serving a model where KV-cache memory is the bottleneck and you can tolerate slower inference, TurboQuant 4bit-nc may be worth a pilot. If you care about latency, throughput, and accuracy at the same time, FP8 is the cleaner answer.
There is also a hardware angle here. FP8 works well because it maps to native Tensor Core behavior on modern NVIDIA GPUs, while TurboQuant has to unpack low-bit storage before attention runs. That extra unpacking step is exactly where the lost time goes.
vLLM’s recommendation is more useful than a generic benchmark chart because it tells teams where to spend effort next. For most production systems, the next experiment should be FP8 first, then a narrow TurboQuant test only if memory pressure remains the real blocker. If you are planning a deployment on H100 hardware, the question is not whether TurboQuant saves memory. It does. The real question is whether those savings are worth slower serving and weaker accuracy on the workloads that matter most.
For related reading, see OraCore’s coverage of FP8 KV-cache in vLLM and KV-cache optimization strategies.
Bottom line
vLLM’s first broad study says TurboQuant is useful only when memory pressure is severe enough to justify slower inference, and even then FP8 remains the default most teams should try first. The next question for the community is whether future low-bit KV-cache methods can keep TurboQuant’s memory gains while removing the dequantization tax that hurts real serving workloads.
// Related Articles
- [RSCH]
TurboQuant and the SEO Shift for Small Sites
- [RSCH]
LLMbda calculus gives agents safety rules
- [RSCH]
A simpler beamspace denoiser for mmWave MIMO
- [RSCH]
Why AI benchmark wins in cyber should scare defenders
- [RSCH]
Why Linux security needs a patch-wave mindset
- [RSCH]
Judge Reliability Harness Stress-Tests LLM Judges