Tag
quantization
Quantization compresses model weights, activations, or KV cache into lower-bit formats to reduce memory and inference cost. Recent work spans 4-bit hybrid schemes and lower-bit LLM inference methods that target bottlenecks without sacrificing too much accuracy.
7 articles

5 TurboQuant lessons for vector search teams
5 takeaways on Qdrant TurboQuant: how rotation changes compression, where recall holds up, and when safer quantizers fit better.

Shannon Scaling Law explains LLM overtraining
A Shannon-based scaling law explains why LLMs can get worse as compute rises under noise.

TurboQuant shows how 4-bit beats guesswork
I break down TurboQuant’s quantization study into a practical playbook for choosing 8-bit, 4-bit, PTQ, or QAT.

TurboQuant turns vLLM KV cache into 3-bit storage
I break down TurboQuant’s vLLM cache compression and give you a copy-ready setup for 3-bit KV cache and fallback paths.

TurboQuant cuts memory use 6x without accuracy loss
Google Research’s TurboQuant claims 6x less memory and 8x faster inference with no accuracy loss, jolting AI inference economics.

TurboQuant Explained: Why Google’s New Paper Matters
Google’s TurboQuant paper targets KV cache bottlenecks with lower-bit quantization, aiming to cut LLM memory use and inference costs.

IF4: Smarter 4-Bit Quantization That Adapts to Your Data
MIT researchers propose a hybrid data format that switches between floating-point and integer representations, improving accuracy in 4-bit neural network quantization.