Tag

quantization

Quantization compresses model weights, activations, or KV cache into lower-bit formats to reduce memory and inference cost. Recent work spans 4-bit hybrid schemes and lower-bit LLM inference methods that target bottlenecks without sacrificing too much accuracy.

7 articles

Industry News/May 31

5 TurboQuant lessons for vector search teams

5 takeaways on Qdrant TurboQuant: how rotation changes compression, where recall holds up, and when safer quantizers fit better.

Research/May 25

Shannon Scaling Law explains LLM overtraining

A Shannon-based scaling law explains why LLMs can get worse as compute rises under noise.

Research/May 20

TurboQuant shows how 4-bit beats guesswork

I break down TurboQuant’s quantization study into a practical playbook for choosing 8-bit, 4-bit, PTQ, or QAT.

Tools & Apps/May 20

TurboQuant turns vLLM KV cache into 3-bit storage

I break down TurboQuant’s vLLM cache compression and give you a copy-ready setup for 3-bit KV cache and fallback paths.

Research/Apr 3

TurboQuant cuts memory use 6x without accuracy loss

Google Research’s TurboQuant claims 6x less memory and 8x faster inference with no accuracy loss, jolting AI inference economics.

Research/Apr 3

TurboQuant Explained: Why Google’s New Paper Matters

Google’s TurboQuant paper targets KV cache bottlenecks with lower-bit quantization, aiming to cut LLM memory use and inference costs.

Research/Mar 31

IF4: Smarter 4-Bit Quantization That Adapts to Your Data

MIT researchers propose a hybrid data format that switches between floating-point and integer representations, improving accuracy in 4-bit neural network quantization.