What is Quantization? — AI Glossary 2026

Definition

Reducing the numerical precision of model weights (e.g., from 32-bit float to 4-bit integer) to shrink model size and speed up inference with minimal accuracy loss. Enables running large models on consumer hardware. Key for local deployments.

Related Terms

QLoRA (Quantized LoRA)

Combines 4-bit quantization with LoRA fine-tuning, enabling fine-tuning of 65B+ parameter models on a single consumer GPU. Published by Tim Dettmers et al. (2023). Made democratized fine-tuning of large models practical.

Distillation

Training a small "student" model to mimic the behavior of a larger "teacher" model. Produces compact models that retain much of the teacher's capability at a fraction of the compute cost. Used by DeepSeek-R1-Zero and many production models.

Inference

The process of running a trained model to generate predictions or outputs — as opposed to training (updating weights). Inference efficiency (speed, cost, latency) is the primary concern for production deployments.

Articles about Quantization

TurboQuant cuts KV cache memory 6x in Google tests

Gemma 4 12B: Specs, Benchmarks & How to Run It Locally

Tether’s TurboQuant cuts AI memory use 5x

Why Tether Is Right to Push Local AI Memory Into Everyday Devices

Apple’s Gemini deal turns cloud AI into local AI

Definition

Related Terms

Articles about Quantization

All Terms