Quantization
TechniqueDefinition
Reducing the numerical precision of model weights (e.g., from 32-bit float to 4-bit integer) to shrink model size and speed up inference with minimal accuracy loss. Enables running large models on consumer hardware. Key for local deployments.
Related Terms
QLoRA (Quantized LoRA)
Combines 4-bit quantization with LoRA fine-tuning, enabling fine-tuning of 65B+ parameter models on a single consumer GPU. Published by Tim Dettmers et al. (2023). Made democratized fine-tuning of large models practical.
Distillation
Training a small "student" model to mimic the behavior of a larger "teacher" model. Produces compact models that retain much of the teacher's capability at a fraction of the compute cost. Used by DeepSeek-R1-Zero and many production models.
Inference
The process of running a trained model to generate predictions or outputs — as opposed to training (updating weights). Inference efficiency (speed, cost, latency) is the primary concern for production deployments.