Inference
ConceptDefinition
The process of running a trained model to generate predictions or outputs — as opposed to training (updating weights). Inference efficiency (speed, cost, latency) is the primary concern for production deployments.
Related Terms
Quantization
Reducing the numerical precision of model weights (e.g., from 32-bit float to 4-bit integer) to shrink model size and speed up inference with minimal accuracy loss. Enables running large models on consumer hardware. Key for local deployments.
Distillation
Training a small "student" model to mimic the behavior of a larger "teacher" model. Produces compact models that retain much of the teacher's capability at a fraction of the compute cost. Used by DeepSeek-R1-Zero and many production models.
Context Window
The maximum number of tokens a model can process in a single call — including both the input (prompt) and output (completion). Larger windows allow processing entire codebases, books, or long conversations. Measured in tokens, not characters.