What is Inference? — AI Glossary 2026

Definition

The process of running a trained model to generate predictions or outputs — as opposed to training (updating weights). Inference efficiency (speed, cost, latency) is the primary concern for production deployments.

Related Terms

Quantization

Reducing the numerical precision of model weights (e.g., from 32-bit float to 4-bit integer) to shrink model size and speed up inference with minimal accuracy loss. Enables running large models on consumer hardware. Key for local deployments.

Distillation

Training a small "student" model to mimic the behavior of a larger "teacher" model. Produces compact models that retain much of the teacher's capability at a fraction of the compute cost. Used by DeepSeek-R1-Zero and many production models.

Context Window

The maximum number of tokens a model can process in a single call — including both the input (prompt) and output (completion). Larger windows allow processing entire codebases, books, or long conversations. Measured in tokens, not characters.

Articles about Inference

TurboQuant vs FP8: vLLM’s first broad test

MiniMax-M1 brings 1M-token open reasoning model

Why Microsoft should stop betting its AI future on OpenAI

Why OpenAI’s Deployment Company is the right move

Local LLM vs Claude for Coding

Definition

Related Terms

Articles about Inference

All Terms