AI Glossary

The core innovation in Transformers that allows a model to weigh the importance of different tokens in a sequence when generating each output token. Self-attention lets every token "attend" to every other token, capturing long-range dependencies.

Chain-of-Thought

A prompting technique where the model is instructed (or naturally encouraged) to output intermediate reasoning steps before a final answer. Dramatically improves performance on multi-step math, logic, and coding problems.

Context Window

The maximum number of tokens a model can process in a single call — including both the input (prompt) and output (completion). Larger windows allow processing entire codebases, books, or long conversations. Measured in tokens, not characters.

Diffusion Model

A generative model that learns to reverse a gradual noising process. Starting from pure noise, the model iteratively denoises to produce images, audio, or video. Powers Stable Diffusion, DALL-E 3, Midjourney, and Sora.

Distillation

Training a small "student" model to mimic the behavior of a larger "teacher" model. Produces compact models that retain much of the teacher's capability at a fraction of the compute cost. Used by DeepSeek-R1-Zero and many production models.

DPO (Direct Preference Optimization)

An alignment training method that optimizes the model directly on human preference pairs (preferred vs. rejected responses) without needing a separate reward model. Simpler and more stable than RLHF, increasingly preferred for instruction tuning.

Embedding

A dense numerical vector that represents text, images, or other data in a high-dimensional space where semantic similarity maps to geometric closeness. Foundation of semantic search, RAG systems, and recommendation engines.

Few-shot Prompting

Providing the model with a small number of input-output examples (shots) in the prompt before asking it to complete a new example. Helps the model understand the desired format, style, or task without fine-tuning.

Fine-tuning

Continuing to train a pre-trained model on a domain-specific or task-specific dataset to specialize its behavior. Ranges from full fine-tuning (updating all weights) to parameter-efficient methods like LoRA and QLoRA.

Function Calling

A structured capability where the model outputs a JSON object describing which function to call and with what arguments, rather than plain text. The calling application executes the function and feeds the result back. Standard in GPT-4, Claude, and Gemini.

GAN (Generative Adversarial Network)

An architecture with two networks — a generator that creates synthetic data and a discriminator that tries to distinguish real from fake. Training as an adversarial game pushes the generator toward photorealistic output. Largely superseded by diffusion models for images.

GRPO (Group Relative Policy Optimization)

A reinforcement learning algorithm from DeepSeek that improves upon PPO by comparing multiple sampled responses within a group rather than relying on a separate critic. Used to train DeepSeek-R1's reasoning capabilities.

Hallucination

When a language model generates confident, fluent text that is factually incorrect, fabricated, or contradictory to the source. A fundamental challenge caused by models optimizing for plausible token sequences rather than factual accuracy.

Inference

The process of running a trained model to generate predictions or outputs — as opposed to training (updating weights). Inference efficiency (speed, cost, latency) is the primary concern for production deployments.

LLM (Large Language Model)

A neural network trained on massive text corpora to predict the next token, resulting in emergent abilities like reasoning, coding, and language understanding. Examples include GPT-4, Claude, Gemini, and Llama. Scale in parameters ranges from billions to trillions.

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning technique that adds small trainable rank-decomposition matrices to frozen model layers. Achieves near full fine-tuning performance while training less than 1% of parameters. Industry standard for adapting LLMs.

MCP (Model Context Protocol)

An open standard by Anthropic for connecting AI assistants to external data sources and tools. Defines a common interface so any MCP-compatible client (Claude, Cursor, etc.) can plug into any MCP-compatible server (databases, APIs, filesystems).

Multimodal

A model capable of processing and generating multiple types of data — text, images, audio, and video — within a single unified architecture. Examples include GPT-4o, Gemini, Claude (vision), and Sora.

Prompt Engineering

The practice of crafting inputs to elicit optimal model outputs. Encompasses techniques like chain-of-thought, few-shot examples, role prompting, structured output instructions, and system prompt design.

QLoRA (Quantized LoRA)

Combines 4-bit quantization with LoRA fine-tuning, enabling fine-tuning of 65B+ parameter models on a single consumer GPU. Published by Tim Dettmers et al. (2023). Made democratized fine-tuning of large models practical.

Quantization

Reducing the numerical precision of model weights (e.g., from 32-bit float to 4-bit integer) to shrink model size and speed up inference with minimal accuracy loss. Enables running large models on consumer hardware. Key for local deployments.

RAG (Retrieval-Augmented Generation)

An architecture that enhances LLM outputs by first retrieving relevant documents from a knowledge base (via vector search) and injecting them into the prompt. Grounds the model in external, up-to-date facts without requiring retraining.

RLHF (Reinforcement Learning from Human Feedback)

Training LLMs using human preference signals: human raters compare model outputs, a reward model is trained on these preferences, then the LLM is fine-tuned via RL to maximize the reward. Used to align ChatGPT, Claude, and similar assistants.

Temperature

A sampling hyperparameter controlling output randomness. At temperature 0, the model always picks the most probable next token (deterministic). Higher values increase diversity and creativity. Values above 1.0 introduce significant noise.

Tokenizer

The component that converts raw text into tokens (integer IDs) that the model processes. Most modern LLMs use Byte-Pair Encoding (BPE) or similar subword algorithms. Token count determines cost and fits within the context window limit.

Tool Use

The ability of an LLM to invoke external tools — web search, code execution, calculators, APIs — during inference. The model decides when and how to call tools, receives the result, and incorporates it into its response.

Top-p (Nucleus Sampling)

A sampling strategy where the model selects the next token from the smallest set of candidates whose cumulative probability exceeds p. Balances diversity and coherence more adaptively than fixed top-k sampling. Often used alongside temperature.

Transformer

The neural network architecture introduced in "Attention Is All You Need" (2017) that replaced recurrent networks for sequence modeling. Based entirely on self-attention and feed-forward layers. Foundation of virtually all modern LLMs.

Vector Database

A database optimized for storing and querying high-dimensional embedding vectors via approximate nearest neighbor (ANN) search. Core infrastructure for RAG systems. Examples: Pinecone, Weaviate, Qdrant, pgvector (PostgreSQL extension).

Zero-shot Prompting