What is DPO (Direct Preference Optimization)? — AI Glossary 2026

Definition

An alignment training method that optimizes the model directly on human preference pairs (preferred vs. rejected responses) without needing a separate reward model. Simpler and more stable than RLHF, increasingly preferred for instruction tuning.

Related Terms

RLHF (Reinforcement Learning from Human Feedback)

Training LLMs using human preference signals: human raters compare model outputs, a reward model is trained on these preferences, then the LLM is fine-tuned via RL to maximize the reward. Used to align ChatGPT, Claude, and similar assistants.

GRPO (Group Relative Policy Optimization)

A reinforcement learning algorithm from DeepSeek that improves upon PPO by comparing multiple sampled responses within a group rather than relying on a separate critic. Used to train DeepSeek-R1's reasoning capabilities.

Fine-tuning

Continuing to train a pre-trained model on a domain-specific or task-specific dataset to specialize its behavior. Ranges from full fine-tuning (updating all weights) to parameter-efficient methods like LoRA and QLoRA.

Definition

Related Terms

All Terms