RLHF (Reinforcement Learning from Human Feedback)
TechniqueDefinition
Training LLMs using human preference signals: human raters compare model outputs, a reward model is trained on these preferences, then the LLM is fine-tuned via RL to maximize the reward. Used to align ChatGPT, Claude, and similar assistants.
Related Terms
DPO (Direct Preference Optimization)
An alignment training method that optimizes the model directly on human preference pairs (preferred vs. rejected responses) without needing a separate reward model. Simpler and more stable than RLHF, increasingly preferred for instruction tuning.
GRPO (Group Relative Policy Optimization)
A reinforcement learning algorithm from DeepSeek that improves upon PPO by comparing multiple sampled responses within a group rather than relying on a separate critic. Used to train DeepSeek-R1's reasoning capabilities.
Fine-tuning
Continuing to train a pre-trained model on a domain-specific or task-specific dataset to specialize its behavior. Ranges from full fine-tuning (updating all weights) to parameter-efficient methods like LoRA and QLoRA.