Tag
policy optimization
2 articles

Research/Apr 21
Why Bounded Ratio RL Replaces PPO's Clipped Objective
BRRL gives PPO a cleaner theory, with BPO and GBPO aiming for more stable policy updates in control and LLM fine-tuning.

Research/Apr 16
PreRL: Training LLMs in pre-train space
PreRL shifts reinforcement learning from P(y|x) to P(y), using reward-driven updates in pre-train space to improve reasoning and exploration.