Tag
1 articles
PreRL shifts reinforcement learning from P(y|x) to P(y), using reward-driven updates in pre-train space to improve reasoning and exploration.