Tag
reinforcement learning
Reinforcement learning studies how models learn decisions from feedback over time, and it underpins robot control, long-horizon agent training, and LLM fine-tuning. Recent work spans PPO variants, safe continual RL, stability, and planning under changing environments.
9 articles

AlphaGRPO teaches multimodal models to self-correct
AlphaGRPO adds verifiable reward signals to multimodal models so they can reason, refine outputs, and improve generation without cold-start training.

Synthetic computers for long-horizon agent training
A method for building synthetic user computers at scale, then simulating month-long productivity tasks to train and evaluate agents.

Safe Continual RL for Changing Real-World Systems
This paper studies how to keep RL controllers safe while they adapt to non-stationary systems—and shows why existing methods still fall short.

Why Bounded Ratio RL Replaces PPO's Clipped Objective
BRRL gives PPO a cleaner theory, with BPO and GBPO aiming for more stable policy updates in control and LLM fine-tuning.

Why LLMs Generalize on Maps but Fail on Scale
A synthetic shortest-path setup shows LLMs transfer across maps, but break when problems get longer because recursive reasoning gets unstable.

PreRL: Training LLMs in pre-train space
PreRL shifts reinforcement learning from P(y|x) to P(y), using reward-driven updates in pre-train space to improve reasoning and exploration.

Physics Simulators as RL Data for LLM Reasoning
Researchers train LLMs on synthetic physics from simulators and report zero-shot gains on IPhO problems, showing a new path beyond web QA data.

Act Wisely: Teaching Agents When Not to Call Tools
A new training scheme, HDPO, aims to cut blind tool use in multimodal agents by separating accuracy from tool efficiency.

Five AI Infra Frontiers Bessemer Expects for 2026
Bessemer’s 2026 AI infra roadmap points to memory, continual learning, RL, inference, and world models as the next big build areas.