Back to home

Tag

reinforcement learning

Reinforcement learning studies how models learn decisions from feedback over time, and it underpins robot control, long-horizon agent training, and LLM fine-tuning. Recent work spans PPO variants, safe continual RL, stability, and planning under changing environments.

9 articles

AlphaGRPO teaches multimodal models to self-correct
Research/May 13

AlphaGRPO teaches multimodal models to self-correct

AlphaGRPO adds verifiable reward signals to multimodal models so they can reason, refine outputs, and improve generation without cold-start training.

Synthetic computers for long-horizon agent training
Research/May 1

Synthetic computers for long-horizon agent training

A method for building synthetic user computers at scale, then simulating month-long productivity tasks to train and evaluate agents.

Safe Continual RL for Changing Real-World Systems
Research/Apr 22

Safe Continual RL for Changing Real-World Systems

This paper studies how to keep RL controllers safe while they adapt to non-stationary systems—and shows why existing methods still fall short.

Why Bounded Ratio RL Replaces PPO's Clipped Objective
Research/Apr 21

Why Bounded Ratio RL Replaces PPO's Clipped Objective

BRRL gives PPO a cleaner theory, with BPO and GBPO aiming for more stable policy updates in control and LLM fine-tuning.

Why LLMs Generalize on Maps but Fail on Scale
Research/Apr 17

Why LLMs Generalize on Maps but Fail on Scale

A synthetic shortest-path setup shows LLMs transfer across maps, but break when problems get longer because recursive reasoning gets unstable.

PreRL: Training LLMs in pre-train space
Research/Apr 16

PreRL: Training LLMs in pre-train space

PreRL shifts reinforcement learning from P(y|x) to P(y), using reward-driven updates in pre-train space to improve reasoning and exploration.

Physics Simulators as RL Data for LLM Reasoning
Research/Apr 14

Physics Simulators as RL Data for LLM Reasoning

Researchers train LLMs on synthetic physics from simulators and report zero-shot gains on IPhO problems, showing a new path beyond web QA data.

Act Wisely: Teaching Agents When Not to Call Tools
Research/Apr 10

Act Wisely: Teaching Agents When Not to Call Tools

A new training scheme, HDPO, aims to cut blind tool use in multimodal agents by separating accuracy from tool efficiency.

Five AI Infra Frontiers Bessemer Expects for 2026
Industry News/Apr 3

Five AI Infra Frontiers Bessemer Expects for 2026

Bessemer’s 2026 AI infra roadmap points to memory, continual learning, RL, inference, and world models as the next big build areas.