Tag

RLVR

RLVR, or reinforcement learning with verifiable rewards, trains models on tasks where success can be checked objectively: math proofs, coding problems, unit tests, or rule-based outputs. It matters because reward design here shapes cold-start behavior, exploration, and training stability.

1 articles

Research/Apr 29

Tsallis loss for faster reasoning-model training

A Tsallis-loss continuum may help reasoning models escape cold-start stalls faster than RLVR, with tradeoffs between speed, noise, and stability.