[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-reinforcement-learning":3},{"tag":4,"articles":11},{"id":5,"name":6,"slug":7,"article_count":8,"description_zh":9,"description_en":10},"d52d08ae-f7f9-4625-ada6-d32a7bcd1036","reinforcement learning","reinforcement-learning",15,"強化學習研究如何讓模型在回饋訊號下逐步學會決策，常見於機器人控制、長期代理訓練與 LLM 微調。這個主題也涵蓋 PPO、BRRL、持續學習與安全約束等方法，重點在穩定更新、長期規劃與部署風險。","Reinforcement learning studies how models learn decisions from feedback over time, and it underpins robot control, long-horizon agent training, and LLM fine-tuning. Recent work spans PPO variants, safe continual RL, stability, and planning under changing environments.",[12,21,28,35,42,49,56,63,70],{"id":13,"slug":14,"title":15,"summary":16,"category":17,"image_url":18,"cover_image":18,"language":19,"created_at":20},"4a7fe7e7-0731-47ec-96a5-2758c5bfd8f9","alphagrpo-self-reflective-multimodal-generation-en","AlphaGRPO teaches multimodal models to self-correct","AlphaGRPO adds verifiable reward signals to multimodal models so they can reason, refine outputs, and improve generation without cold-start training.","research","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778652656972-4yog.png","en","2026-05-13T06:10:34.985001+00:00",{"id":22,"slug":23,"title":24,"summary":25,"category":17,"image_url":26,"cover_image":26,"language":19,"created_at":27},"14c7a767-8a49-4a9f-9531-3ea654444daf","synthetic-computers-long-horizon-agent-training-en","Synthetic computers for long-horizon agent training","A method for building synthetic user computers at scale, then simulating month-long productivity tasks to train and evaluate agents.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777620239939-k8q2.png","2026-05-01T06:30:47.562935+00:00",{"id":29,"slug":30,"title":31,"summary":32,"category":17,"image_url":33,"cover_image":33,"language":19,"created_at":34},"89d74343-03a7-4325-88e0-14029dab320d","safe-continual-rl-changing-environments-en","Safe Continual RL for Changing Real-World Systems","This paper studies how to keep RL controllers safe while they adapt to non-stationary systems—and shows why existing methods still fall short.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776838195882-6v8v.png","2026-04-22T06:09:33.432376+00:00",{"id":36,"slug":37,"title":38,"summary":39,"category":17,"image_url":40,"cover_image":40,"language":19,"created_at":41},"19f116fd-02dd-4a7d-9638-75a3bb70cae2","bounded-ratio-reinforcement-learning-ppo-en","Why Bounded Ratio RL Replaces PPO's Clipped Objective","BRRL gives PPO a cleaner theory, with BPO and GBPO aiming for more stable policy updates in control and LLM fine-tuning.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776751796218-p4in.png","2026-04-21T06:09:40.318224+00:00",{"id":43,"slug":44,"title":45,"summary":46,"category":17,"image_url":47,"cover_image":47,"language":19,"created_at":48},"443c85ce-62b3-4336-ad93-7a8a1538d271","llm-generalization-shortest-path-scale-en","Why LLMs Generalize on Maps but Fail on Scale","A synthetic shortest-path setup shows LLMs transfer across maps, but break when problems get longer because recursive reasoning gets unstable.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776406022431-jsmd.png","2026-04-17T06:06:34.142981+00:00",{"id":50,"slug":51,"title":52,"summary":53,"category":17,"image_url":54,"cover_image":54,"language":19,"created_at":55},"d1bbd868-15d4-459c-9e2b-2626c779b4ef","prerl-training-llms-in-pre-train-space-en","PreRL: Training LLMs in pre-train space","PreRL shifts reinforcement learning from P(y|x) to P(y), using reward-driven updates in pre-train space to improve reasoning and exploration.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776319621187-aig1.png","2026-04-16T06:06:38.24406+00:00",{"id":57,"slug":58,"title":59,"summary":60,"category":17,"image_url":61,"cover_image":61,"language":19,"created_at":62},"8a95a2d8-eb3a-442c-b9c4-c835c79d75c5","physics-simulators-rl-llm-reasoning-en","Physics Simulators as RL Data for LLM Reasoning","Researchers train LLMs on synthetic physics from simulators and report zero-shot gains on IPhO problems, showing a new path beyond web QA data.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1776146992039-q2sc.png","2026-04-14T06:09:33.23692+00:00",{"id":64,"slug":65,"title":66,"summary":67,"category":17,"image_url":68,"cover_image":68,"language":19,"created_at":69},"3cefc37f-e116-4597-a5cb-55bfb3fc4aa4","act-wisely-tool-use-agentic-multimodal-models-en","Act Wisely: Teaching Agents When Not to Call Tools","A new training scheme, HDPO, aims to cut blind tool use in multimodal agents by separating accuracy from tool efficiency.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775801032138-7jih.png","2026-04-10T06:03:34.728615+00:00",{"id":71,"slug":72,"title":73,"summary":74,"category":75,"image_url":76,"cover_image":76,"language":19,"created_at":77},"15c2f00f-4c48-4580-a13e-74626eb520f7","five-ai-infra-frontiers-bessemer-2026-en","Five AI Infra Frontiers Bessemer Expects for 2026","Bessemer’s 2026 AI infra roadmap points to memory, continual learning, RL, inference, and world models as the next big build areas.","industry","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775164380914-xfye.png","2026-04-02T21:12:40.223864+00:00"]