[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-omniagent-active-perception-video-understanding-en":3,"article-related-omniagent-active-perception-video-understanding-en":30,"series-research-0e33a353-6482-43dc-a0d7-646b9b1a2a2a":77},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"0e33a353-6482-43dc-a0d7-646b9b1a2a2a","omniagent-active-perception-video-understanding-en","OmniAgent brings active perception to video understanding","\u003Cp data-speakable=\"summary\">OmniAgent turns long-video understanding into an active reasoning loop that scales with turns, not video length.\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong>Research org\u003C\u002Fstrong>: Unspecified in arXiv abstract\u003C\u002Fli>\u003Cli>\u003Cstrong>Core data\u003C\u002Fstrong>: 50.5% vs. 47.3% on LVBench\u003C\u002Fli>\u003Cli>\u003Cstrong>Breakthrough\u003C\u002Fstrong>: POMDP-based Observation-Thought-Action cycle with persistent textual memory\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.19341\">Native Active Perception as Reasoning for Omni-Modal Understanding\u003C\u002Fa> argues that long-video models should stop treating every frame as equally important. Instead of passively scanning an entire video from start to finish, OmniAgent decides when to observe, think, and act, then stores the useful audio-visual evidence in text memory so later reasoning does not have to keep dragging the full video context around.\u003C\u002Fp>\u003Cp>That matters because the usual “watch-it-all” approach gets expensive fast as videos get longer. This paper is trying to fix a very practical bottleneck: if context cost scales with duration, then long-form video understanding becomes harder to deploy, harder to scale, and harder to make interactive. The authors’ answer is to make perception itself part of the reasoning loop.\u003C\u002Fp>\u003Ch2>What problem the paper is fixing\u003C\u002Fh2>\u003Cp>Traditional long-video systems are passive. They process frames uniformly, even when a user’s question only needs a few moments of the clip. The abstract says this creates unnecessary compute cost, and it also means the model’s context burden grows with video length. In other words, longer input does not just mean more data; it means the reasoning system has to carry more baggage.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781762581923-hx7i.png\" alt=\"OmniAgent brings active perception to video understanding\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The paper also points out that some interactive frameworks already exist, but they still rely on a global pre-scan. That means they may be more selective than a naive baseline, but they do not fully break the dependence on video length. OmniAgent is positioned as the first native omni-modal \u003Ca href=\"\u002Ftag\u002Fagent\">agent\u003C\u002Fa> that makes selective perception the default behavior rather than an add-on.\u003C\u002Fp>\u003Cp>For developers, the distinction is important. A model that can only answer after ingesting everything is fine for offline analysis, but it is awkward for products that need fast iteration, lower \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa> cost, or step-by-step interactive exploration of a long clip. The paper is pushing toward a design where the model asks for more evidence only when it needs it.\u003C\u002Fp>\u003Ch2>How OmniAgent works in plain English\u003C\u002Fh2>\u003Cp>The core idea is to model video understanding as a POMDP, or partially observable Markov decision process. That sounds formal, but the practical meaning is simple: the agent does not assume it sees the whole truth at once. It repeatedly cycles through Observation, Thought, and Action, choosing what to inspect next based on what it already knows.\u003C\u002Fp>\u003Cp>Instead of keeping raw video context alive throughout the whole conversation, OmniAgent selectively distills audio-visual cues into a persistent textual memory. That memory becomes the working state for reasoning. The result is a system that can separate reasoning complexity from raw video duration, which is the main architectural claim in the abstract.\u003C\u002Fp>\u003Cp>To make this work, the authors introduce two training components. The first is Agentic Supervised Fine-Tuning, which bootstraps native active perception using best-of-N trajectory synthesis plus dual-stage quality control. The second is Agentic \u003Ca href=\"\u002Ftag\u002Freinforcement-learning\">Reinforcement Learning\u003C\u002Fa> with TAURA, short for Turn-aware Adaptive Uncertainty Rescaled Advantage, which uses turn-level entropy to guide credit assignment toward the turns where important discoveries happen.\u003C\u002Fp>\u003Cp>In practical terms, the training story is about teaching the agent not just to answer questions, but to choose good questions for itself. That is the difference between a model that passively consumes context and one that can actively search for the evidence it needs.\u003C\u002Fp>\u003Ch2>What the paper actually shows\u003C\u002Fh2>\u003Cp>The abstract reports results across ten benchmarks, including VideoMME and LVBench. It says OmniAgent achieves state-of-the-art performance among open-source models, but the abstract does not list the full \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> table, so those details are not available in the source text provided here.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781762581490-qyip.png\" alt=\"OmniAgent brings active perception to video understanding\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>One concrete number does stand out: on LVBench, the 7B OmniAgent outperforms Qwen2.5-VL-72B, a model that is 10 times larger, with 50.5% versus 47.3%. That is the clearest signal in the abstract that active perception can matter more than raw parameter count for this kind of task.\u003C\u002Fp>\u003Cp>The paper also claims positive test-time scaling. That means performance improves as the number of reasoning turns increases. This is a useful property because it suggests the agent can trade extra deliberation for better answers in a controlled way, rather than being locked into a single fixed pass over the video.\u003C\u002Fp>\u003Cp>That said, the abstract stops short of giving the exact cost profile of those extra turns. It also does not provide latency, memory usage, or throughput numbers in the source notes here, so the practical compute tradeoff remains something readers would need to inspect in the full paper.\u003C\u002Fp>\u003Ch2>Why developers should care\u003C\u002Fh2>\u003Cp>If you are building video assistants, search tools, moderation systems, or any workflow where the question may only depend on a small part of a long clip, this paper points to a different architecture choice. Instead of scaling context windows and hoping brute force is enough, you can design the model to actively gather evidence.\u003C\u002Fp>\u003Cp>That opens the door to systems that are more efficient on long content and potentially easier to control. A persistent textual memory can also be easier to inspect than a giant raw multimodal context, which may help with debugging and traceability. The paper does not prove those operational benefits directly, but the design clearly aims in that direction.\u003C\u002Fp>\u003Cul>\u003Cli>Active perception can reduce unnecessary full-video processing.\u003C\u002Fli>\u003Cli>Turn-based reasoning may improve long-video accuracy without matching model size.\u003C\u002Fli>\u003Cli>Text memory can make multimodal reasoning more inspectable than raw frame stuffing.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Limits and open questions\u003C\u002Fh2>\u003Cp>The biggest limitation in the abstract is that it gives only a high-level view of the method. We know the training setup names, the POMDP framing, and the memory design, but not the implementation details that would matter for reproduction, such as action space design, memory schema, or how the agent chooses what to observe next.\u003C\u002Fp>\u003Cp>The abstract also does not tell us how expensive test-time scaling is, only that performance improves with more reasoning turns. For real deployments, that tradeoff is crucial. An agent that gets better when it thinks longer is useful, but only if the extra thinking is affordable in the product context.\u003C\u002Fp>\u003Cp>Finally, the benchmark story is promising but still bounded by the abstract. We have one explicit comparison on LVBench and a claim of ten-benchmark state-of-the-art performance among open-source models, but not the full set of numbers here. So the safest reading is that OmniAgent \u003Ca href=\"\u002Fnews\u002Fgpt-56-fix-and-upgrade-release-en\">looks like\u003C\u002Fa> a strong step toward active multimodal reasoning, not a finished answer to every long-video problem.\u003C\u002Fp>\u003Cp>Even with those caveats, the paper’s direction is clear: long-video understanding may work better when the model behaves less like a recorder and more like an investigator. For engineers, that is a useful mental shift, because it changes the optimization target from “process everything” to “ask for the right evidence at the right time.”\u003C\u002Fp>","OmniAgent turns long-video understanding into an active reasoning loop that scales with turns, not video length.","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.19341",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781762581923-hx7i.png","research","en","66aaa847-deb1-4cd3-a60f-f23f5e00868e",[17,18,19,20,21],"video understanding","active perception","multimodal agents","POMDP","reinforcement learning",[23,24,25],"OmniAgent reframes long-video understanding as an active Observation-Thought-Action loop.","The model uses persistent textual memory to avoid carrying full video context through reasoning.","On LVBench, the 7B agent scores 50.5% versus 47.3% for Qwen2.5-VL-72B.",0,"2026-06-18T06:02:32.210704+00:00","2026-06-18T06:02:32.201+00:00","3103988e-c4fe-45e3-98ab-846500c9d507",{"tags":31,"relatedLang":36,"relatedPosts":40},[32,34],{"name":19,"slug":33},"multimodal-agents",{"name":21,"slug":35},"reinforcement-learning",{"id":15,"slug":37,"title":38,"language":39},"omniagent-active-perception-video-understanding-zh","OmniAgent讓長影片先想再看","zh",[41,47,53,59,65,71],{"id":42,"slug":43,"title":44,"cover_image":45,"image_url":45,"created_at":46,"category":13},"d7f11606-750d-42ea-87b8-23a761269509","locus-local-ordinance-corpus-us-en","LOCUS opens U.S. local law for legal AI","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781764376812-ikxd.png","2026-06-18T06:32:30.210741+00:00",{"id":48,"slug":49,"title":50,"cover_image":51,"image_url":51,"created_at":52,"category":13},"03e7168c-77a8-40ea-924b-96f86204d88e","turing-rl-user-simulator-rewards-en","Turing-RL trains user simulators by fooling judges","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781763480946-dpwl.png","2026-06-18T06:17:31.584257+00:00",{"id":54,"slug":55,"title":56,"cover_image":57,"image_url":57,"created_at":58,"category":13},"596a6b3f-d7c0-46ef-9a88-1915a6e3f238","arxiv-ai-papers-agents-memory-data-en","ArXiv AI papers push agents, memory, and data","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781685183085-978g.png","2026-06-17T08:32:37.121772+00:00",{"id":60,"slug":61,"title":62,"cover_image":63,"image_url":63,"created_at":64,"category":13},"d910529d-15c0-498a-a930-85e14c6ef748","reprorepo-github-issues-reproducibility-audits-en","ReproRepo scales reproducibility audits with GitHub issues","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781678880894-uawp.png","2026-06-17T06:47:35.608681+00:00",{"id":66,"slug":67,"title":68,"cover_image":69,"image_url":69,"created_at":70,"category":13},"434fbb0a-e925-43f3-9c3d-a3fbd187acdc","variable-width-transformers-cut-wasted-capacity-en","Variable-Width Transformers cut wasted capacity","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781677980601-tp4b.png","2026-06-17T06:32:32.993101+00:00",{"id":72,"slug":73,"title":74,"cover_image":75,"image_url":75,"created_at":76,"category":13},"2f8d825d-5520-4fb6-b1dc-a309b0193f3e","veritas-robot-policy-visual-verification-en","VERITAS lets robots verify and improve at runtime","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781677086468-mhbq.png","2026-06-17T06:17:38.067708+00:00",[78,83,88,93,98,103,108,113,118,123],{"id":79,"slug":80,"title":81,"created_at":82},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":84,"slug":85,"title":86,"created_at":87},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":89,"slug":90,"title":91,"created_at":92},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":94,"slug":95,"title":96,"created_at":97},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":99,"slug":100,"title":101,"created_at":102},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":104,"slug":105,"title":106,"created_at":107},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":109,"slug":110,"title":111,"created_at":112},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":114,"slug":115,"title":116,"created_at":117},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":119,"slug":120,"title":121,"created_at":122},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":124,"slug":125,"title":126,"created_at":127},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]