[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-adacodec-predictive-visual-code-video-mllms-en":3,"article-related-adacodec-predictive-visual-code-video-mllms-en":30,"series-research-a455fdc4-fe0d-41d8-a1f5-b77d7c869c6a":82},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"a455fdc4-fe0d-41d8-a1f5-b77d7c869c6a","adacodec-predictive-visual-code-video-mllms-en","AdaCodec cuts video tokens with predictive visual codes","\u003Cp data-speakable=\"summary\">AdaCodec compresses video for MLLMs by encoding only unpredictable frames and inter-frame changes.\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong>Research org\u003C\u002Fstrong>: Unspecified in arXiv abstract\u003C\u002Fli>\u003Cli>\u003Cstrong>Core data\u003C\u002Fstrong>: 32k tokens vs 224k baseline\u003C\u002Fli>\u003Cli>\u003Cstrong>Breakthrough\u003C\u002Fstrong>: Predictive visual code with reference frames and compact P-tokens\u003C\u002Fli>\u003C\u002Ful>\u003Cp>Video models are still wasting a lot of bandwidth on redundancy. If adjacent frames mostly repeat the same objects, background, and layout, then encoding every sampled frame as if it were a brand-new RGB image is expensive and often unnecessary.\u003C\u002Fp>\u003Cp>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.02569\">AdaCodec: A Predictive Visual Code for Video MLLMs\u003C\u002Fa> takes that observation seriously and turns it into a new video interface for multimodal large language models. Instead of always sending full visual tokens, the model sends a reference frame only when the scene is hard to predict from prior context, and otherwise sends a compact description of what changed.\u003C\u002Fp>\u003Ch2>What problem this paper is trying to fix\u003C\u002Fh2>\u003Cp>The paper is aimed at a very practical inefficiency in video MLLMs: repeated visual tokens. In current systems, each sampled frame is usually encoded independently as an RGB image, even when most of the visual content is already present in earlier frames. That means the model keeps paying for the same information over and over.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780382005109-ib57.png\" alt=\"AdaCodec cuts video tokens with predictive visual codes\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>This matters because video is naturally temporally redundant. Most clips do not change completely from frame to frame. For developers building video understanding systems, that redundancy translates into higher \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> budgets, slower \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa>, and less room for longer context.\u003C\u002Fp>\u003Cp>The abstract frames the issue as a mismatch between how video actually behaves and how video MLLMs consume it. The paper argues for a more direct interface: transmit a full frame only when the model cannot predict the scene well from prior context, and otherwise transmit a compact representation of the inter-frame changes.\u003C\u002Fp>\u003Ch2>How AdaCodec works in plain English\u003C\u002Fh2>\u003Cp>AdaCodec is the paper’s name for this predictive interface. The core idea is simple: use a full reference frame when necessary, and use compact change tokens when not.\u003C\u002Fp>\u003Cp>The paper says AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high. When the scene is predictable, it encodes inter-frame changes instead. Those changes include motion and prediction residuals, which are packaged as compact P-tokens.\u003C\u002Fp>\u003Cp>In other words, AdaCodec is not trying to represent every frame from scratch. It tries to represent what the next frame adds. That is the key shift: from “encode the image” to “encode the difference from what we already know.”\u003C\u002Fp>\u003Cp>For engineers, that distinction is important. A system that can preserve useful video information while spending fewer tokens on redundant content can potentially fit more video into the same budget, or reduce latency for the same amount of input.\u003C\u002Fp>\u003Ch2>What the paper actually shows\u003C\u002Fh2>\u003Cp>The abstract reports results across eleven benchmarks. It says AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. That is the main comparison point in the paper as presented here.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780381977370-c8yg.png\" alt=\"AdaCodec cuts video tokens with predictive visual codes\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The most concrete efficiency number in the abstract is the token budget reduction: even at one-seventh the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks. That is a strong claim about how much redundancy the method can remove while still preserving performance.\u003C\u002Fp>\u003Cp>On five general-video benchmarks, the paper says AdaCodec raises the average score while also reducing time-to-first-token from 9.26 seconds to 1.62 seconds. That latency drop is likely to matter as much as the accuracy gains for interactive systems, where users feel startup delay immediately.\u003C\u002Fp>\u003Cp>One thing the abstract does not provide is the full \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> table, so it does not tell us the exact score deltas on each of the eleven tasks. It also does not specify the details of the benchmark suite beyond distinguishing long-video and general-video evaluations. So the safest reading is that AdaCodec appears to improve both efficiency and quality, but the exact margins depend on the benchmark.\u003C\u002Fp>\u003Ch2>Why developers should care\u003C\u002Fh2>\u003Cp>If you are building a video assistant, a surveillance analyzer, a meeting summarizer, or any other video MLLM product, token efficiency is not a cosmetic detail. It affects how much video you can process, how quickly the model starts responding, and how expensive each request becomes.\u003C\u002Fp>\u003Cp>AdaCodec points toward a design that could be more practical than “treat every frame like a fresh image.” By using predictive coding, it tries to align the video input pipeline with the actual structure of video data: mostly stable, with localized changes.\u003C\u002Fp>\u003Cp>That also suggests a broader engineering lesson. For sequence models, a better representation is often not a bigger model, but a smarter interface to the data. AdaCodec is essentially an input-side optimization for video MLLMs, and the abstract claims that this alone can deliver both better scores and lower latency.\u003C\u002Fp>\u003Ch2>What’s still unclear\u003C\u002Fh2>\u003Cp>The abstract is promising, but it leaves open several questions that matter in practice. It does not explain how the predictive cost is computed in detail, how P-tokens are formed internally, or how sensitive the method is to different kinds of motion and scene cuts.\u003C\u002Fp>\u003Cp>It also does not tell us whether AdaCodec requires special training data, whether it generalizes across model families, or how it behaves on videos with rapid scene changes, camera shake, or heavy occlusion. Those are exactly the cases where predictive compression can become harder.\u003C\u002Fp>\u003Cp>So the right takeaway is not that video MLLMs are solved. It is that the paper offers a concrete, token-efficient alternative to per-frame RGB encoding, and the reported results suggest that a predictive visual code can be a better fit for redundant video than the status quo.\u003C\u002Fp>\u003Ch2>The bottom line\u003C\u002Fh2>\u003Cp>AdaCodec reframes video understanding as a prediction problem: send full visual context only when needed, and otherwise send compact change information. According to the abstract, that approach improves performance at matched budgets, cuts token use sharply, and reduces time-to-first-token in at least some settings.\u003C\u002Fp>\u003Cp>For developers, the appeal is straightforward. If the results hold up beyond the reported benchmarks, predictive visual coding could make video MLLMs cheaper, faster, and more scalable without forcing a tradeoff between efficiency and quality.\u003C\u002Fp>\u003Cul>\u003Cli>It targets redundancy in per-frame RGB video encoding.\u003C\u002Fli>\u003Cli>It uses reference frames plus compact P-tokens for changes.\u003C\u002Fli>\u003Cli>It reports better results at lower token budgets and lower latency.\u003C\u002Fli>\u003C\u002Ful>","AdaCodec compresses video for MLLMs by encoding only unpredictable frames and inter-frame changes.","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.02569",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780382005109-ib57.png","research","en","3479bdee-21fb-4fda-9572-9394caba01b0",[17,18,19,20,21],"video MLLMs","token efficiency","predictive coding","visual tokens","latency",[23,24,25],"AdaCodec encodes video as predictions and changes instead of repeating full frames.","The abstract reports gains on eleven benchmarks at matched visual-token budgets.","It also cuts time-to-first-token from 9.26s to 1.62s on five general-video benchmarks.",2,"2026-06-02T06:32:28.7818+00:00","2026-06-02T06:32:28.771+00:00","3103988e-c4fe-45e3-98ab-846500c9d507",{"tags":31,"relatedLang":41,"relatedPosts":45},[32,34,36,38,40],{"name":18,"slug":33},"token-efficiency",{"name":19,"slug":35},"predictive-coding",{"name":20,"slug":37},"visual-tokens",{"name":17,"slug":39},"video-mllms",{"name":21,"slug":21},{"id":15,"slug":42,"title":43,"language":44},"adacodec-predictive-visual-code-video-mllms-zh","AdaCodec 用預測碼壓縮影片 token","zh",[46,52,58,64,70,76],{"id":47,"slug":48,"title":49,"cover_image":50,"image_url":50,"created_at":51,"category":13},"1770f0e4-4b10-459d-bb9b-be13075b1a3d","persona-pruner-lightweight-role-playing-models-en","Persona-Pruner trims models for role-playing","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781505171903-58bv.png","2026-06-15T06:32:25.55966+00:00",{"id":53,"slug":54,"title":55,"cover_image":56,"image_url":56,"created_at":57,"category":13},"2a85882b-ba8c-44c8-809e-e19691776f37","clinhallu-medical-mllm-hallucination-benchmark-en","ClinHallu maps where medical MLLMs hallucinate","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781504273229-o70v.png","2026-06-15T06:17:23.262119+00:00",{"id":59,"slug":60,"title":61,"cover_image":62,"image_url":62,"created_at":63,"category":13},"32895cbf-48cf-4030-9c82-aa9c5bc313ec","gaze-heads-steering-vlms-attention-en","Gaze Heads: Steering VLMs by Redirecting Attention","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781503375905-dvse.png","2026-06-15T06:02:26.879998+00:00",{"id":65,"slug":66,"title":67,"cover_image":68,"image_url":68,"created_at":69,"category":13},"e891adc0-af64-41c7-bb41-d75e6506d388","ai-benchmarks-2026-evaluations-limits-en","AI Benchmarks 2026: Top Evaluations and Limits","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781381870944-h208.png","2026-06-13T20:17:26.361723+00:00",{"id":71,"slug":72,"title":73,"cover_image":74,"image_url":74,"created_at":75,"category":13},"b1779b30-e9e3-4406-aa29-d44e94f7ca67","art-fine-tunes-multimodal-llms-via-pixels-en","ART fine-tunes multimodal LLMs via pixels","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781266683694-z93k.png","2026-06-12T12:17:32.187899+00:00",{"id":77,"slug":78,"title":79,"cover_image":80,"image_url":80,"created_at":81,"category":13},"763f2b17-41e2-4685-a9eb-9eb285383747","taxonomy-rwa-tokenization-blockchain-infrastructure-en","A Practical Taxonomy for RWA Tokenization","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781259482218-p7ji.png","2026-06-12T10:17:30.894151+00:00",[83,88,93,98,103,108,113,118,123,128],{"id":84,"slug":85,"title":86,"created_at":87},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":89,"slug":90,"title":91,"created_at":92},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":94,"slug":95,"title":96,"created_at":97},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":99,"slug":100,"title":101,"created_at":102},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":104,"slug":105,"title":106,"created_at":107},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":109,"slug":110,"title":111,"created_at":112},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":114,"slug":115,"title":116,"created_at":117},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":119,"slug":120,"title":121,"created_at":122},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":124,"slug":125,"title":126,"created_at":127},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":129,"slug":130,"title":131,"created_at":132},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]