[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-arxiv-ai-papers-agents-memory-data-en":3,"article-related-arxiv-ai-papers-agents-memory-data-en":30,"series-research-596a6b3f-d7c0-46ef-9a88-1915a6e3f238":77},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"596a6b3f-d7c0-46ef-9a88-1915a6e3f238","arxiv-ai-papers-agents-memory-data-en","ArXiv AI papers push agents, memory, and data","\u003Cp data-speakable=\"summary\">New arXiv papers show \u003Ca href=\"\u002Ftag\u002Fai-agents\">AI agents\u003C\u002Fa> getting better at planning, memory, and domain-specific reasoning.\u003C\u002Fp>\u003Cp>ArXiv’s \u003Ca href=\"https:\u002F\u002Fpapers.cool\u002Farxiv\u002Fcs.AI\" target=\"_blank\" rel=\"noopener\">Artificial Intelligence\u003C\u002Fa> feed on \u003Ca href=\"https:\u002F\u002Fpapers.cool\" target=\"_blank\" rel=\"noopener\">papers.cool\u003C\u002Fa> lists 214 papers for June 17, 2026, and the strongest theme is easy to spot: agents are moving from static response generators to systems that remember, plan, and act. Several papers also lean hard into data infrastructure, which matters just as much as model design when training data gets scarce.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Paper\u003C\u002Fth>\u003Cth>Key numbers\u003C\u002Fth>\u003Cth>What it changes\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>EvolveNav\u003C\u002Ftd>\u003Ctd>10.1% success-rate gain\u003C\u002Ftd>\u003Ctd>Test-time learning for zero-shot navigation\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>SEFD\u003C\u002Ftd>\u003Ctd>152B tokens, 18.5M filings, 550B-token archive estimate\u003C\u002Ftd>\u003Ctd>Open long-context data for financial modeling\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>DRFLOW\u003C\u002Ftd>\u003Ctd>100 tasks, 1,246 workflow steps, 3,900+ sources\u003C\u002Ftd>\u003Ctd>Benchmark for personalized workflows\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>Agents are starting to plan before they act\u003C\u002Fh2>\u003Cp>The most interesting paper in the batch is \u003Ca href=\"https:\u002F\u002Fpapers.cool\u002Farxiv\u002Fcs.AI\" target=\"_blank\" rel=\"noopener\">EvolveNav\u003C\u002Fa>, which attacks zero-shot object-goal navigation. The setup is simple to state and hard to solve: an embodied \u003Ca href=\"\u002Ftag\u002Fagent\">agent\u003C\u002Fa> has to find an object it has never been trained for, using only what it can infer at test time.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781685183085-978g.png\" alt=\"ArXiv AI papers push agents, memory, and data\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>Most prior systems lean on foundation models with fixed priors, then spend a lot of time correcting avoidable mistakes. EvolveNav tries a different route. It builds an agentic rule memory from past trajectories, uses upper confidence bound retrieval to pick rules, and adds a preflection module that predicts likely outcomes before the agent moves.\u003C\u002Fp>\u003Cp>The result is practical rather than flashy. The paper reports a 10.1% improvement in success rate, plus fewer unnecessary steps. That matters because in embodied AI, wasted motion is often the real cost, whether the agent is a robot in a house or a simulated explorer in a maze.\u003C\u002Fp>\u003Cul>\u003Cli>Rule memory turns past trajectories into reusable action knowledge.\u003C\u002Fli>\u003Cli>UCB retrieval balances semantic match with historical success.\u003C\u002Fli>\u003Cli>Preflection reduces blind exploration before the next action.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Benchmarks are getting more specific, and that is a good thing\u003C\u002Fh2>\u003Cp>Two papers in this set are really about measurement. \u003Ca href=\"https:\u002F\u002Fpapers.cool\u002Farxiv\u002Fcs.AI\" target=\"_blank\" rel=\"noopener\">DRFLOW\u003C\u002Fa> asks a different question from the usual deep-research \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa>: can an agent recover the actual workflow a user needs, step by step, from scattered sources?\u003C\u002Fp>\u003Cp>That shift matters because many enterprise tasks are procedural, not summarization tasks. DRFLOW includes 100 tasks across five domains, 1,246 reference workflow steps, and more than 3,900 sources. The authors also define seven diagnostic metrics that test grounding, step recovery, ordering, condition handling, and personalization.\u003C\u002Fp>\u003Cblockquote>“The challenge is not to generate a report, but to identify the correct action-step sequence for the user’s task.” — Md Tawkat Islam Khondaker and coauthors, DRFLOW\u003C\u002Fblockquote>\u003Cp>The benchmark result is telling: the reference agent, DRFLOW-Agent, improves over strong baselines by up to 10.02% average F1, but the paper still says there is a lot of room left. That is usually a sign the benchmark is measuring something real rather than something already solved.\u003C\u002Fp>\u003Cp>Another useful comparison is how these papers define progress. EvolveNav optimizes behavior in the world. DRFLOW optimizes planning over documents and sources. Both are agent papers, but they test different failure modes: one is about physical exploration, the other about workflow recovery.\u003C\u002Fp>\u003Cul>\u003Cli>DRFLOW: 100 tasks, 5 domains, 7 diagnostic metrics.\u003C\u002Fli>\u003Cli>DRFLOW-Agent: up to 10.02% average F1 improvement over strong baselines.\u003C\u002Fli>\u003Cli>EvolveNav: 10.1% success-rate gain with fewer unnecessary steps.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Data is becoming the bottleneck, so new corpora matter\u003C\u002Fh2>\u003Cp>The most strategic paper in the batch may be \u003Ca href=\"https:\u002F\u002Fpapers.cool\u002Farxiv\u002Fcs.AI\" target=\"_blank\" rel=\"noopener\">The Stanford EDGAR Filings Dataset\u003C\u002Fa>. It treats SEC filings as a training resource, reconstructing them into layout-faithful MultiMarkdown for long-context pretraining and evaluation.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781685178123-gr1e.png\" alt=\"ArXiv AI papers push agents, memory, and data\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That is a smart response to a very real problem: good public web text is getting harder to find in bulk, and a lot of the remaining long-context corpora are proprietary, synthetic, or too narrow. EDGAR filings are dense, audited, and full of structure that \u003Ca href=\"\u002Fnews\u002Flanguage-models-value-axis-en\">language models\u003C\u002Fa> usually struggle to preserve.\u003C\u002Fp>\u003Cp>The scale is the headline. The authors release SEFD-v1 as a 152B-\u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> snapshot, describe a larger 18.5M-filing archive, and estimate that archive at 550B tokens. They also report less than 0.1% overlap with Common Crawl-derived corpora, which makes the dataset useful for pretraining without simply recycling the same internet text again.\u003C\u002Fp>\u003Cp>They also introduce two benchmarks: EDGAR-Forecast for filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR for transcription of complex financial tables. That pairing is smart because it tests both reasoning and document fidelity, which is where many models still wobble.\u003C\u002Fp>\u003Cp>For teams building finance-focused models, the message is clear: better data can matter as much as another round of tuning. If the corpus is cleaner and more structured, the model gets a better shot at learning long-context behavior that holds up in practice.\u003C\u002Fp>\u003Ch2>Agentic AI is spreading into medicine, robotics, and simulation\u003C\u002Fh2>\u003Cp>The rest of the batch reinforces the same pattern. \u003Ca href=\"https:\u002F\u002Fpapers.cool\u002Farxiv\u002Fcs.AI\" target=\"_blank\" rel=\"noopener\">WEQA\u003C\u002Fa> combines language models with wearable-health tools and reports 24% better accuracy than \u003Ca href=\"\u002Ftag\u002Fllm\">LLM\u003C\u002Fa> and agentic baselines, plus a blinded study with 12 medical experts and 8 users that found stronger usefulness and clinical soundness.\u003C\u002Fp>\u003Cp>\u003Ca href=\"https:\u002F\u002Fpapers.cool\u002Farxiv\u002Fcs.AI\" target=\"_blank\" rel=\"noopener\">LEADS\u003C\u002Fa> applies an LLM agent to cardiac electrophysiology digital twins, using structured action spaces to discover hybrid models that stay physically grounded and numerically stable. That is a nice example of where agents make sense: not as free-form writers, but as guided search systems inside a scientific workflow.\u003C\u002Fp>\u003Cp>Then there is \u003Ca href=\"https:\u002F\u002Fpapers.cool\u002Farxiv\u002Fcs.AI\" target=\"_blank\" rel=\"noopener\">Fixed-Point Reasoners\u003C\u002Fa>, which uses fixed-point convergence as a halting mechanism in looped Transformers. The paper targets Sudoku, Maze, state-tracking, and ARC-AGI, which is a good reminder that algorithmic reasoning is still one of the cleanest places to test whether a model can actually think in steps.\u003C\u002Fp>\u003Cp>One more paper, \u003Ca href=\"https:\u002F\u002Fpapers.cool\u002Farxiv\u002Fcs.AI\" target=\"_blank\" rel=\"noopener\">Memory as a Wasting Asset\u003C\u002Fa>, pulls the conversation in a different direction. It prices flash endurance for embodied agents, showing that memory writes have a real lifetime cost on hardware with limited program\u002Ferase cycles. The paper says the endurance budget is dormant on premium 3,000-P\u002FE TLC at datasheet prices, but binding on commodity QLC\u002FeMMC around 1,000 P\u002FE, which is exactly the kind of hardware detail AI teams ignore until deployment starts eating budget.\u003C\u002Fp>\u003Cp>That mix of papers points to a broader shift: agent research is no longer just about getting a chat model to sound helpful. It is about memory that changes over time, benchmarks that measure actual workflows, and data sources that can support longer context windows without collapsing into noise.\u003C\u002Fp>\u003Cp>If there is a prediction worth making from this batch, it is this: the next wave of agent papers will be judged less by clever prompts and more by whether they improve task completion, data efficiency, and test-time adaptation. The models that matter will be the ones that can remember, plan, and justify their actions without wasting steps or tokens.\u003C\u002Fp>","This arXiv AI batch centers on agentic reasoning, long-context data, and benchmark design across navigation, workflows, and health.","papers.cool","https:\u002F\u002Fpapers.cool\u002Farxiv\u002Fcs.AI",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781685183085-978g.png","research","en","f0501097-0bec-43ec-b310-56fc442ab53c",[17,18,19,20,21],"arXiv","agentic AI","benchmarks","long-context data","embodied agents",[23,24,25],"EvolveNav reports a 10.1% success-rate gain on zero-shot object-goal navigation.","SEFD releases a 152B-token EDGAR snapshot and a larger 550B-token filing archive estimate.","DRFLOW includes 100 tasks, 1,246 workflow steps, and more than 3,900 sources.",0,"2026-06-17T08:32:37.121772+00:00","2026-06-17T08:32:37.106+00:00","3103988e-c4fe-45e3-98ab-846500c9d507",{"tags":31,"relatedLang":36,"relatedPosts":40},[32,34],{"name":17,"slug":33},"arxiv",{"name":18,"slug":35},"agentic-ai",{"id":15,"slug":37,"title":38,"language":39},"arxiv-ai-papers-agents-memory-data-zh","ArXiv這批 AI 論文都在補三件事","zh",[41,47,53,59,65,71],{"id":42,"slug":43,"title":44,"cover_image":45,"image_url":45,"created_at":46,"category":13},"d910529d-15c0-498a-a930-85e14c6ef748","reprorepo-github-issues-reproducibility-audits-en","ReproRepo scales reproducibility audits with GitHub issues","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781678880894-uawp.png","2026-06-17T06:47:35.608681+00:00",{"id":48,"slug":49,"title":50,"cover_image":51,"image_url":51,"created_at":52,"category":13},"434fbb0a-e925-43f3-9c3d-a3fbd187acdc","variable-width-transformers-cut-wasted-capacity-en","Variable-Width Transformers cut wasted capacity","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781677980601-tp4b.png","2026-06-17T06:32:32.993101+00:00",{"id":54,"slug":55,"title":56,"cover_image":57,"image_url":57,"created_at":58,"category":13},"2f8d825d-5520-4fb6-b1dc-a309b0193f3e","veritas-robot-policy-visual-verification-en","VERITAS lets robots verify and improve at runtime","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781677086468-mhbq.png","2026-06-17T06:17:38.067708+00:00",{"id":60,"slug":61,"title":62,"cover_image":63,"image_url":63,"created_at":64,"category":13},"d1c56a9f-a495-46df-b7f7-3a6036031e56","phase-noise-information-aging-massive-mimo-en","Phase noise makes massive MIMO information age","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781641074734-76ux.png","2026-06-16T20:17:28.34729+00:00",{"id":66,"slug":67,"title":68,"cover_image":69,"image_url":69,"created_at":70,"category":13},"29c4b64b-1ff6-4e8f-a478-a43cc9507809","ai-model-benchmarks-gpt-55-claude-gemini-grok-en","18 AI benchmarks now rank GPT-5.5, Claude, Gemini","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781636573742-hzva.png","2026-06-16T19:02:23.681596+00:00",{"id":72,"slug":73,"title":74,"cover_image":75,"image_url":75,"created_at":76,"category":13},"99c24ad4-5a05-4bd8-a1fc-1c9676530a3a","exact-posterior-scores-inverse-problems-en","Exact posterior scores for inverse problems","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781591573015-t209.png","2026-06-16T06:32:32.175258+00:00",[78,83,88,93,98,103,108,113,118,123],{"id":79,"slug":80,"title":81,"created_at":82},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":84,"slug":85,"title":86,"created_at":87},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":89,"slug":90,"title":91,"created_at":92},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":94,"slug":95,"title":96,"created_at":97},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":99,"slug":100,"title":101,"created_at":102},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":104,"slug":105,"title":106,"created_at":107},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":109,"slug":110,"title":111,"created_at":112},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":114,"slug":115,"title":116,"created_at":117},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":119,"slug":120,"title":121,"created_at":122},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":124,"slug":125,"title":126,"created_at":127},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]