[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-ai-agents-token-spending-coding-tasks-en":3,"tags-ai-agents-token-spending-coding-tasks-en":31,"related-lang-ai-agents-token-spending-coding-tasks-en":43,"related-posts-ai-agents-token-spending-coding-tasks-en":47,"series-research-904270f5-c35d-4938-915f-99b405511466":84},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":18,"translated_content":10,"views":19,"is_premium":20,"created_at":21,"updated_at":22,"cover_image":11,"published_at":23,"rewrite_status":24,"rewrite_error":10,"rewritten_from_id":25,"slug":26,"category":27,"related_article_id":28,"status":29,"google_indexed_at":30,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":20},"904270f5-c35d-4938-915f-99b405511466","AI Coding Agents Burn 1000x More Tokens Than Chat","\u003Cp>AI agents are getting better at coding tasks, but they are also getting expensive in ways that are easy to miss until the bill arrives. This paper, \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.22750\">How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks\u003C\u002Fa>, looks at where those tokens go, which models are more efficient, and whether models can predict their own cost before they start working.\u003C\u002Fp>\u003Cp>The short version: agentic coding is not just another LLM workload. The paper argues it is unusually token-hungry, highly variable from run to run, and still difficult for frontier models to estimate ahead of time.\u003C\u002Fp>\u003Ch2>What problem this paper is trying to fix\u003C\u002Fh2>\u003Cp>As AI agents move from demos into real workflows, token usage becomes a practical systems problem, not just an accounting detail. If an agent loops through tools, reads a lot of context, and retries tasks, the cost can climb fast even when the final answer is not better.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777270014409-t6t5.png\" alt=\"How AI Agents Burn Tokens in Coding Tasks\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That creates three questions developers and platform teams actually care about: where the tokens are being spent, which models are cheapest to use for the same task, and whether a system can estimate cost before execution so you can budget or route requests intelligently.\u003C\u002Fp>\u003Cp>This paper focuses on agentic coding tasks, which are a good stress test because they combine long context, tool use, and iterative reasoning. The authors study trajectories from eight frontier LLMs on SWE-bench Verified and examine both token consumption patterns and models’ ability to predict their own token costs before running the task.\u003C\u002Fp>\u003Ch2>How the method works in plain English\u003C\u002Fh2>\u003Cp>The study is observational and comparative rather than a new algorithm proposal. The authors analyze agent trajectories from eight frontier models on the same coding benchmark, then compare how many tokens each run consumes and how that cost relates to task outcome.\u003C\u002Fp>\u003Cp>They also test whether models can estimate their own token usage before task execution. In other words, instead of only asking “did the agent solve the issue?”, they ask “how much did the agent spend to try?” and “could the model have warned us ahead of time?”\u003C\u002Fp>\u003Cp>That framing matters because token cost in agentic systems is often dominated by repeated input context, not just generated output. The paper explicitly separates these dynamics and looks at how input and output tokens contribute to total spend.\u003C\u002Fp>\u003Cp>It also compares perceived difficulty with actual computational effort. The authors use human expert ratings of task difficulty and check how well those ratings line up with token usage in practice.\u003C\u002Fp>\u003Ch2>What the paper actually shows\u003C\u002Fh2>\u003Cp>The headline result is that agentic coding tasks are extremely expensive relative to other coding workloads. The paper says they consume 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777270025582-5fwd.png\" alt=\"How AI Agents Burn Tokens in Coding Tasks\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That is a useful reminder for anyone building agents: the expensive part is often not the final answer generation. It is the repeated reading, re-reading, and context accumulation that happens before the model ever produces a patch.\u003C\u002Fp>\u003Cp>Token usage is also highly variable. The same task can differ by up to 30x in total tokens across runs, which suggests that cost is not a stable property of the benchmark task alone. The paper describes this behavior as inherently stochastic.\u003C\u002Fp>\u003Cp>More tokens do not necessarily mean better results. The authors report that accuracy often peaks at intermediate cost and then saturates at higher costs. In other words, there is not a simple “spend more, get better” relationship.\u003C\u002Fp>\u003Cp>Model choice matters too. On the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5. The paper presents this as a meaningful gap in token efficiency across frontier systems.\u003C\u002Fp>\u003Cp>The human side of the story is also imperfect. Task difficulty rated by human experts only weakly aligns with actual token costs, which suggests that people are not very good at predicting how much compute an agent will burn just by looking at the task.\u003C\u002Fp>\u003Cp>Finally, the paper finds that frontier models are not reliable forecasters of their own token usage. Their self-predictions show weak-to-moderate correlations, up to 0.39, and they systematically underestimate real token costs.\u003C\u002Fp>\u003Ch2>Why developers should care\u003C\u002Fh2>\u003Cp>If you are building agentic coding workflows, this paper points to a simple operational lesson: token cost is a first-class metric. Two runs that look similar in task scope can have wildly different bills, and the model’s own estimate may not save you from that variance.\u003C\u002Fp>\u003Cp>That has direct implications for product design and infrastructure planning. You may need guardrails such as token budgets, routing rules, retry limits, or model selection policies that take cost into account before execution starts.\u003C\u002Fp>\u003Cp>It also suggests that benchmark success alone is not enough. A model that solves tasks at a similar rate but consumes far fewer tokens may be a much better fit for production, especially if you are paying per token or operating under latency constraints.\u003C\u002Fp>\u003Cp>For teams building agent platforms, the paper also highlights a mismatch between human intuition and machine behavior. If expert task difficulty does not strongly predict token spend, then manual estimation is a weak basis for cost forecasting.\u003C\u002Fp>\u003Ch2>Limitations and open questions\u003C\u002Fh2>\u003Cp>The abstract is clear about the study’s scope, but it does not provide full benchmark breakdowns, detailed per-model cost tables, or implementation specifics in the notes provided here. So while the findings are directionally strong, the exact experimental setup needs the full paper for deeper evaluation.\u003C\u002Fp>\u003Cp>It is also worth keeping in mind that the study is centered on SWE-bench Verified and eight frontier LLMs. That makes it highly relevant for agentic coding, but it does not automatically generalize to every tool-using workflow or every model family.\u003C\u002Fp>\u003Cp>Another open question is how these token patterns change with different prompting strategies, toolchains, or context-management approaches. The paper shows that input tokens dominate and variability is high, but it does not claim to solve the underlying efficiency problem.\u003C\u002Fp>\u003Cp>Still, the practical takeaway is strong: if you are shipping AI agents, you should treat token consumption as something to measure, predict, and optimize, not just a side effect of getting the job done.\u003C\u002Fp>\u003Cul>\u003Cli>Agentic coding can be far more expensive than regular code chat or reasoning.\u003C\u002Fli>\u003Cli>Input tokens are the main cost driver.\u003C\u002Fli>\u003Cli>Token use varies massively across runs.\u003C\u002Fli>\u003Cli>Higher spend does not guarantee higher accuracy.\u003C\u002Fli>\u003Cli>Models are still poor at predicting their own cost.\u003C\u002Fli>\u003C\u002Ful>\u003Cp>For developers, that means the next wave of agent optimization may be less about squeezing out a few more benchmark points and more about making the same capability cheaper, more predictable, and easier to budget.\u003C\u002Fp>","A study of SWE-bench Verified shows agentic coding can consume 1000x more tokens than chat, with costs driven by inputs and hard to predict.","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.22750",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777270014409-t6t5.png",[13,14,15,16,17],"AI agents","token consumption","agentic coding","SWE-bench","LLM efficiency","en",1,false,"2026-04-27T06:06:38.046891+00:00","2026-05-01T09:58:59.221017+00:00","2026-04-27T06:06:38.027+00:00","done","5d671e82-10f0-4cf3-bef9-c9be08f90387","ai-agents-token-spending-coding-tasks-en","research","b2725e14-d169-4ef3-9b57-0cc23a7e9338","published","2026-04-27T09:00:07.272+00:00",[32,35,37,39,41],{"name":33,"slug":34},"SWE-Bench","swe-bench",{"name":17,"slug":36},"llm-efficiency",{"name":15,"slug":38},"agentic-coding",{"name":14,"slug":40},"token-consumption",{"name":13,"slug":42},"ai-agents",{"id":28,"slug":44,"title":45,"language":46},"ai-agents-token-spending-coding-tasks-zh","AI 代理寫程式：token 比 chat 多燒 1000 倍","zh",[48,54,60,66,72,78],{"id":49,"slug":50,"title":51,"cover_image":52,"image_url":52,"created_at":53,"category":27},"94994abd-e24d-4fd1-b941-942d03d19acf","turboquant-seo-shift-small-sites-en","TurboQuant and the SEO Shift for Small Sites","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778840455122-jfce.png","2026-05-15T10:20:28.134545+00:00",{"id":55,"slug":56,"title":57,"cover_image":58,"image_url":58,"created_at":59,"category":27},"670a7f69-911f-41e8-a18b-7d3491253a19","turboquant-vllm-comparison-fp8-kv-cache-en","TurboQuant vs FP8: vLLM’s first broad test","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778839858405-b5ao.png","2026-05-15T10:10:37.219158+00:00",{"id":61,"slug":62,"title":63,"cover_image":64,"image_url":64,"created_at":65,"category":27},"5aef1c57-961f-49f7-8277-f83f7336799a","llmbda-calculus-agent-safety-rules-en","LLMbda calculus gives agents safety rules","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778825459914-obkf.png","2026-05-15T06:10:36.242145+00:00",{"id":67,"slug":68,"title":69,"cover_image":70,"image_url":70,"created_at":71,"category":27},"712a0357-f7cd-48f2-adde-c2691da0815f","low-complexity-beamspace-denoiser-mmwave-mimo-en","A simpler beamspace denoiser for mmWave MIMO","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778814646705-e7mx.png","2026-05-15T03:10:31.764301+00:00",{"id":73,"slug":74,"title":75,"cover_image":76,"image_url":76,"created_at":77,"category":27},"f595f949-6ea1-4b0e-a632-f1832ef26e36","ai-benchmark-wins-cyber-scare-defenders-en","Why AI benchmark wins in cyber should scare defenders","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778807444539-gz7f.png","2026-05-15T01:10:30.04579+00:00",{"id":79,"slug":80,"title":81,"cover_image":82,"image_url":82,"created_at":83,"category":27},"3ad202d1-9e5f-49c5-8383-02fcf1a23cf2","why-linux-security-needs-patch-wave-mindset-en","Why Linux security needs a patch-wave mindset","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778741441493-ikl6.png","2026-05-14T06:50:25.906256+00:00",[85,90,95,100,105,110,115,120,125,130],{"id":86,"slug":87,"title":88,"created_at":89},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":91,"slug":92,"title":93,"created_at":94},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":96,"slug":97,"title":98,"created_at":99},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":101,"slug":102,"title":103,"created_at":104},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":106,"slug":107,"title":108,"created_at":109},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":111,"slug":112,"title":113,"created_at":114},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":116,"slug":117,"title":118,"created_at":119},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":121,"slug":122,"title":123,"created_at":124},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":126,"slug":127,"title":128,"created_at":129},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":131,"slug":132,"title":133,"created_at":134},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]