[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-model-triage-coding-tests-cost-win-en":3,"article-related-model-triage-coding-tests-cost-win-en":30,"series-tools-95a3ce84-1732-4bce-a705-4957ca6f06af":75},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"95a3ce84-1732-4bce-a705-4957ca6f06af","model-triage-coding-tests-cost-win-en","Model triage turns coding tests into a cost win","\u003Cp data-speakable=\"summary\">A copy-ready model triage workflow for cheaper coding tasks.\u003C\u002Fp>\u003Cp>I’ve been using \u003Ca href=\"\u002Ftag\u002Fai-coding\">AI coding\u003C\u002Fa> assistants long enough to know when the bill starts lying to me. The output looks fine, the demos look slick, and then you check the usage log and realize you just paid a stupid amount for a task that should have been boring. That’s the part that keeps nagging at me. Not whether the model can solve the problem, but whether I picked a model that made any economic sense in the first place.\u003C\u002Fp>\u003Cp>The annoying thing is that this failure mode doesn’t feel like a failure. The expensive model answers fast, sounds confident, and usually gets you close enough that you stop thinking. Then you notice the cost. Or worse, you notice the cost only after you’ve built a habit around always reaching for the biggest model in the room. I’ve done that. It’s lazy, and it gets expensive fast.\u003C\u002Fp>\u003Cp>What finally clicked for me is that model choice is not a one-time architecture decision. It’s a triage problem. Some tasks deserve a heavyweight model. A lot of tasks don’t. And if you’re building agentic workflows, that difference is the whole game.\u003C\u002Fp>\u003Cp>The article that pushed me to write this was \u003Ca href=\"https:\u002F\u002Fthenewstack.io\u002Fclaude-fable-cost-model-triage\u002F\">The New Stack’s piece on Claude Fable and GPT-5.5 cost differences\u003C\u002Fa>. It’s not a theory piece. It’s a practical reminder that the same coding test can swing wildly in cost depending on which model you point at it.\u003C\u002Fp>\u003Cp>That’s the real takeaway: model triage is becoming a developer skill, not a procurement footnote. If I can route the easy stuff to a cheaper model and keep the expensive one for the hard calls, I get better throughput without turning my budget into confetti.\u003C\u002Fp>\u003Ch2>Stop treating model choice like a vibe\u003C\u002Fh2>\u003Cblockquote>“Claude Fable cost $9 in one coding test. GPT-5.5 cost $1.50.”\u003C\u002Fblockquote>\u003Cp>What this actually means is that identical-looking work can have very different token and tool costs depending on the model you choose. I’m not reading that as “\u003Ca href=\"\u002Ftag\u002Fclaude\">Claude\u003C\u002Fa> bad, GPT good.” That’s too shallow and usually wrong by the next \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa> cycle. I’m reading it as a warning that defaulting to the fanciest model is a tax you pay for not classifying the task.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781840906662-fpo6.png\" alt=\"Model triage turns coding tests into a cost win\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>I’ve seen teams do this with \u003Ca href=\"\u002Ftag\u002Fcode-review\">code review\u003C\u002Fa>, test generation, refactors, and agent planning. They send everything to the same high-end model because it feels safer. It also feels easier. But easy is expensive. If the task is mostly pattern matching, rewriting, or extracting structure, you don’t need to burn premium \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa> on it.\u003C\u002Fp>\u003Cp>How to apply it: start by labeling your AI tasks into three buckets. First, low-risk mechanical work like formatting, summarizing, and boilerplate generation. Second, medium-risk work like code edits, test writing, and dependency analysis. Third, high-risk work like architectural decisions, bug diagnosis, and multi-step planning. The first bucket should almost never touch your most expensive model. The second bucket should use the cheapest model that still clears your quality bar. The third bucket is where you pay for reasoning.\u003C\u002Fp>\u003Cp>That sounds obvious until you wire it into a real product. Then you realize your “assistant” is mostly doing bucket-one and bucket-two work, and you’ve been paying bucket-three rates the whole time.\u003C\u002Fp>\u003Cp>I ran into this while testing a code assistant that was great at making small changes but terrible at knowing when to stop talking. The model was not the problem. My routing was. Everything got the same treatment, so every task inherited the cost of the worst-case path.\u003C\u002Fp>\u003Ch2>Model triage is just good routing with better branding\u003C\u002Fh2>\u003Cp>When people say “model triage,” I think they mean a simple thing: don’t spend expensive intelligence on cheap problems. That’s it. The phrase sounds fancy, but the practice is very old. We already do this with compilers, caches, queues, and background jobs. The AI version just makes the waste visible because the meter is always running.\u003C\u002Fp>\u003Cp>This is where I think a lot of agent builders get sloppy. They design the agent around one model and then bolt on tools later. That’s backwards. The agent should ask, “What kind of work is this?” before it asks, “Which model should I use?” If you don’t do that, the agent becomes a one-size-fits-all slot machine.\u003C\u002Fp>\u003Cp>How to apply it: add a lightweight classifier step before the main model call. It can be rules-based, model-based, or a hybrid. I prefer rules first, because they’re cheap and predictable. If the request includes “summarize,” “extract,” “rewrite,” or “format,” send it low. If it includes “find the bug,” “compare approaches,” or “plan the implementation,” send it higher. If the task is ambiguous, ask a clarifying question instead of guessing with an expensive model.\u003C\u002Fp>\u003Cul>\u003Cli>Cheap model: extraction, rewriting, classification, simple code edits.\u003C\u002Fli>\u003Cli>Mid-tier model: test generation, refactors, dependency tracing, tool use.\u003C\u002Fli>\u003Cli>Premium model: architecture, debugging across systems, long-horizon planning.\u003C\u002Fli>\u003C\u002Ful>\u003Cp>I like this because it turns model selection into an explicit policy instead of an emotional reflex. And once it’s a policy, you can measure it, tune it, and argue about it like adults.\u003C\u002Fp>\u003Cp>There’s also a hidden benefit: routing forces you to define what “good enough” means. That’s uncomfortable, but useful. A lot of teams say they need the best model when what they really mean is they don’t have a success metric yet.\u003C\u002Fp>\u003Ch2>The cheapest model is the one that gets rejected least\u003C\u002Fh2>\u003Cp>Here’s the part nobody likes admitting: the best model for a task is often the one that fails in the least annoying way. Not the smartest. Not the most eloquent. The one that gives you useful output without creating cleanup work.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781840898322-z183.png\" alt=\"Model triage turns coding tests into a cost win\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>I’ve had cheaper models generate code that was slightly clunkier but easier to review. That matters. If a premium model writes beautiful nonsense, I still have to inspect every line. If a smaller model writes plain, boring, correct-ish code, I can move faster. That’s a better trade for a lot of day-to-day work.\u003C\u002Fp>\u003Cp>What this actually means is that quality is not a single axis. You should think about correctness, review cost, latency, and spend together. A model that costs less but doubles review time is not cheaper. A model that costs more but cuts follow-up prompts in half might actually win. This is why raw benchmark talk can be so annoying. Benchmarks don’t pay your cloud bill or your engineer’s attention budget.\u003C\u002Fp>\u003Cp>I ran into this with unit test generation. The expensive model wrote more elegant tests, but the cheaper one wrote tests I could scan faster and fix faster. The cheaper output wasn’t prettier. It was more operationally useful.\u003C\u002Fp>\u003Cp>How to apply it: measure total task cost, not just inference cost. That includes retries, human review time, and tool calls. If you can’t measure all three yet, at least track how often a model’s first answer gets accepted without edits. That’s a brutally honest signal.\u003C\u002Fp>\u003Cul>\u003Cli>Track acceptance rate on first pass.\u003C\u002Fli>\u003Cli>Track number of follow-up prompts per task.\u003C\u002Fli>\u003Cli>Track human edit time after model output.\u003C\u002Fli>\u003Cli>Track dollar cost per completed task, not per call.\u003C\u002Fli>\u003C\u002Ful>\u003Cp>Once you do that, model choice stops being ideology and starts being operations.\u003C\u002Fp>\u003Ch2>Agent workflows need a budget, not just a brain\u003C\u002Fh2>\u003Cp>The New Stack story lands because it’s not really about one model being cheaper than another. It’s about the fact that agent workflows now have economic behavior. If an agent can choose tools, call APIs, and loop through reasoning steps, then cost grows with every unnecessary move. That means your agent needs budget awareness baked in from the start.\u003C\u002Fp>\u003Cp>If you’ve ever watched an agent spin its wheels, you know the smell. It keeps thinking, keeps calling tools, keeps “trying alternatives,” and somehow the task gets more expensive while the answer gets only marginally better. That’s not intelligence. That’s an unbounded bill with a nice UI.\u003C\u002Fp>\u003Cp>What this actually means is that every agent should have guardrails. Not just safety guardrails, but spend guardrails. Limit the number of attempts. Limit tool calls. Escalate only when the lower tier fails. Stop when the marginal gain is not worth the extra cost.\u003C\u002Fp>\u003Cp>I’ve found this especially important in coding agents. They love to wander. Give them a vague bug report and they’ll inspect half the repo, rewrite some unrelated file, and then ask for permission to continue. That’s where triage saves you. Cheap model first for interpretation. Better model only if the problem survives the first pass.\u003C\u002Fp>\u003Cp>How to apply it: build a two-step agent pipeline. Step one classifies the task and estimates difficulty. Step two chooses the model and tool budget. If the task is low confidence or high impact, escalate. If it’s routine, cap the spend and move on. This is not fancy orchestration. It’s just refusing to let every request become a research project.\u003C\u002Fp>\u003Cp>And yes, you can make this boring. Boring is good. Boring means the invoice doesn’t surprise you.\u003C\u002Fp>\u003Ch2>Use triage to separate signal from ego\u003C\u002Fh2>\u003Cp>One reason developers overuse premium models is ego. I don’t mean that as an insult. I mean it literally. It feels nicer to say “the best model handled it” than “the cheap model was enough.” But your job is not to impress the API. Your job is to ship useful software without setting money on fire.\u003C\u002Fp>\u003Cp>There’s also a status problem. Teams like to talk about which model they’re using as if the label itself proves sophistication. It doesn’t. Sophistication is picking the right model for the job and knowing when to stop.\u003C\u002Fp>\u003Cp>How to apply it: write down a routing policy that your team can actually follow. Keep it short. Put it next to the code, not in a slide deck nobody reads. If your policy says the expensive model is only for ambiguous, high-impact, or multi-step reasoning tasks, then enforce that. If someone wants to override it, make them explain why.\u003C\u002Fp>\u003Cp>That sounds bureaucratic, but it’s cheaper than discovering six weeks later that your “assistant” has been charging premium rates to rename variables.\u003C\u002Fp>\u003Cp>I like to think of this as model humility. Not because models are weak, but because they’re expensive in exactly the places where laziness hurts most. If you can get the same result from a smaller model, you should. If you can’t, escalate with intent.\u003C\u002Fp>\u003Ch2>The pattern I’d actually ship\u003C\u002Fh2>\u003Cp>If I were building this into a product today, I’d keep it simple. I wouldn’t start with a giant router service or some overdesigned policy engine. I’d start with a cheap classifier, a clear escalation path, and a log of what happened. That gets you 80% of the value without creating a new platform project.\u003C\u002Fp>\u003Cp>What this actually means is:\u003C\u002Fp>\u003Cul>\u003Cli>Classify the request.\u003C\u002Fli>\u003Cli>Pick the cheapest model that can probably do the job.\u003C\u002Fli>\u003Cli>Escalate only on failure, ambiguity, or high impact.\u003C\u002Fli>\u003Cli>Record cost, retries, and human edits.\u003C\u002Fli>\u003Cli>Review the routing rules every week.\u003C\u002Fli>\u003C\u002Ful>\u003Cp>I’ve seen enough AI systems now to know the failure is rarely “the model is too small.” More often it’s “we never asked whether this task deserved the expensive path.” That’s a routing problem, not an intelligence problem.\u003C\u002Fp>\u003Cp>And once you start thinking this way, you see it everywhere. Support bots. Code assistants. Document generators. Internal copilots. Most of them are doing a mix of easy and hard work, and most of them are probably overpaying for the easy part.\u003C\u002Fp>\u003Cp>The nice thing is that triage is easy to test. Route some tasks down. Route some up. Compare acceptance rate, latency, and spend. If the cheaper path holds up, keep it. If it falls apart, you learned something useful instead of just burning money in silence.\u003C\u002Fp>\u003Ch2>The template you can copy\u003C\u002Fh2>\u003Cpre>\u003Ccode># Model triage policy for AI coding tasks\n\n## Goal\nUse the cheapest model that can complete the task correctly.\nEscalate only when the task is ambiguous, high impact, or fails on the first pass.\n\n## Task buckets\n\n### Bucket 1: Low-cost mechanical work\nUse the cheapest acceptable model.\nExamples:\n- Summarizing code or docs\n- Rewriting text\n- Extracting fields\n- Formatting JSON or markdown\n- Simple code edits\n\n### Bucket 2: Medium-complexity work\nUse a mid-tier model first.\nEscalate if needed.\nExamples:\n- Writing unit tests\n- Small refactors\n- Dependency tracing\n- Tool-assisted lookup\n- Bug isolation in one file or module\n\n### Bucket 3: High-complexity or high-risk work\nUse the strongest model available.\nExamples:\n- Architecture decisions\n- Cross-service debugging\n- Long-horizon planning\n- Security-sensitive changes\n- Tasks with unclear requirements\n\n## Routing rules\n\n1. If the request is obviously mechanical, route to Bucket 1.\n2. If the request needs reasoning but is bounded, route to Bucket 2.\n3. If the request is ambiguous, user-facing, or costly to get wrong, route to Bucket 3.\n4. If Bucket 1 fails twice, escalate to Bucket 2.\n5. If Bucket 2 fails twice or the task remains ambiguous, escalate to Bucket 3.\n6. If the task is still unclear, ask a clarifying question instead of guessing.\n\n## Cost guardrails\n\n- Max 2 retries before escalation\n- Max tool calls per task: 3\n- Max total tokens per task: set a budget per bucket\n- Stop when the next attempt is unlikely to improve the result materially\n\n## Metrics to log\n\n- Model used\n- Task bucket\n- Number of retries\n- Number of tool calls\n- Human edit time\n- First-pass acceptance rate\n- Total cost per completed task\n\n## Weekly review questions\n\n- Which tasks were over-routed to expensive models?\n- Which cheap-model tasks required too many fixes?\n- Did escalation happen for the right reasons?\n- Are we paying premium rates for routine work?\n- Which routing rules should be tightened?\n\n## Example implementation sketch\n\ntext\nif task.is_mechanical():\n    model = cheap_model\nelif task.is_bounded_reasoning():\n    model = mid_model\nelif task.is_ambiguous() or task.is_high_risk():\n    model = premium_model\n\nresult = run_model(model, task)\n\nif result.failed and retries \u003C 2:\n    escalate_one_level()\n\n\n## Team rule\nIf you override the router, write down why.\n\n## Success definition\nThe system should lower cost without reducing first-pass usefulness or increasing human cleanup time.\u003C\u002Fcode>\u003C\u002Fpre>\u003Cp>This template is intentionally plain. That’s the point. You do not need a cathedral here. You need a policy that keeps your AI spend from drifting upward every time someone says, “Let’s just use the best model.”\u003C\u002Fp>\u003Cp>Source attribution: I based this breakdown on \u003Ca href=\"https:\u002F\u002Fthenewstack.io\u002Fclaude-fable-cost-model-triage\u002F\">The New Stack article\u003C\u002Fa> about Claude Fable and GPT-5.5 cost differences. The routing policy, bucket definitions, and template above are my own synthesis, not copied from the source.\u003C\u002Fp>","A practical breakdown of model triage, with a copy-ready workflow for picking cheaper models without wrecking code quality.","thenewstack.io","https:\u002F\u002Fthenewstack.io\u002Fclaude-fable-cost-model-triage\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781840906662-fpo6.png","tools","en","60a23c5e-d9df-4186-a30e-5d2c123a0ed6",[17,18,19,20,21],"model triage","AI coding","cost optimization","agent routing","LLM workflows",[23,24,25],"Model selection should be treated as routing, not vibes.","Cheaper models win when the task is mechanical or bounded.","Track total task cost, not just inference cost.",0,"2026-06-19T03:47:52.260391+00:00","2026-06-19T03:47:52.254+00:00","e4b90b62-2711-4565-95f6-99074b613dde",{"tags":31,"relatedLang":34,"relatedPosts":38},[32],{"name":18,"slug":33},"ai-coding",{"id":15,"slug":35,"title":36,"language":37},"model-triage-coding-tests-cost-win-zh","模型分流把測試成本壓下來","zh",[39,45,51,57,63,69],{"id":40,"slug":41,"title":42,"cover_image":43,"image_url":43,"created_at":44,"category":13},"f9ee1fee-7ac0-4072-a330-dbe682e03b84","renesas-acquires-altium-pcb-design-tool-update-en","瑞萨全资收购Altium，PCB设计工具更新","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781859772738-3319.png","2026-06-19T09:02:23.631252+00:00",{"id":46,"slug":47,"title":48,"cover_image":49,"image_url":49,"created_at":50,"category":13},"c7baab44-71c8-4905-9a7d-a54a98e6cc45","rust-forum-week-25-turns-ideas-into-shipping-work-en","Rust forum week 25 turns ideas into shipping work","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781857111323-iib7.png","2026-06-19T08:18:05.668091+00:00",{"id":52,"slug":53,"title":54,"cover_image":55,"image_url":55,"created_at":56,"category":13},"014be76a-746c-4892-b144-90c05a0c61c6","claude-code-rust-native-terminal-interface-en","Claude Code Rust trims TUI overhead to one binary","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781854432173-8t6o.png","2026-06-19T07:33:30.328578+00:00",{"id":58,"slug":59,"title":60,"cover_image":61,"image_url":61,"created_at":62,"category":13},"ae4915a0-e313-438e-b724-e04e07331683","open-source-tools-vibe-coding-cybersecurity-en","Open source tools that make vibe coding safer","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781852617883-ajan.png","2026-06-19T07:03:09.073748+00:00",{"id":64,"slug":65,"title":66,"cover_image":67,"image_url":67,"created_at":68,"category":13},"cb08c71e-096a-4508-b172-4698b9a607cc","fine-tuning-llms-locally-sft-lora-dpo-en","Fine-Tuning LLMs Locally: SFT, LoRA, DPO","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781839068257-3o35.png","2026-06-19T03:17:22.225063+00:00",{"id":70,"slug":71,"title":72,"cover_image":73,"image_url":73,"created_at":74,"category":13},"cc36e220-9a33-4580-928b-ff7d4c2549ef","vercel-eve-agents-as-directories-en","Vercel’s eve turns agents into directories","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781828295099-qmhc.png","2026-06-19T00:17:45.889297+00:00",[76,81,86,91,96,101,106,111,116,121],{"id":77,"slug":78,"title":79,"created_at":80},"8008f1a9-7a00-4bad-88c9-3eedc9c6b4b1","surepath-ai-mcp-policy-controls-en","SurePath AI's New MCP Policy Controls Enhance AI Security","2026-03-26T01:26:52.222015+00:00",{"id":82,"slug":83,"title":84,"created_at":85},"27e39a8f-b65d-4f7b-a875-859e2b210156","mcp-standard-ai-tools-2026-en","MCP Standard in 2026: Integrating AI Tools","2026-03-26T01:27:43.127519+00:00",{"id":87,"slug":88,"title":89,"created_at":90},"165f9a19-c92d-46ba-b3f0-7125f662921d","rag-2026-transforming-enterprise-ai-en","How RAG in 2026 is Transforming Enterprise AI","2026-03-26T01:28:11.485236+00:00",{"id":92,"slug":93,"title":94,"created_at":95},"6a2a8e6e-b956-49d8-be12-cc47bdc132b2","mastering-ai-prompts-2026-guide-en","Mastering AI Prompts: A 2026 Guide for Developers","2026-03-26T01:29:07.835148+00:00",{"id":97,"slug":98,"title":99,"created_at":100},"3ab2c67e-4664-4c67-a013-687a2f605814","garry-tan-open-sources-claude-code-toolkit-en","Garry Tan Open-Sources a Claude Code Toolkit","2026-03-26T08:26:20.245934+00:00",{"id":102,"slug":103,"title":104,"created_at":105},"66a7cbf8-7e76-41d4-9bbf-eaca9761bf69","github-ai-projects-to-watch-in-2026-en","20 GitHub AI Projects to Watch in 2026","2026-03-26T08:28:09.752027+00:00",{"id":107,"slug":108,"title":109,"created_at":110},"9f332fda-eace-448a-a292-2283951eee71","practical-github-guide-learning-ml-2026-en","A Practical GitHub Guide to Learning ML in 2026","2026-03-27T01:16:50.125678+00:00",{"id":112,"slug":113,"title":114,"created_at":115},"1b1f637d-0f4d-42bd-974b-07b53829144d","aiml-2026-student-ai-ml-lab-repo-review-en","AIML-2026 Is a Bare-Bones Student Lab Repo","2026-03-27T01:21:51.661231+00:00",{"id":117,"slug":118,"title":119,"created_at":120},"6d1bf3f6-e191-4d30-b55b-8a0722fa6afe","ai-trending-github-repos-and-research-feeds-en","AI Trending Tracks Repos and Research Feeds","2026-03-27T01:31:35.709532+00:00",{"id":122,"slug":123,"title":124,"created_at":125},"010539a1-4c3a-4bd3-937a-26616422ee0d","awesome-ai-for-science-research-tools-map-en","Awesome AI for Science Is Becoming a Real Research Map","2026-03-27T01:46:50.89513+00:00"]