[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-marginlab-claude-code-opus-46-tracker-en":3,"tags-marginlab-claude-code-opus-46-tracker-en":30,"related-lang-marginlab-claude-code-opus-46-tracker-en":42,"related-posts-marginlab-claude-code-opus-46-tracker-en":46,"series-ai-agent-1e86831a-5448-4953-b598-edd58f6f58d6":83},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":18,"translated_content":10,"views":19,"is_premium":20,"created_at":21,"updated_at":21,"cover_image":11,"published_at":22,"rewrite_status":23,"rewrite_error":10,"rewritten_from_id":24,"slug":25,"category":26,"related_article_id":27,"status":28,"google_indexed_at":29,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":20},"1e86831a-5448-4953-b598-edd58f6f58d6","Marginlab Tracks Claude Code Opus 4.6 Drift","\u003Cp>\u003Ca href=\"https:\u002F\u002Fmarginlab.ai\u002Ftrackers\u002Fclaude-code\u002F\" target=\"_blank\" rel=\"noopener\">Marginlab\u003C\u002Fa> has built a daily tracker for \u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002Fclaude-code\" target=\"_blank\" rel=\"noopener\">Claude Code\u003C\u002Fa> running \u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002Fnews\u002Fclaude-4-6\" target=\"_blank\" rel=\"noopener\">Opus 4.6\u003C\u002Fa>, and the setup is refreshingly concrete: 50 benchmark cases every day, with weekly and monthly rollups for a steadier read. The point is simple and useful for anyone watching agent quality closely: catch statistically significant drops before they become user complaints.\u003C\u002Fp>\u003Cp>This matters because \u003Ca href=\"\u002Fnews\u002Fglm-5-zai-flagship-coding-agents-en\">coding agents\u003C\u002Fa> can look fine on a single demo and still drift in real use. Marginlab’s tracker focuses on a curated subset of \u003Ca href=\"https:\u002F\u002Fwww.swebench.com\u002F\" target=\"_blank\" rel=\"noopener\">SWE-Bench-Pro\u003C\u002Fa>, runs directly in Claude Code CLI, and avoids custom harnesses so the numbers reflect what users actually experience.\u003C\u002Fp>\u003Ch2>What the tracker measures every day\u003C\u002Fh2>\u003Cp>The dashboard is built around pass rate, but it also exposes the operational signals that usually explain performance changes. That includes input tokens, output tokens, API cost, average runtime per instance, and total tool calls during the daily run.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775147267640-s3yy.png\" alt=\"Marginlab Tracks Claude Code Opus 4.6 Drift\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That mix is smart. If pass rate slips while tool calls spike, you may be looking at a model that is thrashing through more actions to solve the same tasks. If runtime rises without a matching token jump, the issue could be in execution behavior, backend latency, or the agent loop itself.\u003C\u002Fp>\u003Cp>Marginlab also shows a degradation status panel with thresholds tied to sample size. With only 50 eval cases, the tracker says a change of about ±13.8% is needed to clear the p &lt; 0.05 bar. At 350 cases, that shrinks to ±4.8%. At 1,400 cases, the threshold drops to ±2.3%.\u003C\u002Fp>\u003Cul>\u003Cli>Daily sample size: 50 SWE-Bench-Pro cases\u003C\u002Fli>\u003Cli>Weekly and monthly aggregates for tighter confidence bands\u003C\u002Fli>\u003Cli>Pass-rate testing modeled with Bernoulli trials\u003C\u002Fli>\u003Cli>95% confidence intervals shown for daily, weekly, and monthly results\u003C\u002Fli>\u003Cli>Direct Claude Code CLI runs, no custom harness layer\u003C\u002Fli>\u003C\u002Ful>\u003Cp>That last point is easy to miss, but it is the most important design choice on the page. If a benchmark uses a heavy wrapper, it can hide regressions or create fake ones. Marginlab’s method tries to measure the same path a real developer would use.\u003C\u002Fp>\u003Ch2>Why this tracker exists now\u003C\u002Fh2>\u003Cp>Marginlab says the tracker was built in response to Anthropic’s September 2025 postmortem on Claude degradations. That postmortem made one thing clear: model quality can move in the wrong direction after release, and teams need a way to notice quickly.\u003C\u002Fp>\u003Cp>The company also says it is an independent third party with no affiliation to frontier model providers. That matters because benchmark dashboards often blur the line between marketing and measurement. Here, the stated goal is narrower and more credible: detect statistically significant degradation in Claude Code on SWE tasks, not celebrate every new release.\u003C\u002Fp>\u003Cp>The methodology is also meant to be contamination-resistant. In practice, that means the benchmark subset is curated to reduce the chance that training data leakage makes the results look better than they should. For coding agents, that kind of care is the difference between useful signal and dashboard theater.\u003C\u002Fp>\u003Cblockquote>“We want to offer a resource to detect such degradations in the future.” — Marginlab, tracker methodology\u003C\u002Fblockquote>\u003Cp>That quote gets to the heart of the project. This is less about leaderboard bragging rights and more about early warning. If a model update changes the way Claude Code reasons, calls tools, or handles long-running tasks, a daily tracker can surface the drift before it becomes invisible background noise.\u003C\u002Fp>\u003Ch2>How the numbers compare in practice\u003C\u002Fh2>\u003Cp>The most interesting part of the page is how it frames sample size against statistical confidence. Small daily runs are noisy by design, so the dashboard does not pretend a one-day dip is always meaningful. Instead, it compares daily, weekly, and monthly windows so users can separate random variation from a real performance slide.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775147267083-fugz.png\" alt=\"Marginlab Tracks Claude Code Opus 4.6 Drift\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That approach is especially useful in agent benchmarks, where tool use and runtime can swing from day to day. A model may solve the same issue with more steps, or spend more time on retries, without changing the final pass rate much. Watching those side metrics gives you a better read on whether the agent is becoming less efficient even before accuracy drops.\u003C\u002Fp>\u003Cul>\u003Cli>50 cases: about ±13.8% change needed for statistical significance\u003C\u002Fli>\u003Cli>350 cases: about ±4.8% change needed for statistical significance\u003C\u002Fli>\u003Cli>1,400 cases: about ±2.3% change needed for statistical significance\u003C\u002Fli>\u003Cli>Daily runs are paired with weekly and monthly aggregation\u003C\u002Fli>\u003Cli>Reported metrics include pass rate, runtime, tool calls, token usage, and API cost\u003C\u002Fli>\u003C\u002Ful>\u003Cp>That is a practical tradeoff. A daily tracker gives fast feedback, while longer windows reduce the chance of overreacting to noise. For \u003Ca href=\"\u002Fnews\u002Fmlops-explained-how-ml-teams-ship-models-en\">teams ship\u003C\u002Fa>ping agentic products, that combination is more valuable than a single headline score.\u003C\u002Fp>\u003Cp>It also makes the tracker useful beyond Claude Code itself. Any team working on coding agents can compare their own internal monitoring against Marginlab’s setup and ask a blunt question: are we measuring the model, the harness, or both?\u003C\u002Fp>\u003Ch2>What developers should watch next\u003C\u002Fh2>\u003Cp>If you rely on Claude Code for real work, the tracker is worth bookmarking. The most actionable signal is not a one-off pass-rate dip; it is a pattern that shows up across daily runs and is echoed in runtime, tool calls, or token consumption. That is where quality issues usually become visible first.\u003C\u002Fp>\u003Cp>There is also a broader lesson here for the agent world. As models get better, the hard part shifts from proving they can solve a benchmark to proving they keep doing it week after week. A daily degradation monitor is a small idea, but it addresses a very real operational problem.\u003C\u002Fp>\u003Cp>My read is that more teams will copy this pattern in 2026: independent trackers, direct CLI execution, fixed task subsets, and statistical alarms instead of vibe-based judgment. If Marginlab keeps publishing these numbers consistently, it may become the reference point people check after every major \u003Ca href=\"\u002Fnews\u002Fanthropic-april-2026-claude-code-update-en\">Claude Code update\u003C\u002Fa>.\u003C\u002Fp>\u003Cp>The practical takeaway is simple: if your product depends on coding agents, treat benchmark drift like uptime. The question is no longer whether a model can score well once. The question is whether it can hold that score after the next release, the next backend change, and the next quiet update to the agent loop.\u003C\u002Fp>","Marginlab’s daily tracker watches Claude Code Opus 4.6 on 50 SWE-Bench-Pro tasks and flags statistically significant drops.","marginlab.ai","https:\u002F\u002Fmarginlab.ai\u002Ftrackers\u002Fclaude-code\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775147267640-s3yy.png",[13,14,15,16,17],"Claude Code","Opus 4.6","SWE-Bench-Pro","benchmark drift","AI agents","en",0,false,"2026-04-02T16:27:31.350256+00:00","2026-04-02T16:27:31.322+00:00","done","7b9c08cf-92ae-426b-b6ac-5a5bb0fff51c","marginlab-claude-code-opus-46-tracker-en","ai-agent","62a923b4-173d-465e-93f0-071226ba6119","published","2026-04-08T09:00:50.386+00:00",[31,33,35,38,40],{"name":13,"slug":32},"claude-code",{"name":16,"slug":34},"benchmark-drift",{"name":36,"slug":37},"SWE-bench Pro","swe-bench-pro",{"name":14,"slug":39},"opus-46",{"name":17,"slug":41},"ai-agents",{"id":27,"slug":43,"title":44,"language":45},"marginlab-claude-code-opus-46-tracker-zh","Marginlab 盯上 Claude Code 漂移","zh",[47,53,59,65,71,77],{"id":48,"slug":49,"title":50,"cover_image":51,"image_url":51,"created_at":52,"category":26},"c5d4bc11-1f4d-438c-b644-a8498826e1ab","claude-agent-dreaming-outcomes-multiagent-en","Claude给Agent加了“做梦”功能","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778868649463-f5qv.png","2026-05-15T18:10:25.29539+00:00",{"id":54,"slug":55,"title":56,"cover_image":57,"image_url":57,"created_at":58,"category":26},"fda44d24-7baf-4d91-a7f9-bbfecae20a27","switch-ai-outputs-markdown-to-html-en","How to Switch AI Outputs from Markdown to HTML","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778743249827-wmsr.png","2026-05-14T07:20:22.631724+00:00",{"id":60,"slug":61,"title":62,"cover_image":63,"image_url":63,"created_at":64,"category":26},"064275f5-4282-47c3-8e4a-60fe8ac99246","anthropic-cat-wu-proactive-ai-assistants-en","Anthropic’s Cat Wu on proactive AI assistants","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778735465548-a92i.png","2026-05-14T05:10:31.723441+00:00",{"id":66,"slug":67,"title":68,"cover_image":69,"image_url":69,"created_at":70,"category":26},"423ac8ad-2886-42a9-8dd8-78e5d43a1574","how-to-run-hermes-agent-on-discord-en","How to Run Hermes Agent on Discord","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778724656141-i30t.png","2026-05-14T02:10:35.727086+00:00",{"id":72,"slug":73,"title":74,"cover_image":75,"image_url":75,"created_at":76,"category":26},"776a562c-99a6-4a6b-93a0-9af40300f3f2","why-ragflow-is-the-right-open-source-rag-engine-to-self-host-en","Why RAGFlow is the right open-source RAG engine to self-host","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778674254587-0pxn.png","2026-05-13T12:10:25.721583+00:00",{"id":78,"slug":79,"title":80,"cover_image":81,"image_url":81,"created_at":82,"category":26},"322ec8bc-61d3-4c80-bb9e-a19941e137c6","how-to-add-temporal-rag-in-production-en","How to Add Temporal RAG in Production","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778667085221-0mox.png","2026-05-13T10:10:31.619892+00:00",[84,89,94,99,104,109,114,119,124,129],{"id":85,"slug":86,"title":87,"created_at":88},"03db8de8-8dc2-4ac1-9cf7-898782efbb1f","anthropic-claude-ai-agent-task-automation-en","Anthropic's Claude AI Agent: A New Era of Task Automation","2026-03-25T16:25:06.513026+00:00",{"id":90,"slug":91,"title":92,"created_at":93},"045d1abc-190d-4594-8c95-91e2a26f0c5a","googles-2026-ai-agent-report-decoded-en","Google’s 2026 AI Agent Report, Decoded","2026-03-26T11:15:23.046616+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"e64aba21-254b-4f93-aa21-837484bb52ec","kimi-k25-review-stronger-still-not-legend-en","Kimi K2.5 review: stronger, still not a legend","2026-03-27T07:15:55.385951+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"30dfb781-a1b2-4add-aebe-b3df40247c37","claude-code-controls-mac-desktop-en","Claude Code now controls your Mac desktop","2026-03-28T03:01:59.384091+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"254405b6-7833-4800-8e13-f5196deefbe6","cloudflare-100x-faster-ai-agent-sandbox-en","Cloudflare’s 100x Faster AI Agent Sandbox","2026-03-28T03:09:44.356437+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"04f29b7f-9b91-4306-89a7-97d725e6e1ba","openai-backs-isara-agent-swarm-bet-en","OpenAI backs Isara’s agent-swarm bet","2026-03-28T03:15:27.849766+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"3b0bf479-e4ae-4703-9666-721a7e0cdb91","openai-plan-automated-ai-researcher-en","OpenAI’s plan for an automated AI researcher","2026-03-28T03:17:42.312819+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"fe91bce0-b85d-4efa-a207-24ae9939c29f","harness-engineering-ai-agent-reliability-2026","Harness Engineering: From Bridle to Operating System, The Missing Link in AI Agent Reliability","2026-03-31T06:36:55.648751+00:00",{"id":125,"slug":126,"title":127,"created_at":128},"67dc66da-ca46-4aa5-970b-e997a39fe109","openai-codex-plugin-claude-code-en","OpenAI puts Codex inside Claude Code","2026-04-01T09:21:55.381386+00:00",{"id":130,"slug":131,"title":132,"created_at":133},"7a09007d-820f-43b3-8607-8ad1bfcb94c8","mcp-explained-from-prompts-to-production-en","MCP Explained: From Prompts to Production","2026-04-01T09:24:40.089177+00:00"]