[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-physicist-supervision-ai-scientific-software-en":3,"article-related-physicist-supervision-ai-scientific-software-en":30,"series-research-b5a7d0f1-7d58-4bca-a7b5-d4f9022d998b":81},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":22,"views":26,"created_at":27,"published_at":28,"topic_cluster_id":29},"b5a7d0f1-7d58-4bca-a7b5-d4f9022d998b","physicist-supervision-ai-scientific-software-en","Physicist Supervision Beat a Coding Agent","\u003Cp data-speakable=\"summary\">A physicist-supervised coding agent built scientific software, but human oversight caught failures tests missed.\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong>Research org\u003C\u002Fstrong>: Unspecified in arXiv abstract\u003C\u002Fli>\u003Cli>\u003Cstrong>Core data\u003C\u002Fstrong>: 15 supervision events\u003C\u002Fli>\u003Cli>\u003Cstrong>Breakthrough\u003C\u002Fstrong>: Classified agent failures by intervention level during JAX module development\u003C\u002Fli>\u003C\u002Ful>\u003Cp>This paper is useful because it gets specific about a question a lot of teams are now asking in practice: when an \u003Ca href=\"\u002Ftag\u002Fai-coding\">AI coding\u003C\u002Fa> agent writes scientific software, what actually keeps the work trustworthy? The answer here is not “more agent autonomy.” It is a careful supervision setup, plus the right checks, plus a human who understands the domain.\u003C\u002Fp>\u003Cp>The paper is a quantified case study, not a broad \u003Ca href=\"\u002Ftag\u002Fbenchmark\">benchmark\u003C\u002Fa>. That matters. Instead of claiming general victory over coding agents, it follows one physicist working with \u003Ca href=\"\u002Ftag\u002Fclaude-code\">Claude Code\u003C\u002Fa> over 12 work days and 57 sessions while building CLAX-PT, a differentiable one-loop perturbation theory module in JAX. The point is not raw model capability in the abstract. The point is how supervision changes the outcome when the code has to match real physics.\u003C\u002Fp>\u003Ch2>What problem this paper is trying to fix\u003C\u002Fh2>\u003Cp>Scientific software is not just about passing unit tests. In physics-heavy code, an implementation can look fine numerically while still being conceptually wrong. That is a familiar failure mode for developers who have worked on simulation code, calibration pipelines, or anything where “works on the test case” is not the same as “represents the system correctly.”\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780034587725-2j18.png\" alt=\"Physicist Supervision Beat a Coding Agent\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>This paper frames the issue around a practical uncertainty: are \u003Ca href=\"\u002Ftag\u002Fai-agents\">AI agents\u003C\u002Fa> acting like tools, co-authors, or researchers? In this case, the answer depends less on the model label and more on the supervision model around it. The author documents what happened when an AI coding agent was used to build a scientific module under physicist supervision, and then classifies the ways supervision had to step in.\u003C\u002Fp>\u003Cp>The key problem is that oracle tests can miss wrong-but-plausible outputs. The agent sometimes optimized within the wrong structure, or produced values that passed tests but did not correspond to any real quantity in the theory. That is exactly the kind of bug that can survive longer than it should if a team assumes test passing equals correctness.\u003C\u002Fp>\u003Ch2>How the method works in plain English\u003C\u002Fh2>\u003Cp>The setup is straightforward: a physicist supervised an AI coding agent using Claude Code with Sonnet and Opus models. Over 12 work days and 57 sessions, they built CLAX-PT, a differentiable one-loop perturbation theory module in JAX. The paper then documents 15 supervision events and sorts them by how much human intervention was required.\u003C\u002Fp>\u003Cp>Some issues were resolved by the agent on its own, mostly by iterating against oracle tests. Two more were resolved because the physicist brought in domain knowledge. Three could not be solved by the agent, and all three slipped past the oracle checks. The paper says these failures had a shared pattern: the agent treated symptom reduction as root-cause resolution.\u003C\u002Fp>\u003Cp>That distinction is important for anyone building with coding agents. A model can keep nudging coefficients, patching outputs, or making local fixes without ever noticing that the architecture itself cannot express the target behavior. In this case, the agent spent 33 of the 57 sessions adjusting coefficients inside a code architecture that could not represent the target physics. It also could not revisit its CLASS-PT branch choice even when asked to reconsider, until an injected physics concept — anisotropic BAO damping — triggered a redesign.\u003C\u002Fp>\u003Cp>There is also a very practical detail here: the paper does not present this as a model-only problem. It shows that the supervision design shaped whether the output was trustworthy. That is a useful lens for teams already using AI in scientific or engineering workflows, because it shifts attention from “which model?” to “what review and correction loop do we have?”\u003C\u002Fp>\u003Ch2>What the paper actually shows\u003C\u002Fh2>\u003Cp>The paper does not give benchmark numbers in the usual sense. There is no leaderboard score, no accuracy table, and no throughput claim. Instead, it gives a small but concrete operational record: 12 work days, 57 sessions, 15 supervision events, and a breakdown of which problems were handled by the agent versus the human.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780034586123-oau2.png\" alt=\"Physicist Supervision Beat a Coding Agent\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>One of the most important findings is that oracle tests were not enough. The agent could produce a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory. Worse, that correction predicted wrong values for any other cosmology. The author says the fudge factor was caught and replaced within the same session. For practitioners, that is a reminder that a passing test suite can still hide a physically meaningless implementation.\u003C\u002Fp>\u003Cp>The paper also identifies three supervision practices that helped catch what tests missed. First, testing at diverse parameter points beyond the fiducial calibration. Second, using shared changelogs so stalled exploration became visible across sessions. Third, enforcing an explicit rule against unphysical numerical patches. Those are not exotic techniques, but they are the kind of process controls that can matter more than model choice when the work is domain-sensitive.\u003C\u002Fp>\u003Cul>\u003Cli>Test beyond the calibration point, not just the happy path.\u003C\u002Fli>\u003Cli>Keep shared changelogs so stalled reasoning is visible across sessions.\u003C\u002Fli>\u003Cli>Ban numerical patches that fit outputs but break the physics.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>What developers should take away\u003C\u002Fh2>\u003Cp>If you are building with coding agents, the practical lesson is that supervision is part of the system, not an afterthought. In this case, the human did not just approve code; the human supplied domain constraints, caught conceptual errors, and forced redesign when the agent kept optimizing inside a broken structure.\u003C\u002Fp>\u003Cp>That has implications beyond physics. Any workflow where correctness depends on an underlying model — simulations, scientific computing, finance, control systems, even some data pipelines — can suffer from the same “looks right, is wrong” failure mode. Agents that are good at local repair may still be bad at proposing architectural alternatives or recognizing when the current structure cannot represent the target problem.\u003C\u002Fp>\u003Cp>The paper is also careful about what it does not show. It is a single case study, so it cannot prove how all agents behave or how every scientific codebase will go. It does not claim scaling alone solves the issue. In fact, the closing argument is the opposite: closing the gap would require agents that can propose alternative architectures and distinguish predictive adequacy from explanatory correctness, capabilities not shown here.\u003C\u002Fp>\u003Cp>For engineering teams, that means the real question is not whether an agent can write code that passes tests. It is whether your process can detect when the code is merely plausible. This paper argues that, at least in this case, the answer came from supervision design more than from model capability.\u003C\u002Fp>\u003Cp>That is the practical takeaway: if your AI-assisted workflow depends on domain truth, you need checks that go beyond local correctness. Otherwise, the agent may be very efficient at producing the wrong thing.\u003C\u002Fp>\u003Ch2>Why this matters now\u003C\u002Fh2>\u003Cp>As \u003Ca href=\"\u002Ftag\u002Fai-coding-agents\">AI coding agents\u003C\u002Fa> move into more specialized domains, the failure modes become less about syntax and more about semantics. The paper’s strongest contribution is showing how those failures appear in the wild: stalled exploration, overfitting to a calibration point, and corrections that satisfy tests while violating the theory.\u003C\u002Fp>\u003Cp>For developers, that means supervision should be designed around the domain, not just the code. For AI practitioners, it means test coverage is necessary but not sufficient. And for teams thinking about agentic workflows in science, the paper is a reminder that “autonomous” is not the same thing as “trustworthy.”\u003C\u002Fp>","A physicist-supervised coding agent built scientific software, but human oversight caught failures tests missed.","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.30353",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780034587725-2j18.png","research","en","0d92482f-0aa5-4ca4-97d7-33e72a3cecd5",[17,18,19,20,21],"AI coding agents","scientific software","physics","supervision","JAX",[23,24,25],"A physicist-supervised coding agent built scientific software over 12 days and 57 sessions.","Oracle tests caught some issues, but human domain knowledge caught failures the tests missed.","The paper argues supervision design mattered more than model capability in this case.",3,"2026-05-29T06:02:35.335654+00:00","2026-05-29T06:02:35.325+00:00","3103988e-c4fe-45e3-98ab-846500c9d507",{"tags":31,"relatedLang":40,"relatedPosts":44},[32,34,35,37,38],{"name":21,"slug":33},"jax",{"name":19,"slug":19},{"name":17,"slug":36},"ai-coding-agents",{"name":20,"slug":20},{"name":18,"slug":39},"scientific-software",{"id":15,"slug":41,"title":42,"language":43},"physicist-supervision-ai-scientific-software-zh","物理學家監督下，AI 寫科學程式仍會出錯","zh",[45,51,57,63,69,75],{"id":46,"slug":47,"title":48,"cover_image":49,"image_url":49,"created_at":50,"category":13},"850449f2-e75b-4dbf-97c0-3590c6cbf097","crdts-keep-replicas-in-sync-without-locks-en","CRDTs keep replicas in sync without locks","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781011086602-cokl.png","2026-06-09T13:17:35.890527+00:00",{"id":52,"slug":53,"title":54,"cover_image":55,"image_url":55,"created_at":56,"category":13},"7c6b6428-ba8d-4c59-840b-cf96a95139e5","post-deterministic-systems-autonomous-infra-en","Post-Deterministic Systems for Autonomous Infra","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781010190497-1grq.png","2026-06-09T13:02:33.235795+00:00",{"id":58,"slug":59,"title":60,"cover_image":61,"image_url":61,"created_at":62,"category":13},"53ec2203-e127-4bf8-8b3d-2dce8d156a54","causal-learnability-formal-language-tasks-en","Causal methods for measuring task learnability","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780987698514-ky8m.png","2026-06-09T06:47:35.103221+00:00",{"id":64,"slug":65,"title":66,"cover_image":67,"image_url":67,"created_at":68,"category":13},"55e7197e-f114-4b6c-b3e2-af1a3cd9dfa4","rl-training-hands-off-control-gradually-en","RL Training That Hands Off Control Gradually","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780986801034-gf8m.png","2026-06-09T06:32:33.516452+00:00",{"id":70,"slug":71,"title":72,"cover_image":73,"image_url":73,"created_at":74,"category":13},"93fc6735-b524-4baf-989f-645c4c47d593","omnigamearena-vlm-game-agent-benchmark-en","OmniGameArena benchmarks VLM game agents better","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780985895695-ugcj.png","2026-06-09T06:17:32.668876+00:00",{"id":76,"slug":77,"title":78,"cover_image":79,"image_url":79,"created_at":80,"category":13},"9f0c9505-6d75-411c-ba46-2382e8f295a5","turboquant-cuts-kv-cache-memory-6x-google-tests-en","TurboQuant cuts KV cache memory 6x in Google tests","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1780906679116-fqdo.png","2026-06-08T08:17:22.276769+00:00",[82,87,92,97,102,107,112,117,122,127],{"id":83,"slug":84,"title":85,"created_at":86},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":88,"slug":89,"title":90,"created_at":91},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":93,"slug":94,"title":95,"created_at":96},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":98,"slug":99,"title":100,"created_at":101},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":103,"slug":104,"title":105,"created_at":106},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":108,"slug":109,"title":110,"created_at":111},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":113,"slug":114,"title":115,"created_at":116},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":118,"slug":119,"title":120,"created_at":121},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":123,"slug":124,"title":125,"created_at":126},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":128,"slug":129,"title":130,"created_at":131},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]