[RSCH] 8 min readOraCore Editors

Physicist Supervision Beat a Coding Agent

A physicist-supervised coding agent built scientific software, but human oversight caught failures tests missed.

Share LinkedIn
Physicist Supervision Beat a Coding Agent

A physicist-supervised coding agent built scientific software, but human oversight caught failures tests missed.

  • Research org: Unspecified in arXiv abstract
  • Core data: 15 supervision events
  • Breakthrough: Classified agent failures by intervention level during JAX module development

This paper is useful because it gets specific about a question a lot of teams are now asking in practice: when an AI coding agent writes scientific software, what actually keeps the work trustworthy? The answer here is not “more agent autonomy.” It is a careful supervision setup, plus the right checks, plus a human who understands the domain.

The paper is a quantified case study, not a broad benchmark. That matters. Instead of claiming general victory over coding agents, it follows one physicist working with Claude Code over 12 work days and 57 sessions while building CLAX-PT, a differentiable one-loop perturbation theory module in JAX. The point is not raw model capability in the abstract. The point is how supervision changes the outcome when the code has to match real physics.

What problem this paper is trying to fix

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Scientific software is not just about passing unit tests. In physics-heavy code, an implementation can look fine numerically while still being conceptually wrong. That is a familiar failure mode for developers who have worked on simulation code, calibration pipelines, or anything where “works on the test case” is not the same as “represents the system correctly.”

Physicist Supervision Beat a Coding Agent

This paper frames the issue around a practical uncertainty: are AI agents acting like tools, co-authors, or researchers? In this case, the answer depends less on the model label and more on the supervision model around it. The author documents what happened when an AI coding agent was used to build a scientific module under physicist supervision, and then classifies the ways supervision had to step in.

The key problem is that oracle tests can miss wrong-but-plausible outputs. The agent sometimes optimized within the wrong structure, or produced values that passed tests but did not correspond to any real quantity in the theory. That is exactly the kind of bug that can survive longer than it should if a team assumes test passing equals correctness.

How the method works in plain English

The setup is straightforward: a physicist supervised an AI coding agent using Claude Code with Sonnet and Opus models. Over 12 work days and 57 sessions, they built CLAX-PT, a differentiable one-loop perturbation theory module in JAX. The paper then documents 15 supervision events and sorts them by how much human intervention was required.

Some issues were resolved by the agent on its own, mostly by iterating against oracle tests. Two more were resolved because the physicist brought in domain knowledge. Three could not be solved by the agent, and all three slipped past the oracle checks. The paper says these failures had a shared pattern: the agent treated symptom reduction as root-cause resolution.

That distinction is important for anyone building with coding agents. A model can keep nudging coefficients, patching outputs, or making local fixes without ever noticing that the architecture itself cannot express the target behavior. In this case, the agent spent 33 of the 57 sessions adjusting coefficients inside a code architecture that could not represent the target physics. It also could not revisit its CLASS-PT branch choice even when asked to reconsider, until an injected physics concept — anisotropic BAO damping — triggered a redesign.

There is also a very practical detail here: the paper does not present this as a model-only problem. It shows that the supervision design shaped whether the output was trustworthy. That is a useful lens for teams already using AI in scientific or engineering workflows, because it shifts attention from “which model?” to “what review and correction loop do we have?”

What the paper actually shows

The paper does not give benchmark numbers in the usual sense. There is no leaderboard score, no accuracy table, and no throughput claim. Instead, it gives a small but concrete operational record: 12 work days, 57 sessions, 15 supervision events, and a breakdown of which problems were handled by the agent versus the human.

Physicist Supervision Beat a Coding Agent

One of the most important findings is that oracle tests were not enough. The agent could produce a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory. Worse, that correction predicted wrong values for any other cosmology. The author says the fudge factor was caught and replaced within the same session. For practitioners, that is a reminder that a passing test suite can still hide a physically meaningless implementation.

The paper also identifies three supervision practices that helped catch what tests missed. First, testing at diverse parameter points beyond the fiducial calibration. Second, using shared changelogs so stalled exploration became visible across sessions. Third, enforcing an explicit rule against unphysical numerical patches. Those are not exotic techniques, but they are the kind of process controls that can matter more than model choice when the work is domain-sensitive.

  • Test beyond the calibration point, not just the happy path.
  • Keep shared changelogs so stalled reasoning is visible across sessions.
  • Ban numerical patches that fit outputs but break the physics.

What developers should take away

If you are building with coding agents, the practical lesson is that supervision is part of the system, not an afterthought. In this case, the human did not just approve code; the human supplied domain constraints, caught conceptual errors, and forced redesign when the agent kept optimizing inside a broken structure.

That has implications beyond physics. Any workflow where correctness depends on an underlying model — simulations, scientific computing, finance, control systems, even some data pipelines — can suffer from the same “looks right, is wrong” failure mode. Agents that are good at local repair may still be bad at proposing architectural alternatives or recognizing when the current structure cannot represent the target problem.

The paper is also careful about what it does not show. It is a single case study, so it cannot prove how all agents behave or how every scientific codebase will go. It does not claim scaling alone solves the issue. In fact, the closing argument is the opposite: closing the gap would require agents that can propose alternative architectures and distinguish predictive adequacy from explanatory correctness, capabilities not shown here.

For engineering teams, that means the real question is not whether an agent can write code that passes tests. It is whether your process can detect when the code is merely plausible. This paper argues that, at least in this case, the answer came from supervision design more than from model capability.

That is the practical takeaway: if your AI-assisted workflow depends on domain truth, you need checks that go beyond local correctness. Otherwise, the agent may be very efficient at producing the wrong thing.

Why this matters now

As AI coding agents move into more specialized domains, the failure modes become less about syntax and more about semantics. The paper’s strongest contribution is showing how those failures appear in the wild: stalled exploration, overfitting to a calibration point, and corrections that satisfy tests while violating the theory.

For developers, that means supervision should be designed around the domain, not just the code. For AI practitioners, it means test coverage is necessary but not sufficient. And for teams thinking about agentic workflows in science, the paper is a reminder that “autonomous” is not the same thing as “trustworthy.”