Why Prompt Engineering Isn’t Engineering

OraCore Editors

[RSCH] April 2, 20269 min readOraCore Editors

Why Prompt Engineering Isn’t Engineering

Prompt design is mostly heuristic, not formal engineering. The evidence shows weak standards, shaky testing, and a lot of guesswork.

Microsoft Azure AI prompt engineering LLM testing AI evaluation software engineering

Share LinkedIn

Why Prompt Engineering Isn’t Engineering

Microsoft says prompt design is “more of an art than a science,” and that line matters. If a practice depends on taste, iteration, and a lot of trial and error, calling it engineering gets messy fast.

This matters because the stakes are getting higher. Teams are already using prompts to ship products, automate support, and make decisions, yet the field still lacks the kind of methods, tests, and standards engineers usually rely on.

What engineering actually means

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Before anyone can argue about prompt engineering, it helps to define engineering in the boring, useful way: a discipline grounded in science, formal methods, repeatable design, and accountability. Bodies like ABET, IEEE, and the National Society of Professional Engineers all point to the same core ideas. Engineering is about building systems under constraints, with measurable outcomes and real consequences.

That legal and ethical weight is not abstract. In Germany, misusing the title “engineer” can lead to prison time. In Canada, the penalties can reach $25,000. In many US states, “Professional Engineer” is a protected title tied to licensure and exams. The title exists because bad engineering can hurt people.

Prompt work does not sit in that same bucket yet. It looks more like a mix of writing, experimentation, and product tuning. That does not make it useless. It just means the label matters.

Engineering relies on formal methods, testable claims, and repeatable results.
Prompt guides usually say things like “be specific” or “provide context.”
Microsoft’s own docs describe prompt design as “more of an art than a science.”
Protected engineering titles carry legal consequences in several countries.

There is also a historical pattern worth remembering. Software engineering was named at the NATO conference in 1968, and even then participants admitted the term described a need more than a reality. Civil engineering took centuries to formalize. New fields can earn the title over time, but they do that by building methods, standards, and testing culture.

That is the real question for prompt engineering: is it on that path, or is it just a convenient label for a set of heuristics? Right now, the evidence points to heuristics.

What the vendor guides actually say

The easiest place to check for engineering-grade guidance is the vendors themselves. If OpenAI, Anthropic, Google, and Microsoft Azure AI are publishing prompt advice, you would expect clear criteria, measurable thresholds, and testable procedures. Instead, most of the advice is qualitative.

Across roughly 25 recommendations in those guides, only about four include anything close to quantifiable criteria. The rest are variations on the same themes: be specific, give context, use examples, and iterate. That is useful advice for beginners, but it is not a method.

“Prompt engineering is more of an art than a science.” — Microsoft, Azure OpenAI Service documentation

That quote is honest, and it also gives the game away. Engineering needs more than advice that sounds right. It needs definitions that hold up under scrutiny. What counts as “specific enough”? How much context is enough? What is the acceptance test? What failure rate is acceptable? The guides rarely answer those questions.

The pipeline also gets fuzzy as advice moves away from research. Academic papers may include methods, sample sizes, and measured outcomes. Vendor docs compress that into developer-friendly guidance. Blog posts and social media then compress it again into “top tips.” By the time most people see it, the method has been stripped out.

OpenAI, Anthropic, Google, and Microsoft all publish prompt guidance.
Most recommendations are qualitative rather than measurable.
Microsoft explicitly calls prompt design “more of an art than a science.”
Natural language requirements remain ambiguous without formal syntax.

That last point matters because engineering already solved this problem in other places. The Internet Engineering Task Force created RFC 2119 to define words like MUST, SHOULD, and MAY. The whole point was to reduce ambiguity in requirements. If prompt work were a mature engineering discipline, you would expect it to adopt tools like that more widely.

The testing problem nobody can hand-wave away

Engineering without testing is just opinion with a spreadsheet. In software, we have spent decades building test-driven development, continuous integration, coverage analysis, fuzzing, and property-based testing. Those systems work because the software is usually deterministic: same input, same output.

Large language models are different. The same prompt can produce different answers on different runs. That means prompt testing has to be statistical, not just anecdotal. You need golden datasets, repeated trials, confidence intervals, and regression baselines. That is a much harder problem, and it requires more statistical literacy than most prompt guides admit.

There are tools trying to fill the gap. Promptfoo, Helicone, LangSmith, and DeepEval all help teams evaluate prompts and model outputs. That is progress, but it is still early. The tooling is uneven, and the methods are not yet standardized across the industry.

Regulated industries are already warning about the gap. The FAA has said that rigorous safety assurance methods must be developed for AI systems in aviation. The Federal Reserve’s SR 11-7 guidance says adaptive AI models may lose effectiveness over time. Those are not academic footnotes. They are institutions that deal with systems where failure has a price tag.

Traditional software testing assumes deterministic behavior.
LLM evaluation needs repeated trials and statistical thresholds.
Promptfoo, Helicone, LangSmith, and DeepEval are still maturing.
The FAA and Federal Reserve both warn that current assurance methods are incomplete.

There is also a lifecycle problem. Writing the prompt is treated like the finish line, when it is really the start of maintenance. Prompts drift. Models change. Vendor updates alter behavior. If the prompt is a production artifact, it needs versioning, regression tests, and retirement plans. That is normal engineering thinking, and it is still missing from a lot of prompt work.

Some prompt advice is actively bad

The strongest reason to stop calling this engineering is that some of the most popular advice fails in controlled studies. Research from Wharton’s Generative AI Lab presented at EMNLP 2024 found that expert persona prompting can reduce factual accuracy. The same research found that chain-of-thought prompting can hurt performance on reasoning models. Those are not edge cases. They are techniques repeated in almost every prompt guide online.

That should make people uncomfortable. If a widely recommended technique makes output worse in documented tests, then the field has a quality-control problem. Good engineering disciplines do not keep bad practices in circulation for years because they sound smart.

There is a deeper lesson here too. Prompt effectiveness depends heavily on the model, the task, and the evaluation method. A trick that helps one system may harm another. That makes the whole practice feel less like engineering and more like tuning a musical instrument by ear while the instrument keeps changing shape.

For teams building with AI, the practical takeaway is simple: treat prompts as experiments, not doctrine. Measure output quality. Keep version history. Test against real cases. Retire techniques that do not survive evaluation, even if they are popular on social media.

What to do instead

Prompt work is real work. It can improve systems, reduce error, and make models more useful. But the word “engineering” should be reserved for practices with formal methods, repeatable testing, and standards that outlive one product cycle.

My prediction is simple: the teams that win with AI will stop treating prompts like magic phrases and start treating them like testable artifacts. The ones that keep relying on vibes will keep shipping inconsistent systems. If you are building with LLMs today, ask one question before you call the work engineering: what is the acceptance test?

That question will tell you a lot.

// Related Articles

Why Prompt Engineering Isn’t Engineering

What engineering actually means

Get the latest AI news in your inbox

What the vendor guides actually say

The testing problem nobody can hand-wave away

Some prompt advice is actively bad

What to do instead

TurboQuant and the SEO Shift for Small Sites

TurboQuant vs FP8: vLLM’s first broad test

LLMbda calculus gives agents safety rules

A simpler beamspace denoiser for mmWave MIMO

Why AI benchmark wins in cyber should scare defenders

Why Linux security needs a patch-wave mindset