Model-Generated Agent Skills: What Actually Works

OraCore Editors

Back to home

[RSCH] May 25, 20268 min readOraCore Editors

Model-Generated Agent Skills: What Actually Works

A systematic study shows model-generated agent skills help on average, but can also transfer badly.

language agents skill extraction negative transfer agent skills model-generated skills

Share LinkedIn

Model-Generated Agent Skills: What Actually Works

A systematic study shows model-generated agent skills help on average, but can also transfer badly.

Research org: Unspecified in arXiv abstract
Core data: Five agentic task domains
Breakthrough: Utility-grounded evaluation across extraction and consumption stages

Language agents are getting better not just by scaling models, but by reusing skills: structured procedures distilled from past experience. This paper looks at the part people often skip over — whether those skills actually survive the trip from one model to another, and whether the way they are extracted matters as much as the skill itself.

That matters for anyone building agents in production. If a skill library improves one model but hurts another, or works in one domain but not another, then “just reuse the skill” is not a safe assumption. The paper’s main value is that it treats skills like a real software dependency: something that needs testing, profiling, and a clear understanding of where it breaks.

What problem this paper is trying to fix

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The paper starts from a practical gap. Domain-level, model-generated skills are attractive because they promise fast adaptation without hand-crafting procedures for every task. But the field has focused more on inventing extraction methods than on understanding the full lifecycle of a skill: how experience is generated, how a skill is extracted from that experience, and how a target agent consumes it.

That missing lifecycle view is the core issue here. The authors say there has been no comprehensive study spanning all three stages, which makes it hard to answer basic questions: Do these skills actually work? When do they work? What makes them succeed or fail? For engineers, that’s the difference between a reusable automation primitive and a brittle prompt artifact.

The paper also frames the problem in terms of transfer. A skill is only useful if it helps the consumer model do better on the target task. If the extractor and consumer behave differently, a skill can look good in isolation and still fail in deployment. The authors explicitly call out negative transfer as a real risk, not a corner case.

How the method works in plain English

To study this systematically, the authors build a utility-grounded evaluation framework. In plain English, that means they judge skills by whether they actually help downstream agents, rather than by whether the extracted skill text looks polished or plausible.

The framework covers extractors and target agents across five diverse agentic task domains. That gives the study a broader view than a single benchmark or one model family. Instead of asking whether one extractor is “best,” the paper compares how different models behave when they create skills and when they consume them.

The lifecycle framing matters. The authors examine experience generation first, then skill extraction, then skill consumption. They also dig into what the experience contains, what properties useful skills tend to have, and how the same skill transfers across different consumers. That lets them separate “good at writing skills” from “good at using skills,” which are not the same capability.

One of the more useful parts of the setup is that it tries to explain failure instead of just reporting success rates. The paper is not only asking whether a skill helps; it is asking why it helps, why it fails, and whether those patterns are tied to the source experience or the consuming model.

What the paper actually shows

The headline result is straightforward: model-generated skills are beneficial on average, but they also show non-trivial negative transfer. In other words, the average effect is positive, but there are enough bad transfers that you cannot treat skill reuse as universally safe.

The authors also find that extractors and targets do not behave uniformly. A model can be a strong extractor yet a weak consumer, or the reverse. That means skill quality is not just a property of the skill itself; it depends on who wrote it and who is reading it.

Another important finding is that skill utility is independent of model scale or baseline task strength. So bigger models are not automatically better extractors or consumers, and a model that already performs well on the base task is not necessarily the best at handling skills. That is a useful warning for teams that assume capability transfers cleanly across the stack.

The paper then uses its deeper analysis to connect experience composition with skill quality. The authors say they dissect how experience is assembled, what features characterize useful skills, and how skills transfer across consumers. The abstract does not give benchmark scores or exact numeric deltas, so there are no published performance numbers to quote here.

What the paper does provide is a concrete follow-up: a meta-skill that guides extraction toward features tied to actual utility. According to the abstract, this meta-skill consistently improves skill quality across domains and substantially reduces negative transfer. That is the most actionable result in the paper, because it turns the analysis into a method rather than leaving it as diagnosis.

Why developers should care

If you are building agent systems, this paper argues for treating skills as a managed asset, not a free lunch. A skill extracted from one model or one environment may not be equally helpful elsewhere, even if it looks domain-relevant on paper.

That has direct implications for agent pipelines. You need evaluation that measures downstream utility, not just extraction quality. You also need to expect asymmetry: the model that is best at producing a skill may not be the model that benefits most from using it.

The paper’s findings also suggest a practical design pattern. Instead of extracting every plausible procedure, focus on the features linked to actual utility. The meta-skill idea points toward a more selective extraction process, where the system learns what kinds of experiences are worth turning into reusable procedures.

There are still open questions. The abstract does not specify the exact domains, the exact extractor and consumer models, or benchmark numbers. It also does not tell us how far the meta-skill generalizes beyond the five studied domains. So the right takeaway is not “problem solved,” but “we now have evidence that skill reuse needs stronger filtering and evaluation.”

What this means in practice

For teams shipping agents, the main lesson is to test skills the way you would test any other reusable component: across consumers, across domains, and with failure cases included. A skill that helps one model may quietly hurt another, and the paper shows that this kind of negative transfer is real.

For researchers, the paper is a useful reminder that extraction quality and consumption quality are separate axes. If your method looks strong in one stage, that does not prove it is strong across the lifecycle. The utility-grounded framework gives a cleaner way to evaluate that distinction.

And for anyone trying to build a skill library, the meta-skill result is the most practical hint here: optimize extraction toward utility-linked features, not just toward surface-level completeness. That is a narrower target, but it is probably closer to what makes reusable agent skills worth keeping.

Model-generated skills help on average, but negative transfer is a real deployment risk.
Extractor quality and consumer quality are not the same thing.
A utility-grounded evaluation and meta-skill can improve skill reuse across domains.

// Related Articles

Model-Generated Agent Skills: What Actually Works

What problem this paper is trying to fix

Get the latest AI news in your inbox

How the method works in plain English

What the paper actually shows

Why developers should care

What this means in practice

CRDTs keep replicas in sync without locks

Post-Deterministic Systems for Autonomous Infra

Causal methods for measuring task learnability

RL Training That Hands Off Control Gradually

OmniGameArena benchmarks VLM game agents better

TurboQuant cuts KV cache memory 6x in Google tests