5 shifts in LLMs from the last six months

OraCore Editors

[IND] May 19, 20265 min readOraCore Editors

5 shifts in LLMs from the last six months

5 shifts explain why LLMs changed fast over six months: better coding agents, stronger open models, and new local workflows.

Claude Code LLMs Codex

Share LinkedIn

5 shifts in LLMs from the last six months

Five shifts explain how LLMs changed fast over six months.

In one six-month stretch, the “best” model changed hands five times, and coding agents crossed a real usability threshold.

1. Coding agents got good enough for daily work

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The biggest change was not a single model release, but a quality jump in agentic coding. By November 2025, tools like Codex and Claude Code were benefiting from reinforcement learning tuned to verifiable rewards, and the result was obvious in practice: agents stopped feeling like demos and started feeling like helpers.

That shift mattered because it changed the default workflow. Instead of spending most of your time correcting broken output, you could hand off real tasks and expect usable code back. The bar was not perfection, but “mostly works” was enough to make these tools part of everyday development.

Before: often-work, with lots of cleanup
After: mostly-work, good for real tasks
Best fit: coding, refactors, test writing, small feature work

2. The “best model” title kept changing

Another notable pattern was how often the top spot moved among the major providers. Over a few months, the crown passed from Claude Sonnet 4.5 to GPT-5.1, then Gemini 3, then GPT-5.1 Codex Max, and then back to Anthropic with Claude Opus 4.5.

That churn says less about a single winner and more about how tight the race has become. For practical users, the lesson is to test on your own tasks rather than trust a static ranking. A model that is best for code generation may not be best for image prompts, agent planning, or long-running workflows.

Claude Sonnet 4.5: early leader in the period
GPT-5.1 and GPT-5.1 Codex Max: strong mid-period contenders
Gemini 3: especially strong on the author’s pelican test
Claude Opus 4.5: regained the lead for many practitioners

3. Local and open-weight models got far more capable

The open-model side moved fast too. Models such as Gemma 4, GLM-5.1, and Qwen3.6-35B-A3B showed that local or self-hosted options could do far more than people expected, even if they still lagged the frontier models on the hardest tasks.

What changed is the quality-to-size ratio. A 20.9GB model running on a laptop could outperform older assumptions about what “local” meant, while a huge 1.5TB model could produce striking results if you had the hardware. The open-weight story is no longer about compromise alone; it is about choosing the right tradeoff for your setup.

Gemma 4: strongest open-weight release the author had seen from a US company
GLM-5.1: very large, hardware-hungry, but powerful
Qwen3.6-35B-A3B: laptop-friendly relative to its capability

4. New personal AI assistants became a thing

What began as an obscure repo called Warelay turned into OpenClaw, and by February it was drawing huge attention. The broader category also picked up a name: “Claws,” meaning personal AI assistants built around these new agentic patterns.

This matters because it shows how quickly the market moved from single-shot chat to persistent helpers. People were even buying Mac minis just to run them. The idea is simple: keep a small, dedicated machine around for your assistant, then let it handle ongoing tasks without taking over your main computer.

Warelay: original repo name
OpenClaw: final name that caught on
Claws: the emerging generic term
Common setup: a Mac mini as a home for the assistant

5. Benchmark demos got weirder, and more useful

Simon Willison’s pelican-riding-a-bicycle test became a running way to compare models, because it is absurd in exactly the right way. It is hard to draw, easy to recognize, and unlikely to be optimized directly by any lab. That made it a surprisingly good proxy for how models handled tricky multimodal prompts.

The same period also produced playful but revealing demos, including a JavaScript interpreter built in Python with Pyodide, running in WebAssembly, inside the browser. These examples are funny, but they also show how far experimentation had spread. People were no longer just asking models questions; they were building strange stacks to see what the systems could actually do.

browser → JavaScript → WebAssembly → Pyodide → Python → micro-javascript

Pelican test: quick visual check for model quality
Micro-javascript: a hobby project that proved a point
Takeaway: the tooling got good enough to support weird experiments

How to decide

If you want reliable coding help, start with the strongest agentic tools from OpenAI or Anthropic and test them on your own repo. If you care about privacy, cost control, or offline use, look at the open-weight models from Google, GLM, and Qwen. If you are building products, the main lesson is that the center of gravity has moved from “can it do the task?” to “which model fits this task, machine, and budget best?”

For most readers, the practical answer is to keep one frontier model and one local model in rotation. That gives you a fast path for hard tasks and a cheap path for everyday work.

// Related Articles

5 shifts in LLMs from the last six months

1. Coding agents got good enough for daily work

Get the latest AI news in your inbox

2. The “best model” title kept changing

3. Local and open-weight models got far more capable

4. New personal AI assistants became a thing

5. Benchmark demos got weirder, and more useful

How to decide

WebX 2026 turns speaker hype into a conference brief

AI Weekly: 2026-07-06 ~ 2026-07-13

The AI Act should be treated as Europe’s operating system for AI

Booz Allen’s OpenAI Deal Is Real Advantage, Not Hype

OpenSearch’s vector search benchmark in 5 parts

Vector Databases That Work in Production