[MODEL] 8 min readOraCore Editors

Gemini 3.1 Pro: Google’s new top model in numbers

Gemini 3.1 Pro posts 77.1% on ARC-AGI-2, 94.3% on GPQA Diamond, and a 1M-token context window, while keeping Gemini 3 pricing.

Share LinkedIn
Gemini 3.1 Pro: Google’s new top model in numbers

Google DeepMind’s Gemini 3.1 Pro arrives with a very specific flex: 77.1% on ARC-AGI-2, 94.3% on GPQA Diamond, and 80.6% on SWE-Bench Verified. It also ships with a 1,048,576-token context window, which is the kind of number that changes how teams think about long documents, codebases, and agent workflows.

The model launched on February 19, 2026 and keeps the same listed pricing as Gemini 3 Pro: $2 per 1M input tokens and $12 per 1M output tokens. That matters because the jump here is not just about scores; it is about getting more capability without paying a new tax for it.

What Gemini 3.1 Pro actually changes

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The headline features are easy to repeat, but the practical effect is more interesting. A 1M-token context window means teams can stuff in an entire codebase, long research packs, or a huge batch of source material without chopping it into awkward chunks. Google says the model can also handle up to 65,536 output tokens, which reduces the usual problem of truncated answers during long technical work.

Gemini 3.1 Pro: Google’s new top model in numbers

Gemini 3.1 Pro also adds what Google calls native SVG and 3D code rendering. In plain English, it can generate visual assets and structured code from text prompts without the clumsy back-and-forth that older models often need. That makes it more useful for product mockups, UI experiments, and quick visual explanations inside developer tools.

The model uses three thinking levels: Low, Medium, and High. That sounds simple, but it is a useful control knob. Quick classification jobs do not need the same compute budget as a messy debugging session or a multi-step research task.

  • ARC-AGI-2: 77.1%, more than double Gemini 3 Pro according to the product page
  • GPQA Diamond: 94.3%, a strong sign for science and research-style prompts
  • SWE-Bench Verified: 80.6%, aimed at real-world software fixes
  • Context window: 1,048,576 input tokens
  • Output length: up to 65,536 tokens
  • Pricing: $2 input / $12 output per 1M tokens

Why the benchmark mix matters

Benchmarks only tell part of the story, but this mix is unusually revealing. ARC-AGI-2 tests abstract reasoning, GPQA Diamond checks graduate-level science knowledge, and SWE-Bench Verified measures whether a model can fix real repository issues. Put together, they show a model that is not just good at one narrow skill.

Google also highlights 2887 Elo on LiveCodeBench Pro, 69.2% on MCP Atlas, and 85.9% on BrowseComp. Those numbers point to stronger performance in coding contests, tool coordination, and autonomous web research. For teams building agents, that combination matters more than a single flashy score.

There is also a pricing angle. Gemini 3.1 Pro’s input price is far lower than Claude Opus 4.6 at the numbers shown on the page, and the output price is lower too. If you are running long-context workloads, the economics can matter as much as raw quality.

  • LiveCodeBench Pro: 2887 Elo
  • MCP Atlas: 69.2% for tool coordination
  • BrowseComp: 85.9% for autonomous web research
  • Input price: $2.00 per 1M tokens vs Claude Opus 4.6 at $15.00
  • Output price: $12.00 per 1M tokens vs Claude Opus 4.6 at $75.00
  • Comparison note: the page says Gemini 3.1 Pro leads GPT-5.4 on ARC-AGI-2 and GPQA Diamond, while GPT-5.4 leads on OSWorld and some software tasks

What Google is claiming, and what it means in practice

The product page says Gemini 3.1 Pro is the “most capable AI model” in the Gemini line, and the numbers support that claim better than most marketing pages do. The model is positioned as a step up in reasoning, coding, multimodal understanding, and agentic work, with the strongest gains in abstract reasoning and long-context use.

Gemini 3.1 Pro: Google’s new top model in numbers

Google DeepMind has been clear for years that it wants Gemini to handle more of the messy middle of knowledge work. In a 2024 interview with The Verge, DeepMind CEO Demis Hassabis said,

“The ultimate goal is to build a universal assistant.”
That line fits this release well. A model that can read huge inputs, reason through them, and produce usable code or visuals is closer to that goal than a chat model that only answers short prompts well.

There is a catch, though. Big benchmark gains do not automatically translate to better work on every task. The page itself shows Gemini 3.1 Pro losing ground to GPT-5.4 on some office and computer-use benchmarks, and Claude Opus 4.6 edges it on SWE-Bench Verified by a tiny margin. So the right read is not “best at everything.” It is “very strong where long context, reasoning, and tool use matter most.”

How it compares with the other big names

When you line up the numbers, the picture gets clearer. Gemini 3.1 Pro looks especially strong in reasoning and cost efficiency, while competitors still hold some advantages in computer use and certain software tasks. That split is useful for teams choosing a model for a specific workload rather than a generic chatbot.

Here is the short version of the comparison on the page:

  • ARC-AGI-2: Gemini 3.1 Pro at 77.1%, Claude Opus 4.6 at 68.8%, GPT-5.4 at 73.3%
  • GPQA Diamond: Gemini 3.1 Pro at 94.3%, Claude Opus 4.6 at 91.3%, GPT-5.4 at 92.8%
  • SWE-Bench Verified: Gemini 3.1 Pro at 80.6%, Claude Opus 4.6 at 80.8%, GPT-5.2 at 80.0%
  • Terminal-Bench 2.0: Gemini 3.1 Pro at 68.5%, Claude Opus 4.6 at 65.4%, GPT-5.4 at 75.1%
  • OSWorld: GPT-5.4 at 75.0%, Claude Opus 4.6 at 72.7%, GPT-5.2 at 47.3%
  • GDPval: GPT-5.4 at 83.0%, Claude Opus 4.6 at 78.0%, GPT-5.2 at 70.9%

That spread tells you where to place your bets. If your work depends on giant inputs, agent coordination, and high-end reasoning, Gemini 3.1 Pro looks very attractive. If your team lives inside computer-use workflows or office-style tasks, the comparison is more balanced and may still favor another model in some cases.

For developers, the most interesting part may be the price-performance combo. A model that can process a million tokens, render SVG, and score well on coding benchmarks at Gemini 3 pricing is the kind of tool that changes how people prototype, debug, and automate. It may also push more teams toward fewer prompt splits and more end-to-end workflows.

Bottom line for developers and teams

Gemini 3.1 Pro is not interesting because it is “new.” It is interesting because it makes a credible case for being a practical default model for long-context coding, research, and agent work. The 1M-token window, 65K output cap, and strong benchmark set make it one of the few models that can credibly sit in the center of a serious AI workflow.

If Google keeps these prices stable and exposes the model cleanly through the official Google AI and Vertex AI stack, the next question is simple: will teams start defaulting to a single giant prompt instead of splitting work across smaller calls? That is the real test, and it will show up in production logs long before it shows up in a benchmark table.

For a deeper look at how large-context models are changing agent design, see our related coverage on long-context AI models for developers. The next wave of model adoption will probably be decided less by chat quality and more by how much real work one prompt can finish.