Xiaomi’s MiMo AI Push Targets Agentic Software

OraCore Editors

[IND] March 28, 20267 min readOraCore Editors

Xiaomi’s MiMo AI Push Targets Agentic Software

Xiaomi’s MiMo-V2-Pro, Omni, and TTS models pair 1T+ parameters with low pricing, aiming squarely at agentic AI workloads.

Xiaomi multimodal AI speech synthesis MiMo-V2-Pro agentic AI

Share LinkedIn

Xiaomi’s MiMo AI Push Targets Agentic Software

Xiaomi just put a very large number on the table: more than 1 trillion parameters for MiMo-V2-Pro. The bigger surprise is not the scale, though. It is the price, with input tokens listed at $1 per million and output tokens at $3 per million, which undercuts several premium frontier models by a wide margin.

That matters because the AI race is moving from chatbots to agents that can click, read, reason, and act. Xiaomi’s new stack, which also includes MiMo-V2-Omni and MiMo-V2-TTS, is aimed at exactly that kind of software worker.

What Xiaomi actually launched

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The launch is broader than a single model release. Xiaomi is packaging language reasoning, multimodal perception, and speech synthesis into one family of systems that can support agentic workflows across apps, browsers, and eventually physical devices.

MiMo-V2-Pro is the headline model. It uses a Mixture-of-Experts design, claims more than 1 trillion total parameters, and activates 42 billion parameters per request. Xiaomi says it supports a context window of up to 1 million tokens, which is the sort of range you need for long-running agent tasks, codebases, or document-heavy enterprise work.

MiMo-V2-Omni extends that idea into text, image, audio, and video. MiMo-V2-TTS handles speech generation with emotional control and nonverbal cues like laughter and hesitation.

MiMo-V2-Pro: 1T+ total parameters, 42B active per request
Context window: up to 1 million tokens
MiMo-V2-Pro pricing: $1 input, $3 output per million tokens
MiMo-V2-Omni: text, vision, audio, and video in one model
MiMo-V2-TTS: trained on more than 100 million hours of speech data

Why agentic AI is the real story

The interesting shift here is not raw model size. It is the move toward agents that can do work instead of just answering prompts. That means planning a sequence, calling tools, checking results, correcting mistakes, and continuing without a human babysitting every step.

That is the direction the whole field is moving in. OpenAI, Anthropic, and Google DeepMind have all been pushing models that can reason over longer contexts and interact with software. Xiaomi is now trying to win on a mix of scale, multimodality, and price.

The company’s pitch is practical: let the model read a browser page, compare products, write a document, generate audio, and hand off the result. In other words, build AI that behaves more like a junior operator than a text box.

“AI is the new electricity.” — Andrew Ng

That quote gets used a lot because it still fits. Xiaomi is betting that AI will matter less as a standalone app and more as an invisible layer inside products, services, and workflows.

How MiMo-V2-Pro compares on price and scope

Pricing is where Xiaomi gets aggressive. The company is not trying to outspend the biggest labs. It is trying to make a model that developers can actually afford to run at scale.

According to the launch details, MiMo-V2-Pro costs $1 per million input tokens and $3 per million output tokens. That is far below the pricing tier of several premium models that developers use for coding and agent tasks.

Here is the comparison Xiaomi is implicitly making:

MiMo-V2-Pro: $1 input, $3 output per million tokens
Claude Sonnet: $3 input, $15 output per million tokens
Claude Opus: $5 input, $25 output per million tokens

That gap changes the economics of experimentation. A startup building an agent that runs dozens of model calls per task will care a lot more about token pricing than a demo user chatting once a day.

Xiaomi also says MiMo-V2-Pro performs near the top tier on coding and agent benchmarks. The article source places it close to Claude Opus 4.6 in those tasks, which is a serious claim if the real-world performance holds up outside curated tests.

Omni and TTS make the stack more useful

MiMo-V2-Omni is the part that makes Xiaomi’s plan feel less like a lab demo and more like a product strategy. A model that can process text, images, audio, and video can work across interfaces that humans already use every day.

That opens up concrete use cases: checking dashcam footage for hazards, reading a browser page and filling out forms, separating speakers in a meeting recording, or scanning long video streams for important moments. The source article says Omni can handle continuous long-audio analysis beyond 10 hours, which is useful for support centers, meetings, and media review.

MiMo-V2-TTS fills another gap. If agents are going to talk to people all day, the voice layer has to sound natural. Xiaomi says users can describe voice style in plain language, and the system can handle dialects, singing, laughter, and hesitation. That is a meaningful step beyond the fixed emotion presets most TTS tools still use.

Omni can analyze images, video, and long audio streams
Omni supports native audio-video joint reasoning
TTS adds emotional control through text instructions
TTS includes dialects, tones, and paralinguistic sounds
Combined, the stack targets software agents and voice agents

The interesting part is how these pieces fit together. A browser agent needs language reasoning. A customer support agent needs voice. A robotics system eventually needs both, plus perception from cameras and microphones. Xiaomi is building toward that chain.

What the market should watch next

Xiaomi’s move matters because it shows how fast the agent market is splitting into layers. Some companies will sell premium reasoning models. Others will sell cheaper models that developers can run often. Xiaomi is clearly aiming for the second group while still chasing top-tier capability.

There is also a branding twist. The model first appeared anonymously on OpenRouter under the codename Hunter Alpha, and users speculated it was DeepSeek V4 before Xiaomi revealed the real source. That kind of mystery launch only happens when a model is good enough to create its own rumor mill.

For developers, the practical question is simple: will MiMo integrate cleanly into agent frameworks and hold up under real workloads? Xiaomi says it is working with tools such as Cline, Blackbox AI, and Kilo Code, which suggests the company understands that distribution matters as much as raw model quality.

My read: if MiMo-V2-Pro really holds near-frontier coding performance at these prices, the strongest adoption will come from teams building internal agents, not consumer chat apps. The next test is whether Xiaomi can turn that technical win into a developer habit. If it can, the company may become a much bigger AI player than its phone business alone would suggest.

// Related Articles

Xiaomi’s MiMo AI Push Targets Agentic Software

What Xiaomi actually launched

Get the latest AI news in your inbox

Why agentic AI is the real story

How MiMo-V2-Pro compares on price and scope

Omni and TTS make the stack more useful

What the market should watch next

IREN signs Nvidia AI infrastructure pact

Circle launches Agent Stack for AI payments

Why Nebius’s AI Pivot Is More Real Than Hype

Nvidia backs Corning factories with billions

Why Anthropic and the Gates Foundation should fund AI public goods

Why Observability Is Critical for Cloud-Native Systems