Xiaomi’s MiMo AI Push Targets Agentic Software
Xiaomi’s MiMo-V2-Pro, Omni, and TTS models pair 1T+ parameters with low pricing, aiming squarely at agentic AI workloads.

Xiaomi just put a very large number on the table: more than 1 trillion parameters for MiMo-V2-Pro. The bigger surprise is not the scale, though. It is the price, with input tokens listed at $1 per million and output tokens at $3 per million, which undercuts several premium frontier models by a wide margin.
That matters because the AI race is moving from chatbots to agents that can click, read, reason, and act. Xiaomi’s new stack, which also includes MiMo-V2-Omni and MiMo-V2-TTS, is aimed at exactly that kind of software worker.
What Xiaomi actually launched
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
The launch is broader than a single model release. Xiaomi is packaging language reasoning, multimodal perception, and speech synthesis into one family of systems that can support agentic workflows across apps, browsers, and eventually physical devices.
MiMo-V2-Pro is the headline model. It uses a Mixture-of-Experts design, claims more than 1 trillion total parameters, and activates 42 billion parameters per request. Xiaomi says it supports a context window of up to 1 million tokens, which is the sort of range you need for long-running agent tasks, codebases, or document-heavy enterprise work.
MiMo-V2-Omni extends that idea into text, image, audio, and video. MiMo-V2-TTS handles speech generation with emotional control and nonverbal cues like laughter and hesitation.
- MiMo-V2-Pro: 1T+ total parameters, 42B active per request
- Context window: up to 1 million tokens
- MiMo-V2-Pro pricing: $1 input, $3 output per million tokens
- MiMo-V2-Omni: text, vision, audio, and video in one model
- MiMo-V2-TTS: trained on more than 100 million hours of speech data
Why agentic AI is the real story
The interesting shift here is not raw model size. It is the move toward agents that can do work instead of just answering prompts. That means planning a sequence, calling tools, checking results, correcting mistakes, and continuing without a human babysitting every step.
That is the direction the whole field is moving in. OpenAI, Anthropic, and Google DeepMind have all been pushing models that can reason over longer contexts and interact with software. Xiaomi is now trying to win on a mix of scale, multimodality, and price.
The company’s pitch is practical: let the model read a browser page, compare products, write a document, generate audio, and hand off the result. In other words, build AI that behaves more like a junior operator than a text box.
“AI is the new electricity.” — Andrew Ng
That quote gets used a lot because it still fits. Xiaomi is betting that AI will matter less as a standalone app and more as an invisible layer inside products, services, and workflows.
How MiMo-V2-Pro compares on price and scope
Pricing is where Xiaomi gets aggressive. The company is not trying to outspend the biggest labs. It is trying to make a model that developers can actually afford to run at scale.
According to the launch details, MiMo-V2-Pro costs $1 per million input tokens and $3 per million output tokens. That is far below the pricing tier of several premium models that developers use for coding and agent tasks.
Here is the comparison Xiaomi is implicitly making:
- MiMo-V2-Pro: $1 input, $3 output per million tokens
- Claude Sonnet: $3 input, $15 output per million tokens
- Claude Opus: $5 input, $25 output per million tokens
That gap changes the economics of experimentation. A startup building an agent that runs dozens of model calls per task will care a lot more about token pricing than a demo user chatting once a day.
Xiaomi also says MiMo-V2-Pro performs near the top tier on coding and agent benchmarks. The article source places it close to Claude Opus 4.6 in those tasks, which is a serious claim if the real-world performance holds up outside curated tests.
Omni and TTS make the stack more useful
MiMo-V2-Omni is the part that makes Xiaomi’s plan feel less like a lab demo and more like a product strategy. A model that can process text, images, audio, and video can work across interfaces that humans already use every day.
That opens up concrete use cases: checking dashcam footage for hazards, reading a browser page and filling out forms, separating speakers in a meeting recording, or scanning long video streams for important moments. The source article says Omni can handle continuous long-audio analysis beyond 10 hours, which is useful for support centers, meetings, and media review.
MiMo-V2-TTS fills another gap. If agents are going to talk to people all day, the voice layer has to sound natural. Xiaomi says users can describe voice style in plain language, and the system can handle dialects, singing, laughter, and hesitation. That is a meaningful step beyond the fixed emotion presets most TTS tools still use.
- Omni can analyze images, video, and long audio streams
- Omni supports native audio-video joint reasoning
- TTS adds emotional control through text instructions
- TTS includes dialects, tones, and paralinguistic sounds
- Combined, the stack targets software agents and voice agents
The interesting part is how these pieces fit together. A browser agent needs language reasoning. A customer support agent needs voice. A robotics system eventually needs both, plus perception from cameras and microphones. Xiaomi is building toward that chain.
What the market should watch next
Xiaomi’s move matters because it shows how fast the agent market is splitting into layers. Some companies will sell premium reasoning models. Others will sell cheaper models that developers can run often. Xiaomi is clearly aiming for the second group while still chasing top-tier capability.
There is also a branding twist. The model first appeared anonymously on OpenRouter under the codename Hunter Alpha, and users speculated it was DeepSeek V4 before Xiaomi revealed the real source. That kind of mystery launch only happens when a model is good enough to create its own rumor mill.
For developers, the practical question is simple: will MiMo integrate cleanly into agent frameworks and hold up under real workloads? Xiaomi says it is working with tools such as Cline, Blackbox AI, and Kilo Code, which suggests the company understands that distribution matters as much as raw model quality.
My read: if MiMo-V2-Pro really holds near-frontier coding performance at these prices, the strongest adoption will come from teams building internal agents, not consumer chat apps. The next test is whether Xiaomi can turn that technical win into a developer habit. If it can, the company may become a much bigger AI player than its phone business alone would suggest.
// Related Articles
- [IND]
IREN signs Nvidia AI infrastructure pact
- [IND]
Circle launches Agent Stack for AI payments
- [IND]
Why Nebius’s AI Pivot Is More Real Than Hype
- [IND]
Nvidia backs Corning factories with billions
- [IND]
Why Anthropic and the Gates Foundation should fund AI public goods
- [IND]
Why Observability Is Critical for Cloud-Native Systems