Xiaomi’s MiMo trio targets agents, robots, and voice

OraCore Editors

[MODEL] March 28, 20269 min readOraCore Editors

Xiaomi’s MiMo trio targets agents, robots, and voice

Xiaomi released three MiMo models for agents, multimodal tasks, and speech. MiMo-V2-Pro nears Claude Opus 4.6 on key benchmarks.

multimodal models speech synthesis Xiaomi MiMo Claude Opus 4.6 AI agents

Share LinkedIn

Xiaomi’s MiMo trio targets agents, robots, and voice

Xiaomi has shipped three MiMo AI models in one shot: a large language model, a multimodal model, and a speech synthesis model. That matters because the company is not treating AI as a single chat box; it is building a stack for agents that can read, see, speak, click, and eventually act in the physical world.

The headline number is hard to ignore. MiMo-V2-Pro uses more than 1 trillion total parameters, with 42 billion active per request, and Xiaomi says it can handle context windows up to 1 million tokens. It is also priced far below Anthropic’s top models, which makes this launch as much about economics as raw capability.

Xiaomi is building an agent stack, not a chatbot

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The three models each cover a different layer of the same product idea. MiMo-V2-Pro handles reasoning, coding, and agent workflows. MiMo-V2-Omni adds image, video, and audio understanding plus tool use. MiMo-V2-TTS turns text into speech with emotional control and singing support. Put together, they look less like separate releases and more like a full platform for software agents.

That approach makes sense for Xiaomi. The company sells phones, home devices, cars, and consumer electronics, so an AI system that can move across screens, microphones, and cameras is a natural fit. If a model can inspect a dashcam feed, answer a voice prompt, and fill out a browser form, it can slot into a lot more products than a plain text assistant.

MiMo-V2-Pro: over 1 trillion total parameters, 42 billion active, 1 million-token context
MiMo-V2-Omni: image, video, and audio encoders in one model
MiMo-V2-TTS: trained on more than 100 million hours of speech data
Launch pricing for MiMo-V2-Pro: $1 per million input tokens and $3 per million output tokens

There is also a strategic angle here. Xiaomi is not waiting for a single model to do everything well. It is splitting the job into specialized systems and connecting them with agent frameworks. That is a more practical move than chasing one giant model that tries to do every task at once.

MiMo-V2-Pro is Xiaomi’s sharpest shot at premium coding

MiMo-V2-Pro is the model most likely to get developers’ attention first. Xiaomi says it ranks third globally on both PinchBench and ClawEval, and it scores 78 percent on SWE-bench Verified. That puts it just behind Claude Opus 4.6 at 80.8 and close to Claude Sonnet 4.6 at 79.6.

It also ran under the codename “Hunter Alpha” on OpenRouter before the launch, where it climbed the daily rankings and drew a lot of guesses about its origin. Xiaomi says the model processed more than 1 trillion tokens during that period, and coding was the most common use case. That is a useful signal: developers will test a model where it feels expensive to fail.

“We believe the path to general intelligence runs through the real world. A model that only reads text lives in a library. A model that sees, hears, reasons, and acts lives in the world.”

That quote from Xiaomi’s MiMo team tells you what the company wants this family to become. It is aiming for agents that do work, not just answer questions. In practice, that means browser control, code execution, and eventually robotics integration.

On pricing, Xiaomi is being aggressive. The company says MiMo-V2-Pro costs $1 per million input tokens and $3 per million output tokens for contexts up to 256,000 tokens. By comparison, Claude Sonnet 4.6 is listed at $3 and $15, while Claude Opus 4.6 is $5 and $25. Xiaomi is also waiving cache write costs for now, which lowers the barrier for experimentation even more.

MiMo-V2-Pro: 78 on SWE-bench Verified
Claude Opus 4.6: 80.8 on SWE-bench Verified
Claude Sonnet 4.6: 79.6 on SWE-bench Verified
ClawEval agent score: 81 for MiMo-V2-Pro versus 81.5 for Claude Opus 4.6

MiMo-V2-Omni is where Xiaomi gets practical

MiMo-V2-Omni is the more interesting model if you care about devices and automation. Xiaomi says it can see, hear, and act in a single system, with structured tool calls, function execution, and autonomous UI control built in. That is the kind of model you would want if you were trying to move from demos to products that actually do useful work.

The benchmark split is worth paying attention to. Xiaomi says Omni beats Claude Opus 4.6 on audio and image tasks, and it outperforms both Gemini 3 Pro and GPT-5.2 on MM-BrowserComp, a web navigation benchmark. But on ClawEval, the agent benchmark, it scores 54.8, well behind Claude Opus 4.6 at 66.3 and GPT-5.2 at 59.6. In plain English: its perception is strong, but its agent behavior still has room to improve.

Xiaomi’s demos make that gap visible. In one example, the model watched dashcam footage and flagged pedestrians, oncoming vehicles, and traffic bottlenecks as hazards. In another, it opened a browser, searched for product reviews on Xiaohongshu, compared prices on JD.com, negotiated a discount with customer support, and finished a purchase. A separate demo had it create multimedia content, debug the code behind it, and publish the result to TikTok through the browser.

Those demos may sound flashy, but they point to real product categories: in-car assistants, shopping agents, and content tools. Xiaomi already has hardware in all of those categories, so the company has a direct path from model capability to shipped features.

MiMo-TTS could matter more than the flashier models

MiMo-V2-TTS is the quietest release here, but it may be the most consumer-friendly. Xiaomi says it was trained on over 100 million hours of speech data and can generate speech with fine-grained emotional control. Instead of choosing a preset voice style from a menu, users describe the tone in plain language.

That means prompts like “sleepy, just woken up, slightly hoarse” or “angry, but trying to stay calm” can shape the output directly. The model also produces coughs, hesitations, sighs, and laughter as part of the generation process. Xiaomi says it is the only commercially available TTS API that natively handles both speech and singing in the same model.

For consumer devices, that is a big deal. A voice assistant that can sound natural under stress, whisper in the right moment, or sing without a separate pipeline has a much better chance of feeling alive in a phone or smart speaker. It also hints at where Xiaomi may be headed next: assistants that are less like command terminals and more like interactive characters.

The company says the model also reads typography as a cue for emphasis and rhythm, so capital letters and repeated characters change the delivery. That is the kind of detail that makes voice systems feel less synthetic and more controllable.

Xiaomi has the pieces, but agent reliability is the test

China’s AI scene is getting crowded fast, and Xiaomi is now in the same conversation as Zhipu AI, Moonshot AI, and Alibaba’s Qwen team. Zhipu’s GLM-5, Moonshot’s Kimi K2.5, and Alibaba’s Qwen 3.5 line all push hard on coding and agent work. Xiaomi’s answer is different: it is trying to bundle reasoning, perception, speech, and device control into one product story.

That story will only matter if the agents are dependable. Browser control is easy to demo and hard to trust. Shopping flows break. UI layouts change. Speech systems drift into uncanny territory. The benchmarks show Xiaomi can compete, sometimes very closely, but they also show that general agent behavior is still uneven across the family.

MiMo-V2-Pro is strongest on coding and text-based agent tasks
MiMo-V2-Omni is strongest on multimodal perception and browser navigation
MiMo-V2-TTS is the most consumer-ready piece for phones and smart devices
Xiaomi is already positioned to ship these models into hardware, not just APIs

My take: the most important question is not whether MiMo can beat one benchmark by a point or two. It is whether Xiaomi can make these models reliable enough for phones, cars, and home devices without turning every action into a debugging session. If the company can close that gap, the next Xiaomi assistant may feel less like a voice feature and more like a worker that actually gets things done.

// Related Articles

Xiaomi’s MiMo trio targets agents, robots, and voice

Xiaomi is building an agent stack, not a chatbot

Get the latest AI news in your inbox

MiMo-V2-Pro is Xiaomi’s sharpest shot at premium coding

MiMo-V2-Omni is where Xiaomi gets practical

MiMo-TTS could matter more than the flashier models

Xiaomi has the pieces, but agent reliability is the test

MiniMax-M1 brings 1M-token open reasoning model

Gemini Omni Video Review: Text Rendering Beats Rivals

Why Xiaomi’s MiMo-V2.5-Pro Changes Coding Agents More Than Chatbots

OpenAI’s Realtime Audio Models Target Live Voice

Anthropic发布10款金融AI Agent

Why Claude’s “Infinite” Context Window Still Won’t Make AI Autonomous