Self-host MiniMax M3 on GPU cloud
MiniMax M3 brings 229.9B MoE weights, 1M context, and multimodal output, but it needs serious GPU memory to run.

MiniMax M3 is a 229.9B-parameter open-weight model that can run 1M-token multimodal workloads.
Spheron published a deployment guide for MiniMax M3 just 11 days after its June 1, 2026 release, and the numbers explain why teams are paying attention. The model combines 229.9B total parameters, 9.8B active parameters per token, a 1,048,576-token context window, and native image and video understanding in one checkpoint.
| Metric | MiniMax M3 | What it means |
|---|---|---|
| Release date | June 1, 2026 | Fresh open-weight frontier model |
| Total parameters | 229.9B | Full weights must live in VRAM |
| Active parameters | 9.8B per token | Lower per-token compute than dense giants |
| Context length | 1,048,576 tokens | 1M-token prompts and chats |
| SWE-Bench Pro | 59.0% | Strong software engineering score |
MiniMax M3 is built for long, messy work
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
M3 is not a chat toy with a flashy context number attached. It is an open-weight Mixture-of-Experts model with 256 fine-grained experts, and that matters because only 9.8B parameters activate per token while the full 229.9B parameter set stays available in memory.

That design lets the model keep inference costs closer to a mid-sized dense model while still carrying a huge parameter bank. For teams doing agentic coding, long research sessions, or multimodal analysis, the appeal is obvious: one model can inspect code, reason over long histories, and read images or video without switching endpoints.
MiniMax also published a 59.0% score on SWE-Bench Pro, a benchmark that tests whether a model can actually fix software bugs across multiple steps. That is a better signal for real developer use than a single-turn code completion score.
- 229.9B total parameters
- 9.8B active parameters per token
- 256 experts in the MoE stack
- 59.0% on SWE-Bench Pro
- 1,048,576-token context window
MSA is the reason 1M context is even possible
The real trick in M3 is MiniMax Sparse Attention, or MSA. Standard full attention gets expensive fast because compute grows quadratically with context length. At 1M tokens, that becomes a wall for ordinary long-context serving setups.
MiniMax says MSA delivers more than 9x prefill speedup and more than 15x decode speedup at 1M context versus its earlier M2 model. It also cuts per-token compute to about one-twentieth of M2 at the same context length. That is the difference between a model that looks impressive in a demo and one that can sit behind a real product.
“Sparse attention is the key to long-context efficiency.” — Tri Dao, FlashAttention-2 paper
That quote is not about MiniMax M3 specifically, but it captures the same engineering idea: attention has to get smarter if you want long context without absurd compute bills. M3 pushes that idea further by pairing sparse attention with a very large context window and native multimodal input.
For developers, the practical result is simple. Full codebases, long chat histories, legal files, and research threads can stay in one request instead of being chopped into chunks. That reduces retrieval glue code and makes agent loops easier to reason about.
- MSA prefill speedup: more than 9x
- MSA decode speedup: more than 15x
- Per-token compute at 1M context: about 1/20 of M2
- Context window: 1,048,576 tokens
The GPU bill depends on precision and context
Self-hosting M3 is mostly a memory problem, then a cost problem. Because it is an MoE model, you cannot just keep the active experts in VRAM and page the rest from CPU without paying a steep latency penalty.

Spheron’s guide gives a useful snapshot of the hardware math. In BF16, M3 needs about 460 GB of VRAM. FP8 cuts that to about 230 GB. AWQ INT4 drops the footprint to about 115 GB, which opens the door to smaller cards for lighter workloads.
Context length adds another layer. KV cache memory still grows with the number of tokens, even when MSA reduces attention compute. At 1M context, FP8 KV cache alone is about 120 GB, which is why a 2x H200 setup is not enough for the full window.
| Precision | Model VRAM | Typical GPU setup | 1M-context fit? |
|---|---|---|---|
| BF16 | ~460 GB | 4x H200 SXM5 or 6x H100 SXM5 | Yes |
| FP8 | ~230 GB | 4x H200 SXM5 or 8x H100 SXM5 | Yes |
| AWQ INT4 | ~115 GB | 1x H200 SXM5 or 2x H100 SXM5 | Only for smaller contexts |
Spheron’s published pricing on June 12, 2026 puts 2x H200 SXM5 FP8 at $3.64 per hour on spot and $9.68 per hour on demand. A 4x H100 SXM5 FP8 setup costs $5.72 per hour on spot and $15.68 per hour on demand. That makes H200 the cleaner pick for FP8 serving, while H100 becomes attractive when you want more cards and can tolerate a higher hourly rate.
The context math matters more than the raw GPU count. A 2x H200 FP8 node can handle M3 at around 256K to 300K context comfortably, but true 1M-context serving needs at least 4x H200 or 8x H100. If your product only needs 128K or 256K context, you can save a lot by staying below the full 1M target.
- BF16 model memory: ~460 GB
- FP8 model memory: ~230 GB
- AWQ INT4 model memory: ~115 GB
- 2x H200 spot price: $3.64/hr
- 4x H100 spot price: $5.72/hr
vLLM is the practical serving path
vLLM is the right tool for most teams that want an OpenAI-compatible endpoint without building their own server stack. Spheron’s guide uses it because it supports tensor parallelism, expert parallelism, and FP8 KV cache handling, all of which matter for M3.
The deployment flow is straightforward: provision the GPU node on Spheron, install CUDA 12.4+ and the needed Python packages, download the model from Hugging Face, then launch the server with the right parallelism and cache flags. If you are serving multimodal requests, you also need to confirm that the model repository is public and that the license terms are accepted.
One detail worth watching is framework support. MSA requires explicit backend support, so you cannot assume any random vLLM release will work. Pin the version that adds MiniMax M3 support, then test context length, throughput, and KV cache behavior before putting it behind production traffic.
For teams already using SGLang, the same basic memory rules apply. The serving stack matters less than the GPU budget once you push past 128K context.
What this means for teams shipping real products
M3 changes the conversation around open-weight models because it combines three things that usually arrive separately: frontier-level coding performance, native multimodal input, and a context window large enough to hold a serious working set. That combination is useful for coding agents, document analysis, and research copilots that need to keep a lot of state in memory.
The catch is that the model is still expensive to run well. If you want the full 1M-token window, you are buying multi-GPU infrastructure, not a single-card experiment. If your workload tops out at 128K or 256K context, the economics improve a lot and the model becomes easier to justify.
The more interesting question is whether teams will actually need the full million tokens in production. My bet is that most will not, at least at first. They will use M3 for 128K to 256K tasks, then reserve the full context window for debugging, codebase-wide reasoning, and long document synthesis where the extra memory really pays for itself.
If you are planning a deployment, the first decision is simple: do you need multimodal long context, or do you just want a strong open model with decent coding ability? If the answer includes both, M3 is worth the GPU bill. If not, a smaller model may give you better economics and less operational friction.
// Related Articles
- [MODEL]
Devin pricing in June 2026: plans, limits, tradeoffs
- [MODEL]
Apple’s Gemini-backed AI is still its own thing
- [MODEL]
Gemma 4 brings 256K context to open models
- [MODEL]
Kimi K2.7 Code 该优先上 API 和 Kimi Code,而不是等生态成熟
- [MODEL]
Kingdom Hearts IV confirmed for Switch 2 launch
- [MODEL]
Gemini 3.5 Live Translate rolls out in 70+ languages