Self-host MiniMax M3 on GPU cloud

OraCore Editors

Back to home

[MODEL] June 18, 20268 min readOraCore Editors

Self-host MiniMax M3 on GPU cloud

MiniMax M3 brings 229.9B MoE weights, 1M context, and multimodal output, but it needs serious GPU memory to run.

vLLM long context

Share LinkedIn

MiniMax M3 is a 229.9B-parameter open-weight model that can run 1M-token multimodal workloads.

Spheron published a deployment guide for MiniMax M3 just 11 days after its June 1, 2026 release, and the numbers explain why teams are paying attention. The model combines 229.9B total parameters, 9.8B active parameters per token, a 1,048,576-token context window, and native image and video understanding in one checkpoint.

Metric	MiniMax M3	What it means
Release date	June 1, 2026	Fresh open-weight frontier model
Total parameters	229.9B	Full weights must live in VRAM
Active parameters	9.8B per token	Lower per-token compute than dense giants
Context length	1,048,576 tokens	1M-token prompts and chats
SWE-Bench Pro	59.0%	Strong software engineering score

MiniMax M3 is built for long, messy work

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

M3 is not a chat toy with a flashy context number attached. It is an open-weight Mixture-of-Experts model with 256 fine-grained experts, and that matters because only 9.8B parameters activate per token while the full 229.9B parameter set stays available in memory.

That design lets the model keep inference costs closer to a mid-sized dense model while still carrying a huge parameter bank. For teams doing agentic coding, long research sessions, or multimodal analysis, the appeal is obvious: one model can inspect code, reason over long histories, and read images or video without switching endpoints.

MiniMax also published a 59.0% score on SWE-Bench Pro, a benchmark that tests whether a model can actually fix software bugs across multiple steps. That is a better signal for real developer use than a single-turn code completion score.

229.9B total parameters
9.8B active parameters per token
256 experts in the MoE stack
59.0% on SWE-Bench Pro
1,048,576-token context window

MSA is the reason 1M context is even possible

The real trick in M3 is MiniMax Sparse Attention, or MSA. Standard full attention gets expensive fast because compute grows quadratically with context length. At 1M tokens, that becomes a wall for ordinary long-context serving setups.

MiniMax says MSA delivers more than 9x prefill speedup and more than 15x decode speedup at 1M context versus its earlier M2 model. It also cuts per-token compute to about one-twentieth of M2 at the same context length. That is the difference between a model that looks impressive in a demo and one that can sit behind a real product.

“Sparse attention is the key to long-context efficiency.” — Tri Dao, FlashAttention-2 paper

That quote is not about MiniMax M3 specifically, but it captures the same engineering idea: attention has to get smarter if you want long context without absurd compute bills. M3 pushes that idea further by pairing sparse attention with a very large context window and native multimodal input.

For developers, the practical result is simple. Full codebases, long chat histories, legal files, and research threads can stay in one request instead of being chopped into chunks. That reduces retrieval glue code and makes agent loops easier to reason about.

MSA prefill speedup: more than 9x
MSA decode speedup: more than 15x
Per-token compute at 1M context: about 1/20 of M2
Context window: 1,048,576 tokens

The GPU bill depends on precision and context

Self-hosting M3 is mostly a memory problem, then a cost problem. Because it is an MoE model, you cannot just keep the active experts in VRAM and page the rest from CPU without paying a steep latency penalty.

Spheron’s guide gives a useful snapshot of the hardware math. In BF16, M3 needs about 460 GB of VRAM. FP8 cuts that to about 230 GB. AWQ INT4 drops the footprint to about 115 GB, which opens the door to smaller cards for lighter workloads.

Context length adds another layer. KV cache memory still grows with the number of tokens, even when MSA reduces attention compute. At 1M context, FP8 KV cache alone is about 120 GB, which is why a 2x H200 setup is not enough for the full window.

Precision	Model VRAM	Typical GPU setup	1M-context fit?
BF16	~460 GB	4x H200 SXM5 or 6x H100 SXM5	Yes
FP8	~230 GB	4x H200 SXM5 or 8x H100 SXM5	Yes
AWQ INT4	~115 GB	1x H200 SXM5 or 2x H100 SXM5	Only for smaller contexts

Spheron’s published pricing on June 12, 2026 puts 2x H200 SXM5 FP8 at $3.64 per hour on spot and $9.68 per hour on demand. A 4x H100 SXM5 FP8 setup costs $5.72 per hour on spot and $15.68 per hour on demand. That makes H200 the cleaner pick for FP8 serving, while H100 becomes attractive when you want more cards and can tolerate a higher hourly rate.

The context math matters more than the raw GPU count. A 2x H200 FP8 node can handle M3 at around 256K to 300K context comfortably, but true 1M-context serving needs at least 4x H200 or 8x H100. If your product only needs 128K or 256K context, you can save a lot by staying below the full 1M target.

BF16 model memory: ~460 GB
FP8 model memory: ~230 GB
AWQ INT4 model memory: ~115 GB
2x H200 spot price: $3.64/hr
4x H100 spot price: $5.72/hr

vLLM is the practical serving path

vLLM is the right tool for most teams that want an OpenAI-compatible endpoint without building their own server stack. Spheron’s guide uses it because it supports tensor parallelism, expert parallelism, and FP8 KV cache handling, all of which matter for M3.

The deployment flow is straightforward: provision the GPU node on Spheron, install CUDA 12.4+ and the needed Python packages, download the model from Hugging Face, then launch the server with the right parallelism and cache flags. If you are serving multimodal requests, you also need to confirm that the model repository is public and that the license terms are accepted.

One detail worth watching is framework support. MSA requires explicit backend support, so you cannot assume any random vLLM release will work. Pin the version that adds MiniMax M3 support, then test context length, throughput, and KV cache behavior before putting it behind production traffic.

For teams already using SGLang, the same basic memory rules apply. The serving stack matters less than the GPU budget once you push past 128K context.

What this means for teams shipping real products

M3 changes the conversation around open-weight models because it combines three things that usually arrive separately: frontier-level coding performance, native multimodal input, and a context window large enough to hold a serious working set. That combination is useful for coding agents, document analysis, and research copilots that need to keep a lot of state in memory.

The catch is that the model is still expensive to run well. If you want the full 1M-token window, you are buying multi-GPU infrastructure, not a single-card experiment. If your workload tops out at 128K or 256K context, the economics improve a lot and the model becomes easier to justify.

The more interesting question is whether teams will actually need the full million tokens in production. My bet is that most will not, at least at first. They will use M3 for 128K to 256K tasks, then reserve the full context window for debugging, codebase-wide reasoning, and long document synthesis where the extra memory really pays for itself.

If you are planning a deployment, the first decision is simple: do you need multimodal long context, or do you just want a strong open model with decent coding ability? If the answer includes both, M3 is worth the GPU bill. If not, a smaller model may give you better economics and less operational friction.

// Related Articles

Self-host MiniMax M3 on GPU cloud

MiniMax M3 is built for long, messy work

Get the latest AI news in your inbox

MSA is the reason 1M context is even possible

The GPU bill depends on precision and context

vLLM is the practical serving path

What this means for teams shipping real products

Devin pricing in June 2026: plans, limits, tradeoffs

Apple’s Gemini-backed AI is still its own thing

Gemma 4 brings 256K context to open models

Kimi K2.7 Code 该优先上 API 和 Kimi Code，而不是等生态成熟

Kingdom Hearts IV confirmed for Switch 2 launch

Gemini 3.5 Live Translate rolls out in 70+ languages