NVIDIA Sets New MLPerf Inference Records

OraCore Editors

[IND] April 2, 20267 min readOraCore Editors

NVIDIA Sets New MLPerf Inference Records

Blackwell Ultra hit new MLPerf Inference v6.0 highs, with GB300 NVL72 gaining 2.7x on DeepSeek-R1 server tests and 1.5x on Llama 3.1 405B.

Dynamo Nvidia Blackwell Ultra TensorRT-LLM MLPerf Inference

Share LinkedIn

NVIDIA Sets New MLPerf Inference Records

NVIDIA says its GB300 NVL72 system improved by 2.7x on DeepSeek-R1 server inference in MLPerf v6.0 compared with its earlier result, and by 1.5x on Llama 3.1 405B server inference. Those are the kind of numbers that matter when your business bills by the token, not by the benchmark slide.

The bigger story is that MLPerf Inference v6.0 added tougher workloads, and NVIDIA submitted on all of them. That included DeepSeek-R1 Interactive, Qwen3-VL-235B-A22B, GPT-OSS-120B, WAN-2.2-T2V-A14B, and DLRMv3.

That matters because inference is where AI systems make money or burn it. Training gets the headlines, but production throughput decides how many users a cluster can serve, how fast responses arrive, and how much each generated token costs.

What changed in MLPerf v6.0

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

MLPerf Inference keeps expanding to match what people actually deploy. This round added multimodal vision-language, video generation, interactive reasoning, and a new recommendation benchmark with heavier compute demand than the older DLRM-DCNv2 test.

NVIDIA’s results were broad, but the headline numbers are easy to summarize: the company says its platform delivered the highest throughput across the newly added workloads and scenarios, while also posting large gains on returning LLM tests. That combination is more interesting than any single score, because it shows software improvements compounding on the same hardware.

The company also says it now has 291 cumulative MLPerf training and inference wins since 2018. That is a brag, sure, but it is also a signal that its stack is tuned for repeated benchmark cycles, not one-off demos.

DeepSeek-R1 server throughput: 2,494,310 tokens/sec
GPT-OSS-120B server throughput: 1,096,770 tokens/sec
Qwen3-VL offline throughput: 79 samples/sec
DLRMv3 offline throughput: 104,637 samples/sec
GB300 NVL72 DeepSeek-R1 server gain: 2.77x vs. MLPerf v5.1

Why software moved the numbers

NVIDIA’s pitch is simple: the GPU matters, but the software stack decides how much of that GPU you actually use. In this round, the company points to TensorRT-LLM updates, the NVIDIA Dynamo serving framework, and a set of model-specific optimizations that pushed more tokens through the same hardware footprint.

That is where the interesting engineering lives. Faster kernels reduce overhead. Kernel fusion cuts the number of launches. Optimized attention data parallelism balances context requests across ranks. For interactive workloads, disaggregated serving splits prefill and decode so each phase can be tuned separately.

The company also used Wide Expert Parallel, Multi-Token Prediction, and KV-aware routing for the DeepSeek-R1 Interactive benchmark. Those are not marketing flourishes; they are practical ways to reduce bottlenecks when a mixture-of-experts model gets hit with small batches and latency-sensitive traffic.

“If you can make one thing 10 percent better, that’s great. If you can make 10 things 1 percent better, that’s much more powerful.” — Jensen Huang, NVIDIA GTC 2024 keynote

That quote fits this release well. The gains here come from many small engineering decisions stacked together, not from one magical hardware trick.

The numbers that tell the real story

Per-GPU gains are the cleanest way to see how much the software work changed the outcome. NVIDIA’s own table shows that GB300 NVL72 improved on both DeepSeek-R1 and Llama 3.1 405B between MLPerf v5.1 and v6.0.

For the DeepSeek-R1 server scenario, throughput rose from 2,907 tokens/sec/gpu to 8,064 tokens/sec/gpu. On the offline scenario, it climbed from 5,842 to 9,821 tokens/sec/gpu. Llama 3.1 405B, a dense model launched nearly two years ago, also improved meaningfully even though it is a less dramatic benchmark for modern MoE tricks.

DeepSeek-R1 server: 2,907 to 8,064 tokens/sec/gpu
DeepSeek-R1 offline: 5,842 to 9,821 tokens/sec/gpu
Llama 3.1 405B server: 170 to 259 tokens/sec/gpu
Llama 3.1 405B offline: 224 to 271 tokens/sec/gpu
DeepSeek-R1 server gain: 2.77x
Llama 3.1 405B server gain: 1.52x

Those numbers suggest something important about inference economics. Even on an older dense model, NVIDIA is still finding extra headroom. That means cloud operators can sometimes get more output from the same rack without buying fresh silicon, which is exactly the kind of improvement finance teams care about.

The scale-out result is just as notable. NVIDIA says four GB300 NVL72 systems connected with Quantum-X800 InfiniBand and 288 Blackwell Ultra GPUs set a system-level throughput record in MLPerf Inference v6.0. That is the sort of configuration that matters for large AI factories, where the cluster is the product.

What this says about the market

These results also show how much inference has become an ecosystem sport. NVIDIA says 14 partners submitted on its platform this round, including ASUS, Cisco, CoreWeave, Dell Technologies, Supermicro, and Lenovo.

That partner list matters because inference is no longer a single-vendor story. The best results often come from matching hardware, networking, serving software, and model-specific tuning. A cloud provider wants one answer, an enterprise wants another, and a model provider wants a third. The common thread is throughput per watt and throughput per dollar.

It also helps explain why NVIDIA keeps emphasizing open-source pieces like TensorRT-LLM, Dynamo, and vLLM. The company wants its stack to look like the default path for serious deployment work, not a closed box that only shines in demos.

If you care about AI infrastructure, the takeaway is simple: benchmark wins still matter, but the best signal is sustained improvement on the same systems. NVIDIA is showing that a well-tuned stack can keep paying off long after launch day. The next question is whether competitors can match those gains on real workloads, or whether the gap keeps widening when the models get larger and the traffic gets messier.

For teams building AI products today, the practical move is to watch MLPerf more closely, especially the server and interactive scenarios. Those are the tests that look most like production, and they are where the cost of each token starts to define the business.

// Related Articles

NVIDIA Sets New MLPerf Inference Records

What changed in MLPerf v6.0

Get the latest AI news in your inbox

Why software moved the numbers

The numbers that tell the real story

What this says about the market

Circle’s Agent Stack targets machine-speed payments

IREN signs Nvidia AI infrastructure pact

Circle launches Agent Stack for AI payments

Why Nebius’s AI Pivot Is More Real Than Hype

Nvidia backs Corning factories with billions

Why Anthropic and the Gates Foundation should fund AI public goods