[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-nvidia-mlperf-software-inference-benchmarks-zh":3,"tags-nvidia-mlperf-software-inference-benchmarks-zh":33,"related-lang-nvidia-mlperf-software-inference-benchmarks-zh":49,"related-posts-nvidia-mlperf-software-inference-benchmarks-zh":53,"series-research-0b5979a7-dbb3-438f-b8a1-68de0f838df0":90},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":21,"translated_content":10,"views":22,"is_premium":23,"created_at":24,"updated_at":24,"cover_image":11,"published_at":25,"rewrite_status":26,"rewrite_error":10,"rewritten_from_id":27,"slug":28,"category":29,"related_article_id":30,"status":31,"google_indexed_at":32,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":23},"0b5979a7-dbb3-438f-b8a1-68de0f838df0","Nvidia MLPerf 成績證明軟體還很重要","\u003Cp>\u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002F\" target=\"_blank\" rel=\"noopener\">Nvidia\u003C\u002Fa> 又在講同一套。不是只賣 GPU，而是賣整套 AI 平台。這次在 \u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fgtc\u002F\" target=\"_blank\" rel=\"noopener\">GTC\u003C\u002Fa> 和 \u003Ca href=\"https:\u002F\u002Fmlcommons.org\u002Fbenchmarks\u002Finference\u002F\" target=\"_blank\" rel=\"noopener\">MLPerf Inference\u003C\u002Fa> 的新成績，把這句話講得很直白。\u003C\u002Fp>\u003Cp>最吸睛的是數字。Nvidia 說 \u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fdata-center\u002Fgb300-nvl72\u002F\" target=\"_blank\" rel=\"noopener\">GB300 NVL72\u003C\u002Fa> 在 DeepSeek-R1 server 測試上，最高比前一輪提升 2.77 倍。互動式 DeepSeek-R1 也跑到每秒 250,634 tokens，成本是每百萬 tokens 0.30 \u003Ca href=\"\u002Fnews\u002Fapple-pays-google-one-billion-fix-siri-zh\">美元\u003C\u002Fa>。講白了，這就是雲端和企業會拿來算帳的數字。\u003C\u002Fp>\u003Cp>這次最有意思的地方，不是晶片多快，而是軟體真的扛了很多功。硬體、模型、排程、kernel 優化一起動，才把推論效率拉上去。這件事對台灣開發者也很有感，因為大家常常只盯著 GPU 規格，卻忘了軟體堆疊才是最後一哩路。\u003C\u002Fp>\u003Ch2>MLPerf v6.0 比舊版更像今天的 AI 流量\u003C\u002Fh2>\u003Cp>\u003Ca href=\"https:\u002F\u002Fmlcommons.org\u002F\" target=\"blank\" rel=\"noopener\">MLCommons\u003C\u002Fa> 這次更新 \u003Ca href=\"https:\u002F\u002Fmlcommons.org\u002Fbenchmarks\u002Finference\u002F\" target=\"_blank\" rel=\"noopener\">MLPerf Inference v6.0\u003C\u002Fa>，重點放在推論、推理型模型，還有多模態工作負載。這很合理，因為 2026 年的 AI 伺服器，早就不是只跑單純 chatbot 了。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775185790112-2r4u.png\" alt=\"Nvidia MLPerf 成績證明軟體還很重要\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>現在的流量更雜。有人在問答，有人在寫程式，有人在丟圖丟影片。每一種請求，都會影響 token 產生速度、首次回應時間，還有記憶體配置。舊版 benchmark 很容易把這些差異壓平，結果看起來漂亮，實際上不準。\u003C\u002Fp>\u003Cp>Nvidia 產品主管 Dave Salvatore 的說法很直接。MLCommons 更新了測試項目，像是 DeepSeek-R1 Interactive、GPT-OSS-120B、Qwen3-VL-235B-A22B。這些都不是玩具模型。它們會把延遲、吞吐量、記憶體壓力一起拉高，逼系統露出真本事。\u003C\u002Fp>\u003Cul>\u003Cli>DeepSeek-R1 Interactive 看 token 生成速度和首 token 時間。\u003C\u002Fli>\u003Cli>GPT-OSS-120B 是 MoE 推理模型。\u003C\u002Fli>\u003Cli>Qwen3-VL-235B-A22B 測多模態視覺語言能力。\u003C\u002Fli>\u003Cli>v6.0 同時涵蓋 offline、server、interactive。\u003C\u002Fli>\u003C\u002Ful>\u003Cp>這種設計比較貼近真實世界。因為實際上，API 服務不會只跑一種模式。白天可能是低延遲互動，晚上可能是批次摘要，隔天又變成多模態查詢。你要的是整體效率，不是單一漂亮分數。\u003C\u002Fp>\u003Cp>我覺得這也反映一個現實。訓練很重要，但真正燒錢的是推論。只要模型開始大規模上線，伺服器、網路、記憶體、排程都會變成成本黑洞。MLPerf v6.0 其實就是把這件事攤開來看。\u003C\u002Fp>\u003Ch2>Nvidia 的軟體堆疊真的有料\u003C\u002Fh2>\u003Cp>Nvidia 這次最想講的，不是單顆 GPU 多猛，而是整套軟體怎麼把硬體吃乾抹淨。它提到 \u003Ca href=\"https:\u002F\u002Fdeveloper.nvidia.com\u002Fdynamo\" target=\"_blank\" rel=\"noopener\">Dynamo\u003C\u002Fa>、\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM\" target=\"_blank\" rel=\"noopener\">TensorRT-LLM\u003C\u002Fa>，還有 CUDA 層級的優化。這些名字聽起來很工程，但成績單會說話。\u003C\u002Fp>\u003Cp>Dynamo 是分散式推論框架。核心概念是把 prefill 和 dec\u003Ca href=\"\u002Fnews\u002Fgmail-mcp-claude-code-multi-account-setup-zh\">ode\u003C\u002Fa> 拆開，丟到不同 GPU 上做。這樣做的好處很實際。prefill 吃計算，decode 吃延遲敏感度，混在一起常常會把資源用歪。\u003C\u002Fp>\u003Cp>TensorRT-LLM 則是另一層加速。它會用平行化、multi-token prediction，還有 kernel fusion。簡單講，就是少跑一些不必要的步驟，讓 GPU 少空轉。這種優化不會出現在行銷海報上，但很常出現在 benchmark 成績裡。\u003C\u002Fp>\u003Cblockquote>“Increases in token generation or increases in performance basically generate more revenue, they reduce costs, they get you more value from the same infrastructure,” Salvatore said.\u003C\u002Fblockquote>\u003Cp>這句話很實在。token 吞吐量上去，代表同樣一組伺服器可以接更多請求。對雲端商和企業來說，這不是學術問題，是帳單問題。每少花一點推論成本，都是直接省錢。\u003C\u002Fp>\u003Cp>另外，Nvidia 也提到它和 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\" target=\"_blank\" rel=\"noopener\">SGLang\u003C\u002Fa>、\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\" target=\"_blank\" rel=\"noopener\">FlashAttention\u003C\u002Fa> 這類開源推論工具有合作。這點很重要。因為現在開發者不愛封閉黑盒，大家想要的是可調、可改、可整合的工具鏈。\u003C\u002Fp>\u003Cp>講白了，Nvidia 不是只想賣硬體。它想讓你覺得，選 Nvidia 就等於選了一整套可以直接上線的推論堆疊。這種打法很聰明，也很現實。\u003C\u002Fp>\u003Ch2>數字成長快，競爭也更兇\u003C\u002Fh2>\u003Cp>這次最硬的比較，還是成績本身。Nvidia 說 GB300 NVL72 在 MLPerf v6.0，相較 v5.1，Llama 3.1 405B offline 提升 1.21 倍，DeepSeek-R1 server 提升 2.77 倍。對一個已經站在高階市場的系統來說，這不是小修小補。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775185799612-6nlm.png\" alt=\"Nvidia MLPerf 成績證明軟體還很重要\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>再看互動式成績。DeepSeek-R1 Interactive 跑到每秒 250,634 tokens，成本是每百萬 tokens 0.30 美元。這種數字很適合拿來跟雲端報價對照。你不一定會買同一套系統，但你一定會拿這些數字去問供應商。\u003C\u002Fp>\u003Cp>更有意思的是，這不是只有 Nvidia 自己在玩。這次有 14 個合作夥伴提交結果，像 \u003Ca href=\"https:\u002F\u002Fwww.dell.com\u002Fen-us\" target=\"_blank\" rel=\"noopener\">Dell Technologies\u003C\u002Fa>、\u003Ca href=\"https:\u002F\u002Fwww.hpe.com\u002Fus\u002Fen\u002Fhome.html\" target=\"_blank\" rel=\"noopener\">HPE\u003C\u002Fa>，還有 \u003Ca href=\"https:\u002F\u002Fcloud.google.com\u002F\" target=\"_blank\" rel=\"noopener\">Google Cloud\u003C\u002Fa>。這代表優化不是只存在實驗室，而是能落到不同供應商的系統上。\u003C\u002Fp>\u003Cul>\u003Cli>DeepSeek-R1 server：2.77 倍提升。\u003C\u002Fli>\u003Cli>Llama 3.1 405B offline：1.21 倍提升。\u003C\u002Fli>\u003Cli>DeepSeek-R1 Interactive：250,634 tokens\u002Fs。\u003C\u002Fli>\u003Cli>DeepSeek-R1 Interactive：0.30 美元 \u002F 百萬 tokens。\u003C\u002Fli>\u003C\u002Ful>\u003Cp>如果拿競品來看，\u003Ca href=\"https:\u002F\u002Fwww.amd.com\u002Fen\u002Fproducts\u002Faccelerators\u002Finstinct.html\" target=\"_blank\" rel=\"noopener\">AMD Instinct\u003C\u002Fa> 走的是另一條路，主打性價比和開放生態；\u003Ca href=\"https:\u002F\u002Fwww.intel.com\u002Fcontent\u002Fwww\u002Fus\u002Fen\u002Fproducts\u002Fdetails\u002Fdiscrete-gpus\u002Fdata-center\u002Fgaudi.html\" target=\"_blank\" rel=\"noopener\">Intel Gaudi\u003C\u002Fa> 則一直強調推論成本。Nvidia 的優勢，現在不只是硬體快，而是軟體整合成熟。這會讓它在企業採購上很難被輕易替代。\u003C\u002Fp>\u003Cp>但我也不會把話說滿。benchmark 很會說故事，可是真實流量更亂。實際上線後，模型版本、prompt 長度、上下文窗口、快取命中率，都會影響結果。也就是說，成績漂亮，不代表每個客戶都能複製。\u003C\u002Fp>\u003Ch2>這背後是整個 AI 伺服器市場的變化\u003C\u002Fh2>\u003Cp>過去大家買 AI 伺服器，常常先看 GPU 數量。現在不行了。因為模型越來越大，推論成本也越來越敏感。你有再多卡，如果排程、通訊、記憶體配置做不好，最後還是會卡住。\u003C\u002Fp>\u003Cp>這也是為什麼 Nvidia 一直強調 co-design。硬體、軟體、模型一起設計，才有辦法把 token 成本壓下來。這個邏輯對雲端業者、SaaS 團隊、企業內部 AI 平台都一樣。大家都在算一件事：每百萬 tokens 到底要花多少錢。\u003C\u002Fp>\u003Cp>從產業脈絡看，推論已經變成主戰場。訓練是一次性大工程，推論卻是每天都在燒錢。只要 AI 助理、程式碼代理、檢索增強生成、企業知識庫繼續擴張，推論基礎設施就會持續吃掉預算。\u003C\u002Fp>\u003Cp>所以這次 MLPerf 的意義，不只是 Nvidia 成績好看。它還提醒大家，軟體優化仍然很值錢。對台灣開發團隊來說，這也很實際。你未必買得起最頂的卡，但你可以先把 batching、quantization、speculative decoding、cache 管好，先把成本打下來。\u003C\u002Fp>\u003Ch2>結論很簡單：別只看晶片\u003C\u002Fh2>\u003Cp>如果你在看 2026 年的 AI 基礎設施，我的建議很直接。不要只問 GPU 幾張。你要問的是，prefill 和 dec\u003Ca href=\"\u002Fnews\u002Fclaude-code-source-map-leak-51w-lines-zh\">ode\u003C\u002Fa> 怎麼拆，TensorRT-LLM 有沒有上，Dynamo 能不能接，還有整套軟體堆疊怎麼把 token 成本壓低。\u003C\u002Fp>\u003Cp>Nvidia 這次的 MLPerf 成績，最重要的訊息不是 2.77 倍，而是它再次證明：在 AI 伺服器市場，軟體還是很值錢。接下來，真正會贏的供應商，不一定是晶片最猛的那家，而是能把每個 token 算得最便宜的那家。\u003C\u002Fp>\u003Cp>問題來了。你現在評估 AI 平台時，看的還是峰值算力嗎？還是已經開始看每百萬 tokens 成本了？\u003C\u002Fp>","Nvidia 在 MLPerf v6.0 交出最高 2.77x 推論提升。GB300 NVL72 的成績顯示，Dynamo、TensorRT-LLM 這類軟體優化，已經和 GPU 硬體同樣重要。","www.nextplatform.com","https:\u002F\u002Fwww.nextplatform.com\u002Fai\u002F2026\u002F04\u002F02\u002Fnvidia-software-pushes-mlperf-inference-benchmarks-to-new-highs\u002F5214205",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775185790112-2r4u.png",[13,14,15,16,17,18,19,20],"Nvidia","MLPerf","推論","TensorRT-LLM","Dynamo","GB300 NVL72","AI 伺服器","token 成本","zh",0,false,"2026-04-03T03:09:34.300263+00:00","2026-04-03T03:09:34.187+00:00","done","9ddd9ea3-aa75-4cef-b478-931ec93b6291","nvidia-mlperf-software-inference-benchmarks-zh","research","a15782d7-4678-4415-9a0b-4c642e46b022","published","2026-04-07T07:41:11.903+00:00",[34,36,38,40,42,43,45,47],{"name":14,"slug":35},"mlperf",{"name":20,"slug":37},"token-成本",{"name":17,"slug":39},"dynamo",{"name":13,"slug":41},"nvidia",{"name":15,"slug":15},{"name":18,"slug":44},"gb300-nvl72",{"name":16,"slug":46},"tensorrt-llm",{"name":19,"slug":48},"ai-伺服器",{"id":30,"slug":50,"title":51,"language":52},"nvidia-mlperf-software-inference-benchmarks-en","Nvidia’s MLPerf Gains Show Software Still Matters","en",[54,60,66,72,78,84],{"id":55,"slug":56,"title":57,"cover_image":58,"image_url":58,"created_at":59,"category":29},"667b72b6-e821-4d68-80a1-e03340bc85f1","turboquant-seo-shift-small-sites-zh","TurboQuant 與小站 SEO 變化","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778840440690-kcw9.png","2026-05-15T10:20:27.319472+00:00",{"id":61,"slug":62,"title":63,"cover_image":64,"image_url":64,"created_at":65,"category":29},"381fb6c6-6da7-4444-831f-8c5eed8d685c","turboquant-vllm-comparison-fp8-kv-cache-zh","TurboQuant 與 FP8 實測結果","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778839867551-4v9g.png","2026-05-15T10:10:36.034569+00:00",{"id":67,"slug":68,"title":69,"cover_image":70,"image_url":70,"created_at":71,"category":29},"c15f45ee-a548-4dbf-8152-91de159c1a11","llmbda-calculus-agent-safety-rules-zh","LLMbda 演算替 AI 代理人立安全規則","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778825503412-mlbf.png","2026-05-15T06:10:34.832664+00:00",{"id":73,"slug":74,"title":75,"cover_image":76,"image_url":76,"created_at":77,"category":29},"0c02225c-d6ff-44f8-bc92-884c8921c4a3","low-complexity-beamspace-denoiser-mmwave-mimo-zh","更簡單的毫米波波束域去噪器","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778814650361-xtc2.png","2026-05-15T03:10:30.06639+00:00",{"id":79,"slug":80,"title":81,"cover_image":82,"image_url":82,"created_at":83,"category":29},"9d27f967-62cc-433f-8cdb-9300937ade13","ai-benchmark-wins-cyber-scare-defenders-zh","為什麼 AI 基準賽在資安領域的勝利，應該讓防守方警醒","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778807450006-nofx.png","2026-05-15T01:10:29.379041+00:00",{"id":85,"slug":86,"title":87,"cover_image":88,"image_url":88,"created_at":89,"category":29},"bc402dc6-5da6-46fc-9d66-d09cb215f72b","why-linux-security-needs-patch-wave-mindset-zh","為什麼 Linux 安全需要「補丁浪潮」思維","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778741449813-s2wn.png","2026-05-14T06:50:24.052583+00:00",[91,96,101,106,111,116,121,126,131,136],{"id":92,"slug":93,"title":94,"created_at":95},"f18dbadb-8c59-4723-84a4-6ad22746c77a","deepmind-bets-on-continuous-learning-ai-2026-zh","DeepMind 押注 2026 連續學習 AI","2026-03-26T08:16:02.367355+00:00",{"id":97,"slug":98,"title":99,"created_at":100},"f4a106cb-02a6-4508-8f39-9720a0a93cee","ml-papers-of-the-week-github-research-desk-zh","每週 ML 論文清單，為何紅到 GitHub","2026-03-27T01:11:39.284175+00:00",{"id":102,"slug":103,"title":104,"created_at":105},"c4f807ca-4e5f-47f1-a48c-961cf3fc44dc","ai-ml-conferences-to-watch-in-2026-zh","2026 AI 研討會投稿時程整理","2026-03-27T01:51:53.874432+00:00",{"id":107,"slug":108,"title":109,"created_at":110},"9f50561b-aebd-46ba-94a8-363198aa7091","openclaw-agents-manipulated-self-sabotage-zh","OpenClaw Agent 會自己搞砸自己","2026-03-28T03:03:18.786425+00:00",{"id":112,"slug":113,"title":114,"created_at":115},"11f22e92-7066-4978-a544-31f5f2156ec6","vega-learning-to-drive-with-natural-language-instructions-zh","Vega：使用自然語言指示進行自駕車控制","2026-03-28T14:54:04.847912+00:00",{"id":117,"slug":118,"title":119,"created_at":120},"a4c7cfec-8d0e-4fec-93cf-1b9699a530b8","drive-my-way-en-zh","Drive My Way：個性化自駕車風格的實現","2026-03-28T14:54:26.207495+00:00",{"id":122,"slug":123,"title":124,"created_at":125},"dec02f89-fd39-41ba-8e4d-11ede93a536d","training-knowledge-bases-with-writeback-rag-zh","用 WriteBack-RAG 強化知識庫提升檢索效能","2026-03-28T14:54:45.775606+00:00",{"id":127,"slug":128,"title":129,"created_at":130},"3886be5c-a137-40cc-b9e2-0bf18430c002","packforcing-efficient-long-video-generation-method-zh","PackForcing：短影片訓練也能生成長影片","2026-03-28T14:55:02.688141+00:00",{"id":132,"slug":133,"title":134,"created_at":135},"72b90667-d930-4cc9-8ced-aaa0f8968d44","pixelsmile-toward-fine-grained-facial-expression-editing-zh","PixelSmile：提升精細臉部表情編輯的新方法","2026-03-28T14:55:20.678181+00:00",{"id":137,"slug":138,"title":139,"created_at":140},"cf046742-efb2-4753-aef9-caed5da5e32e","adaptive-block-scaled-data-types-zh","IF4：神經網路量化的聰明選擇","2026-03-31T06:00:36.990273+00:00"]