[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-5-kv-cache-takeaways-for-llamacpp-users-zh":3,"article-related-5-kv-cache-takeaways-for-llamacpp-users-zh":33,"series-industry-e62c3870-f6fe-45e1-8628-082b86195d31":86},{"id":4,"slug":5,"title":6,"content":7,"summary":8,"source":9,"source_url":10,"author":11,"image_url":12,"cover_image":12,"category":13,"language":14,"translated_content":11,"related_article_id":15,"keywords":16,"key_takeaways":25,"views":29,"created_at":30,"published_at":31,"topic_cluster_id":32},"e62c3870-f6fe-45e1-8628-082b86195d31","5-kv-cache-takeaways-for-llamacpp-users-zh","5 個 llama.cpp 的 KV cache 重點","\u003Cp data-speakable=\"summary\">這篇整理 5 個 llama.cpp 的 \u003Ca href=\"\u002Ftag\u002Fkv-cache\">KV cache\u003C\u002Fa> 重點，幫你判斷記憶體省多少、速度會怎麼變，以及現在該不該改用新格式。\u003C\u002Fp>\u003Cp>如果你正在調 long context 推論，這 5 點可以直接幫你決定三件事：要不要追 \u003Ca href=\"\u002Ftag\u002Fturboquant\">TurboQuant\u003C\u002Fa>、現有 q4_0 或 q8_0 值不值得先上、以及該怎麼量測才不會看錯結果。對做模型部署的人來說，KV cache 不只是技術細節，而是會直接影響能不能把上下文塞進顯存。\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>項目\u003C\u002Fth>\u003Cth>KV 緩衝區\u003C\u002Fth>\u003Cth>相對 f16 節省\u003C\u002Fth>\u003Cth>110K 生成速度\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>f16\u003C\u002Ftd>\u003Ctd>768 MiB\u003C\u002Ftd>\u003Ctd>基準\u003C\u002Ftd>\u003Ctd>38.0 tok\u002Fs\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>q8_0\u003C\u002Ftd>\u003Ctd>408 MiB\u003C\u002Ftd>\u003Ctd>47%\u003C\u002Ftd>\u003Ctd>未提供\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>q4_0\u003C\u002Ftd>\u003Ctd>216 MiB\u003C\u002Ftd>\u003Ctd>72%\u003C\u002Ftd>\u003Ctd>24.0 tok\u002Fs\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>TurboQuant\u003C\u002Ftd>\u003Ctd>低於 3 bits／值\u003C\u002Ftd>\u003Ctd>目標更高\u003C\u002Ftd>\u003Ctd>近零精度損失\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>1. TurboQuant 可能把 KV cache 壓得更小\u003C\u002Fh2>\u003Cp>\u003Ca href=\"\u002Ftag\u002Fgoogle\">Google\u003C\u002Fa> Research 的 \u003Ca href=\"https:\u002F\u002Fresearch.google\u002Fblog\u002Fturboquant\u002F\">TurboQuant\u003C\u002Fa> 主打的是把 KV cache 壓縮到 3 bits 以下，還盡量維持接近零的準確率損失。這件事\u003Ca href=\"\u002Fnews\u002Fwhy-adam-levine-business-empire-matters-zh\">重要\u003C\u002Fa>，是因為長上下文推論最常卡住的地方，往往不是算力，而是記憶體。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779285255441-f432.png\" alt=\"5 個 llama.cpp 的 KV cache 重點\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>如果這個結果能穩定落地，llama.cpp 這類推論框架就可能在同樣顯存下保留更多上下文，對多輪對話、文件摘要、程式碼助理都很有幫助。\u003C\u002Fp>\u003Cul>\u003Cli>目標：每個 KV 值低於 3 bits\u003C\u002Fli>\u003Cli>主張：精度損失接近零\u003C\u002Fli>\u003Cli>意義：降低長上下文的記憶體壓力\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>2. 現有量化格式已經能省下很多記憶體\u003C\u002Fh2>\u003Cp>就算不等新方法，llama.cpp 討論串裡的修正數據也已經很有參考價值。在 DGX Spark GB10 測試中，f16 的 KV 緩衝區是 768 MiB，q8_0 \u003Ca href=\"\u002Fnews\u002Fmicrosoft-agentic-stack-linux-ai-infra-zh\">變成\u003C\u002Fa> 408 MiB，q4_0 則降到 216 MiB。\u003C\u002Fp>\u003Cp>這代表 KV cache 量化不是理論上的小優化，而是能直接改變工作負載能否跑得動。對顯存有限的機器來說，q4_0 和 q8_0 的差別，常常就是長上下文能不能進服務。\u003C\u002Fp>\u003Cul>\u003Cli>f16：768 MiB\u003C\u002Fli>\u003Cli>q8_0：408 MiB，約省 47%\u003C\u002Fli>\u003Cli>q4_0：216 MiB，約省 72%\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>3. 生成速度和提示詞速度不一定一起變\u003C\u002Fh2>\u003Cp>一個很關鍵的修正是：在 110K context 下，提示詞處理速度其實沒有因為 cache 類型而改變。這提醒我們，prefill 和 decode 是兩個不同\u003Ca href=\"\u002Fnews\u002Ftwo-stage-adaptation-multilingual-coreference-zh\">階段\u003C\u002Fa>，不能把它們混成同一個指標來看。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779285255880-cujx.png\" alt=\"5 個 llama.cpp 的 KV cache 重點\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>真正拉開差距的是生成階段。修正後的測試顯示，q4_0 在 110K context 的生成速度比 f16 慢 36.8%。社群的解讀是，每 \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> 的反量化成本成了瓶頸，這也正是 TurboQuant 想處理的地方。\u003C\u002Fp>\u003Ccode>110K context 生成速度（修正後）\nf16 = 38.0 tok\u002Fs\nq4_0 = 24.0 tok\u002Fs\n差異 = -36.8%\u003C\u002Fcode>\u003Ch2>4. 社群已經開始做不同實作分支\u003C\u002Fh2>\u003Cp>這件事不是只有論文熱度。討論裡已經出現 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FTheTom\u002Fllama-cpp-turboquant\">TheTom 的 llama-cpp-turboquant\u003C\u002Fa> 分支，也有人提到 \u003Ca href=\"\u002Ftag\u002Fnvidia\">NVIDIA\u003C\u002Fa> 的 KTVC、MLX 開發者關注，以及 CUDA、HIP\u002FROCm 和 prefill 優化等不同路線。\u003C\u002Fp>\u003Cp>這表示 KV cache 不是單一答案，而是一串實作選擇。對想看生產可用性的人來說，重點已經從「有沒有方法」變成「哪條路線先成熟、哪個 block size 比較穩、哪個分支比較少 bug」。\u003C\u002Fp>\u003Cul>\u003Cli>Google Research 先提出 TurboQuant\u003C\u002Fli>\u003Cli>llama.cpp 討論串開始驗證實作可行性\u003C\u002Fli>\u003Cli>社群分支已在測 CUDA、ROCm、prefill 優化\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>5. 量測方法會直接影響你怎麼解讀結果\u003C\u002Fh2>\u003Cp>討論串裡有一個很實用的教訓：一開始有人以為提示詞吞吐量大幅崩掉，後來才修正為量測方式有誤。真正的問題不是速度指標本身，而是用了 RSS 這種不夠準的記憶體來源。\u003C\u002Fp>\u003Cp>如果你自己要測 TurboQuant、q4_0 或其他 cache 格式，務必要把 GPU 記憶體、prefill、decode 分開看，還要確認失敗請求沒有被算進吞吐量。否則很容易把量化效果看反。\u003C\u002Fp>\u003Cul>\u003Cli>記憶體請同時看 nvidia-smi 與內部 KV 緩衝區\u003C\u002Fli>\u003Cli>prefill 和 decode 要分開量\u003C\u002Fli>\u003Cli>吞吐量要排除失敗請求\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>怎麼挑\u003C\u002Fh2>\u003Cp>如果你現在最在意的是長上下文能不能塞進顯存，先看 q4_0 和 q8_0 這種現成格式最實際，因為它們已經有明確的省記憶體效果。若你是在做前瞻評估，TurboQuant 值得追蹤，因為它有機會把 KV cache 再往下壓一階。\u003C\u002Fp>\u003Cp>如果你的工作是維護推論服務或寫 benchmark，最重要的不是只看單一速度數字，而是先定義你要解決的是記憶體、提示詞速度，還是生成速度。KV cache 的選擇，最後都要回到你的瓶頸在哪裡。\u003C\u002Fp>","5 個重點帶你看懂 llama.cpp 的 KV cache 壓縮、記憶體節省與效能取捨，判斷該追新方法還是先用現有格式。","github.com","https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fllama.cpp\u002Fdiscussions\u002F20969",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1779285255441-f432.png","industry","zh","bfbd028b-4704-4de5-8f54-55625836952f",[17,18,19,20,21,22,23,24],"llama.cpp","KV cache","TurboQuant","q4_0","q8_0","長上下文","記憶體量化","推論效能",[26,27,28],"TurboQuant 代表 KV cache 可能進一步縮小到 3 bits 以下，目標是大幅降記憶體。","現有 q8_0 與 q4_0 已能在測試中明顯減少 KV 緩衝區，q4_0 可省到 72%。","量測 KV cache 不能只看單一吞吐量，必須分開 prefill、decode 與 GPU 記憶體來源。",16,"2026-05-20T13:53:42.308292+00:00","2026-05-20T13:53:42.247+00:00","caa87b65-9bbc-46fe-bba8-4f4158dd2d8b",{"tags":34,"relatedLang":45,"relatedPosts":49},[35,37,39,41,43],{"name":21,"slug":36},"q80",{"name":18,"slug":38},"kv-cache",{"name":20,"slug":40},"q40",{"name":17,"slug":42},"llamacpp",{"name":19,"slug":44},"turboquant",{"id":15,"slug":46,"title":47,"language":48},"5-kv-cache-takeaways-for-llamacpp-users-en","5 KV cache takeaways for llama.cpp users","en",[50,56,62,68,74,80],{"id":51,"slug":52,"title":53,"cover_image":54,"image_url":54,"created_at":55,"category":13},"0d604500-3a70-40ec-a70e-370f972a66ab","korea-nvidia-talks-ai-factory-push-zh","韓國與 Nvidia 對話，重點是 AI 工廠","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781057871797-7uxx.png","2026-06-10T02:17:21.099824+00:00",{"id":57,"slug":58,"title":59,"cover_image":60,"image_url":60,"created_at":61,"category":13},"173b8876-1867-4e0b-948f-27891d6b6364","openai-should-not-rush-its-ipo-just-to-win-the-ai-race-zh","OpenAI 不該為了搶 AI 賽道而急著 IPO","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781053365610-1hko.png","2026-06-10T01:02:19.886627+00:00",{"id":63,"slug":64,"title":65,"cover_image":66,"image_url":66,"created_at":67,"category":13},"3d7ff80a-4045-4b66-9e21-b6a8eb3b6f6d","openai-europe-privacy-policy-zh","OpenAI 歐洲隱私政策更新重點","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781052479369-yomr.png","2026-06-10T00:47:31.176745+00:00",{"id":69,"slug":70,"title":71,"cover_image":72,"image_url":72,"created_at":73,"category":13},"69002c63-177a-4723-9e63-d28506f08edd","openai-ads-sensitive-chats-policy-zh","OpenAI把廣告擋在敏感對話外是對的","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781051578409-en02.png","2026-06-10T00:32:23.404084+00:00",{"id":75,"slug":76,"title":77,"cover_image":78,"image_url":78,"created_at":79,"category":13},"ea98a8c9-ebe1-4258-8a2b-b0d82b25deed","ai-bootlegs-streaming-royalties-stick-figure-zh","AI bootlegs 正在抽走串流版稅","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781050681742-3rdh.png","2026-06-10T00:17:31.017287+00:00",{"id":81,"slug":82,"title":83,"cover_image":84,"image_url":84,"created_at":85,"category":13},"20d0b5fc-a363-481d-86b2-e30276a49e92","amd-microsoft-windows-ml-acceleration-zh","AMD 與 Microsoft 把 Windows ML 推進 GPU 與 N…","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1781047980407-vd5p.png","2026-06-09T23:32:31.304436+00:00",[87,92,97,102,107,112,117,122,127,132],{"id":88,"slug":89,"title":90,"created_at":91},"ee073da7-28b3-4752-a319-5a501459fb87","ai-in-2026-what-actually-matters-now-zh","2026 AI 真正重要的事","2026-03-26T07:09:12.008134+00:00",{"id":93,"slug":94,"title":95,"created_at":96},"83bd1795-8548-44c9-9a7e-de50a0923f71","trump-ai-framework-power-speech-state-preemption-zh","川普 AI 框架瞄準電力、言論與州權","2026-03-26T07:12:18.695466+00:00",{"id":98,"slug":99,"title":100,"created_at":101},"ea6be18b-c903-4e54-97b7-5f7447a612e0","nvidia-gtc-2026-big-ai-announcements-zh","NVIDIA GTC 2026 重點拆解","2026-03-26T07:14:26.62638+00:00",{"id":103,"slug":104,"title":105,"created_at":106},"4bcec76f-4c36-4daa-909f-54cd702f7c93","claude-users-spreading-out-and-getting-better-zh","Claude 用戶更分散，也更會用","2026-03-26T07:22:52.325888+00:00",{"id":108,"slug":109,"title":110,"created_at":111},"bd903b15-2473-4178-9789-b7557816e535","openclaw-raises-hard-question-for-ai-models-zh","OpenClaw 逼問 AI 模型價值","2026-03-26T07:24:54.707486+00:00",{"id":113,"slug":114,"title":115,"created_at":116},"eeac6b9e-ad9d-4831-8eec-8bba3f9bca6a","gap-google-gemini-checkout-fashion-search-zh","Gap 把結帳搬進 Gemini","2026-03-26T07:28:23.937768+00:00",{"id":118,"slug":119,"title":120,"created_at":121},"0740e53f-605d-4d57-8601-c10beb126f3c","google-pushes-gemini-transition-to-march-2026-zh","Google 把 Gemini 轉換延到 2026 年 3…","2026-03-26T07:30:12.825269+00:00",{"id":123,"slug":124,"title":125,"created_at":126},"e660d801-2421-4529-8fa9-86b82b066990","metas-llama-4-benchmark-scandal-gets-worse-zh","Meta Llama 4 分數風波又擴大","2026-03-26T07:34:21.156421+00:00",{"id":128,"slug":129,"title":130,"created_at":131},"183f9e7c-e143-40bb-a6d5-67ba84a3a8bc","accenture-mistral-ai-sovereign-enterprise-deal-zh","Accenture 攜手 Mistral AI 賣主權 AI","2026-03-26T07:38:14.818906+00:00",{"id":133,"slug":134,"title":135,"created_at":136},"191d9b1b-768a-478c-978c-dd7431a38149","mistral-ai-faces-its-hardest-year-yet-zh","Mistral AI 迎來最硬的一年","2026-03-26T07:40:23.716374+00:00"]