[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-sebastian-raschka-llm-architecture-gallery-zh":3,"tags-sebastian-raschka-llm-architecture-gallery-zh":35,"related-lang-sebastian-raschka-llm-architecture-gallery-zh":50,"related-posts-sebastian-raschka-llm-architecture-gallery-zh":54,"series-research-e7d8242f-edab-4282-8317-9a27fec3cb91":91},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":23,"translated_content":10,"views":24,"is_premium":25,"created_at":26,"updated_at":26,"cover_image":11,"published_at":27,"rewrite_status":28,"rewrite_error":10,"rewritten_from_id":29,"slug":30,"category":31,"related_article_id":32,"status":33,"google_indexed_at":34,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":25},"e7d8242f-edab-4282-8317-9a27fec3cb91","Sebastian Raschka 的 LLM 架構圖鑑","\u003Cp>\u003Ca href=\"https:\u002F\u002Fsebastianraschka.com\u002Fllm-architecture-gallery\u002F\" target=\"_blank\" rel=\"noopener\">Sebastian Raschka’s LLM Architecture Gallery\u003C\u002Fa> 很像工程師的作弊表。它把 30 多個語言模型攤開來看。從 \u003Ca href=\"https:\u002F\u002Fopenai.com\u002Fresearch\u002Fgpt-2\" target=\"_blank\" rel=\"noopener\">GPT-2\u003C\u002Fa> 到 \u003Ca href=\"https:\u002F\u002Fwww.llama.com\u002F\" target=\"_blank\" rel=\"noopener\">Llama 4\u003C\u002Fa>，每個模型都有層數、上下文長度、注意力型態，還有 KV cache 數字。\u003C\u002Fp>\u003Cp>這頁最猛的地方，是它不講空話。你只要看幾個欄位，就知道模型在伺服器上會多吃資源。像 \u003Ca href=\"https:\u002F\u002Fwww.llama.com\u002Fllama3\u002F\" target=\"_blank\" rel=\"noopener\">Llama 3\u003C\u002Fa> 8B 用 32 層，bf16 下每個 token 只要 128 KiB KV cache。\u003Ca href=\"https:\u002F\u002Fallenai.org\u002Folmo\" target=\"_blank\" rel=\"noopener\">OLMo 2\u003C\u002Fa> 7B 也是 32 層，但每個 token 要 512 KiB。差了 4 倍，這種差距不是小事。\u003C\u002Fp>\u003Ch2>這頁到底在幹嘛\u003C\u002Fh2>\u003Cp>講白了，這是一個模型架構資料庫。不是宣傳頁，也不是 benchmark 排行榜。它把架構圖、設定檔、技術報告連在一起，讓你能追到原始資料。這對做軟體的人很重要，因為很多成本問題，都藏在看起來很無聊的細節裡。\u003C\u002Fp>\u003Cp>像是 attention 用什麼形式、layer norm 放哪裡、layer 數多少、context 開多長。這些東西不會直接出現在行銷文案裡。可是它們會直接影響推論延遲、顯存壓力，還有一台卡能塞幾個 session。\u003C\u002Fp>\u003Cp>Raschka 也把他自己的比較文章串進來。像 \u003Ca href=\"https:\u002F\u002Fsebastianraschka.com\u002Fblog\u002F2024\u002Fthe-big-llm-architecture-comparison.html\" target=\"_blank\" rel=\"noopener\">The Big LLM Architecture Comparison\u003C\u002Fa>、\u003Ca href=\"https:\u002F\u002Fsebastianraschka.com\u002Fblog\u002F2024\u002Ffrom-gpt2-to-gpt-oss.html\" target=\"_blank\" rel=\"noopener\">From GPT-2 to gpt-oss\u003C\u002Fa>，還有 \u003Ca href=\"https:\u002F\u002Fsebastianraschka.com\u002Fblog\u002F2025\u002Ffrom-deepseek-v3-to-v3-2.html\" target=\"_blank\" rel=\"noopener\">From DeepSeek V3 to V3.2\u003C\u002Fa>。你可以把它當成一個入口，直接跳去看原始脈絡。\u003C\u002Fp>\u003Cul>\u003Cli>GPT-2 XL：15 億參數，1,024 token context，48 層 MHA，300 KiB KV cache\u003C\u002Fli>\u003Cli>Llama 3 8B：80 億參數，8,192 token context，32 層 GQA，128 KiB KV cache\u003C\u002Fli>\u003Cli>OLMo 2 7B：70 億參數，4,096 token context，32 層 MHA，512 KiB KV cache\u003C\u002Fli>\u003Cli>DeepSeek V3：6710 億總參數，370 億 active，61 層，68.6 KiB KV cache\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>架構差異，真的會影響部署\u003C\u002Fh2>\u003Cp>很多人看模型，先看參數量。說真的，這只看一半。真正決定你伺服器會不會爆掉的，常常是 cache 和 attention。Dense 模型比較好理解，但不一定好跑。MoE 模型參數很多，可是 active compute 可能低很多。\u003C\u002Fp>\u003Cp>像 \u003Ca href=\"https:\u002F\u002Fwww.deepseek.com\u002F\" target=\"_blank\" rel=\"noopener\">DeepSeek\u003C\u002Fa> V3 和 \u003Ca href=\"https:\u002F\u002Fwww.llama.com\u002Fllama4\u002F\" target=\"blank\" rel=\"noopener\">Llama 4 Maverick\u003C\u002Fa> 這類 MoE 架構，就是把容量分散到多個 ex\u003Ca href=\"\u002Fnews\u002Fopenai-122b-raise-ipo-expectations-zh\">pe\u003C\u002Fa>rt。這樣做的好處很直接。總參數可以很大，但每次只喚醒一部分，推論成本不一定跟著爆。\u003C\u002Fp>\u003Cp>注意力設計也很有戲。有人用標準 multi-head attention。有人用 grou\u003Ca href=\"\u002Fnews\u002Fopenai-closes-122bn-round-ipo-looms-zh\">pe\u003C\u002Fa>d-query attention。有人加 QK-Norm。有人把長上下文切成 chunk，再混一點 full attention。Raschka 把這些設計放在同一頁，差異一眼就看得出來。\u003C\u002Fp>\u003Cblockquote>“The best way to understand a model is to look at its architecture.” — Sebastian Raschka, \u003Ca href=\"https:\u002F\u002Fsebastianraschka.com\u002Fblog\u002F2024\u002Fthe-big-llm-architecture-comparison.html\" target=\"_blank\" rel=\"noopener\">The Big LLM Architecture Comparison\u003C\u002Fa>\u003C\u002Fblockquote>\u003Cp>這句話很直白，也很對。Benchmark 只告訴你結果。架構會告訴你，這模型為什麼能跑出這個結果。\u003C\u002Fp>\u003Cp>我覺得這頁還有一個加分點。它不是一次性圖表。它把來源、版本、差異都整理起來。頁面也有 issue tracker，錯了可以回報到 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Frasbt\u002Fllm-architecture-gallery\u002Fissues\" target=\"_blank\" rel=\"noopener\">Architecture Gallery issue tracker\u003C\u002Fa>。在 LLM 世界，規格常常改很快。這種維護很實際。\u003C\u002Fp>\u003Cul>\u003Cli>Llama 4 Maverick：4000 億總參數，170 億 active，1,000,000 token context，36 chunked + 12 full GQA layers\u003C\u002Fli>\u003Cli>Qwen3 235B-A22B：2350 億總參數，220 億 active，94 層，188 KiB KV cache\u003C\u002Fli>\u003Cli>Gemma 3 27B：270 億參數，128,000 token context，52 個 sliding-window + 10 個 global layers\u003C\u002Fli>\u003Cli>Mistral Small 3.1：240 億參數，128,000 token context，40 層 GQA，160 KiB KV cache\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>為什麼比對工具比海報更有用\u003C\u002Fh2>\u003Cp>這頁也有海報版，還能在 \u003Ca href=\"https:\u002F\u002Fwww.redbubble.com\u002F\" target=\"_blank\" rel=\"noopener\">Redbubble\u003C\u002Fa> 買到，或去 \u003Ca href=\"https:\u002F\u002Fgumroad.com\u002F\" target=\"_blank\" rel=\"noopener\">Gumroad\u003C\u002Fa> 找可列印版本。拿來掛牆上很帥，這點我不否認。但真正有用的是比較\u003Ca href=\"\u002Fnews\u002Fai-coding-tool-prices-2026-free-vs-paid-zh\">工具\u003C\u002Fa>。牆上海報是裝飾。比較工具才是工程師會一直開著的東西。\u003C\u002Fp>\u003Cp>因為很多模型在參數大小上差不多，部署成本卻差超多。Llama 3 8B 每個 token 只要 128 KiB cache。OLMo 2 7B 卻要 512 KiB。這不是小差異。這會直接影響 batch size、吞吐量、延遲，還有你到底能不能在同一張卡上多開幾個 request。\u003C\u002Fp>\u003Cp>更大的模型差異更明顯。DeepSeek V3 有 671B total parameters，但 active 只有 37B。這種配置很適合拿來討論 serving 策略。你不能只說它大。你要問的是，實際推論時到底啟動多少參數。\u003C\u002Fp>\u003Cp>Llama 4 Maverick 更誇張。它把 context 拉到 1,000,000 token。這種數字很容易讓人喊哇塞，但工程師會先問另一件事：長上下文到底要多少記憶體，吞吐量會掉多少。這才是重點。\u003C\u002Fp>\u003Cul>\u003Cli>Dense 8B 與 7B 模型，cache 差 4 倍\u003C\u002Fli>\u003Cli>DeepSeek V3 的 active 參數遠低於 total 參數\u003C\u002Fli>\u003Cli>1,000,000 token context 會改變 serving 方式\u003C\u002Fli>\u003Cli>GQA 通常比傳統 MHA 更省記憶體\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>這頁放在產業脈絡裡怎麼看\u003C\u002Fh2>\u003Cp>LLM 這幾年很像從比誰大，變成比誰會省。早期大家在意參數量。後來大家開始看 context。現在更現實。大家在意的是，跑一次要多少顯存，能不能撐住長對話，API 成本會不會炸掉。\u003C\u002Fp>\u003Cp>這也解釋了為什麼架構圖會越來越重要。當模型數量一多，單看排行榜很容易失真。你可能以為兩個模型差不多。結果一個是 dense，一個是 MoE。或是一個用 128K context，另一個只有 4K。部署上的麻煩完全不同。\u003C\u002Fp>\u003Cp>對台灣團隊來說，這種資料很實用。很多新創和內部工具，不一定有超大 GPU 叢集。你更需要知道，哪個模型比較省 cache，哪個 attention 比較穩，哪個 stack 比較適合本地伺服器。這種時候，架構比行銷更誠實。\u003C\u002Fp>\u003Cp>如果你想對照不同模型的新聞解讀，也可以看 OraCore 先前的整理，例如 \u003Ca href=\"\u002Fnews\u002Fllama-4-maverick-architecture-notes\">Llama 4 Maverick architecture notes\u003C\u002Fa>，還有 \u003Ca href=\"\u002Fnews\u002Fdeepseek-v3-2-what-changed\">DeepSeek V3.2 breakdown\u003C\u002Fa>。這些內容跟 Raschka 的圖鑑放在一起看，會更有感。\u003C\u002Fp>\u003Ch2>工程師該怎麼用這份圖鑑\u003C\u002Fh2>\u003Cp>如果你在做 LLM 產品，我會建議你直接把這頁存書籤。真的。你在選模型時，先看 layer、cache、attention，再看 benchmark。順序別顛倒。因為 benchmark 很容易讓人高潮，架構才會決定你能不能上線。\u003C\u002Fp>\u003Cp>如果你在學 LLM，這頁也很適合拿來對照。你可以從 GPT-2 看起，接著看 Llama 3、OLMo 2，再看 DeepSeek 和 Qwen。你會很快發現，模型演進不是只靠參數變大。很多時候，差異來自更好的 attention、更聰明的 cache 設計，還有更務實的 serving 思維。\u003C\u002Fp>\u003Cp>我的判斷很簡單。接下來幾個月，大家會更常用這種架構圖鑑來做模型選型。不是因為它很潮。是因為它真的省時間，也真的能少踩坑。你如果要選一個模型上線，先看這頁，再看文件，通常比較不會翻車。\u003C\u002Fp>\u003C\u002Fcontent>","Raschka 的 LLM Architecture Gallery 把 GPT-2、Llama 3、OLMo 2、DeepSeek、Qwen 等模型的層數、注意力與 KV cache 數字攤開來比，工程師一眼就能看出部署差異。","sebastianraschka.com","https:\u002F\u002Fsebastianraschka.com\u002Fllm-architecture-gallery\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775121663540-srg4.png",[13,14,15,16,17,18,19,20,21,22],"LLM architecture","Sebastian Raschka","KV cache","attention","GPT-2","Llama 3","DeepSeek","Qwen","模型架構","人工智慧","zh",2,false,"2026-04-02T07:27:33.561537+00:00","2026-04-02T07:27:33.502+00:00","done","552ae8f9-ee2f-45e3-973a-510680935bae","sebastian-raschka-llm-architecture-gallery-zh","research","cdcfe76f-c9bf-44ac-98d9-e9041d414d6c","published","2026-04-08T09:00:54.352+00:00",[36,38,40,41,43,45,47,49],{"name":14,"slug":37},"sebastian-raschka",{"name":20,"slug":39},"qwen",{"name":22,"slug":22},{"name":15,"slug":42},"kv-cache",{"name":13,"slug":44},"llm-architecture",{"name":18,"slug":46},"llama-3",{"name":19,"slug":48},"deepseek",{"name":21,"slug":21},{"id":32,"slug":51,"title":52,"language":53},"sebastian-raschka-llm-architecture-gallery-en","Sebastian Raschka’s LLM Architecture Gallery","en",[55,61,67,73,79,85],{"id":56,"slug":57,"title":58,"cover_image":59,"image_url":59,"created_at":60,"category":31},"667b72b6-e821-4d68-80a1-e03340bc85f1","turboquant-seo-shift-small-sites-zh","TurboQuant 與小站 SEO 變化","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778840440690-kcw9.png","2026-05-15T10:20:27.319472+00:00",{"id":62,"slug":63,"title":64,"cover_image":65,"image_url":65,"created_at":66,"category":31},"381fb6c6-6da7-4444-831f-8c5eed8d685c","turboquant-vllm-comparison-fp8-kv-cache-zh","TurboQuant 與 FP8 實測結果","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778839867551-4v9g.png","2026-05-15T10:10:36.034569+00:00",{"id":68,"slug":69,"title":70,"cover_image":71,"image_url":71,"created_at":72,"category":31},"c15f45ee-a548-4dbf-8152-91de159c1a11","llmbda-calculus-agent-safety-rules-zh","LLMbda 演算替 AI 代理人立安全規則","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778825503412-mlbf.png","2026-05-15T06:10:34.832664+00:00",{"id":74,"slug":75,"title":76,"cover_image":77,"image_url":77,"created_at":78,"category":31},"0c02225c-d6ff-44f8-bc92-884c8921c4a3","low-complexity-beamspace-denoiser-mmwave-mimo-zh","更簡單的毫米波波束域去噪器","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778814650361-xtc2.png","2026-05-15T03:10:30.06639+00:00",{"id":80,"slug":81,"title":82,"cover_image":83,"image_url":83,"created_at":84,"category":31},"9d27f967-62cc-433f-8cdb-9300937ade13","ai-benchmark-wins-cyber-scare-defenders-zh","為什麼 AI 基準賽在資安領域的勝利，應該讓防守方警醒","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778807450006-nofx.png","2026-05-15T01:10:29.379041+00:00",{"id":86,"slug":87,"title":88,"cover_image":89,"image_url":89,"created_at":90,"category":31},"bc402dc6-5da6-46fc-9d66-d09cb215f72b","why-linux-security-needs-patch-wave-mindset-zh","為什麼 Linux 安全需要「補丁浪潮」思維","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778741449813-s2wn.png","2026-05-14T06:50:24.052583+00:00",[92,97,102,107,112,117,122,127,132,137],{"id":93,"slug":94,"title":95,"created_at":96},"f18dbadb-8c59-4723-84a4-6ad22746c77a","deepmind-bets-on-continuous-learning-ai-2026-zh","DeepMind 押注 2026 連續學習 AI","2026-03-26T08:16:02.367355+00:00",{"id":98,"slug":99,"title":100,"created_at":101},"f4a106cb-02a6-4508-8f39-9720a0a93cee","ml-papers-of-the-week-github-research-desk-zh","每週 ML 論文清單，為何紅到 GitHub","2026-03-27T01:11:39.284175+00:00",{"id":103,"slug":104,"title":105,"created_at":106},"c4f807ca-4e5f-47f1-a48c-961cf3fc44dc","ai-ml-conferences-to-watch-in-2026-zh","2026 AI 研討會投稿時程整理","2026-03-27T01:51:53.874432+00:00",{"id":108,"slug":109,"title":110,"created_at":111},"9f50561b-aebd-46ba-94a8-363198aa7091","openclaw-agents-manipulated-self-sabotage-zh","OpenClaw Agent 會自己搞砸自己","2026-03-28T03:03:18.786425+00:00",{"id":113,"slug":114,"title":115,"created_at":116},"11f22e92-7066-4978-a544-31f5f2156ec6","vega-learning-to-drive-with-natural-language-instructions-zh","Vega：使用自然語言指示進行自駕車控制","2026-03-28T14:54:04.847912+00:00",{"id":118,"slug":119,"title":120,"created_at":121},"a4c7cfec-8d0e-4fec-93cf-1b9699a530b8","drive-my-way-en-zh","Drive My Way：個性化自駕車風格的實現","2026-03-28T14:54:26.207495+00:00",{"id":123,"slug":124,"title":125,"created_at":126},"dec02f89-fd39-41ba-8e4d-11ede93a536d","training-knowledge-bases-with-writeback-rag-zh","用 WriteBack-RAG 強化知識庫提升檢索效能","2026-03-28T14:54:45.775606+00:00",{"id":128,"slug":129,"title":130,"created_at":131},"3886be5c-a137-40cc-b9e2-0bf18430c002","packforcing-efficient-long-video-generation-method-zh","PackForcing：短影片訓練也能生成長影片","2026-03-28T14:55:02.688141+00:00",{"id":133,"slug":134,"title":135,"created_at":136},"72b90667-d930-4cc9-8ced-aaa0f8968d44","pixelsmile-toward-fine-grained-facial-expression-editing-zh","PixelSmile：提升精細臉部表情編輯的新方法","2026-03-28T14:55:20.678181+00:00",{"id":138,"slug":139,"title":140,"created_at":141},"cf046742-efb2-4753-aef9-caed5da5e32e","adaptive-block-scaled-data-types-zh","IF4：神經網路量化的聰明選擇","2026-03-31T06:00:36.990273+00:00"]