[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-arc-prize-leaderboard-cost-performance-zh":3,"tags-arc-prize-leaderboard-cost-performance-zh":33,"related-lang-arc-prize-leaderboard-cost-performance-zh":49,"related-posts-arc-prize-leaderboard-cost-performance-zh":53,"series-research-ffa8459f-678e-40b9-a513-dee6b02800bc":90},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":21,"translated_content":10,"views":22,"is_premium":23,"created_at":24,"updated_at":24,"cover_image":11,"published_at":25,"rewrite_status":26,"rewrite_error":10,"rewritten_from_id":27,"slug":28,"category":29,"related_article_id":30,"status":31,"google_indexed_at":32,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":23},"ffa8459f-678e-40b9-a513-dee6b02800bc","ARC 排行榜把成本攤開來看","\u003Cp>\u003Ca href=\"\u002Fnews\u002Fai-documentary-ceos-altman-hassabis-amodei-zh\">AI\u003C\u002Fa> 圈很愛看分數。可是 \u003Ca href=\"https:\u002F\u002Farcprize.org\u002Fleaderboard\" target=\"_blank\" rel=\"noopener\">ARC Prize leaderboard\u003C\u002Fa> 直接把成本攤開。網站寫得很白，只有跑一次低於 \u003Ca href=\"\u002Fnews\u002Ffyrox-1-0-lands-after-seven-years-in-rust-zh\">10\u003C\u002Fa>,000 美元的系統才會上榜。這種做法很像把獎盃牆換成帳單牆，誰燒錢，誰有效率，一眼就看懂。\u003C\u002Fp>\u003Cp>更有意思的是，\u003Ca href=\"https:\u002F\u002Farcprize.org\u002F\" target=\"_blank\" rel=\"noopener\">ARC Prize\u003C\u002Fa> 不再只考靜態題目。\u003Ca href=\"https:\u002F\u002Farcprize.org\u002Fleaderboard\" target=\"_blank\" rel=\"noopener\">ARC-AGI-1\u003C\u002Fa> 和 \u003Ca href=\"https:\u002F\u002Farcprize.org\u002Fleaderboard\" target=\"_blank\" rel=\"noopener\">ARC-AGI-2\u003C\u002Fa> 比的是固定題型下的推理。\u003Ca href=\"https:\u002F\u002Farcprize.org\u002Fleaderboard\" target=\"_blank\" rel=\"noopener\">ARC-AGI-3\u003C\u002Fa> 則把 agent 丟進互動環境。講白了，問題從「模型會不會答」變成「它能不能快速學會規則」。\u003C\u002Fp>\u003Cp>這差很多。因為真實軟體環境裡，任務常常會變。資料格式會變。API 會改。工具會掛。你不只要答對，還要能適應。這也是 ARC 排行榜比很多 benchmark 更像工程現場的原因。\u003C\u002Fp>\u003Ch2>ARC 排行榜到底在量什麼\u003C\u002Fh2>\u003Cp>ARC 的圖表不是單純排名。它把每個系統的成本和表現放在同一張散點圖上。你可以把它想成，每個點都在回答一個很現實的問題：這個模型每做一題，燒掉多少算力，換回多少分數。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775143855363-h1lx.png\" alt=\"ARC 排行榜把成本攤開來看\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>這種量法很直接，也很殘酷。很多 AI demo 看起來很猛，但一旦算進推理時間、重試次數、工具呼叫和 Token 消耗，成本就會炸開。分數高不代表能上線。分數高也不代表能長期跑在伺服器上。\u003C\u002Fp>\u003Cp>ARC Prize 還把不同類型的系統分開看。像 \u003Ca href=\"https:\u002F\u002Fopenai.com\u002Findex\u002Fgpt-4-5\u002F\" target=\"_blank\" rel=\"noopener\">GPT-4.5\u003C\u002Fa> 和 \u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002Fnews\u002Fclaude-3-7-sonnet\" target=\"_blank\" rel=\"noopener\">Claude 3.7 Sonnet\u003C\u002Fa> 這類 base LLM，通常是單次推理。另一類是 reasoning system，會拉長思考時間。還有 Kaggle 系統，因為它們是在極小預算下拚命擠分數。\u003C\u002Fp>\u003Cp>這樣切開來看，才不會把不同玩法混成一團。說真的，把 50 美元預算的競賽方法，跟 10,000 美元等級的推理系統放一起比，本來就很怪。ARC 至少有努力把規則講清楚。\u003C\u002Fp>\u003Cul>\u003Cli>上榜門檻是單次運行低於 10,000 美元。\u003C\u002Fli>\u003Cli>Kaggle 組別只有 50 美元算力預算。\u003C\u002Fli>\u003Cli>ARC-AGI-3 改成互動式任務。\u003C\u002Fli>\u003Cli>部分結果還是 preview 或 provisional。\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>為什麼 ARC-AGI-3 讓人更在意\u003C\u002Fh2>\u003Cp>ARC-AGI-3 最有意思的地方，是它把題目從靜態推理拉進互動。這不再是一次性考試。模型要先觀察，再調整，再繼續試。這種模式比較像 agent，而不是傳統聊天機器人。\u003C\u002Fp>\u003Cp>這個改動很重要。因為很多 LLM 在固定題目上看起來很強，可是一進入真實任務就開始亂猜。它可能要多輪試錯，還要一直呼叫工具。每多一次動作，成本就往上疊。Token 也跟著燒。\u003C\u002Fp>\u003Cp>ARC Prize 把這個代價畫出來，算是很誠實。你可以看到某些系統不是不能解題，而是解題方式太貴。這對 AI 團隊是壓力，也是提醒。真正能部署的系統，不是只會做題，而是能控制成本。\u003C\u002Fp>\u003Cblockquote>“True intelligence isn't just about solving problems, but solving them efficiently with minimal resources.”\u003C\u002Fblockquote>\u003Cp>這句引述來自 ARC Prize。意思很直白。只會靠狂燒算力解題，不代表夠聰明。它可能只是預算比較大。這也是我覺得 ARC 題目比很多 benchmark 更有意思的地方。\u003C\u002Fp>\u003Cp>因為它逼你面對一個老問題。模型分數高，跟產品能不能賣，根本不是同一件事。你在 demo 場上看到的漂亮曲線，常常是伺服器和成本團隊在背後幫你扛。\u003C\u002Fp>\u003Ch2>不同系統類型怎麼比\u003C\u002Fh2>\u003Cp>ARC 的排行榜不是只看誰第一。它更像在看不同策略的取捨。reasoning system 通常會隨著思考時間增加而進步，但 ARC 的說明也提到，這種提升常會慢慢趨平。講白了，就是多想一點有用，但不是無限有用。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775143857580-63r9.png\" alt=\"ARC 排行榜把成本攤開來看\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>這點很像現實世界的 AI 服務。你把推理時間拉長，答案有時會更好。可是一旦延遲變高，使用者就開始罵。更別說雲端成本、GPU 排程、API 延遲，全部都會一起上來。\u003C\u002Fp>\u003Cp>base LLM 的價值，則在於它告訴你模型原始能力到哪裡。它沒有長鏈推理的加成，也沒有太多外掛技巧。這種結果很適合拿來看底子。Kaggle 系統則是另一種極端，專門把有限預算榨到乾。\u003C\u002Fp>\u003Cp>如果你是做產品的人，這張圖其實很好懂。高分但貴，適合 demo。中高分但便宜，才像能上線。低分但超便宜，可能是某些場景的實用解。ARC 的好處，就是把這些差異攤平給你看。\u003C\u002Fp>\u003Cul>\u003Cli>reasoning system 會隨思考時間增加而進步。\u003C\u002Fli>\u003Cli>base LLM 反映單次推理的原始能力。\u003C\u002Fli>\u003Cli>Kaggle 系統是固定預算下的極限優化。\u003C\u002Fli>\u003Cli>有些結果只算 preview，不該當成最終答案。\u003C\u002Fli>\u003C\u002Ful>\u003Cp>如果拿現有大廠來看，\u003Ca href=\"https:\u002F\u002Fopenai.com\u002F\" target=\"_blank\" rel=\"noopener\">OpenAI\u003C\u002Fa> 和 \u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002F\" target=\"_blank\" rel=\"noopener\">Anthropic\u003C\u002Fa> 都很愛談 reasoning 能力。這沒錯。但 ARC 逼大家多看一個欄位：成本。沒有成本，能力常常只是幻覺。\u003C\u002Fp>\u003Cp>我覺得這對台灣開發者特別有感。很多團隊現在都在串 API 做 agent。你如果只看成功率，不看每次任務花多少 Token，月底帳單會直接教你做人。\u003C\u002Fp>\u003Ch2>這對 AI agent 開發有什麼意思\u003C\u002Fh2>\u003Cp>ARC-AGI-3 的方向，會直接影響 agent 設計。以前大家常比誰的模型答得準。現在更像在比誰能更快學會任務，還能少走冤枉路。這種能力很接近實際產品需求。\u003C\u002Fp>\u003Cp>例如客服 agent、資料整理 agent、研究助理 agent，都不會只做一次輸出。它們要讀文件、查資料、呼叫工具、修正結果。每一步都會消耗時間和算力。任務一複雜，成本就會跳。\u003C\u002Fp>\u003Cp>所以 ARC 排行榜的價值，不只是展示誰分數高。它也在提醒大家，agent 不能只追求多輪思考。更好的設計，應該是更少重試、更少廢話、更少無效 Token。這才像能在真實伺服器上活下來的系統。\u003C\u002Fp>\u003Cp>這裡可以順手看一下產業脈絡。現在很多團隊都在往 agent framework 靠攏，像是工具調用、記憶管理、工作流編排。可是框架再多，如果成本控制沒做好，最後還是會卡在營運面。技術債會變成雲端帳單。\u003C\u002Fp>\u003Ch2>這股潮流背後的產業壓力\u003C\u002Fh2>\u003Cp>AI 產業這兩年很愛比大模型參數、上下文長度、工具數量。可是真正落地後，大家最先問的常常不是準不準，而是貴不貴。這很現實，也很正常。因為\u003Ca href=\"\u002Fnews\u002Fsolana-developer-platform-enterprise-integration-zh\">企業\u003C\u002Fa>買的是服務，不是論文。\u003C\u002Fp>\u003Cp>ARC 這種榜單會受歡迎，不是因為它比較會炒作。是因為它把成本變成主角。這對模型供應商很麻煩，因為只靠「更大、更強」這套說法，已經不夠了。你還得證明每個 Token 都花得值得。\u003C\u002Fp>\u003Cp>我也覺得這會影響 benchmark 生態。接下來，單看 accuracy 的榜單可能會越來越沒說服力。大家會開始問延遲、成本、失敗率、重試次數，甚至是互動過程中的 sample efficiency。這些才是產品團隊真的在意的指標。\u003C\u002Fp>\u003Cp>如果你是做開發工具、SaaS，或是內部知識庫系統，這種轉變特別重要。因為你不會想把一個 90 分的模型，放進一個每月燒掉幾十萬的 workflow。高分很爽。帳單更真實。\u003C\u002Fp>\u003Ch2>接下來該看什麼\u003C\u002Fh2>\u003Cp>我的判斷很簡單。接下來幾個月，大家會更在意模型的「每分成本」。不是只看誰拿最高分，而是誰能用更少資源拿到接近的結果。這會直接影響 agent、推理服務和雲端部署策略。\u003C\u002Fp>\u003Cp>如果 ARC-AGI-3 持續把互動能力和成本綁在一起，AI 團隊就很難再只靠跑分說故事。下一個值得追的點，不是單一分數，而是分數、延遲、重試、Token 消耗的整體組合。你如果在做產品，現在就該開始記這些數字。\u003C\u002Fp>\u003Cp>講白了，這種榜單不是叫你迷信 ARC。它是在提醒你，AI 的價值不是免費的。下次你看到某個模型分數很漂亮，先問一句：它花了多少錢？如果答案太難看，那分數再高也只是漂亮數字而已。\u003C\u002Fp>","ARC Prize 排行榜把成本和分數放在同一張圖上，ARC-AGI-3 也把任務拉進互動環境。這篇看它怎麼逼 AI 團隊正視算力、Token 和實際可部署性。","arcprize.org","https:\u002F\u002Farcprize.org\u002Fleaderboard",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775143855363-h1lx.png",[13,14,15,16,17,18,19,20],"ARC Prize","ARC leaderboard","AI benchmark","LLM cost","AI agent","Token成本","推理模型","ARC-AGI-3","zh",1,false,"2026-04-02T15:30:39.292235+00:00","2026-04-02T15:30:39.121+00:00","done","245f25e6-76bb-4e86-88d5-0d80485ad8e0","arc-prize-leaderboard-cost-performance-zh","research","7a6580cb-935a-456c-a22d-45bab79f41c9","published","2026-04-08T09:00:51.097+00:00",[34,36,38,40,42,43,45,47],{"name":16,"slug":35},"llm-cost",{"name":20,"slug":37},"arc-agi-3",{"name":18,"slug":39},"token成本",{"name":13,"slug":41},"arc-prize",{"name":19,"slug":19},{"name":17,"slug":44},"ai-agent",{"name":15,"slug":46},"ai-benchmark",{"name":14,"slug":48},"arc-leaderboard",{"id":30,"slug":50,"title":51,"language":52},"arc-prize-leaderboard-cost-performance-en","ARC Prize leaderboard shows cost still matters","en",[54,60,66,72,78,84],{"id":55,"slug":56,"title":57,"cover_image":58,"image_url":58,"created_at":59,"category":29},"667b72b6-e821-4d68-80a1-e03340bc85f1","turboquant-seo-shift-small-sites-zh","TurboQuant 與小站 SEO 變化","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778840440690-kcw9.png","2026-05-15T10:20:27.319472+00:00",{"id":61,"slug":62,"title":63,"cover_image":64,"image_url":64,"created_at":65,"category":29},"381fb6c6-6da7-4444-831f-8c5eed8d685c","turboquant-vllm-comparison-fp8-kv-cache-zh","TurboQuant 與 FP8 實測結果","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778839867551-4v9g.png","2026-05-15T10:10:36.034569+00:00",{"id":67,"slug":68,"title":69,"cover_image":70,"image_url":70,"created_at":71,"category":29},"c15f45ee-a548-4dbf-8152-91de159c1a11","llmbda-calculus-agent-safety-rules-zh","LLMbda 演算替 AI 代理人立安全規則","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778825503412-mlbf.png","2026-05-15T06:10:34.832664+00:00",{"id":73,"slug":74,"title":75,"cover_image":76,"image_url":76,"created_at":77,"category":29},"0c02225c-d6ff-44f8-bc92-884c8921c4a3","low-complexity-beamspace-denoiser-mmwave-mimo-zh","更簡單的毫米波波束域去噪器","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778814650361-xtc2.png","2026-05-15T03:10:30.06639+00:00",{"id":79,"slug":80,"title":81,"cover_image":82,"image_url":82,"created_at":83,"category":29},"9d27f967-62cc-433f-8cdb-9300937ade13","ai-benchmark-wins-cyber-scare-defenders-zh","為什麼 AI 基準賽在資安領域的勝利，應該讓防守方警醒","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778807450006-nofx.png","2026-05-15T01:10:29.379041+00:00",{"id":85,"slug":86,"title":87,"cover_image":88,"image_url":88,"created_at":89,"category":29},"bc402dc6-5da6-46fc-9d66-d09cb215f72b","why-linux-security-needs-patch-wave-mindset-zh","為什麼 Linux 安全需要「補丁浪潮」思維","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778741449813-s2wn.png","2026-05-14T06:50:24.052583+00:00",[91,96,101,106,111,116,121,126,131,136],{"id":92,"slug":93,"title":94,"created_at":95},"f18dbadb-8c59-4723-84a4-6ad22746c77a","deepmind-bets-on-continuous-learning-ai-2026-zh","DeepMind 押注 2026 連續學習 AI","2026-03-26T08:16:02.367355+00:00",{"id":97,"slug":98,"title":99,"created_at":100},"f4a106cb-02a6-4508-8f39-9720a0a93cee","ml-papers-of-the-week-github-research-desk-zh","每週 ML 論文清單，為何紅到 GitHub","2026-03-27T01:11:39.284175+00:00",{"id":102,"slug":103,"title":104,"created_at":105},"c4f807ca-4e5f-47f1-a48c-961cf3fc44dc","ai-ml-conferences-to-watch-in-2026-zh","2026 AI 研討會投稿時程整理","2026-03-27T01:51:53.874432+00:00",{"id":107,"slug":108,"title":109,"created_at":110},"9f50561b-aebd-46ba-94a8-363198aa7091","openclaw-agents-manipulated-self-sabotage-zh","OpenClaw Agent 會自己搞砸自己","2026-03-28T03:03:18.786425+00:00",{"id":112,"slug":113,"title":114,"created_at":115},"11f22e92-7066-4978-a544-31f5f2156ec6","vega-learning-to-drive-with-natural-language-instructions-zh","Vega：使用自然語言指示進行自駕車控制","2026-03-28T14:54:04.847912+00:00",{"id":117,"slug":118,"title":119,"created_at":120},"a4c7cfec-8d0e-4fec-93cf-1b9699a530b8","drive-my-way-en-zh","Drive My Way：個性化自駕車風格的實現","2026-03-28T14:54:26.207495+00:00",{"id":122,"slug":123,"title":124,"created_at":125},"dec02f89-fd39-41ba-8e4d-11ede93a536d","training-knowledge-bases-with-writeback-rag-zh","用 WriteBack-RAG 強化知識庫提升檢索效能","2026-03-28T14:54:45.775606+00:00",{"id":127,"slug":128,"title":129,"created_at":130},"3886be5c-a137-40cc-b9e2-0bf18430c002","packforcing-efficient-long-video-generation-method-zh","PackForcing：短影片訓練也能生成長影片","2026-03-28T14:55:02.688141+00:00",{"id":132,"slug":133,"title":134,"created_at":135},"72b90667-d930-4cc9-8ced-aaa0f8968d44","pixelsmile-toward-fine-grained-facial-expression-editing-zh","PixelSmile：提升精細臉部表情編輯的新方法","2026-03-28T14:55:20.678181+00:00",{"id":137,"slug":138,"title":139,"created_at":140},"cf046742-efb2-4753-aef9-caed5da5e32e","adaptive-block-scaled-data-types-zh","IF4：神經網路量化的聰明選擇","2026-03-31T06:00:36.990273+00:00"]