[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-cuda-asinf-accuracy-no-performance-hit-zh":3,"tags-cuda-asinf-accuracy-no-performance-hit-zh":32,"related-lang-cuda-asinf-accuracy-no-performance-hit-zh":45,"related-posts-cuda-asinf-accuracy-no-performance-hit-zh":49,"series-tools-83e2a967-1919-4771-857f-37fb8d4cfd00":86},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":20,"translated_content":10,"views":21,"is_premium":22,"created_at":23,"updated_at":23,"cover_image":11,"published_at":24,"rewrite_status":25,"rewrite_error":10,"rewritten_from_id":26,"slug":27,"category":28,"related_article_id":29,"status":30,"google_indexed_at":31,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":22},"83e2a967-1919-4771-857f-37fb8d4cfd00","CUDA asinf() 更準，速度沒掉","\u003Cp>GPU 上的三角函式，常常很現實。多 1、2 條指令，整個 kernel 就可能變味。這次在 \u003Ca href=\"https:\u002F\u002Fforums.developer.nvidia.com\u002F\" target=\"_blank\" rel=\"noopener\">NVIDIA Developer Forums\u003C\u002Fa> 上，有人把 \u003Ca href=\"https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002F\" target=\"_blank\" rel=\"noopener\">CUDA\u003C\u002Fa> 的 \u003Ccode>asinf()\u003C\u002Fcode> 拿來重做，目標很直白：準度更好，效能別掉。\u003C\u002Fp>\u003Cp>更狠的是，CUDA 12.8 原生 \u003Ccode>asinf()\u003C\u002Fcode> 編譯後是 26 條指令。這代表你想贏它，不能靠嘴砲。你得在同樣級距內，把誤差壓得更漂亮。講白了，這就是 GPU 數學工程的硬仗。\u003C\u002Fp>\u003Cp>我覺得這種題目很有意思。因為它不是在玩花俏演算法。它是在碰實際開發會遇到的痛點。你要的是能塞進現有 k\u003Ca href=\"\u002Fnews\u002Fbytedance-deerflow-2-0-47k-stars-zh\">er\u003C\u002Fa>nel 的版本，不是紙上談兵的漂亮公式。\u003C\u002Fp>\u003Ch2>為什麼 GPU 數學這麼難搞\u003C\u002Fh2>\u003Cp>在 GPU 上，函式不是單獨存在。它會被一整批 thread 重複呼叫。只要一個 \u003Ccode>asinf()\u003C\u002Fcode> 多幾條指令，吞吐量就可能被拖到。這在模擬、渲染、訊號處理，還有前處理資料時都很常見。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775142948311-udy5.png\" alt=\"CUDA asinf() 更準，速度沒掉\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>問題是，精度和速度常常互相拉扯。你想把誤差壓低，通常就得多做幾步近似修正。你想快，就可能得接受粗一點的結果。這次的重點，正是想把這條線往前推一點。\u003C\u002Fp>\u003Cp>CUDA 的標準數學函式本來就有做過硬體優化。要在這種基準上再改進，難度不低。尤其 \u003Ccode>asinf()\u003C\u002Fcode> 這種反三角函式，輸入靠近 -1 或 1 時，數值敏感度會上來，誤差很容易被放大。\u003C\u002Fp>\u003Cul>\u003Cli>CUDA 12.8 原生 \u003Ccode>asinf()\u003C\u002Fcode>：26 條指令\u003C\u002Fli>\u003Cli>目標：提高精度，別增加明顯成本\u003C\u002Fli>\u003Cli>適用場景：大量重複呼叫的 GPU kernel\u003C\u002Fli>\u003Cli>風險：邊界輸入的誤差會被放大\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>這次改的是哪個痛點\u003C\u002Fh2>\u003Cp>\u003Ccode>asinf()\u003C\u002Fcode> 看起來很單純。其實它很挑輸入。靠近區間邊界時，arcsine 的斜率變化很大。這表示一點點近似誤差，可能在輸出端變得很明顯。對做數值運算的人來說，這種地方最容易出事。\u003C\u002Fp>\u003Cp>這篇討論的出發點，和之前的 \u003Ccode>acosf()\u003C\u002Fcode> 優化很像。先找出內建函式的誤差弱點，再用更細的近似策略補上。這種做法很務實。它不是追求理論上最漂亮，而是追求在真實 GPU 上比較好用。\u003C\u002Fp>\u003Cp>重點還有一個。它不是只看精度。它同時盯著指令數。因為在 CUDA 世界裡，指令數很誠實。你多寫一點，編譯器和硬體通常都會讓你付帳。這也是為什麼 26 條指令這個基準很重要。\u003C\u002Fp>\u003Cblockquote>“The built-in implementation of CUDA 12.8 served as my baseline. It compiles to 26 instructions ...”\u003C\u002Fblockquote>\u003Cp>這句話很乾脆。它把比較基準講清楚了。不是拿舊版本、不是拿 debug build、也不是拿一個慢到不行的參考實作。它直接對準 \u003Ca href=\"\u002Fnews\u002Fcuda-tile-basic-nvidia-april-fools-post-zh\">NVID\u003C\u002Fa>IA 現成版本。\u003C\u002Fp>\u003Cp>如果你想看原始討論，來源在 \u003Ca href=\"https:\u002F\u002Fforums.developer.nvidia.com\u002Ft\u002Fimplementation-of-asinf-with-improved-accuracy-and-without-negative-performance-impact\u002F365423\" target=\"_blank\" rel=\"noopener\">NVIDIA Developer Forums\u003C\u002Fa>。相關背景也可以搭配 OraCore 的 \u003Ca href=\"\u002Fnews\u002Fcuda-12-8-math-updates\" target=\"_blank\" rel=\"noopener\">CUDA 12.8 math updates\u003C\u002Fa> 一起看。\u003C\u002Fp>\u003Ch2>跟原生版本比，差在哪裡\u003C\u002Fh2>\u003Cp>這類優化最怕一件事。你以為自己贏了，結果只是把誤差從 A 換成 B。真正有價值的比較，必須在同一顆 GPU、同一個編譯器條件下做。這樣才知道差異是不是實際存在。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775142961548-rnqy.png\" alt=\"CUDA asinf() 更準，速度沒掉\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>原生 \u003Ccode>asinf()\u003C\u002Fcode> 已經很強。它能維持 26 條指令，代表 NVIDIA 早就把很多細節磨過了。你要在這個基準上改善，通常得靠更精細的分段近似，或更好的誤差修正策略。\u003C\u002Fp>\u003Cp>我覺得這類工作最有價值的地方，不是單次結果，而是方法論。先找 v\u003Ca href=\"\u002Fnews\u002Fopenai-content-filtering-labeling-factory-zh\">en\u003C\u002Fa>dor baseline。再看誤差分佈。最後才決定要不要換掉內建函式。這種流程，比看到一個漂亮數字就高潮來得可靠多了。\u003C\u002Fp>\u003Cul>\u003Cli>原生版本已經高度優化，不是隨便就能超過\u003C\u002Fli>\u003Cli>比較重點是同硬體、同編譯條件\u003C\u002Fli>\u003Cli>邊界區間的誤差最值得盯\u003C\u002Fli>\u003Cli>能直接塞進既有 kernel，實用性才高\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>這件事放到產業裡怎麼看\u003C\u002Fh2>\u003Cp>GPU 數學優化，通常不會上新聞首頁。可是它真的會影響產品。做 3D、科學計算、影像管線、ML 前處理的人，都可能碰到這種函式。你平常看不到它，但它會藏在熱點裡偷吃效能。\u003C\u002Fp>\u003Cp>這也解釋了為什麼很多團隊會自己寫近似函式。不是因為官方版本爛。是因為不同工作負載，容忍的誤差不同。像有些圖學管線，能接受一點誤差換吞吐量；但某些物理模擬，就得把誤差壓得更死。\u003C\u002Fp>\u003Cp>這裡可以順手對比一下。NVIDIA 的原生數學庫，優勢在穩定和硬體貼合。自寫近似函式，優勢在可控。前者像現成工具箱。後者像自己改扳手。哪個好，要看你手上的工作。\u003C\u002Fp>\u003Cul>\u003Cli>原生函式：穩定、好用、貼近硬體\u003C\u002Fli>\u003Cli>自寫近似：可調整誤差與成本\u003C\u002Fli>\u003Cli>適合大量重複呼叫的熱點函式\u003C\u002Fli>\u003Cli>數值工作越敏感，越需要自己量測\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>背景再往前看一點\u003C\u002Fh2>\u003Cp>這種討論其實不是新鮮事。從 CPU 時代開始，數學函式就一直在精度和速度之間拉扯。到了 GPU，這個問題更明顯。因為一個 kernel 可能同時跑上千個 thread，任何微小成本都會被放大。\u003C\u002Fp>\u003Cp>另一個背景是，現代編譯器和硬體已經很會優化。這代表你不能再用「我自己寫一定比較快」這種老派想法。很多時候，內建版本就是很強。你要贏它，得拿出明確證據，不然只是自嗨。\u003C\u002Fp>\u003Cp>也因為這樣，這次的案例才值得看。它沒有亂吹。它直接把目標鎖在 26 條指令這個硬門檻上。這種做法很工程，也很誠實。對開發者來說，這比空談精度有用多了。\u003C\u002Fp>\u003Ch2>你可以怎麼用這個思路\u003C\u002Fh2>\u003Cp>如果你自己在寫 CUDA，我會建議先看熱點。先找出哪些函式被呼叫最多。再看它們是不是剛好落在 \u003Ccode>asinf()\u003C\u002Fcode>、\u003Ccode>acosf()\u003C\u002Fcode> 這種高敏感區。不要一開始就改整包，先動最痛的地方。\u003C\u002Fp>\u003Cp>接著，自己做測試。量誤差。量指令數。量 kernel 時間。三個都要看。少一個，你就很容易被假象騙到。尤其是資料量一大，單次函式差一點點，最後都會變成真金白銀的成本。\u003C\u002Fp>\u003Cp>我自己的看法是，這類優化會越來越實際。不是因為大家突然愛研究數學。是因為 GPU 算力很貴，誰都不想把時間浪費在不必要的近似誤差上。你如果能把準度拉高，還不多花指令，這種成果很難不讓人心動。\u003C\u002Fp>\u003Cp>下一步最值得看的，不是這個版本本身，而是它能不能在更多 GPU、更多輸入分佈、更多編譯設定下維持表現。你要是正在做 CUDA 專案，現在就該把熱函式列出來，重新量一次。別猜，直接測。\u003C\u002Fp>","NVIDIA Developer Forums 上有人替 CUDA 12.8 的 asinf() 做精度優化，指令數仍維持 26 條。這篇看它怎麼在 GPU 數學裡，硬拚準度與效能。","forums.developer.nvidia.com","https:\u002F\u002Fforums.developer.nvidia.com\u002Ft\u002Fimplementation-of-asinf-with-improved-accuracy-and-without-negative-performance-impact\u002F365423",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775142948311-udy5.png",[13,14,15,16,17,18,19],"CUDA","asinf","GPU math","NVIDIA","數值精度","效能優化","CUDA 12.8","zh",1,false,"2026-04-02T15:15:32.933149+00:00","2026-04-02T15:15:32.901+00:00","done","27646ed5-071b-4a9a-8c8f-97c3fc036891","cuda-asinf-accuracy-no-performance-hit-zh","tools","5dda57f2-dfb7-4970-98ec-2e6ad298dd8c","published","2026-04-08T09:00:51.431+00:00",[33,34,37,39,40,42,44],{"name":18,"slug":18},{"name":35,"slug":36},"Nvidia","nvidia",{"name":13,"slug":38},"cuda",{"name":14,"slug":14},{"name":15,"slug":41},"gpu-math",{"name":19,"slug":43},"cuda-128",{"name":17,"slug":17},{"id":29,"slug":46,"title":47,"language":48},"cuda-asinf-accuracy-no-performance-hit-en","CUDA asinf() Gets More Accurate Without Slowing Down","en",[50,56,62,68,74,80],{"id":51,"slug":52,"title":53,"cover_image":54,"image_url":54,"created_at":55,"category":28},"d058a76f-6548-4135-8970-f3a97f255446","why-gemini-api-pricing-is-cheaper-than-it-looks-zh","為什麼 Gemini API 定價其實比看起來更便宜","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778869845081-j4m7.png","2026-05-15T18:30:25.797639+00:00",{"id":57,"slug":58,"title":59,"cover_image":60,"image_url":60,"created_at":61,"category":28},"68e4be16-dc38-4524-a6ea-5ebe22a6c4fb","why-vidhub-huiyuan-hutong-bushi-quan-shebei-tongyong-zh","為什麼 VidHub 會員互通不是「買一次全設備通用」","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778789450987-advz.png","2026-05-14T20:10:24.048988+00:00",{"id":63,"slug":64,"title":65,"cover_image":66,"image_url":66,"created_at":67,"category":28},"7a1e174f-746b-4e82-a0e3-b2475ab39747","why-buns-zig-to-rust-experiment-is-right-zh","為什麼 Bun 的 Zig-to-Rust 實驗是對的","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778767879127-5dna.png","2026-05-14T14:10:26.886397+00:00",{"id":69,"slug":70,"title":71,"cover_image":72,"image_url":72,"created_at":73,"category":28},"e742fc73-5a65-4db3-ad17-88c99262ceb7","why-openai-api-pricing-is-product-strategy-zh","為什麼 OpenAI API 定價是產品策略，不是註腳","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778749859485-chvz.png","2026-05-14T09:10:26.003818+00:00",{"id":75,"slug":76,"title":77,"cover_image":78,"image_url":78,"created_at":79,"category":28},"c757c5d8-eda9-45dc-9020-4b002f4d6237","why-claude-code-prompt-design-beats-ide-copilots-zh","為什麼 Claude Code 的提示設計贏過 IDE Copilot","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778742645084-dao9.png","2026-05-14T07:10:29.371901+00:00",{"id":81,"slug":82,"title":83,"cover_image":84,"image_url":84,"created_at":85,"category":28},"4adef3ab-9f07-4970-91cf-77b8b581b348","why-databricks-model-serving-is-right-default-zh","為什麼 Databricks Model Serving 是生產推論的正確預設","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778692245329-a2wt.png","2026-05-13T17:10:30.659153+00:00",[87,92,97,102,107,112,117,122,127,132],{"id":88,"slug":89,"title":90,"created_at":91},"de769291-4574-4c46-a76d-772bd99e6ec9","googles-biggest-gemini-launches-in-2026-zh","Google 2026 最大 Gemini 盤點","2026-03-26T07:26:39.21072+00:00",{"id":93,"slug":94,"title":95,"created_at":96},"855cd52f-6fab-46cc-a7c1-42195e8a0de4","surepath-real-time-mcp-policy-controls-zh","SurePath 推出即時 MCP 政策控管","2026-03-26T07:57:40.77233+00:00",{"id":98,"slug":99,"title":100,"created_at":101},"9b19ab54-edef-4dbd-9ce4-a51e4bae4ebb","mcp-in-2026-the-ai-tool-layer-teams-use-zh","2026 年 MCP：團隊真的在用的 AI 工具層","2026-03-26T08:01:46.589694+00:00",{"id":103,"slug":104,"title":105,"created_at":106},"af9c46c3-7a28-410b-9f04-32b3de30a68c","prompting-in-2026-what-actually-works-zh","2026 提示工程，真正有用的是什麼","2026-03-26T08:08:12.453028+00:00",{"id":108,"slug":109,"title":110,"created_at":111},"05553086-6ed0-4758-81fd-6cab24b575e0","garry-tan-open-sources-claude-code-toolkit-zh","Garry Tan 開源 Claude Code 工具包","2026-03-26T08:26:20.068737+00:00",{"id":113,"slug":114,"title":115,"created_at":116},"042a73a2-18a2-433d-9e8f-9802b9559aac","github-ai-projects-to-watch-in-2026-zh","2026 必看 20 個 GitHub AI 專案","2026-03-26T08:28:09.619964+00:00",{"id":118,"slug":119,"title":120,"created_at":121},"a5f94120-ac0d-4483-9a8b-63590071ac6a","claude-code-vs-cursor-2026-zh","Claude Code 與 Cursor 深度對比：202…","2026-03-26T13:27:14.279193+00:00",{"id":123,"slug":124,"title":125,"created_at":126},"0975afa1-e0c7-4130-a20d-d890eaed995e","practical-github-guide-learning-ml-2026-zh","2026 機器學習入門 GitHub 實用指南","2026-03-27T01:16:49.712576+00:00",{"id":128,"slug":129,"title":130,"created_at":131},"bfdb467a-290f-4a80-b3a9-6f081afb6dff","aiml-2026-student-ai-ml-lab-repo-review-zh","AIML-2026：像課綱的學生實驗 Repo","2026-03-27T01:21:51.467798+00:00",{"id":133,"slug":134,"title":135,"created_at":136},"80cabc3e-09fc-4ff5-8f07-b8d68f5ae545","ai-trending-github-repos-and-research-feeds-zh","AI Trending：把 AI 資源收成一張表","2026-03-27T01:31:35.262183+00:00"]