[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-cuda-asinf-accuracy-no-performance-hit-en":3,"tags-cuda-asinf-accuracy-no-performance-hit-en":30,"related-lang-cuda-asinf-accuracy-no-performance-hit-en":39,"related-posts-cuda-asinf-accuracy-no-performance-hit-en":43,"series-tools-5dda57f2-dfb7-4970-98ec-2e6ad298dd8c":80},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":18,"translated_content":10,"views":19,"is_premium":20,"created_at":21,"updated_at":21,"cover_image":11,"published_at":22,"rewrite_status":23,"rewrite_error":10,"rewritten_from_id":24,"slug":25,"category":26,"related_article_id":27,"status":28,"google_indexed_at":29,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":20},"5dda57f2-dfb7-4970-98ec-2e6ad298dd8c","CUDA asinf() Gets More Accurate Without Slowing Down","\u003Cp>A developer on the \u003Ca href=\"https:\u002F\u002Fforums.developer.nvidia.com\u002F\" target=\"_blank\" rel=\"noopener\">NVIDIA Developer Forums\u003C\u002Fa> just published a fresh take on \u003Ca href=\"https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002F\" target=\"_blank\" rel=\"noopener\">CUDA\u003C\u002Fa> math: an accuracy-focused \u003Ccode>asinf()\u003C\u002Fcode> implementation that aims to beat the built-in version without paying extra in performance. The baseline matters here because CUDA 12.8’s native \u003Ccode>asinf()\u003C\u002Fcode> compiles to 26 instructions, so any improvement has to earn its keep.\u003C\u002Fp>\u003Cp>That is a pretty narrow target, and that is why this post is interesting. GPU math work usually forces a trade-off between speed and precision, especially for transcendental functions like arcsine. This effort tries to keep the instruction count in the same neighborhood while tightening the approximation where it matters.\u003C\u002Fp>\u003Ch2>Why this kind of work matters on GPUs\u003C\u002Fh2>\u003Cp>On a GPU, a few extra instructions can ripple through an entire kernel. If a math function gets called millions or billions of times, even a small change in code generation can affect throughput, latency, and occupancy. That is why developers pay close attention to built-in functions such as \u003Ccode>asinf()\u003C\u002Fcode>, especially in compute-heavy workloads like simulation, rendering, signal processing, and machine learning preprocessing.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775142952141-rcb7.png\" alt=\"CUDA asinf() Gets More Accurate Without Slowing Down\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The interesting part of this post is the constraint set. The goal was not to write a custom math routine that is merely more accurate in the abstract. The goal was to improve accuracy while avoiding a performance penalty compared with the CUDA 12.8 builtin. That makes the work more practical, because a faster but less accurate approximation is often useless in production, and a more accurate function that slows kernels down can be just as hard to justify.\u003C\u002Fp>\u003Cp>CUDA’s standard math library is already tuned for the hardware, so beating it in both accuracy and efficiency is a high bar. The author’s benchmark point, 26 instructions for the built-in implementation, gives readers a concrete reference instead of a vague claim about speed.\u003C\u002Fp>\u003Cul>\u003Cli>Baseline: CUDA 12.8 built-in \u003Ccode>asinf()\u003C\u002Fcode>\u003C\u002Fli>\u003Cli>Instruction count: 26 instructions\u003C\u002Fli>\u003Cli>Goal: higher accuracy with no negative performance impact\u003C\u002Fli>\u003Cli>Scope: single-precision arcsine for CUDA code\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>The accuracy problem behind \u003Ccode>asinf()\u003C\u002Fcode>\u003C\u002Fh2>\u003Cp>\u003Ccode>asinf()\u003C\u002Fcode> looks simple on paper, but it is one of those functions where edge cases matter a lot. Inputs near -1 and 1 are tricky because the derivative of arcsine grows steep there, which means small input errors can produce larger output differences. That is exactly the sort of region where a better approximation can pay off.\u003C\u002Fp>\u003Cp>The forum post follows an earlier success story with an accuracy-optimized \u003Ccode>acosf()\u003C\u002Fcode> implementation, then applies the same mindset to \u003Ccode>asinf()\u003C\u002Fcode>. That pairing makes sense mathematically because the two functions are closely related, and improvements in one often suggest a reusable strategy for the other.\u003C\u002Fp>\u003Cp>There is also a practical reason GPU developers care about this. In many kernels, transcendental functions are not the dominant cost by themselves, but they become painful when repeated in tight loops. If a revised implementation can stay close to the built-in instruction budget while reducing approximation error, it can make downstream numerical code easier to trust.\u003C\u002Fp>\u003Cblockquote>“The built-in implementation of CUDA 12.8 served as my baseline. It compiles to 26 instructions ...”\u003C\u002Fblockquote>\u003Cp>That quote matters because it defines the benchmark honestly. The author is not comparing against a straw man or a slow debug build. The reference point is the shipping CUDA implementation, which is exactly what developers care about when they are deciding whether to swap in custom math.\u003C\u002Fp>\u003Cp>For readers who want more context on CUDA math behavior, OraCore has covered adjacent GPU tooling topics in \u003Ca href=\"\u002Fnews\u002Fcuda-12-8-math-updates\" target=\"_blank\" rel=\"noopener\">CUDA 12.8 math updates\u003C\u002Fa> and \u003Ca href=\"\u002Fnews\u002Fgpu-kernel-optimization-notes\" target=\"_blank\" rel=\"noopener\">GPU kernel optimization notes\u003C\u002Fa>.\u003C\u002Fp>\u003Ch2>What makes the comparison interesting\u003C\u002Fh2>\u003Cp>The post’s value is in the comparison model. A custom approximation only matters if it can be measured against the vendor implementation on the same hardware and under the same compiler assumptions. In this case, the built-in function is the baseline, and the author is trying to improve the numerical result without increasing the cost in instructions.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775142960644-eebj.png\" alt=\"CUDA asinf() Gets More Accurate Without Slowing Down\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That is the sort of benchmark GPU programmers actually need. If a new implementation adds a few instructions but cuts error sharply, some workloads will still take it. If it preserves the 26-instruction footprint and improves accuracy, the case gets much stronger. It means the function may fit into existing kernels without forcing a redesign.\u003C\u002Fp>\u003Cp>This is also why the post is worth reading even if you do not care about \u003Ccode>asinf()\u003C\u002Fcode> specifically. The method reflects a broader pattern in performance engineering: start from the vendor baseline, measure the real cost, then optimize the weak spots without assuming the compiler will save you.\u003C\u002Fp>\u003Cul>\u003Cli>Vendor baseline is already highly optimized for CUDA hardware\u003C\u002Fli>\u003Cli>Any improvement has to justify itself against a 26-instruction implementation\u003C\u002Fli>\u003Cli>Accuracy gains matter most near the function’s sensitive input range\u003C\u002Fli>\u003Cli>Custom math is most valuable when it drops into existing kernels cleanly\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>What developers should take away\u003C\u002Fh2>\u003Cp>The main lesson here is simple: GPU math still has room for careful, targeted improvement. The fact that a developer can revisit a standard function like \u003Ccode>asinf()\u003C\u002Fcode> and find a path to better accuracy without a clear performance hit says something useful about the state of CUDA programming. Vendor libraries are strong, but they are not the end of the story.\u003C\u002Fp>\u003Cp>For teams that ship numerical code on NVIDIA hardware, this is a reminder to inspect hot functions instead of assuming the default implementation is always the best fit. If your workload depends heavily on arcsine, or on a family of inverse trig functions, a custom approximation may be worth testing against your own error budget and kernel profile.\u003C\u002Fp>\u003Cp>The bigger question is whether these hand-tuned math routines will become more common in production CUDA code as developers get more comfortable with profiling and approximation theory. If the next round of benchmarks shows the same pattern for related functions, more teams will start treating vendor math as a starting point rather than a final answer.\u003C\u002Fp>\u003Cp>For now, the actionable takeaway is clear: profile your kernels, check where transcendental functions sit in the instruction mix, and compare custom approximations against the CUDA baseline before assuming there is no room to improve.\u003C\u002Fp>\u003Cp>If you want to read the original discussion, the source thread is on the \u003Ca href=\"https:\u002F\u002Fforums.developer.nvidia.com\u002Ft\u002Fimplementation-of-asinf-with-improved-accuracy-and-without-negative-performance-impact\u002F365423\" target=\"_blank\" rel=\"noopener\">NVIDIA Developer Forums\u003C\u002Fa>.\u003C\u002Fp>\u003Cp>The next test worth watching is simple: can this approach hold up across more inputs, more GPUs, and more compiler settings, or does the win fade once real workloads get involved?\u003C\u002Fp>","A developer tuned asinf() for CUDA 12.8 and kept the 26-instruction baseline while improving accuracy, a rare win for GPU math.","forums.developer.nvidia.com","https:\u002F\u002Fforums.developer.nvidia.com\u002Ft\u002Fimplementation-of-asinf-with-improved-accuracy-and-without-negative-performance-impact\u002F365423",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775142952141-rcb7.png",[13,14,15,16,17],"CUDA","asinf","GPU math","numerical accuracy","performance","en",0,false,"2026-04-02T15:15:33.15066+00:00","2026-04-02T15:15:33.12+00:00","done","27646ed5-071b-4a9a-8c8f-97c3fc036891","cuda-asinf-accuracy-no-performance-hit-en","tools","83e2a967-1919-4771-857f-37fb8d4cfd00","published","2026-04-08T09:00:51.397+00:00",[31,33,34,35,37],{"name":13,"slug":32},"cuda",{"name":17,"slug":17},{"name":14,"slug":14},{"name":15,"slug":36},"gpu-math",{"name":16,"slug":38},"numerical-accuracy",{"id":27,"slug":40,"title":41,"language":42},"cuda-asinf-accuracy-no-performance-hit-zh","CUDA asinf() 更準，速度沒掉","zh",[44,50,56,62,68,74],{"id":45,"slug":46,"title":47,"cover_image":48,"image_url":48,"created_at":49,"category":26},"a6c1d84d-0d9c-4a5a-9ca0-960fbfc1412e","why-gemini-api-pricing-is-cheaper-than-it-looks-en","Why Gemini API pricing is cheaper than it looks","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778869846824-s2r1.png","2026-05-15T18:30:26.595941+00:00",{"id":51,"slug":52,"title":53,"cover_image":54,"image_url":54,"created_at":55,"category":26},"8b02abfa-eb16-4853-8b15-63d302c7b587","why-vidhub-huiyuan-hutong-bushi-quan-shebei-tongyong-en","Why VidHub 会员互通不是“买一次全设备通用”","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778789439875-uceq.png","2026-05-14T20:10:26.046635+00:00",{"id":57,"slug":58,"title":59,"cover_image":60,"image_url":60,"created_at":61,"category":26},"abe54a57-7461-4659-b2a0-99918dfd2a33","why-buns-zig-to-rust-experiment-is-right-en","Why Bun’s Zig-to-Rust experiment is the right move","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778767895201-5745.png","2026-05-14T14:10:29.298057+00:00",{"id":63,"slug":64,"title":65,"cover_image":66,"image_url":66,"created_at":67,"category":26},"f0015918-251b-43d7-95af-032d2139f3f6","why-openai-api-pricing-is-product-strategy-en","Why OpenAI API pricing is a product strategy, not a footnote","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778749841805-uyhg.png","2026-05-14T09:10:27.921211+00:00",{"id":69,"slug":70,"title":71,"cover_image":72,"image_url":72,"created_at":73,"category":26},"7096dab0-6d27-42d9-b951-7545a5dddf33","why-claude-code-prompt-design-beats-ide-copilots-en","Why Claude Code’s prompt design beats IDE copilots","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778742651754-3kxk.png","2026-05-14T07:10:30.953808+00:00",{"id":75,"slug":76,"title":77,"cover_image":78,"image_url":78,"created_at":79,"category":26},"1f1bff1e-0ebc-4fa7-a078-64dc4b552548","why-databricks-model-serving-is-right-default-en","Why Databricks Model Serving is the right default for production infe…","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778692290314-gopj.png","2026-05-13T17:10:32.167576+00:00",[81,86,91,96,101,106,111,116,121,126],{"id":82,"slug":83,"title":84,"created_at":85},"8008f1a9-7a00-4bad-88c9-3eedc9c6b4b1","surepath-ai-mcp-policy-controls-en","SurePath AI's New MCP Policy Controls Enhance AI Security","2026-03-26T01:26:52.222015+00:00",{"id":87,"slug":88,"title":89,"created_at":90},"27e39a8f-b65d-4f7b-a875-859e2b210156","mcp-standard-ai-tools-2026-en","MCP Standard in 2026: Integrating AI Tools","2026-03-26T01:27:43.127519+00:00",{"id":92,"slug":93,"title":94,"created_at":95},"165f9a19-c92d-46ba-b3f0-7125f662921d","rag-2026-transforming-enterprise-ai-en","How RAG in 2026 is Transforming Enterprise AI","2026-03-26T01:28:11.485236+00:00",{"id":97,"slug":98,"title":99,"created_at":100},"6a2a8e6e-b956-49d8-be12-cc47bdc132b2","mastering-ai-prompts-2026-guide-en","Mastering AI Prompts: A 2026 Guide for Developers","2026-03-26T01:29:07.835148+00:00",{"id":102,"slug":103,"title":104,"created_at":105},"d6653030-ee6d-4043-898d-d2de0388545b","evolving-world-prompt-engineering-en","The Evolving World of Prompt Engineering","2026-03-26T01:29:42.061205+00:00",{"id":107,"slug":108,"title":109,"created_at":110},"3ab2c67e-4664-4c67-a013-687a2f605814","garry-tan-open-sources-claude-code-toolkit-en","Garry Tan Open-Sources a Claude Code Toolkit","2026-03-26T08:26:20.245934+00:00",{"id":112,"slug":113,"title":114,"created_at":115},"66a7cbf8-7e76-41d4-9bbf-eaca9761bf69","github-ai-projects-to-watch-in-2026-en","20 GitHub AI Projects to Watch in 2026","2026-03-26T08:28:09.752027+00:00",{"id":117,"slug":118,"title":119,"created_at":120},"231306b3-1594-45b2-af81-bb80e41182f2","claude-code-vs-cursor-2026-en","Claude Code vs Cursor in 2026","2026-03-26T13:27:14.177468+00:00",{"id":122,"slug":123,"title":124,"created_at":125},"9f332fda-eace-448a-a292-2283951eee71","practical-github-guide-learning-ml-2026-en","A Practical GitHub Guide to Learning ML in 2026","2026-03-27T01:16:50.125678+00:00",{"id":127,"slug":128,"title":129,"created_at":130},"1b1f637d-0f4d-42bd-974b-07b53829144d","aiml-2026-student-ai-ml-lab-repo-review-en","AIML-2026 Is a Bare-Bones Student Lab Repo","2026-03-27T01:21:51.661231+00:00"]