[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-cuda-in-2025-why-gpus-still-win-en":3,"tags-cuda-in-2025-why-gpus-still-win-en":30,"related-lang-cuda-in-2025-why-gpus-still-win-en":42,"related-posts-cuda-in-2025-why-gpus-still-win-en":46,"series-tools-e05a606a-88b9-45cd-8c3e-7ad0b30b7b5d":83},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":18,"translated_content":10,"views":19,"is_premium":20,"created_at":21,"updated_at":21,"cover_image":11,"published_at":22,"rewrite_status":23,"rewrite_error":10,"rewritten_from_id":24,"slug":25,"category":26,"related_article_id":27,"status":28,"google_indexed_at":29,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":20},"e05a606a-88b9-45cd-8c3e-7ad0b30b7b5d","CUDA in 2025: Why GPUs Still Win","\u003Cp>CUDA is 18 years old now, and it \u003Ca href=\"\u002Fnews\u002Farc-prize-leaderboard-cost-performance-en\">still matters\u003C\u002Fa> because the numbers are hard to ignore: NVIDIA says there are hundreds of millions of CUDA-enabled GPUs in use, and modern clusters can throw tens of thousands of GPU cores at a single workload. That is why the same software stack shows up in weather models, protein simulation, and LLM training.\u003C\u002Fp>\u003Cp>At its core, \u003Ca href=\"https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-zone\" target=\"_blank\" rel=\"noopener\">CUDA\u003C\u002Fa> is NVIDIA’s programming model for running general-purpose code on GPUs. If you have ever watched a task shrink from hours to minutes after moving from a CPU to a GPU, you already understand the appeal.\u003C\u002Fp>\u003Cp>What makes CUDA interesting in 2025 is not that it is new. It is that it has become the default assumption for a huge chunk of accelerated computing, from \u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fdata-center\u002F\" target=\"_blank\" rel=\"noopener\">NVIDIA\u003C\u002Fa> data center hardware to the libraries inside \u003Ca href=\"https:\u002F\u002Fpytorch.org\" target=\"_blank\" rel=\"noopener\">PyTorch\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fwww.tensorflow.org\" target=\"_blank\" rel=\"noopener\">TensorFlow\u003C\u002Fa>.\u003C\u002Fp>\u003Ch2>How CUDA got here\u003C\u002Fh2>\u003Cp>CUDA launched publicly in 2007, after NVIDIA spent years turning GPU hardware into something developers could program directly instead of abusing graphics APIs for compute. Before that, general-purpose GPU work meant awkward hacks through OpenGL or DirectX shaders. CUDA gave developers a cleaner model: write code in C or C++, launch kernels on the GPU, and let thousands of threads chew through data in parallel.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775149432831-x799.png\" alt=\"CUDA in 2025: Why GPUs Still Win\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The timing mattered. In 2007, CPUs were still improving, but they were not getting enough extra cores fast enough to satisfy scientific computing and later deep learning. GPUs were already built for parallel math, and CUDA made that power accessible without forcing developers to rewrite everything in graphics terms.\u003C\u002Fp>\u003Cp>That early bet paid off because NVIDIA kept shipping new toolkit versions, new compiler support, and new libraries instead of treating CUDA as a one-off launch. The platform became sticky for a simple reason: once your code depends on CUDA libraries, switching away gets expensive.\u003C\u002Fp>\u003Cul>\u003Cli>First public CUDA release: 2007\u003C\u002Fli>\u003Cli>Initial hardware support: GeForce 8 series\u003C\u002Fli>\u003Cli>Modern toolkit support: CUDA 13.0\u003C\u002Fli>\u003Cli>Current architecture support includes Hopper and Blackwell\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>What CUDA actually does under the hood\u003C\u002Fh2>\u003Cp>CUDA is a heterogeneous model. The CPU is the host, the GPU is the device, and the work gets split between them. The CPU handles orchestration, while the GPU handles the parts that can be broken into many small pieces and run at the same time.\u003C\u002Fp>\u003Cp>The important unit is the kernel, a function that runs on the GPU across many threads. Those threads are grouped into blocks, and blocks are grouped into grids. That structure matters because it lets developers control how work is split across the hardware instead of hoping the runtime figures it out magically.\u003C\u002Fp>\u003Cp>Memory behavior matters just as much as raw compute. CUDA has global memory, shared memory, constant memory, texture memory, and unified memory. Global memory is large but slower. Shared memory is much faster but limited to threads inside one block. Unified memory makes life easier by presenting one address space to both CPU and GPU, although it does not make bad memory access patterns disappear.\u003C\u002Fp>\u003Cblockquote>“The GPU is a very different kind of processor than the CPU. It is optimized for throughput, not latency.” — Ian Buck, NVIDIA developer conference talk on CUDA and GPU computing\u003C\u002Fblockquote>\u003Cp>That quote gets to the heart of CUDA better than any marketing copy. CUDA works when the problem has enough parallel work to keep the GPU busy. If your workload is mostly serial, the GPU will not save you.\u003C\u002Fp>\u003Cp>And that is why CUDA programming is still partly an engineering discipline and partly a performance puzzle. The fastest code is usually the code that moves data the least, keeps memory access coalesced, and avoids branch divergence inside warps.\u003C\u002Fp>\u003Ch2>Where CUDA is winning in the real world\u003C\u002Fh2>\u003Cp>The strongest evidence for CUDA’s reach is not in benchmark slides. It is in the software researchers and engineers actually use. In molecular dynamics, \u003Ca href=\"https:\u002F\u002Fwww.gromacs.org\" target=\"_blank\" rel=\"noopener\">GROMACS\u003C\u002Fa> uses CUDA to simulate biomolecules at scales involving millions of particles. In weather forecasting, the \u003Ca href=\"https:\u002F\u002Fwww.mmm.ucar.edu\u002Fmodels\u002Fwrf\" target=\"_blank\" rel=\"noopener\">Weather Research and Forecasting (WRF)\u003C\u002Fa> model has GPU implementations that can deliver up to 10x speedups in numerical computation.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775149435202-ov6l.png\" alt=\"CUDA in 2025: Why GPUs Still Win\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>AI is even more dependent on CUDA. Training large neural networks depends on matrix math, and CUDA libraries like \u003Ca href=\"https:\u002F\u002Fdeveloper.nvidia.com\u002Fcublas\" target=\"_blank\" rel=\"noopener\">cuBLAS\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fdeveloper.nvidia.com\u002Fcudnn\" target=\"_blank\" rel=\"noopener\">cuDNN\u003C\u002Fa> do a lot of the heavy lifting behind the scenes. That is one reason GPU training became the default path for modern deep learning.\u003C\u002Fp>\u003Cp>CUDA also shows up in domains that do not get as much attention in AI headlines. Finance teams use it for risk analysis, genomics pipelines use it for sequence work, and autonomous systems use it for real-time perception. The common thread is simple: lots of math, lots of data, and a need to finish before the answer goes stale.\u003C\u002Fp>\u003Cul>\u003Cli>GROMACS uses CUDA for biomolecular simulation at million-particle scale\u003C\u002Fli>\u003Cli>WRF GPU implementations can reach up to 10x speedups\u003C\u002Fli>\u003Cli>CUDA underpins training and inference in major deep learning frameworks\u003C\u002Fli>\u003Cli>Python users can access CUDA through \u003Ca href=\"https:\u002F\u002Fnumba.pydata.org\" target=\"_blank\" rel=\"noopener\">Numba\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fcupy.dev\" target=\"_blank\" rel=\"noopener\">CuPy\u003C\u002Fa>\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>CUDA versus the alternatives\u003C\u002Fh2>\u003Cp>CUDA’s biggest advantage is maturity. It has the deepest library stack, the broadest developer adoption, and the clearest path from prototype to production on NVIDIA hardware. That matters because performance work is expensive, and engineers prefer a path with fewer unknowns.\u003C\u002Fp>\u003Cp>But CUDA is not the only option. \u003Ca href=\"https:\u002F\u002Fwww.khronos.org\u002Fopencl\u002F\" target=\"_blank\" rel=\"noopener\">OpenCL\u003C\u002Fa> is more portable across vendors, \u003Ca href=\"https:\u002F\u002Fwww.intel.com\u002Fcontent\u002Fwww\u002Fus\u002Fen\u002Fdeveloper\u002Ftools\u002Foneapi\u002Foverview.html\" target=\"_blank\" rel=\"noopener\">Intel oneAPI\u003C\u002Fa> targets Intel’s hardware and software stack, and \u003Ca href=\"https:\u002F\u002Frocm.docs.amd.com\" target=\"_blank\" rel=\"noopener\">AMD ROCm\u003C\u002Fa> gives AMD a serious answer for GPU compute. The tradeoff is clear: broader portability usually means less polish, fewer battle-tested libraries, or more porting work.\u003C\u002Fp>\u003Cp>Here is the comparison that matters in practice:\u003C\u002Fp>\u003Cul>\u003Cli>CUDA: strongest ecosystem on NVIDIA GPUs, widest library support, highest adoption in AI\u003C\u002Fli>\u003Cli>OpenCL: vendor-neutral, useful when hardware portability matters more than peak NVIDIA performance\u003C\u002Fli>\u003Cli>Intel oneAPI: best fit for Intel-focused shops and mixed CPU\u002FGPU workflows\u003C\u002Fli>\u003Cli>AMD ROCm: the main route for AMD GPU acceleration, especially in research and some AI deployments\u003C\u002Fli>\u003C\u002Ful>\u003Cp>For most teams, the decision is not philosophical. It is about where the hardware budget is going. If the cluster is NVIDIA-based, CUDA is the path of least resistance. If procurement is mixed, the portability story gets more important, and the developer experience gets harder.\u003C\u002Fp>\u003Cp>There is also a business reality here: CUDA creates lock-in, and NVIDIA knows it. The lock-in is not just at the API level. It is in the training materials, the code samples, the libraries, and the habits of entire engineering teams.\u003C\u002Fp>\u003Ch2>What to watch next\u003C\u002Fh2>\u003Cp>CUDA is not going away, but its role is changing. The biggest question is how much of modern AI and HPC will stay tied to NVIDIA-specific tooling as more vendors push their own stacks and more teams ask for portability. The answer will depend on whether the convenience of CUDA keeps outweighing the pain of being tied to one hardware family.\u003C\u002Fp>\u003Cp>For developers, the practical takeaway is straightforward: if your workload is parallel, memory-heavy, and already lives on NVIDIA GPUs, CUDA is still the fastest route to real speedups. If you are starting a new platform strategy, you should decide early whether NVIDIA-first optimization is worth the lock-in.\u003C\u002Fp>\u003Cp>My bet is that CUDA will keep dominating high-performance AI and scientific computing for the next few years, while more teams quietly build portability layers above it. The real question is whether your codebase should speak CUDA directly, or whether it should treat CUDA as an implementation detail behind a thinner abstraction.\u003C\u002Fp>\u003Cp>If you want a related read on GPU software stacks, see our coverage of \u003Ca href=\"\u002Fnews\u002Fwhat-llm-inference-actually-costs\" target=\"_self\">LLM inference costs\u003C\u002Fa> and how hardware choices shape deployment budgets.\u003C\u002Fp>","CUDA powers NVIDIA GPUs across AI, science, and simulation, with up to 10x weather-model speedups and deep learning gains in the thousands.","grokipedia.com","https:\u002F\u002Fgrokipedia.com\u002Fpage\u002FCUDA",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775149432831-x799.png",[13,14,15,16,17],"CUDA","NVIDIA","GPU computing","deep learning","HPC","en",2,false,"2026-04-02T17:03:38.270396+00:00","2026-04-02T17:03:38.176+00:00","done","dd1605a3-17b6-48f2-ac19-1116e2be9fab","cuda-in-2025-why-gpus-still-win-en","tools","e97caa94-b5de-452f-ae23-ac5c2b2854b3","published","2026-04-08T09:00:50.312+00:00",[31,34,36,38,40],{"name":32,"slug":33},"Nvidia","nvidia",{"name":13,"slug":35},"cuda",{"name":16,"slug":37},"deep-learning",{"name":17,"slug":39},"hpc",{"name":15,"slug":41},"gpu-computing",{"id":27,"slug":43,"title":44,"language":45},"cuda-in-2025-why-gpus-still-win-zh","2025 年 CUDA 為何還是強","zh",[47,53,59,65,71,77],{"id":48,"slug":49,"title":50,"cover_image":51,"image_url":51,"created_at":52,"category":26},"a6c1d84d-0d9c-4a5a-9ca0-960fbfc1412e","why-gemini-api-pricing-is-cheaper-than-it-looks-en","Why Gemini API pricing is cheaper than it looks","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778869846824-s2r1.png","2026-05-15T18:30:26.595941+00:00",{"id":54,"slug":55,"title":56,"cover_image":57,"image_url":57,"created_at":58,"category":26},"8b02abfa-eb16-4853-8b15-63d302c7b587","why-vidhub-huiyuan-hutong-bushi-quan-shebei-tongyong-en","Why VidHub 会员互通不是“买一次全设备通用”","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778789439875-uceq.png","2026-05-14T20:10:26.046635+00:00",{"id":60,"slug":61,"title":62,"cover_image":63,"image_url":63,"created_at":64,"category":26},"abe54a57-7461-4659-b2a0-99918dfd2a33","why-buns-zig-to-rust-experiment-is-right-en","Why Bun’s Zig-to-Rust experiment is the right move","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778767895201-5745.png","2026-05-14T14:10:29.298057+00:00",{"id":66,"slug":67,"title":68,"cover_image":69,"image_url":69,"created_at":70,"category":26},"f0015918-251b-43d7-95af-032d2139f3f6","why-openai-api-pricing-is-product-strategy-en","Why OpenAI API pricing is a product strategy, not a footnote","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778749841805-uyhg.png","2026-05-14T09:10:27.921211+00:00",{"id":72,"slug":73,"title":74,"cover_image":75,"image_url":75,"created_at":76,"category":26},"7096dab0-6d27-42d9-b951-7545a5dddf33","why-claude-code-prompt-design-beats-ide-copilots-en","Why Claude Code’s prompt design beats IDE copilots","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778742651754-3kxk.png","2026-05-14T07:10:30.953808+00:00",{"id":78,"slug":79,"title":80,"cover_image":81,"image_url":81,"created_at":82,"category":26},"1f1bff1e-0ebc-4fa7-a078-64dc4b552548","why-databricks-model-serving-is-right-default-en","Why Databricks Model Serving is the right default for production infe…","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778692290314-gopj.png","2026-05-13T17:10:32.167576+00:00",[84,89,94,99,104,109,114,119,124,129],{"id":85,"slug":86,"title":87,"created_at":88},"8008f1a9-7a00-4bad-88c9-3eedc9c6b4b1","surepath-ai-mcp-policy-controls-en","SurePath AI's New MCP Policy Controls Enhance AI Security","2026-03-26T01:26:52.222015+00:00",{"id":90,"slug":91,"title":92,"created_at":93},"27e39a8f-b65d-4f7b-a875-859e2b210156","mcp-standard-ai-tools-2026-en","MCP Standard in 2026: Integrating AI Tools","2026-03-26T01:27:43.127519+00:00",{"id":95,"slug":96,"title":97,"created_at":98},"165f9a19-c92d-46ba-b3f0-7125f662921d","rag-2026-transforming-enterprise-ai-en","How RAG in 2026 is Transforming Enterprise AI","2026-03-26T01:28:11.485236+00:00",{"id":100,"slug":101,"title":102,"created_at":103},"6a2a8e6e-b956-49d8-be12-cc47bdc132b2","mastering-ai-prompts-2026-guide-en","Mastering AI Prompts: A 2026 Guide for Developers","2026-03-26T01:29:07.835148+00:00",{"id":105,"slug":106,"title":107,"created_at":108},"d6653030-ee6d-4043-898d-d2de0388545b","evolving-world-prompt-engineering-en","The Evolving World of Prompt Engineering","2026-03-26T01:29:42.061205+00:00",{"id":110,"slug":111,"title":112,"created_at":113},"3ab2c67e-4664-4c67-a013-687a2f605814","garry-tan-open-sources-claude-code-toolkit-en","Garry Tan Open-Sources a Claude Code Toolkit","2026-03-26T08:26:20.245934+00:00",{"id":115,"slug":116,"title":117,"created_at":118},"66a7cbf8-7e76-41d4-9bbf-eaca9761bf69","github-ai-projects-to-watch-in-2026-en","20 GitHub AI Projects to Watch in 2026","2026-03-26T08:28:09.752027+00:00",{"id":120,"slug":121,"title":122,"created_at":123},"231306b3-1594-45b2-af81-bb80e41182f2","claude-code-vs-cursor-2026-en","Claude Code vs Cursor in 2026","2026-03-26T13:27:14.177468+00:00",{"id":125,"slug":126,"title":127,"created_at":128},"9f332fda-eace-448a-a292-2283951eee71","practical-github-guide-learning-ml-2026-en","A Practical GitHub Guide to Learning ML in 2026","2026-03-27T01:16:50.125678+00:00",{"id":130,"slug":131,"title":132,"created_at":133},"1b1f637d-0f4d-42bd-974b-07b53829144d","aiml-2026-student-ai-ml-lab-repo-review-en","AIML-2026 Is a Bare-Bones Student Lab Repo","2026-03-27T01:21:51.661231+00:00"]