[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-cutile-jl-v0-3-cuda-jl-support-faster-startup-en":3,"tags-cutile-jl-v0-3-cuda-jl-support-faster-startup-en":34,"related-lang-cutile-jl-v0-3-cuda-jl-support-faster-startup-en":45,"related-posts-cutile-jl-v0-3-cuda-jl-support-faster-startup-en":49,"series-tools-b102dcec-143f-41ab-a0f5-f1364f86552a":86},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":18,"translated_content":10,"views":19,"is_premium":20,"created_at":21,"updated_at":21,"cover_image":11,"published_at":22,"rewrite_status":23,"rewrite_error":10,"rewritten_from_id":24,"slug":25,"category":26,"related_article_id":27,"status":28,"google_indexed_at":29,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":30,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":20},"b102dcec-143f-41ab-a0f5-f1364f86552a","cuTile.jl v0.3 adds CUDA.jl support and faster startup","\u003Cp data-speakable=\"summary\">cuTile.jl v0.3 adds \u003Ca href=\"\u002Ftag\u002Fcuda\">CUDA\u003C\u002Fa>.jl integration, faster startup, and \u003Ca href=\"\u002Ftag\u002Fgpu\">GPU\u003C\u002Fa> random number generation.\u003C\u002Fp>\u003Cp>\u003Ca href=\"https:\u002F\u002Fdiscourse.julialang.org\u002Ft\u002Fann-cutile-jl-v0-3-webinar\u002F136988\" target=\"_blank\" rel=\"noopener\">cuTile.jl\u003C\u002Fa> v0.3 landed with a tighter link to \u003Ca href=\"https:\u002F\u002Fcuda.juliagpu.org\u002Fstable\u002F\" target=\"_blank\" rel=\"noopener\">CUDA.jl\u003C\u002Fa>, and the author says launching a tile kernel is now as simple as \u003Ccode>@cuda backend=cuTile ...\u003C\u002Fcode>. The package also claims parity or better results versus \u003Ca href=\"\u002Ftag\u002Fnvidia\">NVIDIA\u003C\u002Fa>’s cuTile Python on every benchmark it ships, plus much lower latency for first-time execution.\u003C\u002Fp>\u003Ctable>\u003Cthead>\u003Ctr>\u003Cth>Feature\u003C\u002Fth>\u003Cth>v0.3 detail\u003C\u002Fth>\u003Cth>Reported number\u003C\u002Fth>\u003C\u002Ftr>\u003C\u002Fthead>\u003Ctbody>\u003Ctr>\u003Ctd>Kernel launch\u003C\u002Ftd>\u003Ctd>CUDA.jl integration\u003C\u002Ftd>\u003Ctd>\u003Ccode>@cuda backend=cuTile\u003C\u002Fcode>\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Latency\u003C\u002Ftd>\u003Ctd>TTFX for a trivial kernel\u003C\u002Ftd>\u003Ctd>~1.8s\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Benchmark claim\u003C\u002Ftd>\u003Ctd>Compared with NVIDIA cuTile Python\u003C\u002Ftd>\u003Ctd>Matches or outperforms on every shipped test\u003C\u002Ftd>\u003C\u002Ftr>\u003Ctr>\u003Ctd>Webinar\u003C\u002Ftd>\u003Ctd>Joint session with Andy Terrel\u003C\u002Ftd>\u003Ctd>May 12, 2026 at 1 PM ET\u003C\u002Ftd>\u003C\u002Ftr>\u003C\u002Ftbody>\u003C\u002Ftable>\u003Ch2>What changed in v0.3\u003C\u002Fh2>\u003Cp>The headline change is integration with CUDA.jl, which matters because it lowers the friction for Julia GPU users who already know the CUDA.jl \u003Ca href=\"\u002Ftag\u002Fapi\">API\u003C\u002Fa>. Instead of treating cuTile as a separate workflow, the package now plugs into the familiar \u003Ccode>@cuda\u003C\u002Fcode> path and switches execution through the backend flag.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778109049352-7f96.png\" alt=\"cuTile.jl v0.3 adds CUDA.jl support and faster startup\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That sounds like a small ergonomics update, but it changes how quickly people can try the package in real code. GPU tooling often loses users in the first ten minutes, when setup feels different from the rest of their stack. By keeping the launch syntax close to standard Julia GPU code, cuTile.jl reduces that initial tax.\u003C\u002Fp>\u003Cul>\u003Cli>Kernel launch now uses \u003Ccode>@cuda backend=cuTile\u003C\u002Fcode>\u003C\u002Fli>\u003Cli>The package keeps working inside the CUDA.jl programming model\u003C\u002Fli>\u003Cli>TTFX for a trivial kernel is reported at about 1.8 seconds\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Why the performance claims matter\u003C\u002Fh2>\u003Cp>The author says cuTile.jl v0.3 now matches or beats \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutile\" target=\"_blank\" rel=\"noopener\">NVIDIA’s cuTile Python\u003C\u002Fa> on every benchmark shipped with the release. That is a strong claim, but the more interesting part is the latency story. First-time-to-first-execution, or TTFX, is one of the biggest annoyances in Julia GPU work, especially when you are testing small kernels.\u003C\u002Fp>\u003Cp>For a trivial kernel, the reported TTFX is about 1.8 seconds on the author’s system. That puts cuTile.jl in the same ballpark as regular CUDA.jl kernels, which is a practical win for anyone iterating on GPU code. If a tile-based API can keep performance high without making startup worse, then it becomes easier to justify using it in day-to-day work.\u003C\u002Fp>\u003Cblockquote>\u003Cp>“We now match or outperform NVIDIA’s cuTile Python on every benchmark we ship.”\u003C\u002Fp>\u003Cfooter>maleadt, cuTile.jl v0.3 announcement on Discourse\u003C\u002Ffooter>\u003C\u002Fblockquote>\u003Cp>The benchmark language matters because GPU programmers tend to be skeptical of API changes that promise cleaner code at the expense of speed. Here, the release note points in the opposite direction: cleaner integration and better numbers together. That combination is what makes people pay attention.\u003C\u002Fp>\u003Ch2>Random numbers and slicing make the API more usable\u003C\u002Fh2>\u003Cp>v0.3 also adds random number generation on both the host side and inside kernels. The announcement says performance matches or beats \u003Ca href=\"https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcurand\u002F\" target=\"_blank\" rel=\"noopener\">cuRAND\u003C\u002Fa> and the newer GPUArrays.jl generator. For scientific computing and simulation work, that matters because random numbers are part of the workload, not a side feature.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778109051643-cgl9.png\" alt=\"cuTile.jl v0.3 adds CUDA.jl support and faster startup\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The other notable addition is array slicing. With \u003Ccode>@view A[i:j, :]\u003C\u002Fcode>, users can produce a sub-range \u003Ccode>TileArray\u003C\u002Fcode> and pass it to \u003Ccode>ct.load\u003C\u002Fcode> or \u003Ccode>ct.store\u003C\u002Fcode>. That makes tiled GPU code easier to compose with Julia’s normal array idioms, which is exactly where a lot of GPU packages either feel native or feel bolted on.\u003C\u002Fp>\u003Cul>\u003Cli>Host-level random number generation is included\u003C\u002Fli>\u003Cli>In-kernel random number generation is included\u003C\u002Fli>\u003Cli>Performance is reported to match or exceed cuRAND and GPUArrays.jl’s generator\u003C\u002Fli>\u003Cli>\u003Ccode>@view\u003C\u002Fcode>-based slicing now produces \u003Ccode>TileArray\u003C\u002Fcode> sub-ranges\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>A webinar is part of the rollout\u003C\u002Fh2>\u003Cp>The release is paired with a webinar on May 12, 2026 at 1 PM ET, hosted by \u003Ca href=\"https:\u002F\u002Fjuliahub.com\u002F\" target=\"_blank\" rel=\"noopener\">JuliaHub\u003C\u002Fa>. The session features Andy Terrel from \u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002F\" target=\"_blank\" rel=\"noopener\">NVIDIA\u003C\u002Fa> and the package author, and it will cover CUDA Tile’s design, how cuTile.jl sits on top of it, and worked examples.\u003C\u002Fp>\u003Cp>That matters because tile-based GPU programming is still a niche topic for many Julia users. A live walkthrough can do what release notes cannot: show how the abstractions fit together, where the sharp edges are, and what the mental model looks like when the code is actually running.\u003C\u002Fp>\u003Cp>If you want the deeper technical write-up, the announcement points to the JuliaGPU article on \u003Ca href=\"https:\u002F\u002Fjuliagpu.org\u002Fpost\u002F2026-05-05-cutile-0.3\u002F\" target=\"_blank\" rel=\"noopener\">cuTile.jl 0.3\u003C\u002Fa>. There is also a related OraCore.dev piece on tile-based GPU work in Julia at \u003Ca href=\"\u002Fnews\u002Fblock-tile-based-gpu-programming-not-scratch\">\u002Fnews\u002Fblock-tile-based-gpu-programming-not-scratch\u003C\u002Fa>.\u003C\u002Fp>\u003Ch2>What this means for Julia GPU users\u003C\u002Fh2>\u003Cp>cuTile.jl v0.3 is not trying to win by adding another exotic API. It is trying to make tile-based GPU programming fit into the Julia tooling people already use, while keeping launch time and benchmark results competitive. That is a more credible strategy than asking developers to accept a slower path for a cleaner abstraction.\u003C\u002Fp>\u003Cp>The real test is adoption. If the CUDA.jl integration lowers the setup cost and the random-number and slicing support hold up in real projects, cuTile.jl could become a practical option for users who want tile-oriented GPU code without leaving Julia’s usual workflow. The next question is whether package authors start building on it in libraries, not just in examples.\u003C\u002Fp>\u003Cp>My bet: the webinar will matter less for the announcement itself and more for whether it gives people enough confidence to try cuTile.jl in existing CUDA.jl projects. If that happens, the package has a clear path from interesting demo to everyday tool.\u003C\u002Fp>","cuTile.jl v0.3 adds CUDA.jl integration, better latency, and GPU random number generation for Julia users.","discourse.julialang.org","https:\u002F\u002Fdiscourse.julialang.org\u002Ft\u002Fann-cutile-jl-v0-3-webinar\u002F136988",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778109049352-7f96.png",[13,14,15,16,17],"cuTile.jl","JuliaGPU","CUDA.jl","GPU programming","tile-based computing","en",1,false,"2026-05-06T23:10:35.336971+00:00","2026-05-06T23:10:35.327+00:00","done","8e7195b8-ac5e-4bea-b5e3-4c466c7ee02a","cutile-jl-v0-3-cuda-jl-support-faster-startup-en","tools","16dd1c25-37f6-42f3-9c8b-91ea90ab376c","published","2026-05-07T09:00:18.562+00:00",[31,32,33],"cuTile.jl v0.3 adds CUDA.jl integration with an easier launch path.","The release reports ~1.8s TTFX and benchmark parity or better versus NVIDIA cuTile Python.","New random number generation and slicing support make the API more practical for real GPU work.",[35,37,39,41,43],{"name":17,"slug":36},"tile-based-computing",{"name":15,"slug":38},"cudajl",{"name":13,"slug":40},"cutilejl",{"name":16,"slug":42},"gpu-programming",{"name":14,"slug":44},"juliagpu",{"id":27,"slug":46,"title":47,"language":48},"cutile-jl-v0-3-cuda-jl-support-faster-startup-zh","cuTile.jl v0.3 加入 CUDA.jl 支援","zh",[50,56,62,68,74,80],{"id":51,"slug":52,"title":53,"cover_image":54,"image_url":54,"created_at":55,"category":26},"a6c1d84d-0d9c-4a5a-9ca0-960fbfc1412e","why-gemini-api-pricing-is-cheaper-than-it-looks-en","Why Gemini API pricing is cheaper than it looks","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778869846824-s2r1.png","2026-05-15T18:30:26.595941+00:00",{"id":57,"slug":58,"title":59,"cover_image":60,"image_url":60,"created_at":61,"category":26},"8b02abfa-eb16-4853-8b15-63d302c7b587","why-vidhub-huiyuan-hutong-bushi-quan-shebei-tongyong-en","Why VidHub 会员互通不是“买一次全设备通用”","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778789439875-uceq.png","2026-05-14T20:10:26.046635+00:00",{"id":63,"slug":64,"title":65,"cover_image":66,"image_url":66,"created_at":67,"category":26},"abe54a57-7461-4659-b2a0-99918dfd2a33","why-buns-zig-to-rust-experiment-is-right-en","Why Bun’s Zig-to-Rust experiment is the right move","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778767895201-5745.png","2026-05-14T14:10:29.298057+00:00",{"id":69,"slug":70,"title":71,"cover_image":72,"image_url":72,"created_at":73,"category":26},"f0015918-251b-43d7-95af-032d2139f3f6","why-openai-api-pricing-is-product-strategy-en","Why OpenAI API pricing is a product strategy, not a footnote","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778749841805-uyhg.png","2026-05-14T09:10:27.921211+00:00",{"id":75,"slug":76,"title":77,"cover_image":78,"image_url":78,"created_at":79,"category":26},"7096dab0-6d27-42d9-b951-7545a5dddf33","why-claude-code-prompt-design-beats-ide-copilots-en","Why Claude Code’s prompt design beats IDE copilots","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778742651754-3kxk.png","2026-05-14T07:10:30.953808+00:00",{"id":81,"slug":82,"title":83,"cover_image":84,"image_url":84,"created_at":85,"category":26},"1f1bff1e-0ebc-4fa7-a078-64dc4b552548","why-databricks-model-serving-is-right-default-en","Why Databricks Model Serving is the right default for production infe…","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778692290314-gopj.png","2026-05-13T17:10:32.167576+00:00",[87,92,97,102,107,112,117,122,127,132],{"id":88,"slug":89,"title":90,"created_at":91},"8008f1a9-7a00-4bad-88c9-3eedc9c6b4b1","surepath-ai-mcp-policy-controls-en","SurePath AI's New MCP Policy Controls Enhance AI Security","2026-03-26T01:26:52.222015+00:00",{"id":93,"slug":94,"title":95,"created_at":96},"27e39a8f-b65d-4f7b-a875-859e2b210156","mcp-standard-ai-tools-2026-en","MCP Standard in 2026: Integrating AI Tools","2026-03-26T01:27:43.127519+00:00",{"id":98,"slug":99,"title":100,"created_at":101},"165f9a19-c92d-46ba-b3f0-7125f662921d","rag-2026-transforming-enterprise-ai-en","How RAG in 2026 is Transforming Enterprise AI","2026-03-26T01:28:11.485236+00:00",{"id":103,"slug":104,"title":105,"created_at":106},"6a2a8e6e-b956-49d8-be12-cc47bdc132b2","mastering-ai-prompts-2026-guide-en","Mastering AI Prompts: A 2026 Guide for Developers","2026-03-26T01:29:07.835148+00:00",{"id":108,"slug":109,"title":110,"created_at":111},"d6653030-ee6d-4043-898d-d2de0388545b","evolving-world-prompt-engineering-en","The Evolving World of Prompt Engineering","2026-03-26T01:29:42.061205+00:00",{"id":113,"slug":114,"title":115,"created_at":116},"3ab2c67e-4664-4c67-a013-687a2f605814","garry-tan-open-sources-claude-code-toolkit-en","Garry Tan Open-Sources a Claude Code Toolkit","2026-03-26T08:26:20.245934+00:00",{"id":118,"slug":119,"title":120,"created_at":121},"66a7cbf8-7e76-41d4-9bbf-eaca9761bf69","github-ai-projects-to-watch-in-2026-en","20 GitHub AI Projects to Watch in 2026","2026-03-26T08:28:09.752027+00:00",{"id":123,"slug":124,"title":125,"created_at":126},"231306b3-1594-45b2-af81-bb80e41182f2","claude-code-vs-cursor-2026-en","Claude Code vs Cursor in 2026","2026-03-26T13:27:14.177468+00:00",{"id":128,"slug":129,"title":130,"created_at":131},"9f332fda-eace-448a-a292-2283951eee71","practical-github-guide-learning-ml-2026-en","A Practical GitHub Guide to Learning ML in 2026","2026-03-27T01:16:50.125678+00:00",{"id":133,"slug":134,"title":135,"created_at":136},"1b1f637d-0f4d-42bd-974b-07b53829144d","aiml-2026-student-ai-ml-lab-repo-review-en","AIML-2026 Is a Bare-Bones Student Lab Repo","2026-03-27T01:21:51.661231+00:00"]