cuTile.jl v0.3 adds CUDA.jl support and faster startup
cuTile.jl v0.3 adds CUDA.jl integration, better latency, and GPU random number generation for Julia users.

cuTile.jl v0.3 adds CUDA.jl integration, faster startup, and GPU random number generation.
cuTile.jl v0.3 landed with a tighter link to CUDA.jl, and the author says launching a tile kernel is now as simple as @cuda backend=cuTile .... The package also claims parity or better results versus NVIDIA’s cuTile Python on every benchmark it ships, plus much lower latency for first-time execution.
| Feature | v0.3 detail | Reported number |
|---|---|---|
| Kernel launch | CUDA.jl integration | @cuda backend=cuTile |
| Latency | TTFX for a trivial kernel | ~1.8s |
| Benchmark claim | Compared with NVIDIA cuTile Python | Matches or outperforms on every shipped test |
| Webinar | Joint session with Andy Terrel | May 12, 2026 at 1 PM ET |
What changed in v0.3
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
The headline change is integration with CUDA.jl, which matters because it lowers the friction for Julia GPU users who already know the CUDA.jl API. Instead of treating cuTile as a separate workflow, the package now plugs into the familiar @cuda path and switches execution through the backend flag.

That sounds like a small ergonomics update, but it changes how quickly people can try the package in real code. GPU tooling often loses users in the first ten minutes, when setup feels different from the rest of their stack. By keeping the launch syntax close to standard Julia GPU code, cuTile.jl reduces that initial tax.
- Kernel launch now uses
@cuda backend=cuTile - The package keeps working inside the CUDA.jl programming model
- TTFX for a trivial kernel is reported at about 1.8 seconds
Why the performance claims matter
The author says cuTile.jl v0.3 now matches or beats NVIDIA’s cuTile Python on every benchmark shipped with the release. That is a strong claim, but the more interesting part is the latency story. First-time-to-first-execution, or TTFX, is one of the biggest annoyances in Julia GPU work, especially when you are testing small kernels.
For a trivial kernel, the reported TTFX is about 1.8 seconds on the author’s system. That puts cuTile.jl in the same ballpark as regular CUDA.jl kernels, which is a practical win for anyone iterating on GPU code. If a tile-based API can keep performance high without making startup worse, then it becomes easier to justify using it in day-to-day work.
“We now match or outperform NVIDIA’s cuTile Python on every benchmark we ship.”
maleadt, cuTile.jl v0.3 announcement on Discourse
The benchmark language matters because GPU programmers tend to be skeptical of API changes that promise cleaner code at the expense of speed. Here, the release note points in the opposite direction: cleaner integration and better numbers together. That combination is what makes people pay attention.
Random numbers and slicing make the API more usable
v0.3 also adds random number generation on both the host side and inside kernels. The announcement says performance matches or beats cuRAND and the newer GPUArrays.jl generator. For scientific computing and simulation work, that matters because random numbers are part of the workload, not a side feature.

The other notable addition is array slicing. With @view A[i:j, :], users can produce a sub-range TileArray and pass it to ct.load or ct.store. That makes tiled GPU code easier to compose with Julia’s normal array idioms, which is exactly where a lot of GPU packages either feel native or feel bolted on.
- Host-level random number generation is included
- In-kernel random number generation is included
- Performance is reported to match or exceed cuRAND and GPUArrays.jl’s generator
@view-based slicing now producesTileArraysub-ranges
A webinar is part of the rollout
The release is paired with a webinar on May 12, 2026 at 1 PM ET, hosted by JuliaHub. The session features Andy Terrel from NVIDIA and the package author, and it will cover CUDA Tile’s design, how cuTile.jl sits on top of it, and worked examples.
That matters because tile-based GPU programming is still a niche topic for many Julia users. A live walkthrough can do what release notes cannot: show how the abstractions fit together, where the sharp edges are, and what the mental model looks like when the code is actually running.
If you want the deeper technical write-up, the announcement points to the JuliaGPU article on cuTile.jl 0.3. There is also a related OraCore.dev piece on tile-based GPU work in Julia at /news/block-tile-based-gpu-programming-not-scratch.
What this means for Julia GPU users
cuTile.jl v0.3 is not trying to win by adding another exotic API. It is trying to make tile-based GPU programming fit into the Julia tooling people already use, while keeping launch time and benchmark results competitive. That is a more credible strategy than asking developers to accept a slower path for a cleaner abstraction.
The real test is adoption. If the CUDA.jl integration lowers the setup cost and the random-number and slicing support hold up in real projects, cuTile.jl could become a practical option for users who want tile-oriented GPU code without leaving Julia’s usual workflow. The next question is whether package authors start building on it in libraries, not just in examples.
My bet: the webinar will matter less for the announcement itself and more for whether it gives people enough confidence to try cuTile.jl in existing CUDA.jl projects. If that happens, the package has a clear path from interesting demo to everyday tool.
// Related Articles
- [TOOLS]
Why VidHub 会员互通不是“买一次全设备通用”
- [TOOLS]
Why Bun’s Zig-to-Rust experiment is the right move
- [TOOLS]
Why OpenAI API pricing is a product strategy, not a footnote
- [TOOLS]
Why Claude Code’s prompt design beats IDE copilots
- [TOOLS]
Why Databricks Model Serving is the right default for production infe…
- [TOOLS]
Why IBM’s Bob is the right kind of AI coding assistant