cuTile.jl v0.3 adds CUDA.jl support and faster startup

OraCore Editors

Back to home

[TOOLS] May 7, 20266 min readOraCore Editors

cuTile.jl v0.3 adds CUDA.jl support and faster startup

cuTile.jl v0.3 adds CUDA.jl integration, better latency, and GPU random number generation for Julia users.

Share LinkedIn

cuTile.jl v0.3 adds CUDA.jl support and faster startup

cuTile.jl v0.3 adds CUDA.jl integration, faster startup, and GPU random number generation.

cuTile.jl v0.3 landed with a tighter link to CUDA.jl, and the author says launching a tile kernel is now as simple as @cuda backend=cuTile .... The package also claims parity or better results versus NVIDIA’s cuTile Python on every benchmark it ships, plus much lower latency for first-time execution.

Feature	v0.3 detail	Reported number
Kernel launch	CUDA.jl integration	`@cuda backend=cuTile`
Latency	TTFX for a trivial kernel	~1.8s
Benchmark claim	Compared with NVIDIA cuTile Python	Matches or outperforms on every shipped test
Webinar	Joint session with Andy Terrel	May 12, 2026 at 1 PM ET

What changed in v0.3

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

The headline change is integration with CUDA.jl, which matters because it lowers the friction for Julia GPU users who already know the CUDA.jl API. Instead of treating cuTile as a separate workflow, the package now plugs into the familiar @cuda path and switches execution through the backend flag.

That sounds like a small ergonomics update, but it changes how quickly people can try the package in real code. GPU tooling often loses users in the first ten minutes, when setup feels different from the rest of their stack. By keeping the launch syntax close to standard Julia GPU code, cuTile.jl reduces that initial tax.

Kernel launch now uses @cuda backend=cuTile
The package keeps working inside the CUDA.jl programming model
TTFX for a trivial kernel is reported at about 1.8 seconds

Why the performance claims matter

The author says cuTile.jl v0.3 now matches or beats NVIDIA’s cuTile Python on every benchmark shipped with the release. That is a strong claim, but the more interesting part is the latency story. First-time-to-first-execution, or TTFX, is one of the biggest annoyances in Julia GPU work, especially when you are testing small kernels.

For a trivial kernel, the reported TTFX is about 1.8 seconds on the author’s system. That puts cuTile.jl in the same ballpark as regular CUDA.jl kernels, which is a practical win for anyone iterating on GPU code. If a tile-based API can keep performance high without making startup worse, then it becomes easier to justify using it in day-to-day work.

“We now match or outperform NVIDIA’s cuTile Python on every benchmark we ship.”
maleadt, cuTile.jl v0.3 announcement on Discourse

The benchmark language matters because GPU programmers tend to be skeptical of API changes that promise cleaner code at the expense of speed. Here, the release note points in the opposite direction: cleaner integration and better numbers together. That combination is what makes people pay attention.

Random numbers and slicing make the API more usable

v0.3 also adds random number generation on both the host side and inside kernels. The announcement says performance matches or beats cuRAND and the newer GPUArrays.jl generator. For scientific computing and simulation work, that matters because random numbers are part of the workload, not a side feature.

The other notable addition is array slicing. With @view A[i:j, :], users can produce a sub-range TileArray and pass it to ct.load or ct.store. That makes tiled GPU code easier to compose with Julia’s normal array idioms, which is exactly where a lot of GPU packages either feel native or feel bolted on.

Host-level random number generation is included
In-kernel random number generation is included
Performance is reported to match or exceed cuRAND and GPUArrays.jl’s generator
@view-based slicing now produces TileArray sub-ranges

A webinar is part of the rollout

The release is paired with a webinar on May 12, 2026 at 1 PM ET, hosted by JuliaHub. The session features Andy Terrel from NVIDIA and the package author, and it will cover CUDA Tile’s design, how cuTile.jl sits on top of it, and worked examples.

That matters because tile-based GPU programming is still a niche topic for many Julia users. A live walkthrough can do what release notes cannot: show how the abstractions fit together, where the sharp edges are, and what the mental model looks like when the code is actually running.

If you want the deeper technical write-up, the announcement points to the JuliaGPU article on cuTile.jl 0.3. There is also a related OraCore.dev piece on tile-based GPU work in Julia at /news/block-tile-based-gpu-programming-not-scratch.

What this means for Julia GPU users

cuTile.jl v0.3 is not trying to win by adding another exotic API. It is trying to make tile-based GPU programming fit into the Julia tooling people already use, while keeping launch time and benchmark results competitive. That is a more credible strategy than asking developers to accept a slower path for a cleaner abstraction.

The real test is adoption. If the CUDA.jl integration lowers the setup cost and the random-number and slicing support hold up in real projects, cuTile.jl could become a practical option for users who want tile-oriented GPU code without leaving Julia’s usual workflow. The next question is whether package authors start building on it in libraries, not just in examples.

My bet: the webinar will matter less for the announcement itself and more for whether it gives people enough confidence to try cuTile.jl in existing CUDA.jl projects. If that happens, the package has a clear path from interesting demo to everyday tool.

// Related Articles

cuTile.jl v0.3 adds CUDA.jl support and faster startup

What changed in v0.3

Get the latest AI news in your inbox

Why the performance claims matter

Random numbers and slicing make the API more usable

A webinar is part of the rollout

What this means for Julia GPU users

Aliyun Bailian Token Plan turns credits into agents

One API gateway turns six AI APIs into one

OpenAI FDEs turn broken agents into shipped systems

Anthropic’s daily brief turns news into a workflow

Claude Reflect turns usage into retention

Midjourney turns prompt ideas into art