[TOOLS] 6 min readOraCore Editors

How to Build Rust GPU Kernels with cuda-oxide

Set up cuda-oxide to compile Rust GPU kernels into PTX and run them on NVIDIA GPUs.

Share LinkedIn
How to Build Rust GPU Kernels with cuda-oxide

Set up cuda-oxide to compile Rust GPU kernels into PTX and run them on NVIDIA GPUs.

If you are a Rust developer who wants to write SIMT GPU kernels without switching to C++ or a DSL, this guide shows you how to get NVIDIA’s experimental cuda-oxide GitHub repository and the companion docs working end to end. By the end, you will have a local build that emits a host binary plus PTX, runs the vecadd example, and lets you inspect the compilation pipeline.

This is a practical setup guide, so each step ends with a concrete result you can verify. You will install the right Rust nightly, LLVM, and Clang versions, confirm your CUDA toolchain, and then run your first kernel build.

Before you start

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

  • Linux only, tested on Ubuntu 24.04
  • Rust nightly pinned by the repo, currently nightly-2026-04-03
  • rust-src and rustc-dev Rust components
  • CUDA Toolkit 12.x+
  • LLVM 21+ with NVPTX backend support
  • Clang 21 or libclang-common-21-dev
  • Git
  • NVIDIA GPU with a working CUDA driver

Step 1: Verify your CUDA and GPU toolchain

Goal: confirm your machine has a compatible NVIDIA stack before you install the compiler backend. cuda-oxide depends on a CUDA-capable Linux environment, and the article notes that LLVM 21 is required for newer GPU features such as Hopper and Blackwell support.

How to Build Rust GPU Kernels with cuda-oxide
nvcc --version
nvidia-smi
llc-21 --version | grep nvptx

You should see a CUDA Toolkit 12.x version from nvcc, your GPU listed in nvidia-smi, and an NVPTX line in llc-21 output. If NVPTX is missing, stop here and install an LLVM build that includes the GPU backend.

Step 2: Install the pinned Rust nightly

Goal: match the repository’s compiler expectations so the custom codegen backend can hook into rustc internals. cuda-oxide uses nightly Rust plus rust-src and rustc-dev, and the repo pins nightly-2026-04-03 for reproducible builds.

How to Build Rust GPU Kernels with cuda-oxide
rustup toolchain install nightly-2026-04-03
rustup component add rust-src rustc-dev --toolchain nightly-2026-04-03
rustup show

You should see the nightly-2026-04-03 toolchain installed and selected. If rustc-dev is missing, the backend will not be able to use compiler APIs, so fix that before moving on.

Step 3: Install LLVM 21 and Clang 21

Goal: provide the external llc binary and the libclang headers that cuda-oxide needs during build and code generation. The backend emits textual LLVM IR and then hands it to llc, while bindgen needs Clang’s resource headers to find standard includes.

sudo apt update
sudo apt install llvm-21 clang-21 libclang-common-21-dev
llc-21 --version | grep nvptx
clang-21 --version

You should see both LLVM 21 and Clang 21 installed, plus NVPTX in llc-21 output. If bindgen later complains about stddef.h not found, it usually means you installed only a runtime libclang package instead of the full Clang headers.

Step 4: Clone cuda-oxide and install cargo-oxide

Goal: get the repo, the cargo subcommand, and the codegen backend in place so Cargo can route kernel builds through cuda-oxide. The project uses cargo oxide build, run, debug, and pipeline commands to drive the full Rust to PTX flow.

git clone https://github.com/NVlabs/cuda-oxide.git
cd cuda-oxide
cargo install --git https://github.com/NVlabs/cuda-oxide.git cargo-oxide
cargo oxide doctor

You should see cargo-oxide install cleanly, then a doctor report that checks Rust, CUDA, LLVM, Clang, and the backend binary. Any red item in doctor output should be fixed before you try to compile a kernel.

Step 5: Run the vecadd kernel example

Goal: build and execute the sample GPU kernel so you know the whole pipeline works. vecadd adds 1,024 f32 values on the GPU and validates the result on the host, which makes it an ideal first smoke test.

cargo oxide run vecadd

You should see a success message that all 1,024 elements are correct. That confirms the host binary launched, the device code compiled to PTX, and the CUDA driver loaded the generated kernel.

Step 6: Inspect the Rust to PTX pipeline

Goal: verify that cuda-oxide is really compiling through Stable MIR, Pliron, LLVM IR, and PTX rather than hiding the steps. This is useful when you are debugging compiler behavior or validating what the backend emits at each stage.

cargo oxide pipeline vecadd
cargo oxide debug vecadd --tui

You should see a stage-by-stage trace from Rust MIR through dialect-mir, mem2reg, dialect-llvm, LLVM IR, and finally PTX output. The debug command should open a TUI for cuda-gdb, which confirms the project can hand off device debugging when needed.

MetricBefore/BaselineAfter/Result
Compilation targetRust source plus separate CUDA flowSingle-source Rust emits host binary and PTX
Kernel build pathC++/CUDA or a DSL wrapperRust kernel code compiled directly to PTX
Toolchain requirementMixed Rust, C++, and CUDA toolingRust nightly, LLVM 21, Clang 21, CUDA Toolkit 12.x+

Common mistakes

  • Using an older LLVM build. Fix: install LLVM 21 or later, and confirm llc-21 --version shows NVPTX support.
  • Installing only libclang runtime packages. Fix: install clang-21 or libclang-common-21-dev so bindgen can find the resource headers.
  • Skipping rustc-dev or the pinned nightly. Fix: switch to nightly-2026-04-03 and add rust-src plus rustc-dev with rustup.

What's next

Once vecadd works, try writing your own #[kernel] function, then use cargo oxide pipeline on it to inspect how monomorphized Rust, device intrinsics, and barrier semantics survive the trip to PTX. From there, you can explore the cuda-device crate, the compiler backend source, and the project’s coordination with rust-cuda for broader Rust GPU development.