Nx.Vulkan

A GPU tensor backend for Nx that runs on anything with a Vulkan driver — including FreeBSD, where CUDA and Metal don't exist.

✓ Linux + NVIDIA RTX 3060 Ti      (proprietary driver)
✓ FreeBSD + NVIDIA GT 750M        (NVIDIA legacy driver)
✓ FreeBSD + NVIDIA GT 650M        (NVIDIA legacy driver)

Two backends live in this repo:

What works today

Capability VulkanoBackend spirit (legacy)
Buffer alloc / upload / download Arc-managed raw pointer
Elementwise binary (add/sub/mul/div/pow/max/min) ✓ f32 + f64 ✓ f32
Elementwise unary (exp/log/sqrt/sigmoid/tanh/abs/neg/floor/ceil/sign) ✓ f32 + f64 ✓ f32
Reductions (sum, reduce_max, reduce_min, axis + leading + trailing) ✓ f32 + f64 ✓ f32
Shape / movement (reshape, squeeze, transpose-2D)
Matmul (rank-2 × rank-2) ✓ f32 ✓ f32
Slice / as_type / general dot axes host-fallback host-fallback
Chain shader synthesis (Mission II) dispatches generated SPV dispatches generated SPV
Nx.Defn.grad (autograd) works automatically works automatically
Axon training step ✓ validated (1e-8 grad match vs BinaryBackend) partial
eXMC NUTS sampler integration ✓ regime log_p byte-identical ✓ chain-shader path
Scholar linear regression ✓ coefficients match to 2e-6 (SVD via host-fallback) partial
Long-running workloads (5000+ dispatches) ✓ pipeline cache ✗ stale-buffer crash class

The chain-shader dispatch path (leapfrog_chain_synth) is shared between both backends — same SPV cache, same content-addressing, same runtime synthesis pipeline. The difference is the backend that handles general Nx tensor ops outside that dispatch.

Why two backends

The spirit backend reached production first — chain-shader synthesis, runtime SPV compilation, content-addressed disk cache, and a long-lived Nx.Vulkan.Node GenServer. Then a use-after-free in the C++ FFI layer crashed the live trader three minutes after every restart. The failure surfaced as Nx.Vulkan.Native.byte_size raising :badarg on a stale VkBuf* pointer — a classic FFI ownership leak the C++ type system cannot detect. The vulkano backend grew from a spike that proved the migration was mechanical: same SPV bytes in, byte-identical chain tensors out, perf within ten percent on the bench target.

The two coexist while we backfill the long tail of ops. Long-term, the spirit path retires.

See docs/VULKANO_BACKEND_ROADMAP.md for the full stage breakdown. The full story is in The Backend That Didn't Need to Know.

Benchmarks (May 2026)

Square matmul, milliseconds per dispatch, median of 50–200 iterations:

size bin (super-io) bin (mac-247) vulkano (super-io) vulkano (mac-247) spirit (mac-247)
16×16 2.76 2.51 1.18 1.06 1.16
64×64 130.76 158.45 7.07 7.92 7.56
256×256 20,097 13,891 149.19 136.10 141.73
1024×1024 n/a (hours) n/a (hours) 2,323 2,843 2,845

Two observations:

  1. vulkano and spirit agree within 5% on every matmul size where both run on the same hardware. The C++ path doesn't buy back its maintenance cost.
  2. The Vulkan path beats BinaryBackend by 92–135× at 256×256 on the GT 650M. The GPU is from 2013; what changes is moving the loop off the BEAM scheduler.

The C++ spirit path crashed on Linux super-io mid-bench with a memory-supervisor high-watermark warning — same fragility class that motivated the migration in the first place. The vulkano path completed cleanly on both hosts. Full bench script: examples/full_bench.exs.

Position vs EXLA and EMLX

Three GPU backends exist for Nx today. Each won a different platform first.

EXLA EMLX Nx.Vulkan.VulkanoBackend
Backing API Google XLA Apple MLX (Metal) Khronos Vulkan via vulkano (Rust)
Maturity Years; production Released 2024 Released 2026
Linux + NVIDIA CUDA ✓ canonical ✓ via Vulkan
macOS + Apple Silicon ✓ canonical ✓ via MoltenVK
FreeBSD + NVIDIA✓ only path
Windows / WSL2 partial via TF ✓ (Vulkan ships on Windows)
Op coverage full Nx surface (~200) full Nx surface 24 native, rest via host fallback
Nx.Defn.grad (autograd) full full ✓ free (graph transformation)
fp64 compute full none (Metal limit) ✓ binary/unary/reduce
Production use Google scale Apple devices eXMC trader on mac-247

The autograd insight

Nx.Defn.grad is a graph transformation that runs at compile time on the Nx.Defn.Expr AST. For every forward op in the graph, it inserts the corresponding backward op expressed in terms of more forward ops. The backend never sees a "backward op" — it just keeps executing forward primitives. Forward op coverage IS gradient coverage when running through Nx.Defn.Evaluator.

That means VulkanoBackend supports gradients for any function expressible in its 24 native ops + host-fallback long tail. No backward callbacks were written. Validated by running a complete Axon training step (Dense → sigmoid → Dense → MSE → Nx.Defn.value_and_grad) on Nx.Vulkan.VulkanoBackend, with gradient sum agreeing to 1e-8 against the BinaryBackend reference.

What's missing

Op coverage — the long tail. Convolutions, FFTs, sort, scatter, Nx.LinAlg.solve/qr/svd, complex types, sparse ops. Most of these have host-fallback paths that work today but are slow. Native shaders for each are 50–100 LOC of vulkano apiece. Estimated effort to reach feature parity with EXLA: 6–12 months of focused work, parallelisable.

Nx.Defn custom compiler. Today we run through Nx.Defn.Evaluator, which dispatches ops one at a time. EXLA compiles whole graphs to optimised HLO. A custom Defn compiler that batches dispatches, fuses elementwise chains, and caches compiled graphs would close most of the remaining perf gap. Estimated effort: 3–6 months.

Persistent buffer pool. Currently per-call buffer allocation through vulkano's StandardMemoryAllocator. Works but costs a millisecond per dispatch that an explicit pool could reclaim. Mid-2026 work.

f64 matmul.matmul.spv is f32-only. f64 dot products fall back to host, which is slow for large tensors. 2 weeks to add the f64 variant.

Scholar — linalg fast paths. Linear regression (normal equation + SVD) now smoke-tests cleanly via a host-fallback block/4 callback that routes Nx.Block.LinAlg.SVD/QR/solve/cholesky through BinaryBackend. Coefficients match to 2e-6. The fallback works for any Scholar algorithm whose linalg uses Nx.Block; native SVD/QR shaders would speed things up but aren't blocking correctness. 2-4 weeks to add the most-used linalg shaders natively.

Quickstart

As a backend in your project

# mix.exs
def deps do
  [
    {:nx, "~> 0.10"},
    {:nx_vulkan, git: "https://github.com/borodark/nx_vulkan"}
  ]
end
# Build a tensor, transfer to GPU, do work
x_bin = Nx.tensor([1.0, 2.0, 3.0, 4.0], type: :f32)
x_vk  = Nx.backend_transfer(x_bin, Nx.Vulkan.VulkanoBackend)

y_vk  = Nx.sigmoid(x_vk)
y_bin = Nx.backend_transfer(y_vk, Nx.BinaryBackend)
IO.inspect(Nx.to_list(y_bin))
# [0.7310585975646973, 0.8807970881462097, 0.9525741338729858, 0.9820137619972229]

Try the Axon training example

git clone https://github.com/borodark/nx_vulkan
cd nx_vulkan
mix deps.get && mix compile
elixir examples/axon_training_loop.exs

Runs a 100-step Dense(4→32, tanh)→Dense(1) regression with manual SGD. Compares loss trajectories on BinaryBackend vs VulkanoBackend. PASS verdict on both Linux + FreeBSD.

Try the full bench

mix run examples/full_bench.exs

Per-op + end-to-end + robustness across every backend Nx can find. Auto-detects EXLA availability. Runs in ~10 minutes on RTX 3060 Ti, ~15 on GT 650M.

Why FreeBSD matters

Nx today has three GPU backends. Two of them — EXLA and EMLX — explicitly do not run on FreeBSD. If you have NVIDIA hardware on FreeBSD, Vulkan is the only path. mac-248 (FreeBSD 15.0 / GT 750M) and mac-247 (FreeBSD 15.0 / GT 650M Mac Edition) are the canonical bring-up boxes; every commit gets verified there alongside the Linux dev host.

The companion blog series:

Architecture

   ┌─────────────────────────────────────────────────────────┐
   │  Nx layer                                                │
   │  • Nx.Vulkan.VulkanoBackend  (current)                   │
   │  • Nx.Vulkan.Backend         (legacy, C++ path)          │
   └──────────────┬─────────────────────────┬─────────────────┘
                  │                         │
   ┌──────────────▼──────────┐  ┌──────────▼──────────────────┐
   │  Nx.Vulkan.NativeV       │  │  Nx.Vulkan.Native            │
   │  (Rustler crate          │  │  (Rustler crate              │
   │   nx_vulkan_vulkano)     │  │   nx_vulkan_native)          │
   │  • Arc<Buffer> resources │  │  • C++ shim NIFs             │
   │  • pipeline cache        │  │  • opaque VkBuf* pointers    │
   │  • specialisation        │  │                              │
   └──────────┬───────────────┘  └─────────┬────────────────────┘
              │                            │
              │                       ┌────▼─────────┐
              │                       │  C++ shim    │
              │                       │  (legacy)    │
              │                       └────┬─────────┘
              │                            │
              │                       ┌────▼─────────┐
              │                       │   spirit     │
              │                       │   (vendored) │
              │                       └────┬─────────┘
              │                            │
              └──────────┬─────────────────┘
                         ▼
              ┌─────────────────────────┐
              │  Vulkan driver (loader) │
              └─────────────────────────┘
                         │
              ┌──────────▼──────────────┐
              │  priv/shaders/*.spv      │
              │  • elementwise_binary    │
              │  • elementwise_unary     │
              │  • reduce_axis           │
              │  • matmul                │
              │  • transpose             │
              │  • synthesised chain     │
              │    shaders (Mission II)  │
              │  • 9 hand-written leap-  │
              │    frog families         │
              └──────────────────────────┘

The SPV catalog under priv/shaders/ is shared by both backends. The synthesis pipeline that produces new chain shaders at runtime (Nx.Vulkan.Synthesis, Nx.Vulkan.ShaderTemplate, Nx.Vulkan.ChainShaderSpecs) lives in the Elixir layer and is backend-agnostic.

Old spirit-era infrastructure that survives unchanged:

Building

Prerequisites

Build

mix deps.get
mix compile

Vulkano compiles in ~30s on Linux, ~3:18 on FreeBSD 15.0 (mostly dependency compilation). The spirit/C++ path compiles in parallel.

Rust toolchain pin

rust-toolchain.toml pins rustc to 1.85. The reason is in the file's comment; bump when upstream rustler emits a corrected rustler-sys signature.

Status

Phase 3 in progress (May 2026): vulkano backend covers stages 1–8 of the roadmap.

Feature Status
Vulkano buffer lifecycle (alloc/upload/download/free)
24 native compute ops via specialised SPVs
f64 shader paths (binary/unary/reduce)
Pipeline cache (correctness + perf)
Cross-host validation (Linux + 2× FreeBSD)
Axon training step end-to-end
eXMC regime log_p (f64) byte-identical
Autograd via Nx.Defn.grad
Persistent buffer pool mid-2026
f64 matmul mid-2026
Scholar linear regression (coefs match to 2e-6)
Scholar native linalg shaders (SVD/QR/cholesky/solve) mid-2026
Custom Nx.Defn compiler 2026 H2
Conv / FFT / sort / scatter 2026 H2–Q4

Plan history is in PLAN_GPU_NODE.md (Phase 1–2 era) and docs/VULKANO_BACKEND_ROADMAP.md (Phase 3+). Per-workstream notes in research/gpu_node/.

Sibling: zed

zed is the declarative ZFS + Elixir deploy tool that orchestrates BEAM nodes. nx_vulkan is consumed inside deployed BEAM nodes — not as a zed dependency. See specs/nx-vulkan-execution.md in the zed repo for the integration story.

License

Apache 2.0. Same as Spirit and Nx.