Nx.Vulkan
A GPU tensor backend for Nx that runs on anything with a Vulkan driver — including FreeBSD, where CUDA and Metal don't exist.
✓ Linux + NVIDIA RTX 3060 Ti (proprietary driver)
✓ FreeBSD + NVIDIA GT 750M (NVIDIA legacy driver)
✓ FreeBSD + NVIDIA GT 650M (NVIDIA legacy driver)Two backends live in this repo:
Nx.Vulkan.VulkanoBackend— pure-Rust, vulkano-backed, current primary. 24 ops native + host-fallback for the long tail. Validated on Axon training (forward + autograd + SGD), the eXMC regime-model log-posterior at f64 precision, and Scholar linear regression (coefficients match to 2e-6).Nx.Vulkan.Backend— C++ spirit-backed, predecessor. Chain-shader synthesis pipeline (Synthesis,ShaderTemplate,ChainShaderSpecs) still ships here and produces SPV blobs that either backend can dispatch.
What works today
| Capability | VulkanoBackend | spirit (legacy) |
|---|---|---|
| Buffer alloc / upload / download | Arc-managed | raw pointer |
| Elementwise binary (add/sub/mul/div/pow/max/min) | ✓ f32 + f64 | ✓ f32 |
| Elementwise unary (exp/log/sqrt/sigmoid/tanh/abs/neg/floor/ceil/sign) | ✓ f32 + f64 | ✓ f32 |
| Reductions (sum, reduce_max, reduce_min, axis + leading + trailing) | ✓ f32 + f64 | ✓ f32 |
| Shape / movement (reshape, squeeze, transpose-2D) | ✓ | ✓ |
| Matmul (rank-2 × rank-2) | ✓ f32 | ✓ f32 |
| Slice / as_type / general dot axes | host-fallback | host-fallback |
| Chain shader synthesis (Mission II) | dispatches generated SPV | dispatches generated SPV |
Nx.Defn.grad (autograd) | works automatically | works automatically |
| Axon training step | ✓ validated (1e-8 grad match vs BinaryBackend) | partial |
| eXMC NUTS sampler integration | ✓ regime log_p byte-identical | ✓ chain-shader path |
| Scholar linear regression | ✓ coefficients match to 2e-6 (SVD via host-fallback) | partial |
| Long-running workloads (5000+ dispatches) | ✓ pipeline cache | ✗ stale-buffer crash class |
The chain-shader dispatch path (leapfrog_chain_synth) is shared between both backends — same SPV cache, same content-addressing, same runtime synthesis pipeline. The difference is the backend that handles general Nx tensor ops outside that dispatch.
Why two backends
The spirit backend reached production first — chain-shader synthesis, runtime SPV compilation, content-addressed disk cache, and a long-lived Nx.Vulkan.Node GenServer. Then a use-after-free in the C++ FFI layer crashed the live trader three minutes after every restart. The failure surfaced as Nx.Vulkan.Native.byte_size raising :badarg on a stale VkBuf* pointer — a classic FFI ownership leak the C++ type system cannot detect. The vulkano backend grew from a spike that proved the migration was mechanical: same SPV bytes in, byte-identical chain tensors out, perf within ten percent on the bench target.
The two coexist while we backfill the long tail of ops. Long-term, the spirit path retires.
See docs/VULKANO_BACKEND_ROADMAP.md for the full stage breakdown. The full story is in The Backend That Didn't Need to Know.
Benchmarks (May 2026)
Square matmul, milliseconds per dispatch, median of 50–200 iterations:
| size | bin (super-io) | bin (mac-247) | vulkano (super-io) | vulkano (mac-247) | spirit (mac-247) |
|---|---|---|---|---|---|
| 16×16 | 2.76 | 2.51 | 1.18 | 1.06 | 1.16 |
| 64×64 | 130.76 | 158.45 | 7.07 | 7.92 | 7.56 |
| 256×256 | 20,097 | 13,891 | 149.19 | 136.10 | 141.73 |
| 1024×1024 | n/a (hours) | n/a (hours) | 2,323 | 2,843 | 2,845 |
Two observations:
- vulkano and spirit agree within 5% on every matmul size where both run on the same hardware. The C++ path doesn't buy back its maintenance cost.
- The Vulkan path beats BinaryBackend by 92–135× at 256×256 on the GT 650M. The GPU is from 2013; what changes is moving the loop off the BEAM scheduler.
The C++ spirit path crashed on Linux super-io mid-bench with a memory-supervisor high-watermark warning — same fragility class that motivated the migration in the first place. The vulkano path completed cleanly on both hosts. Full bench script: examples/full_bench.exs.
Position vs EXLA and EMLX
Three GPU backends exist for Nx today. Each won a different platform first.
| EXLA | EMLX | Nx.Vulkan.VulkanoBackend | |
|---|---|---|---|
| Backing API | Google XLA | Apple MLX (Metal) | Khronos Vulkan via vulkano (Rust) |
| Maturity | Years; production | Released 2024 | Released 2026 |
| Linux + NVIDIA CUDA | ✓ canonical | ✗ | ✓ via Vulkan |
| macOS + Apple Silicon | ✗ | ✓ canonical | ✓ via MoltenVK |
| FreeBSD + NVIDIA | ✗ | ✗ | ✓ only path |
| Windows / WSL2 | partial via TF | ✗ | ✓ (Vulkan ships on Windows) |
| Op coverage | full Nx surface (~200) | full Nx surface | 24 native, rest via host fallback |
Nx.Defn.grad (autograd) | full | full | ✓ free (graph transformation) |
| fp64 compute | full | none (Metal limit) | ✓ binary/unary/reduce |
| Production use | Google scale | Apple devices | eXMC trader on mac-247 |
The autograd insight
Nx.Defn.grad is a graph transformation that runs at compile time on the Nx.Defn.Expr AST. For every forward op in the graph, it inserts the corresponding backward op expressed in terms of more forward ops. The backend never sees a "backward op" — it just keeps executing forward primitives. Forward op coverage IS gradient coverage when running through Nx.Defn.Evaluator.
That means VulkanoBackend supports gradients for any function expressible in its 24 native ops + host-fallback long tail. No backward callbacks were written. Validated by running a complete Axon training step (Dense → sigmoid → Dense → MSE → Nx.Defn.value_and_grad) on Nx.Vulkan.VulkanoBackend, with gradient sum agreeing to 1e-8 against the BinaryBackend reference.
What's missing
Op coverage — the long tail. Convolutions, FFTs, sort, scatter, Nx.LinAlg.solve/qr/svd, complex types, sparse ops. Most of these have host-fallback paths that work today but are slow. Native shaders for each are 50–100 LOC of vulkano apiece. Estimated effort to reach feature parity with EXLA: 6–12 months of focused work, parallelisable.
Nx.Defn custom compiler. Today we run through Nx.Defn.Evaluator, which dispatches ops one at a time. EXLA compiles whole graphs to optimised HLO. A custom Defn compiler that batches dispatches, fuses elementwise chains, and caches compiled graphs would close most of the remaining perf gap. Estimated effort: 3–6 months.
Persistent buffer pool. Currently per-call buffer allocation through vulkano's StandardMemoryAllocator. Works but costs a millisecond per dispatch that an explicit pool could reclaim. Mid-2026 work.
f64 matmul.matmul.spv is f32-only. f64 dot products fall back to host, which is slow for large tensors. 2 weeks to add the f64 variant.
Scholar — linalg fast paths. Linear regression (normal equation + SVD) now smoke-tests cleanly via a host-fallback block/4 callback that routes Nx.Block.LinAlg.SVD/QR/solve/cholesky through BinaryBackend. Coefficients match to 2e-6. The fallback works for any Scholar algorithm whose linalg uses Nx.Block; native SVD/QR shaders would speed things up but aren't blocking correctness. 2-4 weeks to add the most-used linalg shaders natively.
Quickstart
As a backend in your project
# mix.exs
def deps do
[
{:nx, "~> 0.10"},
{:nx_vulkan, git: "https://github.com/borodark/nx_vulkan"}
]
end# Build a tensor, transfer to GPU, do work
x_bin = Nx.tensor([1.0, 2.0, 3.0, 4.0], type: :f32)
x_vk = Nx.backend_transfer(x_bin, Nx.Vulkan.VulkanoBackend)
y_vk = Nx.sigmoid(x_vk)
y_bin = Nx.backend_transfer(y_vk, Nx.BinaryBackend)
IO.inspect(Nx.to_list(y_bin))
# [0.7310585975646973, 0.8807970881462097, 0.9525741338729858, 0.9820137619972229]Try the Axon training example
git clone https://github.com/borodark/nx_vulkan
cd nx_vulkan
mix deps.get && mix compile
elixir examples/axon_training_loop.exs
Runs a 100-step Dense(4→32, tanh)→Dense(1) regression with manual SGD. Compares loss trajectories on BinaryBackend vs VulkanoBackend. PASS verdict on both Linux + FreeBSD.
Try the full bench
mix run examples/full_bench.exsPer-op + end-to-end + robustness across every backend Nx can find. Auto-detects EXLA availability. Runs in ~10 minutes on RTX 3060 Ti, ~15 on GT 650M.
Why FreeBSD matters
Nx today has three GPU backends. Two of them — EXLA and EMLX — explicitly do not run on FreeBSD. If you have NVIDIA hardware on FreeBSD, Vulkan is the only path. mac-248 (FreeBSD 15.0 / GT 750M) and mac-247 (FreeBSD 15.0 / GT 650M Mac Edition) are the canonical bring-up boxes; every commit gets verified there alongside the Linux dev host.
The companion blog series:
- The Backend That Didn't Need to Know — the C++→vulkano migration; descriptor pool debugging; autograd was free
- The GPU That Doesn't Need CUDA — the FreeBSD Vulkan story (spirit-era)
- A Walkable Path Under the Mountain — eXMC + zed integration
Architecture
┌─────────────────────────────────────────────────────────┐
│ Nx layer │
│ • Nx.Vulkan.VulkanoBackend (current) │
│ • Nx.Vulkan.Backend (legacy, C++ path) │
└──────────────┬─────────────────────────┬─────────────────┘
│ │
┌──────────────▼──────────┐ ┌──────────▼──────────────────┐
│ Nx.Vulkan.NativeV │ │ Nx.Vulkan.Native │
│ (Rustler crate │ │ (Rustler crate │
│ nx_vulkan_vulkano) │ │ nx_vulkan_native) │
│ • Arc<Buffer> resources │ │ • C++ shim NIFs │
│ • pipeline cache │ │ • opaque VkBuf* pointers │
│ • specialisation │ │ │
└──────────┬───────────────┘ └─────────┬────────────────────┘
│ │
│ ┌────▼─────────┐
│ │ C++ shim │
│ │ (legacy) │
│ └────┬─────────┘
│ │
│ ┌────▼─────────┐
│ │ spirit │
│ │ (vendored) │
│ └────┬─────────┘
│ │
└──────────┬─────────────────┘
▼
┌─────────────────────────┐
│ Vulkan driver (loader) │
└─────────────────────────┘
│
┌──────────▼──────────────┐
│ priv/shaders/*.spv │
│ • elementwise_binary │
│ • elementwise_unary │
│ • reduce_axis │
│ • matmul │
│ • transpose │
│ • synthesised chain │
│ shaders (Mission II) │
│ • 9 hand-written leap- │
│ frog families │
└──────────────────────────┘
The SPV catalog under priv/shaders/ is shared by both backends. The synthesis pipeline that produces new chain shaders at runtime
(Nx.Vulkan.Synthesis, Nx.Vulkan.ShaderTemplate,
Nx.Vulkan.ChainShaderSpecs) lives in the Elixir layer and is
backend-agnostic.
Old spirit-era infrastructure that survives unchanged:
Nx.Vulkan.Node— long-lived named GenServer that owns thevkPipelineCacheblob and serialises dispatch viawith_node/2. Used by the legacy backend; new backend doesn't require it but cooperates with it.Nx.Vulkan.PipelineCache— disk-persistentvkPipelineCachewith UUID validation. Survives BEAM restarts.- Runtime chain shader synthesis — render a
FamilySpec, hand toSynthesis.compile/1, get a content-addressed SPV path back. ~150 ms cold, 5 ms cache hit. Both backends consume the output.
Building
Prerequisites
- Erlang/OTP 26+, Elixir 1.17+
- Rust 1.78+
- C++ compiler (only needed for the legacy spirit backend; vulkano is pure Rust)
-
Vulkan SDK +
glslangValidator:-
Debian/Ubuntu:
apt install libvulkan-dev vulkan-tools glslang-tools -
FreeBSD:
pkg install vulkan-loader vulkan-headers vulkan-tools glslang shaderc
-
Debian/Ubuntu:
Build
mix deps.get
mix compileVulkano compiles in ~30s on Linux, ~3:18 on FreeBSD 15.0 (mostly dependency compilation). The spirit/C++ path compiles in parallel.
Rust toolchain pin
rust-toolchain.toml pins rustc to 1.85. The reason is in the file's comment; bump when upstream rustler emits a corrected rustler-sys signature.
Status
Phase 3 in progress (May 2026): vulkano backend covers stages 1–8 of the roadmap.
| Feature | Status |
|---|---|
| Vulkano buffer lifecycle (alloc/upload/download/free) | ✓ |
| 24 native compute ops via specialised SPVs | ✓ |
| f64 shader paths (binary/unary/reduce) | ✓ |
| Pipeline cache (correctness + perf) | ✓ |
| Cross-host validation (Linux + 2× FreeBSD) | ✓ |
| Axon training step end-to-end | ✓ |
| eXMC regime log_p (f64) byte-identical | ✓ |
Autograd via Nx.Defn.grad | ✓ |
| Persistent buffer pool | mid-2026 |
| f64 matmul | mid-2026 |
| Scholar linear regression (coefs match to 2e-6) | ✓ |
| Scholar native linalg shaders (SVD/QR/cholesky/solve) | mid-2026 |
Custom Nx.Defn compiler | 2026 H2 |
| Conv / FFT / sort / scatter | 2026 H2–Q4 |
Plan history is in PLAN_GPU_NODE.md (Phase 1–2 era) and docs/VULKANO_BACKEND_ROADMAP.md (Phase 3+). Per-workstream notes in research/gpu_node/.
Sibling: zed
zed is the declarative ZFS + Elixir deploy tool that orchestrates BEAM nodes. nx_vulkan is consumed inside deployed BEAM nodes — not as a zed dependency. See specs/nx-vulkan-execution.md in the zed repo for the integration story.
License
Apache 2.0. Same as Spirit and Nx.