[!IMPORTANT] viva_tensor IS NOT A WRAPPER. It is a production-grade FP8 LLM inference engine written from scratch: hand-tuned CUDA kernels, blocked W8A16 GEMV, full-token CUDA Graphs, and a public
ModelHandleAPI β all driven from Gleam on the BEAM.It is faster than Ollama on the same hardware.
π― Overview
A tensor library for Gleam on the BEAM. Provides a pure-Gleam tensor API
for portability, an inference API for FP8 / INT4-2:4 / INT8-2:4 sparse
linear layers, and a public LLM ModelHandle API for Llama-family
HuggingFace checkpoints.
The library works fully in pure BEAM (slow but portable) and transparently upgrades to the native CUDA path when the NIF is loaded.
| Property | Value |
|---|---|
| Language | Pure Gleam (type-safe functional) |
| Runtime | BEAM / OTP 27+ |
| Native backend | CUDA 12 + CUTLASS + cuSPARSELt (SM89 / Ada) |
| Tests | 792 passing |
| Decode | 448 tok/s TinyLlama-1.1B (vs Ollama 352) |
| Public API | viva_tensor.load_model / viva_tensor.generate |
β‘ Quick Start
git clone https://github.com/gabrielmaialva33/viva_tensor.git && cd viva_tensor
gleam deps download
# Optional: native CUDA backend (RTX 4090 / Ada SM89)
make cutlass-libs # CUTLASS + cuSPARSELt static archives
make zig # the NIF shared object
gleam test # 792 tests, all pass with NIF loadedGenerate text in 4 lines of Gleam
import viva_tensor as t
let assert Ok(model) = t.load_model("tmp/tinyllama/model.safetensors")
let assert Ok(result) = t.generate(model, "Hello", t.default_generate_opts())
result.textπ Prerequisites
| Tool | Version | Required for | | :------------------------ | :---------- | :-------------------- | | Gleam | `>= 1.14` | Build / pure-Gleam | | Erlang/OTP | `>= 27` | BEAM runtime | | CUDA toolkit | `>= 12.0` | Native inference path | | NVIDIA GPU | Ada+ (SM89) | FP8 / Tensor Cores | | `make` + `zig` + `clang` | recent | NIF build pipeline | The pure-Gleam path needs only Gleam + Erlang/OTP.ποΈ Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Gleam application code β
β viva_tensor.load_model / .generate / .Tensor β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ
β Erlang public API (viva_tensor_llm) β
β SafeTensors loader Β· BPE tokenizer Β· sampling Β· KV β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ
β NIF dispatch (viva_tensor_zig.so + .erl) β
β PackedWeight Β· EmbeddingTable Β· KvCache Β· ModelHandle β
βββββββββββ¬βββββββββββββββββββββββββββββββ¬βββββββββββββββββ
β β
βββββββββββΌβββββββββββ βββββββββββΌβββββββββββββββββ
β Pure-Gleam tensors β β CUDA kernels β
β (no GPU needed) β β W8A16 GEMV Β· FlashAttn β
β β β Full-token CUDA Graph β
β β β CUTLASS FP8/INT4 sparse β
ββββββββββββββββββββββ ββββββββββββββββββββββββββββπ Core Modules
| Module | Description | | :---------------------------------- | :------------------------------------------------ | | `viva_tensor` | Public Gleam API: tensors, prepack, linear, LLM | | `viva_tensor_llm` | `load_model` / `generate` β opaque `ModelHandle` | | `viva_tensor_zig` | NIF dispatch (Erlang stubs) | | `viva_tensor_safetensors_ffi` | HF SafeTensors loader, sharded support, BF16/F16 | | `viva_tensor_tokenizer_ffi` | SentencePiece + byte-level BPE (GPT-2/Llama-3) | | `zig_src/cuda_block_forward.cu` | RMSNorm, RoPE, GQA flash attn, SiLU, residual | | `zig_src/nif_forward_block.c` | Decode-step orchestration, CUDA Graph capture | | `zig_src/cuda_fp8_cutlass.cu` | CUTLASS FP8 dense GEMM | | `zig_src/nif_prepack_int_sparse.c` | INT4 / INT8 2:4 sparse weight prepack |π Performance
All numbers measured on RTX 4090 (Ada SM89) + Intel i9-13900K (32 threads @ 5.80 GHz). Reproducible via
bench/harness.
Text generation β TinyLlama-1.1B-Chat (FP8 W8A16)
| Runtime | Decode speed |
|---|---|
| viva_tensor β best run | 448 tok/s |
| Ollama local baseline (same model) | 352 tok/s |
viva_tensor.generate (warm) | 2.31 ms/token |
viva_tensor.generate Llama-3.2-1B-Instruct | 2.47 ms/token |
Validated models
| Model | Status | Path | Notes |
|---|---|---|---|
| TinyLlama-1.1B-Chat | β validated | single safetensors |
byte-identical baseline, 2.31 ms/tok |
| Llama-3.2-1B-Instruct (unsloth) | β validated | single safetensors |
tied embeddings, byte-level BPE, 2.47 ms/tok |
| NousResearch/Llama-2-7b-chat-hf | β validated | sharded F16 (13.5GB) | head_dim=128 dynamic path, 113 ms/tok |
| Phi-2 | β οΈ partial | sharded folder | sharded discovery OK, Phi arch β Llama |
Quantized GEMM kernels (RTX 4090)
| Kernel | Peak performance | Backend |
|---|---|---|
| INT8 2:4 sparse (cuSPARSELt) | 1320 TOPS | cuSPARSELt |
| INT4 2:4 sparse (CUTLASS Sm80) | 1854 TOPS | CUTLASS |
| FP8 dense (CUTLASS E4M3 W8A8) | ~660 TFLOPS | CUTLASS |
| FP8 W8A16 blocked GEMV (custom) | decode-optimized | hand-tuned CUDA |
Full methodology + raw numbers in bench/results/matmul_showdown.md.
𧬠Design Principles
| Principle | Description |
|---|---|
| Honest numerics | argmax tokens stay byte-identical to HF reference fp32 |
| Pure-Gleam fallback | Every API works without CUDA, just slower |
| Owned device memory | PackedWeight, EmbeddingTable, KvCache are Erlang resources |
| Single-token by default |
Decode is batch=1 first; batched prefill is future work |
| No magic kernels |
Every .cu file is human-written, benchmarked, and committed |
π οΈ Public API
High-level: LLM inference
import viva_tensor as t
pub fn main() {
let assert Ok(model) = t.load_model("tmp/tinyllama/model.safetensors")
let opts =
t.GenerateOpts(
max_new_tokens: 50,
temperature: 0.0, // argmax β deterministic
top_k: t.TopKInfinity,
top_p: 1.0,
seed: 42,
stop_on_eos: True,
)
let assert Ok(result) = t.generate(model, "Hello", opts)
result.text
}Reproducible sampling
let sampling_opts =
t.GenerateOpts(
max_new_tokens: 30,
temperature: 0.8,
top_k: t.TopK(40),
top_p: 0.95,
seed: 42,
stop_on_eos: True,
)
Same seed β same token sequence across machines.
Low-level: quantized linears
let assert Ok(packed) = t.prepack_fp8_weight_blocked(weight, 16)
let assert Ok(output) = t.linear_fp8_w8a16(input, packed, bias)
Prepack once, run linear forwards many times β the FP8 weight + scales
live on the device for the lifetime of the PackedWeight resource.
πΊοΈ Roadmap
| Phase | Status |
|---|---|
| Pure-Gleam tensors | β |
| CUDA backend (CUTLASS + cuSPARSELt) | β |
| FP8 / INT4-2:4 / INT8-2:4 sparse kernels | β |
Public ModelHandle API | β |
| Sharded SafeTensors loader | β |
| Byte-level BPE + SentencePiece tokenizers | β |
| Weight-tied embeddings | β |
| Full-token CUDA Graph capture | β |
| Reproducible temperature/top-k/top-p sampling | β |
| FP16 weight dtype (Llama-2-7B) | π |
| Batched prefill | β³ |
| Speculative decoding | β³ |
| Hopper SM90 / Blackwell FP4 / NVFP4 | β³ |
π€ Contributing
git checkout -b feature/your-feature
make cutlass-libs && make zig
gleam test # 792 should pass with NIF loaded
make test-no-nif # 791 should pass without NIFSee CONTRIBUTING.md for guidelines.
π Documentation
| Language | Link |
|---|---|
| π§π· PortuguΓͺs | docs/pt-br/ |
| πΊπΈ English | docs/en/ |
| π¨π³ δΈζ | docs/zh-cn/ |
Guides
- Getting started β install, first run.
- LLM inference end-to-end β load β tokenize β decode β sample.
- FFI architecture β Gleam β Erlang β C/CUDA boundaries.
- Project structure β repo layout.
API reference
- LLM ModelHandle β
load_model,generate, tested models. - Inference API β prepack + linear FP8 / INT-sparse.
- Tensor API β pure-Gleam tensor surface.
Technical paper
- Honest paper β what works, what doesn't, why.
π What's new in 2.2.102
- Public LLM API.
viva_tensor.load_model(path)accepts a HuggingFace Llama-family checkpoint (single file, sharded, or folder) and returns an opaqueModelHandle.viva_tensor.generate(model, prompt, opts)drives deterministic argmax or seeded temperature/top-k/top-p sampling. - Fast FP8 W8A16 decode. Hand-tuned
vt_w8a16_mmv_blocked_k16GEMV withuint4vectorized loads, full-token CUDA Graph capture withcudaGraphExecUpdate, persistent device-resident KV caches, and a cuBLASLt plan cache. - Multi-model validated. TinyLlama-1.1B and Llama-3.2-1B-Instruct pass byte-identical and through the same public API.
- Sharded SafeTensors. Loads single
.safetensors, HFmodel.safetensors.index.json, or any folder containing either. - Byte-level BPE. GPT-2 / Llama-3 byte-encoded vocabularies decode
back to readable text; SentencePiece (
β) still works as before. - Tied embeddings. Detects
tie_word_embeddingsfromconfig.jsonand reusesembed_tokensaslm_headwhen set.
Full evolution from 63 tok/s baseline to 448 tok/s across 14 rounds
of optimization is documented in CHANGELOG.md.