[![Gleam](https://img.shields.io/badge/Gleam-FFAFF3?style=for-the-badge&logo=gleam&logoColor=black)](https://gleam.run/) [![BEAM](https://img.shields.io/badge/BEAM-A90533?style=for-the-badge&logo=erlang&logoColor=white)](https://www.erlang.org/) [![OTP](https://img.shields.io/badge/OTP_27+-4B275F?style=for-the-badge)](https://www.erlang.org/doc/design_principles/des_princ) [![CUDA](https://img.shields.io/badge/CUDA_12+-76B900?style=for-the-badge&logo=nvidia&logoColor=white)](https://developer.nvidia.com/cuda-toolkit) [![SM89](https://img.shields.io/badge/RTX_4090-Ada_SM89-76B900?style=for-the-badge)](https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090/) [![Tests](https://img.shields.io/badge/tests-792_passing-00875A?style=for-the-badge)](./test) [![Version](https://img.shields.io/badge/version-2.2.102-CD5C5C?style=for-the-badge)](./CHANGELOG.md) [![License](https://img.shields.io/badge/license-MIT-228B22?style=for-the-badge)](./LICENSE) **[πŸ‡§πŸ‡· PortuguΓͺs](docs/pt-br/README.md)** Β· **[πŸ‡ΊπŸ‡Έ English](docs/en/README.md)** Β· **[πŸ‡¨πŸ‡³ δΈ­ζ–‡](docs/zh-cn/README.md)** --- *"Tensors speak Gleam. Kernels burn silicon. The BEAM holds the soul."*

[!IMPORTANT] viva_tensor IS NOT A WRAPPER. It is a production-grade FP8 LLM inference engine written from scratch: hand-tuned CUDA kernels, blocked W8A16 GEMV, full-token CUDA Graphs, and a public ModelHandle API β€” all driven from Gleam on the BEAM.

It is faster than Ollama on the same hardware.


🎯 Overview

A tensor library for Gleam on the BEAM. Provides a pure-Gleam tensor API for portability, an inference API for FP8 / INT4-2:4 / INT8-2:4 sparse linear layers, and a public LLM ModelHandle API for Llama-family HuggingFace checkpoints.

The library works fully in pure BEAM (slow but portable) and transparently upgrades to the native CUDA path when the NIF is loaded.

Property Value
Language Pure Gleam (type-safe functional)
Runtime BEAM / OTP 27+
Native backend CUDA 12 + CUTLASS + cuSPARSELt (SM89 / Ada)
Tests 792 passing
Decode448 tok/s TinyLlama-1.1B (vs Ollama 352)
Public APIviva_tensor.load_model / viva_tensor.generate

⚑ Quick Start

git clone https://github.com/gabrielmaialva33/viva_tensor.git && cd viva_tensor
gleam deps download

# Optional: native CUDA backend (RTX 4090 / Ada SM89)
make cutlass-libs    # CUTLASS + cuSPARSELt static archives
make zig             # the NIF shared object

gleam test           # 792 tests, all pass with NIF loaded

Generate text in 4 lines of Gleam

import viva_tensor as t

let assert Ok(model) = t.load_model("tmp/tinyllama/model.safetensors")
let assert Ok(result) = t.generate(model, "Hello", t.default_generate_opts())
result.text
πŸ“‹ Prerequisites | Tool | Version | Required for | | :------------------------ | :---------- | :-------------------- | | Gleam | `>= 1.14` | Build / pure-Gleam | | Erlang/OTP | `>= 27` | BEAM runtime | | CUDA toolkit | `>= 12.0` | Native inference path | | NVIDIA GPU | Ada+ (SM89) | FP8 / Tensor Cores | | `make` + `zig` + `clang` | recent | NIF build pipeline | The pure-Gleam path needs only Gleam + Erlang/OTP.

πŸ—οΈ Architecture

   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚                  Gleam application code                 β”‚
   β”‚       viva_tensor.load_model / .generate / .Tensor      β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚            Erlang public API (viva_tensor_llm)          β”‚
   β”‚  SafeTensors loader Β· BPE tokenizer Β· sampling Β· KV     β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚          NIF dispatch (viva_tensor_zig.so + .erl)       β”‚
   β”‚   PackedWeight Β· EmbeddingTable Β· KvCache Β· ModelHandle β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚                              β”‚
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Pure-Gleam tensors β”‚         β”‚      CUDA kernels        β”‚
   β”‚   (no GPU needed)  β”‚         β”‚ W8A16 GEMV Β· FlashAttn   β”‚
   β”‚                    β”‚         β”‚ Full-token CUDA Graph    β”‚
   β”‚                    β”‚         β”‚ CUTLASS FP8/INT4 sparse  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
πŸ“‹ Core Modules | Module | Description | | :---------------------------------- | :------------------------------------------------ | | `viva_tensor` | Public Gleam API: tensors, prepack, linear, LLM | | `viva_tensor_llm` | `load_model` / `generate` β€” opaque `ModelHandle` | | `viva_tensor_zig` | NIF dispatch (Erlang stubs) | | `viva_tensor_safetensors_ffi` | HF SafeTensors loader, sharded support, BF16/F16 | | `viva_tensor_tokenizer_ffi` | SentencePiece + byte-level BPE (GPT-2/Llama-3) | | `zig_src/cuda_block_forward.cu` | RMSNorm, RoPE, GQA flash attn, SiLU, residual | | `zig_src/nif_forward_block.c` | Decode-step orchestration, CUDA Graph capture | | `zig_src/cuda_fp8_cutlass.cu` | CUTLASS FP8 dense GEMM | | `zig_src/nif_prepack_int_sparse.c` | INT4 / INT8 2:4 sparse weight prepack |

πŸ“Š Performance

All numbers measured on RTX 4090 (Ada SM89) + Intel i9-13900K (32 threads @ 5.80 GHz). Reproducible via bench/ harness.

Text generation β€” TinyLlama-1.1B-Chat (FP8 W8A16)

Runtime Decode speed
viva_tensor β€” best run448 tok/s
Ollama local baseline (same model)352 tok/s
viva_tensor.generate (warm) 2.31 ms/token
viva_tensor.generate Llama-3.2-1B-Instruct 2.47 ms/token

Validated models

Model Status Path Notes
TinyLlama-1.1B-Chat βœ… validated single safetensors byte-identical baseline, 2.31 ms/tok
Llama-3.2-1B-Instruct (unsloth) βœ… validated single safetensors tied embeddings, byte-level BPE, 2.47 ms/tok
NousResearch/Llama-2-7b-chat-hf βœ… validated sharded F16 (13.5GB) head_dim=128 dynamic path, 113 ms/tok
Phi-2 ⚠️ partial sharded folder sharded discovery OK, Phi arch β‰  Llama

Quantized GEMM kernels (RTX 4090)

Kernel Peak performance Backend
INT8 2:4 sparse (cuSPARSELt) 1320 TOPS cuSPARSELt
INT4 2:4 sparse (CUTLASS Sm80) 1854 TOPS CUTLASS
FP8 dense (CUTLASS E4M3 W8A8) ~660 TFLOPS CUTLASS
FP8 W8A16 blocked GEMV (custom) decode-optimized hand-tuned CUDA

Full methodology + raw numbers in bench/results/matmul_showdown.md.


🧬 Design Principles

Principle Description
Honest numerics argmax tokens stay byte-identical to HF reference fp32
Pure-Gleam fallback Every API works without CUDA, just slower
Owned device memoryPackedWeight, EmbeddingTable, KvCache are Erlang resources
Single-token by default Decode is batch=1 first; batched prefill is future work
No magic kernels Every .cu file is human-written, benchmarked, and committed

πŸ› οΈ Public API

High-level: LLM inference

import viva_tensor as t

pub fn main() {
  let assert Ok(model) = t.load_model("tmp/tinyllama/model.safetensors")

  let opts =
    t.GenerateOpts(
      max_new_tokens: 50,
      temperature: 0.0,           // argmax β€” deterministic
      top_k: t.TopKInfinity,
      top_p: 1.0,
      seed: 42,
      stop_on_eos: True,
    )

  let assert Ok(result) = t.generate(model, "Hello", opts)
  result.text
}

Reproducible sampling

let sampling_opts =
  t.GenerateOpts(
    max_new_tokens: 30,
    temperature: 0.8,
    top_k: t.TopK(40),
    top_p: 0.95,
    seed: 42,
    stop_on_eos: True,
  )

Same seed β†’ same token sequence across machines.

Low-level: quantized linears

let assert Ok(packed) = t.prepack_fp8_weight_blocked(weight, 16)
let assert Ok(output) = t.linear_fp8_w8a16(input, packed, bias)

Prepack once, run linear forwards many times β€” the FP8 weight + scales live on the device for the lifetime of the PackedWeight resource.


πŸ—ΊοΈ Roadmap

Phase Status
Pure-Gleam tensors βœ…
CUDA backend (CUTLASS + cuSPARSELt) βœ…
FP8 / INT4-2:4 / INT8-2:4 sparse kernels βœ…
Public ModelHandle API βœ…
Sharded SafeTensors loader βœ…
Byte-level BPE + SentencePiece tokenizers βœ…
Weight-tied embeddings βœ…
Full-token CUDA Graph capture βœ…
Reproducible temperature/top-k/top-p sampling βœ…
FP16 weight dtype (Llama-2-7B) πŸ”„
Batched prefill ⏳
Speculative decoding ⏳
Hopper SM90 / Blackwell FP4 / NVFP4 ⏳

🀝 Contributing

git checkout -b feature/your-feature
make cutlass-libs && make zig
gleam test          # 792 should pass with NIF loaded
make test-no-nif    # 791 should pass without NIF

See CONTRIBUTING.md for guidelines.


πŸ“š Documentation

Language Link
πŸ‡§πŸ‡· PortuguΓͺs docs/pt-br/
πŸ‡ΊπŸ‡Έ English docs/en/
πŸ‡¨πŸ‡³ δΈ­ζ–‡ docs/zh-cn/

Guides

API reference

Technical paper


πŸ“œ What's new in 2.2.102

Full evolution from 63 tok/s baseline to 448 tok/s across 14 rounds of optimization is documented in CHANGELOG.md.


**Star if you believe BEAM can do LLM inference ⭐** [![GitHub stars](https://img.shields.io/github/stars/gabrielmaialva33/viva_tensor?style=social)](https://github.com/gabrielmaialva33/viva_tensor) *Created by Gabriel Maia · MIT License*