
[](https://gleam.run/)
[](./test)
[](./LICENSE)
**The fastest tensor library on the BEAM**
Performance
GPU Tensor Cores (RTX 4090)
|
Backend |
Throughput |
% of Peak |
|---|
|
FP8 E4M3 (CUTLASS) | 660 TOPS |
100% |
|
INT8 Dense (IMMA) | 604 TOPS |
92% |
|
FP16 Dense (cublasGemmEx) | 284 TFLOPS |
86% |
|
FP32/TF32 (cuBLAS) | 84.5 TFLOPS |
102% |
|
Fused GEMM+ReLU | 162 TFLOPS |
free activation |
GPU 2:4 Structured Sparsity
|
Backend |
Throughput |
% of Peak |
|---|
|
INT4 Sparse (CUTLASS) | 1854 TOPS |
70% |
|
INT8 Sparse (cuSPARSELt) | 1094 TOPS |
83% |
|
INT8 Sparse (CUTLASS) | 841 TOPS |
64% |
|
FP8 Sparse (cuSPARSELt) | 702 TOPS |
53% |
|
FP16 Sparse (cuSPARSELt) | 355 TFLOPS |
53% |
CPU (Intel MKL)
|
Size |
viva_tensor |
PyTorch |
NumPy |
vs PyTorch |
|---|
|
5000x5000 | 931 GFLOPS |
620 |
368 | +50% |
Xeon 24-core (AVX2), MKL dgemm FP64, compact affinity, MADV_HUGEPAGE.
All numbers verified with CUDA events and IQR outlier removal.
Install
gleam add viva_tensor
Architecture
graph TB
subgraph "Gleam Layer (44 modules, 67K lines)"
A[viva_tensor API]
B[core/ - tensor, ops, shape, ffi]
C[quant/ - INT8, NF4, AWQ]
D[nn/ - autograd, layers, flash_attention]
end
subgraph "Erlang Layer"
E[viva_tensor_zig.erl - NIF wrapper]
end
subgraph "Native Layer (13K+ lines C/CUDA)"
F[nif_entry.c - dispatch]
G[nif_cpu_ops.c - AVX2 SIMD]
H[nif_cuda_fp32/fp16/int8.c - Tensor Cores]
I[nif_sparse.c - 2:4 sparsity]
J[nif_specialized.c - fused GEMM]
end
subgraph "Backend Libraries"
K[Intel MKL]
L[CUDA cuBLAS/cuBLASLt]
M[cuSPARSELt]
N[CUTLASS]
end
A --> B & C & D
B --> E
E --> F
F --> G & H & I & J
G --> K
H --> L
I --> M & N
J --> L
style A fill:#FFAFF3
style K fill:#0071C5,color:#fff
style L fill:#76B900,color:#fff
style M fill:#76B900,color:#fff
style N fill:#76B900,color:#fff
Quick Start
import viva_tensor as t
// Create tensors
let a = t.zeros([1000, 1000])
let b = t.random_uniform([1000, 1000])
// Matrix multiplication (auto-selects best backend)
let c = t.matmul(a, b)
// Activations
let activated = t.relu(c) |> t.sigmoid()
Features
mindmap
root((viva_tensor))
Core Ops
add/sub/mul/div
matmul/transpose
sum/mean/max/min
dot/outer/broadcast
GPU Backends
FP32/TF32 cuBLAS
FP16 Tensor Cores
INT8 IMMA
FP8 E4M3 CUTLASS
Sparsity
INT4 2:4 CUTLASS
INT8 2:4 cuSPARSELt
FP8/FP16 Sparse
Quantization
INT8 4x compress
NF4 7.5x compress
AWQ 7.7x compress
Neural Networks
autograd
linear layers
flash attention
fused GEMM+act
CNN
conv2d
max/avg pool2d
global_avg_pool2d
Quantization
|
Method |
Compression |
Quality |
Use Case |
|---|
|
INT8 |
4x |
96% |
Inference |
|
NF4 |
7.5x |
99% |
QLoRA Fine-tuning |
|
AWQ |
7.7x |
97% |
Edge Deployment |
Build
# Pure Gleam (no native deps)
make build && make test
# With NIF acceleration (Intel MKL + CUDA)
make zig && make build
# Full build
make build-all
Requirements
-
Gleam 1.14.0+
-
OTP 27+
-
Zig 0.14+ (for NIF build)
-
Intel MKL (CPU BLAS)
-
CUDA 13+ with cuBLAS, cuBLASLt (GPU)
-
cuSPARSELt 0.8.1+ (sparse ops)
-
CUTLASS 4.3+ (FP8, INT4 sparse)
GPU Benchmark Suite
# Individual benchmarks (Erlang escripts)
./bench/bench_gpu_peak.erl # FP32/TF32
./bench/bench_fp16_imma.erl # FP16 Tensor Cores
./bench/bench_int8_imma.erl # INT8 IMMA
./bench/bench_fp8_peak.erl # FP8 E4M3
./bench/bench_sparse_peak.erl # 2:4 Sparsity
./bench/bench_fused_peak.erl # Fused GEMM+activation
./bench/bench_batched_peak.erl # Batched GEMM
```mermaid
flowchart LR
G[Gleam] --> Z[Zig NIF] --> M[Intel MKL]
Z --> C[CUDA Tensor Cores]
Z --> S[cuSPARSELt]
Z --> CU[CUTLASS]
style G fill:#FFAFF3,color:#000
style Z fill:#F7A41D,color:#000
style M fill:#0071C5,color:#fff
style C fill:#76B900,color:#fff
style S fill:#76B900,color:#fff
style CU fill:#76B900,color:#fff
```
**Built with love for the BEAM**
