LlamaCppEx
Elixir bindings for llama.cpp — run LLMs locally with Metal, CUDA, Vulkan, or CPU acceleration.
Built with C++ NIFs using fine for ergonomic resource management and elixir_make for the build system.
Features
- Load and run GGUF models directly from Elixir
- HuggingFace Hub integration — search, list, and download GGUF models
- GPU acceleration: Metal (macOS), CUDA (NVIDIA), Vulkan, or CPU
-
Streaming token generation via lazy
Stream -
Jinja chat templates with
enable_thinkingsupport (Qwen3, Qwen3.5, etc.) - RAII resource management — models, contexts, and samplers are garbage collected by the BEAM
- Configurable sampling: temperature, top-k, top-p, min-p, repetition penalty, frequency & presence penalty
- Embedding generation with L2 normalization
- Grammar-constrained generation (GBNF)
- Structured output via JSON Schema (auto-converted to GBNF grammar)
- Optional Ecto schema to JSON Schema conversion
- Continuous batching server for concurrent inference
- Prefix caching — same-slot KV cache reuse for multi-turn chat (1.23x faster)
- Pluggable batching strategies — DecodeMaximal, PrefillPriority, Balanced
- Pre-tokenized API — tokenize outside the GenServer for lower contention
- Telemetry integration for observability
Installation
Add llama_cpp_ex to your list of dependencies in mix.exs:
def deps do
[
{:llama_cpp_ex, "~> 0.6.7"}
]
endPrerequisites
- C++17 compiler (GCC, Clang, or MSVC)
- CMake 3.14+
- Git (for the llama.cpp submodule)
Backend Selection
mix compile # Auto-detect (Metal on macOS, CUDA if nvcc found, else CPU)
LLAMA_BACKEND=metal mix compile # Apple Silicon GPU
LLAMA_BACKEND=cuda mix compile # NVIDIA GPU
LLAMA_BACKEND=vulkan mix compile # Vulkan
LLAMA_BACKEND=cpu mix compile # CPU onlyPower users can pass arbitrary CMake flags:
LLAMA_CMAKE_ARGS="-DGGML_CUDA_FORCE_CUBLAS=ON" mix compileQuick Start
# Initialize the backend (once per application)
:ok = LlamaCppEx.init()
# Load a GGUF model (use n_gpu_layers: -1 to offload all layers to GPU)
{:ok, model} = LlamaCppEx.load_model("path/to/model.gguf", n_gpu_layers: -1)
# Generate text
{:ok, text} = LlamaCppEx.generate(model, "Once upon a time", max_tokens: 200, temp: 0.8)
# Stream tokens
model
|> LlamaCppEx.stream("Tell me a story", max_tokens: 500)
|> Enum.each(&IO.write/1)
# Chat with template
{:ok, reply} = LlamaCppEx.chat(model, [
%{role: "system", content: "You are a helpful assistant."},
%{role: "user", content: "What is Elixir?"}
], max_tokens: 200)
# Chat with thinking disabled (Qwen3/3.5 and similar models)
{:ok, reply} = LlamaCppEx.chat(model, [
%{role: "user", content: "What is 2+2?"}
], max_tokens: 64, enable_thinking: false)
# Stream a chat response
model
|> LlamaCppEx.stream_chat([
%{role: "user", content: "Explain pattern matching in Elixir."}
], max_tokens: 500)
|> Enum.each(&IO.write/1)HuggingFace Hub
Download GGUF models directly from HuggingFace Hub. Requires the optional :req dependency:
{:req, "~> 0.5"}# Search for GGUF models
{:ok, models} = LlamaCppEx.Hub.search("qwen3 gguf", limit: 5)
# List GGUF files in a repository
{:ok, files} = LlamaCppEx.Hub.list_gguf_files("Qwen/Qwen3-0.6B-GGUF")
# Download (cached locally in ~/.cache/llama_cpp_ex/models/)
{:ok, path} = LlamaCppEx.Hub.download("Qwen/Qwen3-0.6B-GGUF", "Qwen3-0.6B-Q8_0.gguf")
# Or download + load in one step
{:ok, model} = LlamaCppEx.load_model_from_hub(
"Qwen/Qwen3-0.6B-GGUF", "Qwen3-0.6B-Q8_0.gguf",
n_gpu_layers: -1
)
For private/gated models, set HF_TOKEN or pass token: "hf_...". Set LLAMA_OFFLINE=1 for offline-only cached access.
Structured Output (JSON Schema)
Constrain model output to valid JSON matching a schema. Pass :json_schema to any generate or chat function — the schema is automatically converted to a GBNF grammar via llama.cpp's built-in converter.
schema = %{
"type" => "object",
"properties" => %{
"name" => %{"type" => "string"},
"age" => %{"type" => "integer"},
"hobbies" => %{"type" => "array", "items" => %{"type" => "string"}}
},
"required" => ["name", "age", "hobbies"],
"additionalProperties" => false
}
# Works with generate
{:ok, json} = LlamaCppEx.generate(model, "Generate a person:",
json_schema: schema, temp: 0.0)
# => "{\"name\": \"Alice\", \"age\": 30, \"hobbies\": [\"reading\", \"hiking\"]}"
# Works with chat
{:ok, json} = LlamaCppEx.chat(model, [
%{role: "user", content: "Generate a person named Bob who is 25."}
], json_schema: schema, temp: 0.0)
# Works with streaming
model
|> LlamaCppEx.stream("Generate a person:", json_schema: schema, temp: 0.0)
|> Enum.each(&IO.write/1)
# Works with chat completions
{:ok, completion} = LlamaCppEx.chat_completion(model, [
%{role: "user", content: "Generate a person."}
], json_schema: schema, temp: 0.0)Tip: Set
"additionalProperties" => falsein your schema to produce a tighter grammar that avoids potential issues with the grammar sampler.
Manual Grammar Conversion
You can also convert the schema to GBNF manually for more control:
{:ok, gbnf} = LlamaCppEx.Grammar.from_json_schema(schema)
IO.puts(gbnf)
# root ::= "{" space name-kv "," space age-kv "," space hobbies-kv "}" space
# ...
# Use the grammar directly
{:ok, json} = LlamaCppEx.generate(model, "Generate a person:", grammar: gbnf, temp: 0.0)Ecto Schema Integration
Convert Ecto schema modules to JSON Schema automatically (requires {:ecto, "~> 3.0"} — optional dependency):
defmodule MyApp.Person do
use Ecto.Schema
embedded_schema do
field :name, :string
field :age, :integer
field :active, :boolean
field :tags, {:array, :string}
end
end
# Ecto schema -> JSON Schema -> constrained generation
schema = LlamaCppEx.Schema.to_json_schema(MyApp.Person)
# => %{"type" => "object", "properties" => %{"name" => %{"type" => "string"}, ...}, ...}
{:ok, json} = LlamaCppEx.chat(model, [
%{role: "user", content: "Generate a person."}
], json_schema: schema, temp: 0.0)
Supported Ecto types: :string, :integer, :float, :decimal, :boolean, :map, {:array, inner}, :date, :utc_datetime, :naive_datetime, and embedded schemas (embeds_one/embeds_many). Fields :id, :inserted_at, and :updated_at are excluded automatically.
Lower-level API
For fine-grained control over the inference pipeline:
# Tokenize
{:ok, tokens} = LlamaCppEx.Tokenizer.encode(model, "Hello world")
{:ok, text} = LlamaCppEx.Tokenizer.decode(model, tokens)
# Create context and sampler separately
{:ok, ctx} = LlamaCppEx.Context.create(model, n_ctx: 4096)
{:ok, sampler} = LlamaCppEx.Sampler.create(model, temp: 0.7, top_p: 0.9)
# Run generation with your own context
{:ok, tokens} = LlamaCppEx.Tokenizer.encode(model, "The answer is")
{:ok, text} = LlamaCppEx.Context.generate(ctx, sampler, tokens, max_tokens: 100)
# Model introspection
LlamaCppEx.Model.desc(model) # "llama 7B Q4_K - Medium"
LlamaCppEx.Model.n_params(model) # 6_738_415_616
LlamaCppEx.Model.chat_template(model) # "<|im_start|>..."
LlamaCppEx.Tokenizer.vocab_size(model) # 32000Server (Continuous Batching)
For concurrent inference, LlamaCppEx.Server manages a shared model/context with a slot pool and continuous batching:
{:ok, server} = LlamaCppEx.Server.start_link(
model_path: "model.gguf",
n_gpu_layers: -1,
n_parallel: 4,
n_ctx: 8192
)
# Synchronous
{:ok, text} = LlamaCppEx.Server.generate(server, "Once upon a time", max_tokens: 100)
# Streaming
LlamaCppEx.Server.stream(server, "Tell me a story", max_tokens: 200)
|> Enum.each(&IO.write/1)Multiple callers are batched into a single forward pass per tick, improving throughput under load.
Prefix Caching
The server caches KV state between requests on the same slot. Multi-turn chat benefits automatically — the system prompt and prior turns aren't recomputed:
{:ok, server} = LlamaCppEx.Server.start_link(
model_path: "model.gguf",
n_parallel: 4,
cache_prompt: true # opt-in (default: false)
)Benchmark: 1.23x faster for multi-turn conversations (487ms vs 597ms per 4-turn exchange).
Batching Strategies
Choose how the token budget is split between generation and prompt processing:
# Default: generation latency optimized
batch_strategy: LlamaCppEx.Server.Strategy.DecodeMaximal
# Throughput optimized (batch processing)
batch_strategy: LlamaCppEx.Server.Strategy.PrefillPriority
# Fair split (mixed workloads)
batch_strategy: LlamaCppEx.Server.Strategy.BalancedPre-Tokenized API
Tokenize outside the GenServer to reduce contention under concurrent load:
model = LlamaCppEx.Server.get_model(server)
{:ok, tokens} = LlamaCppEx.Tokenizer.encode(model, prompt)
{:ok, text} = LlamaCppEx.Server.generate_tokens(server, tokens, max_tokens: 100)See Performance Guide for detailed tuning advice.
Benchmarks
Measured on Apple M4 Max (64 GB), Metal backend (n_gpu_layers: -1).
Single-model generation speed
| Model | Quantization | Tokens/sec |
|---|---|---|
| Llama 3.2 3B Instruct | Q4_K_XL | 125.6 |
| Ministral 3 3B Reasoning | Q4_K_XL | 113.0 |
| Ministral 3 3B Instruct | Q4_K_XL | 104.3 |
| GPT-OSS 20B | Q4_K_XL | 79.4 |
| Qwen3.5-35B-A3B | Q6_K | 56.0 |
| Qwen3.5-27B | Q4_K_XL | 17.5 |
Single-sequence generation (Qwen3-4B Q4_K_M)
| Prompt | 32 tokens | 128 tokens |
|---|---|---|
| short (6 tok) | 0.31s (3.19 ips) | 1.01s (0.98 ips) |
| medium (100 tok) | 0.36s (2.79 ips) | 1.06s (0.94 ips) |
| long (500 tok) | 0.65s (1.53 ips) | 1.29s (0.77 ips) |
Continuous batching throughput (Qwen3-4B Q4_K_M)
max_tokens: 32, prompt: "short"
──────────────────────────────────────────────────────────────────────────────
Concurrency Wall time Total tok/s Per-req tok/s Speedup Avg batch
1 318ms 100.6 100.6 1.00x 1.1
2 440ms 145.5 72.7 1.45x 2.2
4 824ms 155.3 38.8 1.54x 4.5Run benchmarks yourself:
MIX_ENV=bench mix deps.get
LLAMA_MODEL_PATH=path/to/model.gguf MIX_ENV=bench mix run bench/single_generate.exs
LLAMA_MODEL_PATH=path/to/model.gguf MIX_ENV=bench mix run bench/server_concurrent.exsRunning Qwen3.5-35B-A3B
Qwen3.5-35B-A3B is a Mixture-of-Experts model with 35B total parameters but only 3B active per token. It supports 256K context and both thinking (CoT) and non-thinking modes.
Hardware requirements
| Quantization | RAM / VRAM | File size |
|---|---|---|
| Q4_K_M | ~20 GB | ~19 GB |
| Q8_0 | ~37 GB | ~36 GB |
| BF16 | ~70 GB | ~67 GB |
Download
# Install the HuggingFace CLI if needed: pip install huggingface-hub
huggingface-cli download Qwen/Qwen3.5-35B-A3B-GGUF Qwen3.5-35B-A3B-Q4_K_M.gguf --local-dir models/Thinking mode (general)
:ok = LlamaCppEx.init()
{:ok, model} = LlamaCppEx.load_model("models/Qwen3.5-35B-A3B-Q4_K_M.gguf", n_gpu_layers: -1)
# Qwen3.5 recommended: temp 1.0, top_p 0.95, top_k 20, presence_penalty 1.5
{:ok, reply} = LlamaCppEx.chat(model, [
%{role: "user", content: "Explain the birthday paradox."}
], max_tokens: 2048, temp: 1.0, top_p: 0.95, top_k: 20, min_p: 0.0, penalty_present: 1.5)Thinking mode (math/code)
# For math and code, lower temperature without presence penalty
{:ok, reply} = LlamaCppEx.chat(model, [
%{role: "user", content: "Write a function to find the longest palindromic substring."}
], max_tokens: 4096, temp: 0.6, top_p: 0.95, top_k: 20, min_p: 0.0)Non-thinking mode
# Disable thinking via enable_thinking option (uses Jinja chat template kwargs)
{:ok, reply} = LlamaCppEx.chat(model, [
%{role: "user", content: "What is the capital of France?"}
], max_tokens: 256, enable_thinking: false, temp: 0.7, top_p: 0.8, top_k: 20, min_p: 0.0, penalty_present: 1.5)Streaming with Server
{:ok, server} = LlamaCppEx.Server.start_link(
model_path: "models/Qwen3.5-35B-A3B-Q4_K_M.gguf",
n_gpu_layers: -1,
n_parallel: 2,
n_ctx: 16384,
temp: 1.0, top_p: 0.95, top_k: 20, min_p: 0.0, penalty_present: 1.5
)
LlamaCppEx.Server.stream(server, "Explain monads in simple terms", max_tokens: 1024)
|> Enum.each(&IO.write/1)Qwen3.5 enable_thinking benchmarks
Measured on MacBook Pro, Apple M4 Max (16-core, 64 GB), Metal backend, n_gpu_layers: -1, 512 output tokens, temp: 0.6.
| Metric | Qwen3.5-27B (Q4_K_XL) | Qwen3.5-35B-A3B (Q6_K) |
|---|---|---|
| Think ON / Think OFF | Think ON / Think OFF | |
| Prompt tokens | 65 / 66 | 65 / 66 |
| Output tokens | 512 / 512 | 512 / 512 |
| TTFT | 599 ms / 573 ms | 554 ms / 191 ms |
| Prompt eval | 108.5 / 115.2 t/s | 117.3 / 345.5 t/s |
| Gen speed | 17.5 / 17.3 t/s | 56.0 / 56.0 t/s |
| Total time | 29.77 / 30.10 s | 9.69 / 9.33 s |
The MoE model (35B-A3B) is ~3.2x faster at generation since only 3B parameters are active per token despite the 35B total. Thinking mode only affects the prompt template, not inference speed.
Examples
The examples/ directory contains runnable scripts demonstrating key features:
# Basic text generation
LLAMA_MODEL_PATH=/path/to/model.gguf mix run examples/basic_generation.exs
# Streaming tokens to terminal
LLAMA_MODEL_PATH=/path/to/model.gguf mix run examples/streaming.exs
# Interactive multi-turn chat
LLAMA_MODEL_PATH=/path/to/model.gguf mix run examples/chat.exs
# JSON Schema constrained generation + Ecto integration
LLAMA_MODEL_PATH=/path/to/model.gguf mix run examples/structured_output.exs
# Embedding generation and cosine similarity
LLAMA_EMBEDDING_MODEL_PATH=/path/to/embedding-model.gguf mix run examples/embeddings.exs
# Continuous batching server with concurrent requests
LLAMA_MODEL_PATH=/path/to/model.gguf mix run examples/server.exsArchitecture
Elixir API (lib/)
│
LlamaCppEx.NIF (@on_load, stubs)
│
C++ NIF layer (c_src/) — fine.hpp for RAII + type encoding
│
llama.cpp static libs (vendor/llama.cpp, built via CMake)
│
Hardware (CPU / Metal / CUDA / Vulkan)License
Apache License 2.0 — see LICENSE.
llama.cpp is licensed under the MIT License.