erllama

erllama is a native Erlang/OTP wrapper around llama.cpp with a token-exact, multi-tier, supervised KV cache. It turns a multi-second prefill into a millisecond restore, and lets you keep more warm state than fits in RAM by promoting cold-but-popular prefixes down to the disk tier.

If you have ever waited five seconds for a chat assistant to acknowledge "hello" — that's prompt prefill. erllama caches the work so the second turn, the third turn, and every subsequent agent sharing the same system prompt skip it.

A 30-second taste

1> {ok, _} = application:ensure_all_started(erllama).
2> Path = "/srv/models/tinyllama-1.1b-chat.Q4_K_M.gguf".
3> {ok, Bin} = file:read_file(Path).
4> {ok, M} = erllama:load_model(#{
       backend     => erllama_model_llama,
       model_path  => Path,
       fingerprint => crypto:hash(sha256, Bin)
   }).
{ok, <<"erllama_model_2375">>}

5> {ok, Reply, _} = erllama:complete(M, <<"Once upon a time">>).
%% ~3 s on a CPU box. Prompt prefill, async cold save fired.

6> {ok, Reply2, _} = erllama:complete(M, <<"Once upon a time">>).
%% ~10 ms. Cache hit; KV state restored, one decode for fresh logits.

7> {ok, _, _} = erllama:complete(M, <<"Once upon a time, in a quiet village">>).
%% ~50 ms. Longest-prefix walk found the cached row even though
%% the new prompt is longer.

load_model/1 returns a binary model_id that is also the registered name for the underlying gen_statem. Pass it to complete/2,3, unload/1, etc.

That is the whole pitch. The cache is on by default, runs under its own supervisor, and never returns approximate matches.

What you get

Many models in one BEAM. Load TinyLlama and Llama-3-8B side by side, hot-swap a model without bouncing the cache, give each model its own policy and tier. One shared cache; rows are fingerprint-segregated so models never collide on identical prompts.
Token-exact hits. Cache key is sha256(model_fp || quant || ctx_params || tokens_le32). Same tokens, same key, guaranteed-correct restore.
Three storage tiers. ETS slabs for the hottest data, files on /dev/shm for warm working set, on-disk files (plain read I/O) for everything else. Each tier supervised independently with its own quota and LRU.
Bigger than RAM. Disk is a first-class tier, not a fallback. A 70B model in Q4 already takes ~40 GB of weights; the disk tier holds the warm KV state your working set needs without crowding weights out of RAM.
Shared-prefix hits across agents. Spawn N workers that all start with the same system prompt: the first cold-prefills, every subsequent worker gets a longest-prefix hit on the shared part.
Multi-turn warmth. Pass the previous turn's parent_key and the cache waits up to session_resume_wait_ms for the in-flight finish save to publish.
Stateless-friendly. OpenAI/Anthropic-shaped servers that resend the full conversation each turn get hits automatically through a longest-prefix walk. No parent_key needed.
Crash-safe saves. Reserve, write temp, validate, atomic link(2), announce. Two-stage TTL cleanup adopts orphans on writer crash.
Memory-pressure-driven eviction. Pluggable pressure source (memsup, nvidia-smi, or your own callback). Off by default.
Always-on metrics. Hits, misses, saves, evictions, and per-path latency totals exposed via erllama_cache:get_counters/0. Per-counter cost is ~10-20 ns; you cannot meaningfully turn them off.

Installation

erllama targets Erlang/OTP 28 with rebar3 3.25+.

Add to rebar.config:

{deps, [
    {erllama, "~> 0.1"}
]}.

Then in your supervision tree, wait for the application to start before loading models:

ok = application:ensure_started(erllama).

The first compile builds vendored llama.cpp (~3 minutes on a fast machine). Subsequent builds are cached. See requirements for the toolchain.

Documentation

Guide	What it covers
Loading a model	Every option to `erllama:load_model/1,2`, with examples and pitfalls.
Caching	Tiers, save reasons, lookup paths, watermarks. The operator's manual.
Configuration	Full `sys.config` and per-model option reference.
Building	Platform-specific build notes (Linux, macOS, FreeBSD), CUDA/Metal toggles, common build issues.
Examples	Drop-in patterns for one-shot completion, stateless HTTP servers, multi-turn sessions, concurrent agents, cache inspection.

For the API reference (erllama, erllama_cache, erllama_scheduler, erllama_nif), see the generated module docs on HexDocs or run rebar3 ex_doc locally.

For the design rationale behind the cache:

Cache design — why multi-tier, why token-exact, what was deliberately left out.
Publish protocol — the five-stage crash-safe save protocol.
NIF safety — two-resource lifetime, exception shim, why disk reads use plain file:read_file/1.

Many models in one BEAM

Each loaded model is its own supervised gen_statem under erllama_model_sup. The cache is process-wide and segregates rows by fingerprint, so the only thing two models share is the byte budget.

{ok, _} = erllama:load_model(<<"tiny">>, TinyConfig).
{ok, _} = erllama:load_model(<<"big">>,  BigConfig).

{ok, R1, _} = erllama:complete(<<"tiny">>, <<"summarise: ...">>).
{ok, R2, _} = erllama:complete(<<"big">>,  <<"deep analysis of: ...">>).

ok = erllama:unload(<<"tiny">>).

Capability	How
N models in one BEAM	`load_model/2` per binary id; each is one `gen_statem`
No cross-model collisions	Cache key includes the model fingerprint
Hot-swap a model	`unload/1` then `load_model/2`; the cache survives
Per-model `policy`	`policy => #{...}` on the load; merges over app-env defaults
Per-model `tier`	`tier_srv => MyDisk, tier => disk` per model
Shared-prefix hits across agents	Longest-prefix walk on every cold prompt
Concurrent saves bounded	Single writer pool with a leak-proof semaphore

Tested end-to-end in test/erllama_SUITE.erl:concurrent_complete_under_writer_cap — four models with distinct fingerprints running parallel completions under one writer cap.

A slightly longer example

A real load with all the cache parameters. The disk tier requires a running erllama_cache_disk_srv started by the operator; the RAM tier (erllama_cache_ram) starts automatically with the application.

{ok, _} = erllama_cache_disk_srv:start_link(my_disk, "/var/lib/erllama/kvc"),
{ok, Bin} = file:read_file("/srv/models/llama-3.1-8b.Q4_K_M.gguf"),
Fp = crypto:hash(sha256, Bin),
CtxHash = crypto:hash(sha256, term_to_binary({8192, 4096})),

{ok, M} = erllama:load_model(#{
    backend          => erllama_model_llama,
    model_path       => "/srv/models/llama-3.1-8b.Q4_K_M.gguf",
    model_opts       => #{n_gpu_layers => 99},
    context_opts     => #{n_ctx => 8192, n_batch => 4096},
    fingerprint      => Fp,
    fingerprint_mode => safe,
    quant_type       => q4_k_m,
    quant_bits       => 4,
    ctx_params_hash  => CtxHash,
    context_size     => 8192,
    tier_srv         => my_disk,
    tier             => disk,
    policy           => #{
        boundary_trim_tokens   => 32,
        boundary_align_tokens  => 256,
        session_resume_wait_ms => 500
    }
}).

Stateless OpenAI/Anthropic-shaped server:

handle_completion(ModelId, Prompt) ->
    {ok, Reply, _Tokens} = erllama:complete(ModelId, Prompt),
    Reply.

No parent_key. The cache walks the new prompt backward by the configured stride and finds the longest cached prefix. If the new prompt is yesterday's conversation plus one fresh turn, the walk hits.

Stateful Erlang-native multi-turn: the session layer threads parent_key between turns. The previous turn's finish-save key is the parent of the next call. It is held by the calling session process, not retrieved from the cache.

%% First turn: cold prefill. The model fires an async finish save
%% whose key is sha256(fingerprint || quant || ctx_params || tokens).
{ok, R1, Tokens1} = erllama:complete(M, Prompt1),
ParentKey = erllama_cache_key:make(#{
    fingerprint => Fp,
    quant_type  => q4_k_m,
    ctx_params_hash => CtxHash,
    tokens      => Tokens1
}),

%% Second turn: pass ParentKey to skip the longest-prefix walk.
{ok, R2, _} = erllama:complete(M, Prompt2, #{parent_key => ParentKey}).

Inspect cache state from a shell:

1> erllama_cache:get_counters().
#{hits_exact => 142, hits_resume => 17, hits_longest_prefix => 89,
  misses => 12, saves_cold => 12, saves_continued => 67,
  saves_finish => 31, evictions => 3, ...}

2> erllama_cache_meta_srv:dump().
%% List of raw ETS rows:
%%   {Key, Tier, Size, LastUsedNs, Refcount, Status, HeaderBin,
%%    Location, TokensRef, Hits}
[{<<_:256>>, disk, 8388608, 1737..., 0, available, _, _, _, 4}, ...]

Requirements

Erlang/OTP 28
rebar3 3.25+
C++17 toolchain (Apple clang or recent gcc; cmake >= 3.20)
Apple Silicon: Metal + Accelerate auto-detected.
Linux: BLAS auto-detected; CUDA off by default (toggle via ERLLAMA_OPTS=-DGGML_CUDA=ON).
FreeBSD: erlang-runtime28 from ports, plus cmake bash gmake.

Architecture at a glance

erllama_sup
├── erllama_cache_sup
│   ├── erllama_cache_meta_srv      sole writer; meta + LRU + reservations
│   ├── erllama_cache_ram           RAM tier (ETS slabs)
│   ├── erllama_cache_ramfile_srv   ram_file tier
│   ├── erllama_cache_disk_srv      disk tier (plain read/write I/O)
│   └── erllama_cache_writer        writer pool, leak-proof semaphore
├── erllama_model_sup               simple_one_for_one for dynamic models
└── erllama_scheduler               memory-pressure poller (off by default)

Inside a request:

erllama:complete/2 enters the per-model gen_statem.
prefilling — tokenize, then either hit the cache and kv_unpack (warm) or run llama_decode over the prompt (cold). Cold path fires an async cold save at the trimmed-prefix boundary.
generating — token-by-token greedy llama_decode. Every continued_interval tokens, fire an async continued save.
idle — fire an async finish save for the full prompt + reply. The KV state becomes evictable.

For the publish protocol, the reservation state machine, and the exception-safe NIF wrappers, see internals/publish-protocol.md and internals/nif-safety.md.

Status

Pre-release. Core cache, scheduler, and NIF: 166 EUnit + 11 PropEr + 7 stub Common Test cases. End-to-end CT suite gated on LLAMA_TEST_MODEL (6 cases, passing locally with TinyLlama 1.1B Q4_K_M).

See CHANGELOG.md for the release notes.

Contributing

The contributor guide is AGENTS.md. The short version:

rebar3 fmt          # auto-format (always run first)
rebar3 compile      # warnings_as_errors
rebar3 eunit        # unit tests
rebar3 proper       # property tests
rebar3 ct           # Common Test (without a real model)
rebar3 lint         # Elvis
rebar3 dialyzer     # static analysis
rebar3 xref         # cross-reference

End-to-end against a real GGUF:

LLAMA_TEST_MODEL=/path/to/tinyllama-1.1b-chat.gguf \
    rebar3 ct --suite=test/erllama_real_model_SUITE

Bumping the vendored llama.cpp: see UPDATE_LLAMA.md.

Coming next: erllama_cluster

A separate OTP application is in development to coordinate a fleet of erllama nodes into a single inference cluster. Each node continues to run erllama as a standalone library — local model loading, local KV cache, local inference. The cluster layer sits on top and decides which node serves which request.

Three distribution strategies, all in v1:

Request distribution with pluggable load-balancing and cache-affinity routing — follow-up requests prefer the node that warmed the KV cache for the prefix.
Speculative decoding across nodes — small draft model on one node, large verifier on another, coordinated per request.
Pipeline parallelism — models too large for one node split by layer ranges across multiple nodes, hidden states passed between shards as Erlang binaries.

Transport is QUIC, via Erlang distribution carried over erlang_quic — a pure Erlang QUIC implementation, no C NIF in the protocol path. Circuit breakers per {Node, ModelId} driven by nodeup/nodedown rather than application-level pings. A globally registered scheduler handles cluster-wide GPU budgeting and on-demand model placement, with local fallback schedulers elected by pg quorum on network partition.

Repository: https://github.com/erllama/erllama_cluster (under construction).

Acknowledgements

Same idea as antirez/ds4.

License

The vendored c_src/llama.cpp/ retains its upstream MIT license; see c_src/llama.cpp/LICENSE.