erllama

CIHex.pm

erllama is a native Erlang/OTP wrapper around llama.cpp with a token-exact, multi-tier, supervised KV cache. It turns a multi-second prefill into a millisecond restore, and lets you keep more warm state than fits in RAM by promoting cold-but-popular prefixes down to the disk tier.

If you have ever waited five seconds for a chat assistant to acknowledge "hello" — that's prompt prefill. erllama caches the work so the second turn, the third turn, and every subsequent agent sharing the same system prompt skip it.

A 30-second taste

1> {ok, _} = application:ensure_all_started(erllama).
2> Path = "/srv/models/tinyllama-1.1b-chat.Q4_K_M.gguf".
3> {ok, Bin} = file:read_file(Path).
4> {ok, M} = erllama:load_model(#{
       backend     => erllama_model_llama,
       model_path  => Path,
       fingerprint => crypto:hash(sha256, Bin)
   }).
{ok, <<"erllama_model_2375">>}

5> {ok, Reply, _} = erllama:complete(M, <<"Once upon a time">>).
%% ~3 s on a CPU box. Prompt prefill, async cold save fired.

6> {ok, Reply2, _} = erllama:complete(M, <<"Once upon a time">>).
%% ~10 ms. Cache hit; KV state restored, one decode for fresh logits.

7> {ok, _, _} = erllama:complete(M, <<"Once upon a time, in a quiet village">>).
%% ~50 ms. Longest-prefix walk found the cached row even though
%% the new prompt is longer.

load_model/1 returns a binary model_id that is also the registered name for the underlying gen_statem. Pass it to complete/2,3, unload/1, etc.

That is the whole pitch. The cache is on by default, runs under its own supervisor, and never returns approximate matches.

What you get

Installation

erllama targets Erlang/OTP 28 with rebar3 3.25+.

Add to rebar.config:

{deps, [
    {erllama, "~> 0.1"}
]}.

Then in your supervision tree, wait for the application to start before loading models:

ok = application:ensure_started(erllama).

The first compile builds vendored llama.cpp (~3 minutes on a fast machine). Subsequent builds are cached. See requirements for the toolchain.

Documentation

Guide What it covers
Loading a model Every option to erllama:load_model/1,2, with examples and pitfalls.
Caching Tiers, save reasons, lookup paths, watermarks. The operator's manual.
Configuration Full sys.config and per-model option reference.
Building Platform-specific build notes (Linux, macOS, FreeBSD), CUDA/Metal toggles, common build issues.
Examples Drop-in patterns for one-shot completion, stateless HTTP servers, multi-turn sessions, concurrent agents, cache inspection.

For the API reference (erllama, erllama_cache, erllama_scheduler, erllama_nif), see the generated module docs on HexDocs or run rebar3 ex_doc locally.

For the design rationale behind the cache:

Many models in one BEAM

Each loaded model is its own supervised gen_statem under erllama_model_sup. The cache is process-wide and segregates rows by fingerprint, so the only thing two models share is the byte budget.

{ok, _} = erllama:load_model(<<"tiny">>, TinyConfig).
{ok, _} = erllama:load_model(<<"big">>,  BigConfig).

{ok, R1, _} = erllama:complete(<<"tiny">>, <<"summarise: ...">>).
{ok, R2, _} = erllama:complete(<<"big">>,  <<"deep analysis of: ...">>).

ok = erllama:unload(<<"tiny">>).
Capability How
N models in one BEAM load_model/2 per binary id; each is one gen_statem
No cross-model collisions Cache key includes the model fingerprint
Hot-swap a model unload/1 then load_model/2; the cache survives
Per-model policypolicy => #{...} on the load; merges over app-env defaults
Per-model tiertier_srv => MyDisk, tier => disk per model
Shared-prefix hits across agents Longest-prefix walk on every cold prompt
Concurrent saves bounded Single writer pool with a leak-proof semaphore

Tested end-to-end in test/erllama_SUITE.erl:concurrent_complete_under_writer_cap — four models with distinct fingerprints running parallel completions under one writer cap.

A slightly longer example

A real load with all the cache parameters. The disk tier requires a running erllama_cache_disk_srv started by the operator; the RAM tier (erllama_cache_ram) starts automatically with the application.

{ok, _} = erllama_cache_disk_srv:start_link(my_disk, "/var/lib/erllama/kvc"),
{ok, Bin} = file:read_file("/srv/models/llama-3.1-8b.Q4_K_M.gguf"),
Fp = crypto:hash(sha256, Bin),
CtxHash = crypto:hash(sha256, term_to_binary({8192, 4096})),

{ok, M} = erllama:load_model(#{
    backend          => erllama_model_llama,
    model_path       => "/srv/models/llama-3.1-8b.Q4_K_M.gguf",
    model_opts       => #{n_gpu_layers => 99},
    context_opts     => #{n_ctx => 8192, n_batch => 4096},
    fingerprint      => Fp,
    fingerprint_mode => safe,
    quant_type       => q4_k_m,
    quant_bits       => 4,
    ctx_params_hash  => CtxHash,
    context_size     => 8192,
    tier_srv         => my_disk,
    tier             => disk,
    policy           => #{
        boundary_trim_tokens   => 32,
        boundary_align_tokens  => 256,
        session_resume_wait_ms => 500
    }
}).

Stateless OpenAI/Anthropic-shaped server:

handle_completion(ModelId, Prompt) ->
    {ok, Reply, _Tokens} = erllama:complete(ModelId, Prompt),
    Reply.

No parent_key. The cache walks the new prompt backward by the configured stride and finds the longest cached prefix. If the new prompt is yesterday's conversation plus one fresh turn, the walk hits.

Stateful Erlang-native multi-turn: the session layer threads parent_key between turns. The previous turn's finish-save key is the parent of the next call. It is held by the calling session process, not retrieved from the cache.

%% First turn: cold prefill. The model fires an async finish save
%% whose key is sha256(fingerprint || quant || ctx_params || tokens).
{ok, R1, Tokens1} = erllama:complete(M, Prompt1),
ParentKey = erllama_cache_key:make(#{
    fingerprint => Fp,
    quant_type  => q4_k_m,
    ctx_params_hash => CtxHash,
    tokens      => Tokens1
}),

%% Second turn: pass ParentKey to skip the longest-prefix walk.
{ok, R2, _} = erllama:complete(M, Prompt2, #{parent_key => ParentKey}).

Inspect cache state from a shell:

1> erllama_cache:get_counters().
#{hits_exact => 142, hits_resume => 17, hits_longest_prefix => 89,
  misses => 12, saves_cold => 12, saves_continued => 67,
  saves_finish => 31, evictions => 3, ...}

2> erllama_cache_meta_srv:dump().
%% List of raw ETS rows:
%%   {Key, Tier, Size, LastUsedNs, Refcount, Status, HeaderBin,
%%    Location, TokensRef, Hits}
[{<<_:256>>, disk, 8388608, 1737..., 0, available, _, _, _, 4}, ...]

Requirements

Architecture at a glance

erllama_sup
├── erllama_cache_sup
│   ├── erllama_cache_meta_srv      sole writer; meta + LRU + reservations
│   ├── erllama_cache_ram           RAM tier (ETS slabs)
│   ├── erllama_cache_ramfile_srv   ram_file tier
│   ├── erllama_cache_disk_srv      disk tier (plain read/write I/O)
│   └── erllama_cache_writer        writer pool, leak-proof semaphore
├── erllama_model_sup               simple_one_for_one for dynamic models
└── erllama_scheduler               memory-pressure poller (off by default)

Inside a request:

  1. erllama:complete/2 enters the per-model gen_statem.
  2. prefilling — tokenize, then either hit the cache and kv_unpack (warm) or run llama_decode over the prompt (cold). Cold path fires an async cold save at the trimmed-prefix boundary.
  3. generating — token-by-token greedy llama_decode. Every continued_interval tokens, fire an async continued save.
  4. idle — fire an async finish save for the full prompt + reply. The KV state becomes evictable.

For the publish protocol, the reservation state machine, and the exception-safe NIF wrappers, see internals/publish-protocol.md and internals/nif-safety.md.

Status

Pre-release. Core cache, scheduler, and NIF: 166 EUnit + 11 PropEr + 7 stub Common Test cases. End-to-end CT suite gated on LLAMA_TEST_MODEL (6 cases, passing locally with TinyLlama 1.1B Q4_K_M).

See CHANGELOG.md for the release notes.

Contributing

The contributor guide is AGENTS.md. The short version:

rebar3 fmt          # auto-format (always run first)
rebar3 compile      # warnings_as_errors
rebar3 eunit        # unit tests
rebar3 proper       # property tests
rebar3 ct           # Common Test (without a real model)
rebar3 lint         # Elvis
rebar3 dialyzer     # static analysis
rebar3 xref         # cross-reference

End-to-end against a real GGUF:

LLAMA_TEST_MODEL=/path/to/tinyllama-1.1b-chat.gguf \
    rebar3 ct --suite=test/erllama_real_model_SUITE

Bumping the vendored llama.cpp: see UPDATE_LLAMA.md.

Coming next: erllama_cluster

A separate OTP application is in development to coordinate a fleet of erllama nodes into a single inference cluster. Each node continues to run erllama as a standalone library — local model loading, local KV cache, local inference. The cluster layer sits on top and decides which node serves which request.

Three distribution strategies, all in v1:

Transport is QUIC, via Erlang distribution carried over erlang_quic — a pure Erlang QUIC implementation, no C NIF in the protocol path. Circuit breakers per {Node, ModelId} driven by nodeup/nodedown rather than application-level pings. A globally registered scheduler handles cluster-wide GPU budgeting and on-demand model placement, with local fallback schedulers elected by pg quorum on network partition.

Repository: https://github.com/erllama/erllama_cluster (under construction).

Acknowledgements

Same idea as antirez/ds4.

License

MIT. Copyright (c) 2026 Benoit Chesneau. See LICENSE.

The vendored c_src/llama.cpp/ retains its upstream MIT license; see c_src/llama.cpp/LICENSE.