erllama

CIHex.pm

Run llama.cpp from Erlang. Keep prompts warm. Stay inside OTP.

erllama is a native Erlang/OTP runtime for llama.cpp with supervised model processes, OpenAI-shaped completion APIs, and a token-exact KV cache that turns repeated prompt prefill from seconds into milliseconds.

If your app sends the same system prompt, agent scaffold, or conversation prefix again and again, erllama saves the model state once and restores it on the next request. No fuzzy matching. No hidden session server. Just exact tokens, exact cache keys, and OTP supervision around the whole path.

Why erllama?

Quick taste

1> {ok, _} = application:ensure_all_started(erllama).
2> Path = "/srv/models/tinyllama-1.1b-chat.Q4_K_M.gguf".
3> {ok, Bin} = file:read_file(Path).
4> {ok, Model} = erllama:load_model(#{
backend => erllama_model_llama,
model_path => Path,
fingerprint => crypto:hash(sha256, Bin)
}).
{ok, <<"erllama_model_2375">>}
5> {ok, #{reply := Reply, finish_key := Key}} =
erllama:complete(Model, <<"Once upon a time">>).
%% First call: cold prefill, async save.
6> {ok, #{reply := Reply2}} =
erllama:complete(Model, <<"Once upon a time">>).
%% Same prompt: KV cache restore.
7> {ok, #{reply := Reply3}} =
erllama:complete(Model,
<<"Once upon a time, in a quiet village">>).
%% Longer prompt: longest cached prefix wins.
8> {ok, #{reply := Reply4}} =
erllama:complete(Model, <<"and they lived happily ever after">>,
#{parent_key => Key}).
%% Stateful resume from the previous finish save.

load_model/1 returns a binary model id. Pass it to complete/2,3, infer/4, tokenize/2, unload/1, and the rest of the public API.

Install

erllama targets Erlang/OTP 28 and rebar3 3.25+.

Add it to rebar.config:

{deps, [
{erllama, "~> 0.5"}
]}.

Then start the application before loading models:

{ok, _} = application:ensure_all_started(erllama).

The first compile builds the vendored llama.cpp. See Building for platform notes and CUDA/Metal options.

Common patterns

Stateless HTTP completion

OpenAI/Anthropic-shaped servers usually resend the whole conversation on each turn. That is fine. erllama walks the prompt backward and restores the longest exact prefix it has already saved.

handle_completion(ModelId, Prompt) ->
{ok, #{reply := Reply}} =
erllama:complete(ModelId, Prompt, #{response_tokens => 256}),
Reply.

Stateful Erlang session

If your session process already tracks turns, keep the returned finish_key and pass it as parent_key on the next request. That skips the longest-prefix walk and resumes directly from the saved row.

{ok, #{reply := R1, finish_key := K1}} =
erllama:complete(ModelId, Prompt1),
{ok, #{reply := R2, finish_key := K2}} =
erllama:complete(ModelId, Prompt2, #{parent_key => K1}).

Many models in one BEAM

Each loaded model is its own supervised process. The cache is shared, but rows are fingerprint-segregated.

{ok, _} = erllama:load_model(<<"tiny">>, TinyConfig),
{ok, _} = erllama:load_model(<<"big">>, BigConfig),
{ok, #{reply := R1}} = erllama:complete(<<"tiny">>, <<"summarise: ...">>),
{ok, #{reply := R2}} = erllama:complete(<<"big">>, <<"deep analysis: ...">>),
ok = erllama:unload(<<"tiny">>).

Inspect live state

1> erllama_cache:get_counters().
#{hits_exact => 142, hits_resume => 17, hits_longest_prefix => 89,
misses => 12, saves_cold => 12, saves_finish => 31, ...}
2> erllama:phase(<<"big">>).
generating
3> erllama:pending_len(<<"big">>).
3
4> erllama:last_cache_hit(<<"big">>).
#{kind => partial, prefix_len => 1024}

Documentation

NeedRead
Load a modelLoading a model
Configure cache tiers and save policyCaching
Configure sys.config and per-model optionsConfiguration
Build from sourceBuilding
Copy working snippetsExamples
Stream tool calls while preserving cache hitsTool calls
Understand cache design tradeoffsCache design
Understand crash-safe save publicationPublish protocol
Understand request admission and decode flowRequest lifecycle
Understand NIF lifetime safetyNIF safety

API reference for erllama, erllama_cache, erllama_scheduler, and erllama_nif is published on HexDocs. You can also build it locally:

rebar3 ex_doc

Architecture

erllama_sup
├── erllama_cache_sup
│ ├── erllama_cache_meta_srv
│ ├── erllama_cache_ram
│ └── erllama_cache_writer
├── erllama_registry
├── erllama_inflight
├── erllama_model_sup
│ └── erllama_model one supervised gen_statem per loaded model
└── erllama_scheduler memory-pressure poller, off by default

Disk and ram_file tier servers are started by the operator, one per root directory, then referenced by loaded models through tier_srv and tier.

The important invariant is simple: cache hits are token-exact. A key is derived from the model fingerprint, quantization, context shape, and full token list. erllama may find a shorter saved prefix for a longer prompt, but it never returns an approximate match.

Requirements

Status

erllama is pre-release. The cache, scheduler, and NIF safety wrappers have unit, property, and Common Test coverage. The real-model Common Test suite is gated by LLAMA_TEST_MODEL so normal CI can run without a GGUF file.

See CHANGELOG.md for release notes.

Contributing

The contributor guide is AGENTS.md. The short version:

rebar3 fmt
rebar3 compile
rebar3 eunit
rebar3 proper
rebar3 ct
rebar3 lint
rebar3 dialyzer
rebar3 xref

Run the real-model suite when you have a GGUF available:

LLAMA_TEST_MODEL=/path/to/tinyllama-1.1b-chat.Q4_K_M.gguf \
rebar3 ct --suite=test/erllama_real_model_SUITE

Bumping the vendored llama.cpp is covered in UPDATE_LLAMA.md.

erllama_cluster is planned as a separate OTP application for routing, cache-aware placement, speculative decoding, and distributed inference across erllama nodes.

Repository: https://github.com/erllama/erllama_cluster

Acknowledgements

Same idea as antirez/ds4.

License

MIT. Copyright (c) 2026 Benoit Chesneau. See LICENSE.

The vendored c_src/llama.cpp/ retains its upstream MIT license; see c_src/llama.cpp/LICENSE.