DGen

dgen_server brings the gen_server programming model to a distributed system. It uses the same callback interface — init/1, handle_call/3, handle_cast/2 — but unlike a gen_server, it is not a single Erlang process. A dgen_server is a virtual entity that exists as long as its state does in FoundationDB, independent of any running process. What makes it distributed rather than merely durable is its message queue: multiple consumers on any number of nodes can consume from the same queue simultaneously, with serialization and exactly-once delivery guaranteed by the backend.

Motivation

I love gen_server. There are only 2 things stopping me from writing my entire app with them:

Durability: The state is lost when the process goes down.
High availability: The functionality is unavailable when the process goes down.

Let's try to solve this with a distributed system, and find out if an app actually can be written with only gen_servers.

Getting Started

Erlang

The simplest distributed server is just a regular gen_server with dgen_server behaviour:

-module(counter).
-behavior(dgen_server).

-export([start/1, increment/1, value/1]).
-export([init/1, handle_call/3]).

start(Tenant) ->
    dgen_server:start(?MODULE, [], [{tenant, Tenant}]).

increment(Pid) ->
    dgen_server:call(Pid, increment).

value(Pid) ->
    dgen_server:call(Pid, value).

init([]) ->
    {ok, 0}.

handle_call(increment, _From, State) ->
    {reply, ok, State + 1};
handle_call(value, _From, State) ->
    {reply, State, State}.

Start it inside a FoundationDB directory, and the state persists across restarts:

Tenant = dgen_erlfdb:sandbox_open(<<"demo">>, <<"counter">>),
{ok, Pid} = counter:start(Tenant),
counter:increment(Pid),
counter:increment(Pid),
2 = counter:value(Pid),

%% Restart the process
dgen_server:stop(Pid),
{ok, Pid2} = counter:start(Tenant),
2 = counter:value(Pid2).  %% State persisted!

Elixir

The simplest distributed server is just a regular GenServer with use DGenServer:

defmodule Counter do
  use DGenServer

  def start(tenant), do: DGenServer.start(__MODULE__, [], tenant: tenant)

  def increment(pid), do: DGenServer.cast(pid, :increment)
  def value(pid), do: DGenServer.call(pid, :value)

  @impl true
  def init([]), do: {:ok, 0}

  @impl true
  def handle_call(:value, _from, state), do: {:reply, state, state}

  @impl true
  def handle_cast(:increment, state), do: {:noreply, state + 1}
end

Start it inside a FoundationDB directory, and the state persists across restarts:

tenant = :dgen_erlfdb.sandbox_open("demo", "counter")
{:ok, pid} = Counter.start(tenant)
Counter.increment(pid)
Counter.increment(pid)
2 = Counter.value(pid)

# Restart the process
GenServer.stop(pid)
{:ok, pid2} = Counter.start(tenant)
2 = Counter.value(pid2)  # State persisted!

Installation

DGen can be installed by adding dgen to your list of dependencies in mix.exs:

def deps do
  [
    {:dgen, "~> 0.1.0"}
  ]
end

The docs can be found at https://hexdocs.pm/dgen.

API Contract

Message Processing

dgen_server provides different message paths with different guarantees:

Standard messages (call, cast):

Processed with strict serializability via the durable queue
Execute within a database transaction (subject to FDB transaction limits)
Must not include side effects. Callbacks must be pure with respect to external systems
Respect the lock (see Locking below)

Priority messages (priority_call, priority_cast, handle_info):

Skip the durable queue and execute immediately
Still execute within a database transaction (subject to FDB transaction limits)
Must not include side effects
Do not respect the lock. Always execute even when locked

Actions

Callbacks may return {reply, Reply, State, Actions} or {noreply, State, Actions} where Actions is a list of 1-arity functions. These functions:

Execute after the transaction commits
Receive the committed State as their argument, but cannot modify it
Are the correct place for side effects: logging, telemetry, publishing to external systems
Can return halt to stop processing actions, or any other value to continue

Locking

A callback may return {lock, State} to enter locked mode. When locked:

Standard call and cast messages are queued but not processed
Priority messages and handle_info continue to execute
The handle_locked/3 callback is invoked outside of a transaction
- Not subject to FDB transaction limits
- Side effects are permitted
- Can modify state, which is written back to the database

Use locking for long-running operations that would exceed transaction time limits, such as calling external APIs or performing extended computations.

Persisted State

Encoder/Decoder

State is persisted to the key-value store using a structured encoding scheme that optimizes for partial updates. Three encoding types are supported:

Assigns map: Maps with all atom keys are split across separate keys, one per entry. No ordering guarantees.
```
 #{
     mykey => <<"my value">>,
     otherkey => 42
 }
```
Components list: Lists where every item is a map with an atom id key containing a binary value. Each item is stored separately with ordering maintained via fractional indexing in the storage key.
```
 [
     #{id => "item1", value => 1},
     #{id => "item2", value => 2}
 ]
```
Term: All other terms use term_to_binary and are chunked into 100KB values.
```
 {this, is <<"some">>, term, 4.5, %{3 => 2}}
```
Nesting: The encoder handles nested structures recursively. For example, an assigns map containing a components list will nest both encodings in the key path.
```
 #{
     mykey => <<"my value">>,
     mylist => [
         #{id => "item1", value => 1},
         #{id => "item2", value => 2}
     ]
 }
```

When writing updates, diffs are generated by comparing old and new state:

Assigns map: Only changed entries are written; removed entries are cleared
Components list: Only changed items are written; ordering changes update fractional indices
Term: Full rewrite (no diffing)

If the encoding type changes between updates, the old keys are cleared and the new encoding is written in full.

Caching

Each consumer process can maintain an in-memory cache of the state paired with its versionstamp. On subsequent messages, if the cached versionstamp matches the current database version, the state is reused without a read operation. This eliminates redundant reads when processing multiple messages in sequence.

The cache is invalidated when the process detects that another consumer has modified the state.

Crashing

DGenServer has well-defined behavior during crashes.

Key guarantee: Standard call and cast messages are processed exactly-once under normal operation. In the event of a crash before the transaction commits, the message will be retried — so in crash scenarios, delivery is at-least-once. When dead_letter_threshold is set, retries are bounded by that limit. Design your callbacks to be idempotent when possible.

During init/1:

If the first init/1 crashes, the gen_server process exits before any state is persisted
When restarted init/1 runs again from scratch
No durable state exists yet, so there's nothing to recover

During transactional callbacks (handle_call, handle_cast, handle_info):

The database transaction is automatically aborted — no state changes are committed
For call and cast: the message remains in the durable queue and will be retried by the next consumer. If dead_letter_threshold is set, each failed attempt increments a counter embedded in the message envelope; once the counter reaches the threshold, the message is moved to the dead-letter queue instead of being retried (see below).
For priority_call and priority_cast: the message is lost (it never entered the queue)
For handle_info: the Erlang message is lost (info messages are not durable)
State remains unchanged from before the callback was invoked

Dead-letter queue:

A poison message is a queue message that consistently crashes consumers. To enable bounded retries, set dead_letter_threshold to a positive integer. Each message envelope carries an attempt counter; when the counter reaches the threshold, the message is moved to the dead-letter queue (DLQ) in FoundationDB rather than being retried:

For call messages, the blocked caller raises {dead_letter, N} where N is the attempt count.
The optional handle_dead_letter/2 callback is invoked with the original message and attempt count, if defined.
A warning is logged.

Dead-lettering is disabled by default (dead_letter_threshold: infinity). Enable it with the start option:

Erlang

dgen_server:start(MyMod, [], [{tenant, Tenant}, {dead_letter_threshold, 3}])

Elixir

DGenServer.start_link(MyMod, [], tenant: tenant, dead_letter_threshold: 3)

Coordinating with the supervisor's restart intensity:

Each consumer crash counts as one restart from the supervisor's perspective. OTP supervisors enforce a restart intensity — a maximum number of restarts allowed within a sliding time window (max_restarts / max_seconds in Erlang; max_restarts / max_seconds in Elixir, defaulting to 3 restarts in 5 seconds). If the supervisor reaches this limit before dead_letter_threshold is hit, the supervisor itself terminates rather than the message being dead-lettered.

To ensure dead-lettering takes effect, configure the supervisor so that it tolerates at least dead_letter_threshold restarts within the window. A practical approach is to set max_restarts >= dead_letter_threshold with a max_seconds value long enough to cover the expected crash-restart cycle time:

Erlang

%% Allow up to 5 restarts in 60 seconds — enough headroom for a threshold of 3.
{ok, _} = supervisor:start_link({local, my_sup}, my_sup, []),

%% In the supervisor&#39;s init/1:
SupFlags = #{strategy => one_for_one, intensity => 5, period => 60},

Elixir

# Allow up to 5 restarts in 60 seconds — enough headroom for a threshold of 3.
Supervisor.start_link(children,
  strategy: :one_for_one,
  max_restarts: 5,
  max_seconds: 60
)

With dead_letter_threshold: infinity (the default), poison messages produce an unbounded crash loop. The supervisor will eventually exhaust its restart intensity and terminate, which is standard OTP crash-loop behavior. Set a finite threshold to bound the loop and keep the supervisor alive.

Inspecting and managing the dead-letter queue:

Dead-lettered messages accumulate in FoundationDB and are not automatically cleared. An operator can inspect and manage them from a remote shell using functions in dgen_queue. The Quid for a server is obtained with dgen_server:get_quid/1 (Erlang) or :dgen_server.get_quid/1 (Elixir), passing the tuid the server was started with.

Erlang

Quid = dgen_server:get_quid(Tuid),

%% Inspect — returns [{Key, Envelope, AttemptCount, TimestampMs}]
Entries = dgen_queue:peek_dlq(Tenant, Quid),

%% Count without decoding
dgen_queue:dlq_length(Tenant, Quid),

%% Requeue one entry (resets attempt count to 0, atomic with DLQ delete)
{Key, _Envelope, _N, _Ts} = hd(Entries),
dgen_queue:requeue_dlq_entry(Tenant, Quid, Key),  %% ok | {error, not_found}

%% Discard one entry permanently
dgen_queue:delete_dlq_entry(Tenant, Key),

%% Discard all entries for the queue
dgen_queue:purge_dlq(Tenant, Quid).

Elixir

quid = :dgen_server.get_quid(tuid)

# Inspect — returns [{key, envelope, attempt_count, timestamp_ms}]
entries = :dgen_queue.peek_dlq(tenant, quid)

# Count without decoding
:dgen_queue.dlq_length(tenant, quid)

# Requeue one entry (resets attempt count to 0, atomic with DLQ delete)
{key, _envelope, _n, _ts} = hd(entries)
:dgen_queue.requeue_dlq_entry(tenant, quid, key)  # :ok | {:error, :not_found}

# Discard one entry permanently
:dgen_queue.delete_dlq_entry(tenant, key)

# Discard all entries for the queue
:dgen_queue.purge_dlq(tenant, quid)

requeue_dlq_entry/3 is atomic: it pushes the envelope back onto the main queue with the attempt count reset to zero and deletes the DLQ entry in a single FDB transaction. If the root cause of the crashes has been fixed and the server has been redeployed, requeueing allows the message to be retried from a clean state.

During handle_locked:

handle_locked executes outside a transaction, so previous state changes have already been persisted
If the crash is an Erlang/Elixir throw, then the lock is cleared before the process exits
If the crash is a system disruption such as SegFault, OOM, or sudden power loss, the lock is not cleared and the dgen_server is deadlocked. Manual intervention is required to clear the lock.
In either case, the triggering message has been consumed from the queue, so it will not be retried

During action execution:

Actions run after the transaction commits, so state changes are already persisted
If an action crashes, the state update succeeds but remaining actions are not executed
The message has been consumed from the queue and will not be retried

Supervisor restart:

When a dgen_server is restarted by a supervisor, it reads existing state from the database
If state exists, init/1 is called, but the initial state is ignored. The server resumes with the persisted state
The process immediately begins consuming any queued messages, if it's configured to do so
Multiple processes can safely consume from the same queue; they coordinate via database transactions