EKV

Eventually consistent durable KV store for Elixir with opt-in per-key linearizable CAS, with zero runtime dependencies.

Data survives node restarts, node death, and network partitions. Member nodes replicate directly across all connected Erlang nodes using delta sync via per-shard oplogs. Storage is backed by SQLite (vendored, compiled as a NIF) with zero runtime dependencies.

Installation

def deps do
  [
    {:ekv, "~> 0.2.0"}
  ]
end

EKV uses sqlite as the storage layer. Precompiled NIF binaries are available for common platforms. If a precompiled binary isn't available for your system, it will compile from source (requires a C compiler).

Usage

Add EKV to your supervision tree:

children = [
  {EKV, name: :my_kv, data_dir: "data/ekv/my_kv"}
]

Or start a stateless client that routes to members by region preference:

children = [
  {EKV,
   name: :my_kv_client,
   mode: :client,
   region: "ord",
   region_routing: ["iad", "dfw", "lhr"],
   wait_for_route: :timer.seconds(10),
   wait_for_quorum: :timer.seconds(10)}
]

Then use the API:

# Put / Get / Delete
EKV.put(:my_kv, "user/1", %{name: "Alice", role: :admin})
EKV.get(:my_kv, "user/1")
#=> %{name: "Alice", role: :admin}

EKV.delete(:my_kv, "user/1")
EKV.get(:my_kv, "user/1")
#=> nil

# TTL
EKV.put(:my_kv, "session/abc", token, ttl: :timer.minutes(30))

# Prefix scans
EKV.put(:my_kv, "user/1", %{name: "Alice"})
EKV.put(:my_kv, "user/2", %{name: "Bob"})

EKV.scan(:my_kv, "user/") |> Enum.to_list()
#=> [
#=>   {"user/1", %{name: "Alice"}, {ts, origin_node}},
#=>   {"user/2", %{name: "Bob"}, {ts, origin_node}}
#=> ]

EKV.keys(:my_kv, "user/")
#=> [{"user/1", {ts, origin_node}}, {"user/2", {ts, origin_node}}]

# Subscribe to a key
EKV.subscribe(:my_kv, "room/1")
EKV.put(:my_kv, "room/1", %{title: "Elixir"})

    # => receive
    {:ekv, [%EKV.Event{type: :put, key: "room/1", value: %{title: "Elixir"}}], %{name: :my_kv}}

# Subscribe to a prefix
EKV.subscribe(:my_kv, "room/")

    # => receive
    {:ekv, [
     %EKV.Event{type: :put, key: "room/1", value: %{title: "Elixir"}},
     %EKV.Event{type: :put, key: "room/2", value: %{title: "Phoenix"}}
    ], %{name: :my_kv}}

EKV.unsubscribe(:my_kv, "room/")

Values can be any Erlang term (stored via :erlang.term_to_binary/1). Keys are strings.

Options

Option Default Description
:namerequired Atom identifying this EKV instance
:mode:member:member stores/replicates data and participates in CAS quorum. :client is stateless and routes requests to members.
:region"default" Region label exposed by members and used by clients for routing preference.
:region_routingnil Client mode only. Ordered list of preferred member regions.
:wait_for_routefalse Client mode only. Optional startup gate. Blocks startup until the first reachable member in :region_routing order is selected.
:data_dirrequired in :member Directory for SQLite database files
:cluster_sizenil Member mode only. Required for CAS (if_vsn, consistent: true, update/4). Total number of logical voting members.
:node_id auto-generated+persistent Member mode only. Stable logical member id used in ballots, persisted replay origins, quorum accounting, and blue-green overlap. If omitted, EKV generates one on first boot and persists it to the shard DBs.
:shards8 Member mode only. Number of shards (each is an independent GenServer + SQLite db)
:tombstone_ttl604_800_000 (7 days) Member mode only. How long tombstones are retained in milliseconds
:gc_interval300_000 (5 min) Member mode only. GC tick interval in milliseconds
:member_progress_retention_ttlmin(:tombstone_ttl, 21_600_000) (6 hours by default) Member mode only. How long disconnected members keep anchoring replay retention before their kv_member_progress rows may be pruned. This is the main guard against partitions turning into full syncs after a GC; 0 restores the old immediate-prune behavior.
:wait_for_quorumfalse Optional startup gate. In member mode, waits for this EKV member to reach CAS quorum. In client mode, waits for the selected backend member to report CAS quorum reachable.
:anti_entropy_interval30_000 (30 sec) Member mode only. Periodic background repair for already-connected members. Re-runs the normal HWM-driven delta/full sync path to heal missed replication without waiting for reconnect. Must be a positive timeout in ms.
:delta_sync_log_min_entries8 Member mode only. Suppresses per-delta info logs for successful terminal delta syncs smaller than this many entries. :verbose logging still prints all deltas.
:delta_sync_storm_window60_000 (60 sec) Member mode only. Rolling per-shard window used to aggregate delta sync activity for storm detection.
:delta_sync_storm_threshold100 Member mode only. When a shard sends at least this many delta syncs inside one storm window, EKV emits a single aggregated warning for that window. false/nil disables storm warnings.
:wire_compression_threshold262_144 (256 KB) Optional byte threshold for member-to-member wire compression of large replicated value payloads. false/nil disables it. Large LWW replication and CAS accept/commit payloads compress on the wire only; values remain uncompressed on disk and on reads.
:shutdown_barrierfalse Optional graceful-shutdown barrier. Keeps EKV serving during coordinated shutdown for up to the configured timeout so members can finish final writes and replication.
:allow_stale_startupfalse Member mode only. Dangerous recovery override. If true, EKV trusts on-disk data even when stale-db detection would normally refuse startup. Intended only for explicit disaster recovery / full cold-cluster restore cases.
:blue_greenfalse Member mode only. Enable blue-green deployment handoff for shared-volume replacement nodes.
:log:info:info, false (silent), or :verbose
:partition_ttl_policy:quarantine Member mode only. Policy when a member identity reconnects after being disconnected longer than tombstone_ttl. :quarantine blocks replication with that member until operator intervention. :ignore disables that quarantine and allows reconnect/sync anyway.

Client mode

Client mode keeps the EKV API but does not start SQLite, replication, GC, or blue-green machinery on that node.

How It Works

Storage

Each shard has a single SQLite database (WAL mode) as its sole storage layer — no data is held in memory so your dataset is not bound by available system memory. Normal writes go through the shard GenServer and atomically update current state plus retained replay history in a single NIF call. Replay rows use a deduplicated kv_keyrefs dictionary so kv_oplog does not repeat full key strings on every version, and full sync rebuilds kv without seeding replay history on the receiver. Reads go directly to SQLite via per-scheduler read connections stored in persistent_term.

Data survives restarts automatically since SQLite is the source of truth.

Replication

Every write is broadcast to the counterpart shard on all connected members. Member discovery is self-contained by monitoring connected Erlang nodes going up and down. Client routing, client subscriptions, and shutdown coordination are separate and use an EKV-instance-specific :pg scope.

*Note: Node connection is left up to the user, ie either explicit Node.connect/1/sys.config, or using a library like DNSCluster, or libcluster.

When a node connects (or reconnects), each shard pair exchanges a handshake. Based on high-water marks (HWMs), they decide:

Connected members also re-run that same handshake periodically by default (anti_entropy_interval) so a member that missed a prior update eventually repairs itself without waiting for a reconnect.

Conflict resolution

Last-Writer-Wins with nanosecond timestamps. Ties are broken deterministically by comparing persisted origin strings, so all nodes converge to the same result without coordination. In current member mode this origin is the stable node_id, not the transient blue/green Erlang node name.

A delete is just an entry with deleted_at set. Same LWW rules apply -- a put with a higher timestamp beats a delete, and vice versa.

Consistency Modes - LWW vs CAS (Compare-And-Swap)

EKV supports two write modes:

Use consistent mode as key ownership:

CAS write error semantics

CAS writes (put with if_vsn: or consistent: true, delete with if_vsn:, update) can return:

On :unconfirmed, resolve with EKV.get(name, key, consistent: true) before taking follow-up actions.

You can opt in to internal resolution per call with resolve_unconfirmed: true on CAS write APIs. In that mode, EKV performs one barrier read when an ambiguous accept outcome occurs and returns resolved current-state outcomes ({:ok, ...} or {:error, :conflict}) when possible, or {:error, :unavailable} if the resolution read itself cannot complete.

Mixed-mode note

Transition rule per key:

Keep lock/ownership keyspaces CAS-write-only after transition.

Garbage collection

A periodic GC timer runs three phases per tick:

  1. Observe TTL expiry -- emits :expired events; expired LWW rows become tombstones, expired CAS rows stay lazy/local
  2. Purge old tombstones / long-expired CAS rows -- hard-deletes data older than the retention window from SQLite
  3. Truncate oplog -- removes oplog entries below the minimum member HWM

Stale database protection

If a node goes away longer than tombstone_ttl and comes back with an old database on disk, other members will have already GC'd the tombstones for entries deleted during the absence. EKV detects this by checking a last_active_at timestamp stored in the database. If the database is too stale, EKV fails startup by default instead of trusting that on-disk state. Operators can then wipe that node's data dir so it rebuilds from members, or explicitly set allow_stale_startup: true when they intend to trust the old on-disk cluster state.

Each shard DB also persists a named schema_version in kv_meta. Fresh databases stamp the current version on first open. Initialized shard DBs with missing or mismatched schema_version fail startup closed so EKV does not silently boot incompatible on-disk state.

Fresh shard DBs also enable SQLite auto_vacuum=INCREMENTAL. This only applies at creation time; EKV does not rewrite existing shard DBs on normal startup just to change SQLite vacuum mode.

Long live-partition protection

A different edge case is when nodes stay up but are partitioned longer than tombstone_ttl. In that window, one side can purge delete tombstones before reconnect.

For shorter outages, oplog retention is anchored independently by member_progress_retention_ttl (default: min(tombstone_ttl, 6 hours)). If a disconnected member rejoins within that window, GC keeps its replay cursor so heal can usually stay on delta instead of falling back to a full sync.

With the default partition_ttl_policy: :quarantine, EKV detects reconnects after a downtime longer than tombstone_ttl and quarantines that member pair instead of syncing potentially unsafe state. Replication stays blocked for that member until an operator rebuilds one side.

Down-since markers are persisted in kv_meta, keyed by node_id when available (fallback: node name), so restart does not clear quarantine history. This also means node-name churn does not bypass quarantine when node_id is stable.

Fallback name-based markers are bounded: EKV prunes very old entries and caps the retained set per shard to avoid unbounded growth over long periods.

Multiple instances

Each EKV instance (identified by :name) is fully independent -- its own SQLite files, shard GenServers, member mesh, and scoped :pg control plane for routing, subscriptions, and shutdown coordination. To isolate replication between groups of nodes, start separate EKV instances with different names on the nodes that should form each group.

# Only nodes in the US region start this:
{EKV, name: :us_data, data_dir: "/data/ekv/us"}

# Only nodes in the EU region start this:
{EKV, name: :eu_data, data_dir: "/data/ekv/eu"}

License

MIT