Approx

Probabilistic data structures for Elixir. Pure Elixir, zero runtime dependencies, immutable, mergeable, and serializable.

Approx gives you battle-tested sketches and samplers for the problems that come up in every data-intensive Elixir application — deduplication, frequency counting, cardinality estimation, percentile tracking, similarity detection, and sampling — all in constant or sub-linear space.

Installation

def deps do
  [
    {:approx, "~> 0.1.0"}
  ]
end

What's Included

Structure	Answers the question...	Module
Bloom Filter	"Have I seen this before?" (no false negatives)	`Approx.BloomFilter`
Cuckoo Filter	"Have I seen this before?" (supports deletion)	`Approx.CuckooFilter`
Count-Min Sketch	"How many times has X appeared?"	`Approx.CountMinSketch`
Top-K	"What are the K most frequent items?"	`Approx.TopK`
HyperLogLog	"How many distinct items have I seen?"	`Approx.HyperLogLog`
Reservoir Sampling	"Give me a uniform random sample of K items"	`Approx.Reservoir`
MinHash	"How similar are these two sets?"	`Approx.MinHash`
t-digest	"What is the p99 latency?"	`Approx.TDigest`

Used in Production

Structure	Used by
Bloom Filter	Google Bigtable skips disk reads for non-existent rows, Chrome checks URLs against the Safe Browsing blocklist locally, Medium deduplicates article recommendations per user, PostgreSQL filters non-matching rows during hash joins, Cassandra skips SSTables that don't contain a partition key
Cuckoo Filter	Cache invalidation systems track cached keys and delete them on eviction, dynamic firewalls block/unblock IPs without rebuilding the filter, session revocation stores revoked JWT IDs and removes them after expiry
Count-Min Sketch	Twitter/X counts per-hashtag frequency to rank trending topics, Cisco routers estimate per-flow byte counts at line rate, Apache Spark Structured Streaming runs approximate frequency aggregations on unbounded streams
Top-K	Search engines rank autocomplete suggestions by query frequency, CDNs like Akamai track most-requested URLs for cache-warming at edge PoPs, analytics dashboards stream events to display "most visited pages" in real time
HyperLogLog	Redis `PFADD`/`PFCOUNT` counts unique items in ~12 KB per key, BigQuery's `APPROX_COUNT_DISTINCT` avoids full-table distinct scans, Presto/Trino merges per-worker sketches for `approx_distinct()`, Elasticsearch counts distinct values on billion-entry fields
Reservoir Sampling	Datadog/New Relic agents keep a fixed-size sample of traces per interval to cap ingestion volume, load balancers sample requests for debugging without logging everything, Apache Spark merges per-partition reservoirs for approximate quantiles
MinHash	Google computes shingle-set signatures during crawling to detect near-duplicate pages, Spotify compares playlist track-set signatures for recommendations, ad platforms estimate audience overlap between campaigns without joining raw user lists
t-digest	Elasticsearch builds per-shard digests and merges them to return p50/p95/p99 without sorting, Datadog merges per-host latency digests for service-level percentiles, Apache Druid stores serialized digests per time chunk for fast range queries, Prometheus client libraries compute summary quantiles internally

Examples

Bloom Filter — deduplicate webhook deliveries

Prevent processing the same webhook twice. The filter uses a few KB of memory instead of storing every event ID in a database or ETS table.

bf = Approx.BloomFilter.new(1_000_000, 0.001)

# On each incoming webhook
event_id = "evt_abc123"

if Approx.BloomFilter.member?(bf, event_id) do
  # Already processed — skip
  :duplicate
else
  bf = Approx.BloomFilter.add(bf, event_id)
  # Process the webhook...
  :ok
end

Cuckoo Filter — manage a blocklist with removals

Like a Bloom filter, but you can remove entries — useful for a dynamic IP blocklist where addresses are blocked and later unblocked.

cf = Approx.CuckooFilter.new(100_000)

# Block an IP
{:ok, cf} = Approx.CuckooFilter.add(cf, "192.168.1.42")
Approx.CuckooFilter.member?(cf, "192.168.1.42")  # => true

# Unblock it later
{:ok, cf} = Approx.CuckooFilter.delete(cf, "192.168.1.42")
Approx.CuckooFilter.member?(cf, "192.168.1.42")  # => false

Count-Min Sketch — rate limiting by API key

Track how many requests each API key has made without storing per-key counters for millions of keys.

cms = Approx.CountMinSketch.new(0.001, 0.01)

# On each API request
api_key = "key_xyz789"
cms = Approx.CountMinSketch.add(cms, api_key)

if Approx.CountMinSketch.count(cms, api_key) > 1000 do
  # Rate limit exceeded
  {:error, :rate_limited}
else
  {:ok, :allowed}
end

Top-K — find the most popular pages in real time

Track the top 10 most viewed pages across your application without storing every page view.

tk = Approx.TopK.new(10)

# On each page view
tk = Approx.TopK.add(tk, "/docs/getting-started")
tk = Approx.TopK.add(tk, "/pricing")
tk = Approx.TopK.add(tk, "/docs/getting-started")

Approx.TopK.top(tk)
# => [{"/docs/getting-started", 2}, {"/pricing", 1}]

HyperLogLog — count unique visitors across nodes

Each web server maintains its own HyperLogLog. At the end of the day, merge them for a global distinct count using only ~16 KB of memory.

hll_node_a = Approx.HyperLogLog.new(14)
hll_node_a = Enum.reduce(user_ids_node_a, hll_node_a, &Approx.HyperLogLog.add(&2, &1))

hll_node_b = Approx.HyperLogLog.new(14)
hll_node_b = Enum.reduce(user_ids_node_b, hll_node_b, &Approx.HyperLogLog.add(&2, &1))

{:ok, merged} = Approx.HyperLogLog.merge(hll_node_a, hll_node_b)
Approx.HyperLogLog.count(merged)
# => ~2_500_000 unique visitors (estimated, <1% error)

Reservoir Sampling — sample logs for debugging

Keep a random sample of 100 log entries from a stream of millions, without knowing the stream length in advance. Every entry has an equal chance of being in the sample.

reservoir = Approx.Reservoir.new(100)

# In a log processing pipeline
reservoir = Enum.reduce(log_entries, reservoir, &Approx.Reservoir.add(&2, &1))

sampled = Approx.Reservoir.sample(reservoir)
# => 100 uniformly random log entries from the full stream

MinHash — detect near-duplicate articles

Compute compact signatures of article word sets, then estimate similarity in constant time — no need to compare full texts.

mh = Approx.MinHash.new(128, seed: 42)

words_a = article_a |> String.split() |> MapSet.new()
words_b = article_b |> String.split() |> MapSet.new()

sig_a = Approx.MinHash.signature(mh, words_a)
sig_b = Approx.MinHash.signature(mh, words_b)

Approx.MinHash.similarity(sig_a, sig_b)
# => 0.82 (82% similar — likely a near-duplicate)

t-digest — monitor API latency SLAs

Track latency percentiles with high accuracy at the tails — exactly where SLA violations happen. Two digests from different time windows can be merged.

td = Approx.TDigest.new()

# Record latencies in milliseconds
td = Enum.reduce(request_latencies_ms, td, &Approx.TDigest.add(&2, &1))

Approx.TDigest.percentile(td, 0.50)   # => p50 (median)
Approx.TDigest.percentile(td, 0.95)   # => p95
Approx.TDigest.percentile(td, 0.99)   # => p99
Approx.TDigest.percentile(td, 0.999)  # => p99.9

# Merge hourly digests into a daily summary
daily = Approx.TDigest.merge(hour_1_digest, hour_2_digest)

Key Features

Pure Elixir — no NIFs, no ports, no runtime dependencies
Mergeable — combine structures from distributed nodes with merge/2
Serializable — most structures support to_binary/1 / from_binary/1
Tested — 600+ tests including statistical accuracy bounds and round-trip serialization

Documentation

Full API docs with examples on every function: hexdocs.pm/approx

License

MIT — see LICENSE for details.