Approx
Probabilistic data structures for Elixir. Pure Elixir, zero runtime dependencies, immutable, mergeable, and serializable.
Approx gives you battle-tested sketches and samplers for the problems that come up in every data-intensive Elixir application — deduplication, frequency counting, cardinality estimation, percentile tracking, similarity detection, and sampling — all in constant or sub-linear space.
Installation
def deps do
[
{:approx, "~> 0.1.0"}
]
endWhat's Included
| Structure | Answers the question... | Module |
|---|---|---|
| Bloom Filter | "Have I seen this before?" (no false negatives) | Approx.BloomFilter |
| Cuckoo Filter | "Have I seen this before?" (supports deletion) | Approx.CuckooFilter |
| Count-Min Sketch | "How many times has X appeared?" | Approx.CountMinSketch |
| Top-K | "What are the K most frequent items?" | Approx.TopK |
| HyperLogLog | "How many distinct items have I seen?" | Approx.HyperLogLog |
| Reservoir Sampling | "Give me a uniform random sample of K items" | Approx.Reservoir |
| MinHash | "How similar are these two sets?" | Approx.MinHash |
| t-digest | "What is the p99 latency?" | Approx.TDigest |
Used in Production
| Structure | Used by |
|---|---|
| Bloom Filter | Google Bigtable skips disk reads for non-existent rows, Chrome checks URLs against the Safe Browsing blocklist locally, Medium deduplicates article recommendations per user, PostgreSQL filters non-matching rows during hash joins, Cassandra skips SSTables that don't contain a partition key |
| Cuckoo Filter | Cache invalidation systems track cached keys and delete them on eviction, dynamic firewalls block/unblock IPs without rebuilding the filter, session revocation stores revoked JWT IDs and removes them after expiry |
| Count-Min Sketch | Twitter/X counts per-hashtag frequency to rank trending topics, Cisco routers estimate per-flow byte counts at line rate, Apache Spark Structured Streaming runs approximate frequency aggregations on unbounded streams |
| Top-K | Search engines rank autocomplete suggestions by query frequency, CDNs like Akamai track most-requested URLs for cache-warming at edge PoPs, analytics dashboards stream events to display "most visited pages" in real time |
| HyperLogLog |
Redis PFADD/PFCOUNT counts unique items in ~12 KB per key, BigQuery's APPROX_COUNT_DISTINCT avoids full-table distinct scans, Presto/Trino merges per-worker sketches for approx_distinct(), Elasticsearch counts distinct values on billion-entry fields |
| Reservoir Sampling | Datadog/New Relic agents keep a fixed-size sample of traces per interval to cap ingestion volume, load balancers sample requests for debugging without logging everything, Apache Spark merges per-partition reservoirs for approximate quantiles |
| MinHash | Google computes shingle-set signatures during crawling to detect near-duplicate pages, Spotify compares playlist track-set signatures for recommendations, ad platforms estimate audience overlap between campaigns without joining raw user lists |
| t-digest | Elasticsearch builds per-shard digests and merges them to return p50/p95/p99 without sorting, Datadog merges per-host latency digests for service-level percentiles, Apache Druid stores serialized digests per time chunk for fast range queries, Prometheus client libraries compute summary quantiles internally |
Examples
Bloom Filter — deduplicate webhook deliveries
Prevent processing the same webhook twice. The filter uses a few KB of memory instead of storing every event ID in a database or ETS table.
bf = Approx.BloomFilter.new(1_000_000, 0.001)
# On each incoming webhook
event_id = "evt_abc123"
if Approx.BloomFilter.member?(bf, event_id) do
# Already processed — skip
:duplicate
else
bf = Approx.BloomFilter.add(bf, event_id)
# Process the webhook...
:ok
endCuckoo Filter — manage a blocklist with removals
Like a Bloom filter, but you can remove entries — useful for a dynamic IP blocklist where addresses are blocked and later unblocked.
cf = Approx.CuckooFilter.new(100_000)
# Block an IP
{:ok, cf} = Approx.CuckooFilter.add(cf, "192.168.1.42")
Approx.CuckooFilter.member?(cf, "192.168.1.42") # => true
# Unblock it later
{:ok, cf} = Approx.CuckooFilter.delete(cf, "192.168.1.42")
Approx.CuckooFilter.member?(cf, "192.168.1.42") # => falseCount-Min Sketch — rate limiting by API key
Track how many requests each API key has made without storing per-key counters for millions of keys.
cms = Approx.CountMinSketch.new(0.001, 0.01)
# On each API request
api_key = "key_xyz789"
cms = Approx.CountMinSketch.add(cms, api_key)
if Approx.CountMinSketch.count(cms, api_key) > 1000 do
# Rate limit exceeded
{:error, :rate_limited}
else
{:ok, :allowed}
endTop-K — find the most popular pages in real time
Track the top 10 most viewed pages across your application without storing every page view.
tk = Approx.TopK.new(10)
# On each page view
tk = Approx.TopK.add(tk, "/docs/getting-started")
tk = Approx.TopK.add(tk, "/pricing")
tk = Approx.TopK.add(tk, "/docs/getting-started")
Approx.TopK.top(tk)
# => [{"/docs/getting-started", 2}, {"/pricing", 1}]HyperLogLog — count unique visitors across nodes
Each web server maintains its own HyperLogLog. At the end of the day, merge them for a global distinct count using only ~16 KB of memory.
hll_node_a = Approx.HyperLogLog.new(14)
hll_node_a = Enum.reduce(user_ids_node_a, hll_node_a, &Approx.HyperLogLog.add(&2, &1))
hll_node_b = Approx.HyperLogLog.new(14)
hll_node_b = Enum.reduce(user_ids_node_b, hll_node_b, &Approx.HyperLogLog.add(&2, &1))
{:ok, merged} = Approx.HyperLogLog.merge(hll_node_a, hll_node_b)
Approx.HyperLogLog.count(merged)
# => ~2_500_000 unique visitors (estimated, <1% error)Reservoir Sampling — sample logs for debugging
Keep a random sample of 100 log entries from a stream of millions, without knowing the stream length in advance. Every entry has an equal chance of being in the sample.
reservoir = Approx.Reservoir.new(100)
# In a log processing pipeline
reservoir = Enum.reduce(log_entries, reservoir, &Approx.Reservoir.add(&2, &1))
sampled = Approx.Reservoir.sample(reservoir)
# => 100 uniformly random log entries from the full streamMinHash — detect near-duplicate articles
Compute compact signatures of article word sets, then estimate similarity in constant time — no need to compare full texts.
mh = Approx.MinHash.new(128, seed: 42)
words_a = article_a |> String.split() |> MapSet.new()
words_b = article_b |> String.split() |> MapSet.new()
sig_a = Approx.MinHash.signature(mh, words_a)
sig_b = Approx.MinHash.signature(mh, words_b)
Approx.MinHash.similarity(sig_a, sig_b)
# => 0.82 (82% similar — likely a near-duplicate)t-digest — monitor API latency SLAs
Track latency percentiles with high accuracy at the tails — exactly where SLA violations happen. Two digests from different time windows can be merged.
td = Approx.TDigest.new()
# Record latencies in milliseconds
td = Enum.reduce(request_latencies_ms, td, &Approx.TDigest.add(&2, &1))
Approx.TDigest.percentile(td, 0.50) # => p50 (median)
Approx.TDigest.percentile(td, 0.95) # => p95
Approx.TDigest.percentile(td, 0.99) # => p99
Approx.TDigest.percentile(td, 0.999) # => p99.9
# Merge hourly digests into a daily summary
daily = Approx.TDigest.merge(hour_1_digest, hour_2_digest)Key Features
- Pure Elixir — no NIFs, no ports, no runtime dependencies
- Mergeable — combine structures from distributed nodes with
merge/2 - Serializable — most structures support
to_binary/1/from_binary/1 - Tested — 600+ tests including statistical accuracy bounds and round-trip serialization
Documentation
Full API docs with examples on every function: hexdocs.pm/approx
License
MIT — see LICENSE for details.