ExDataSketch

Production-grade streaming data sketching algorithms for Elixir.

ExDataSketch provides probabilistic data structures for approximate counting, frequency estimation, quantile computation, heavy-hitter detection, membership testing with deletion, and set reconciliation on streaming data. Stream-native integration with Elixir's Collectable, GenStage, Broadway, and Flow, plus :telemetry/OpenTelemetry instrumentation, persistence backends (ETS, DETS, CubDB, Mnesia, Ecto), and nine production-oriented Livebooks.

CIHex versionHex docsLicenseCoverage Status

Supported Algorithms

Algorithm Purpose Status
HyperLogLog (HLL) Cardinality estimation Implemented (Pure + Rust)
Count-Min Sketch (CMS) Frequency estimation Implemented (Pure + Rust)
Theta Sketch Set operations on cardinalities Implemented (Pure + Rust)
KLL Quantiles Rank and quantile estimation Implemented (Pure + Rust)
DDSketch Relative-error quantile estimation Implemented (Pure + Rust)
FrequentItems (SpaceSaving) Heavy-hitter / top-k detection Implemented (Pure + Rust)
Bloom Filter Probabilistic membership testing Implemented (Pure + Rust)
Cuckoo Filter Membership testing with deletion Implemented (Pure + Rust)
Quotient Filter Membership with deletion and merge Implemented (Pure + Rust)
CQF (Counting Quotient) Multiset membership with counting Implemented (Pure + Rust)
XorFilter Static immutable membership testing Implemented (Pure + Rust)
IBLT Set reconciliation Implemented (Pure + Rust)
REQ Sketch Relative-error quantile estimation Implemented (Pure)
Misra-Gries Deterministic heavy-hitter detection Implemented (Pure)
UltraLogLog (ULL) Improved cardinality estimation Implemented (Pure + Rust)

Capability Matrix

Structure insert delete merge count serialize static reconciliation
Bloom yes -- yes yes* yes -- --
Cuckoo yes yes -- yes yes -- --
Quotient yes yes yes yes yes -- --
CQF yes yes yes yes yes -- --
XorFilter -- -- -- yes yes yes --
IBLT yes yes yes yes yes -- yes

*Bloom count is a popcount-based cardinality estimate.

When to Choose

Installation

Add ex_data_sketch to your list of dependencies in mix.exs:

def deps do
  [
    {:ex_data_sketch, "~> 0.9.0"}
  ]
end

Quick Start

# HLL: count distinct elements
hll = ExDataSketch.HLL.new() |> ExDataSketch.HLL.update_many(1..100_000)
ExDataSketch.HLL.estimate(hll)  # ~100_000

# Stream API: build from lazy enumerables
sketch = 1..100_000 |> Stream.map(&to_string/1) |> ExDataSketch.Stream.hll(p: 14)

# Collectable: Enum.into works with any mergeable sketch
sketch = Enum.into(1..50_000, ExDataSketch.ULL.new(p: 14))

# Partitioned parallel reduction
sketch = ExDataSketch.Stream.reduce_partitioned(1..1_000_000, ExDataSketch.HLL, partitions: 8, p: 14)

# Persistence: save and load from ETS
:ets.new(:sketches, [:set, :public, :named_table])
ExDataSketch.Storage.ETS.save(sketch, :sketches, "daily:2024-01-15")
{:ok, loaded} = ExDataSketch.Storage.ETS.load(ExDataSketch.HLL, :sketches, "daily:2024-01-15")

# KLL: quantile estimation
kll = ExDataSketch.KLL.new() |> ExDataSketch.KLL.update_many(1..100_000)
ExDataSketch.KLL.quantile(kll, 0.5)   # approximate median (~50_000)
ExDataSketch.KLL.quantile(kll, 0.99)  # 99th percentile (~99_000)

# Bloom: membership testing
bloom = ExDataSketch.Bloom.new(capacity: 100_000)
bloom = ExDataSketch.Bloom.put_many(bloom, 1..50_000)
ExDataSketch.Bloom.member?(bloom, 42)      # true
ExDataSketch.Bloom.member?(bloom, 99_999)  # false (probably)

See the Quick Start Guide for more examples.

Livebooks

Nine production-oriented Livebooks demonstrate real-world patterns, from basic stream consumption to distributed merge semantics and AI workload analytics. See Livebooks Guide for the recommended reading order and what each Livebook teaches.

Documentation

Full documentation is available at HexDocs.

Architecture

Compatibility and Stability

The following guarantees apply within the v0.x release series:

Not guaranteed:

Development

# Get dependencies
mix deps.get

# Run tests with coverage
mix test --cover

# Run lints
mix lint

# Run benchmarks
mix bench

# Generate docs
mix docs

Roadmap

Version Focus Status
v0.1.0 Core sketches (HLL, CMS, Theta) + Rust NIFs Released
v0.2.0 KLL quantiles Released
v0.2.1 DDSketch relative-error quantiles Released
v0.3.0 FrequentItems (SpaceSaving) Released
v0.4.0 Bloom filter (membership testing) Released
v0.5.0 Advanced membership filters (Cuckoo, Quotient, CQF, XorFilter, IBLT, FilterChain) Released
v0.6.0 REQ sketch, Misra-Gries, XXHash3, Rust NIF parity for all membership filters Released
v0.7.0 ULL (UltraLogLog) -- improved cardinality estimation with Pure + Rust NIF Released
v0.7.1 NIF batch hashing, hash customization, quotient filter fix, merge safety Released
v0.8.0 Deterministic Foundations -- pluggable hash registry (XXHash3 + Murmur3), binary stability and corruption detection, HLL hot-path optimization, precompiled NIFs, property-based validation Released
v0.9.0 Streaming Integrations -- Stream/Collectable API, Broadway/GenStage/Flow integration, persistence (ETS/DETS/CubDB/Mnesia/Ecto), telemetry + OpenTelemetry, ULL accuracy fix, v1 serialization escape hatch Released
v0.10.0 Apache Interoperability -- full cross-language KLL and HLL exchange, golden binary corpus, validation suite Planned
v0.11.0 New Sketch Families -- CPC (Compressed Probabilistic Counting), Tuple Sketch (weighted distinct counting) Planned
v0.12.0 Similarity & Sampling -- MinHash, Weighted MinHash, VarOpt sampling Planned
v1.0.0 Stable Binary Contract -- locked EXSK format, full benchmark suite, Nx / Arrow ecosystem integrations Planned

See guides/roadmap.md for the next-release preview. The long-form strategic roadmap is in plans/next_steps.md.

License

MIT License. See LICENSE for details.