ExDataSketch
Production-grade streaming data sketching algorithms for Elixir.
ExDataSketch provides probabilistic data structures for approximate counting, frequency estimation, and quantile computation on streaming data. All sketch state is stored as Elixir-owned binaries, enabling straightforward serialization, distribution, and persistence.
Supported Algorithms
| Algorithm | Purpose | Status |
|---|---|---|
| HyperLogLog (HLL) | Cardinality estimation | Implemented (Pure + Rust) |
| Count-Min Sketch (CMS) | Frequency estimation | Implemented (Pure + Rust) |
| Theta Sketch | Set operations on cardinalities | Implemented (Pure + Rust) |
| KLL Quantiles | Rank and quantile estimation | Implemented (Pure + Rust) |
Installation
Add ex_data_sketch to your list of dependencies in mix.exs:
def deps do
[
{:ex_data_sketch, "~> 0.2.0"}
]
endQuick Start
# HLL: count distinct elements
hll = ExDataSketch.HLL.new() |> ExDataSketch.HLL.update_many(1..100_000)
ExDataSketch.HLL.estimate(hll) # ~100_000
# KLL: quantile estimation
kll = ExDataSketch.KLL.new() |> ExDataSketch.KLL.update_many(1..100_000)
ExDataSketch.KLL.quantile(kll, 0.5) # approximate median (~50_000)
ExDataSketch.KLL.quantile(kll, 0.99) # 99th percentile (~99_000)See the Quick Start Guide for more examples.
Documentation
Full documentation is available at HexDocs.
Architecture
- Binary state: All sketch state is canonical Elixir binaries. No opaque NIF resources.
- Backend system: Pure Elixir reference implementation with optional Rust NIF acceleration. The Rust backend falls back to Pure automatically when unavailable.
- Serialization: ExDataSketch-native format (EXSK) for all sketches, plus Apache DataSketches CompactSketch interop for Theta.
- Deterministic hashing: Stable 64-bit hash (
ExDataSketch.Hash) for reproducible results. - Backend parity: Both backends produce byte-identical serialized output for the same inputs.
Compatibility and Stability
The following guarantees apply within the v0.x release series:
- EXSK serialization: The ExDataSketch-native binary format is stable. Binaries produced by any v0.x release can be deserialized by any other v0.x release.
- Pure vs Rust parity: Given identical inputs, both backends produce byte-identical serialized state and identical estimates.
- Deterministic output: The same input sequence always produces the same sketch state and estimate, regardless of backend.
Not guaranteed:
- Cross-language interop: Only Theta supports Apache DataSketches CompactSketch format. HLL and CMS DataSketches interop is not implemented.
- Performance stability: Benchmark results may vary across hardware and OTP versions.
- EXSK format across major versions: The binary format may change in future major releases.
Development
# Get dependencies
mix deps.get
# Run tests with coverage
mix test --cover
# Run lints
mix lint
# Run benchmarks
mix bench
# Generate docs
mix docsLicense
MIT License. See LICENSE for details.