UltraLogLog

Space-efficient distinct counting for the BEAM, ported from Ertl 2024 (VLDB).

Hex.pmHex DocsLicense

UltraLogLog is a probabilistic data structure for approximate distinct counting — same constant memory, constant-time inserts, and associative merge as HyperLogLog, but 24–28% less memory at the same accuracy. This package is a paper-faithful Elixir port, cross-validated bit-for-bit against the Hash4j Java reference (v0.17.0) on every estimator.

The algorithm comes from:

Otmar Ertl. UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct Counting. PVLDB 17(7), 2024. [VLDB PDF]]paper · [[arXiv extended]

Quick start

Add to mix.exs:

def deps do
  [{:ultra_log_log, "~> 0.1.0"}]
end

Count distinct values in a stream:

ull = UltraLogLog.new(precision: 12)        # 4 KB, ~1.2% standard error
ull = UltraLogLog.add(ull, "session-abc")
ull = UltraLogLog.add(ull, "session-xyz")
{:ok, count} = UltraLogLog.cardinality(ull)
# count ≈ 2.0

Merge two sketches — merge/2 is commutative, associative, and idempotent:

a = UltraLogLog.new(precision: 12) |> UltraLogLog.add("x")
b = UltraLogLog.new(precision: 12) |> UltraLogLog.add("y")
merged = UltraLogLog.merge(a, b)
{:ok, count} = UltraLogLog.cardinality(merged)
# count ≈ 2.0

Trade memory for accuracy by raising the precision:

small = UltraLogLog.new(precision: 10)   #  1 KB, ~3.1% error
big   = UltraLogLog.new(precision: 14)   # 16 KB, ~0.6% error

Why UltraLogLog

HyperLogLog has been the standard answer for "how many distinct things have I seen?" since 2007: constant memory, constant-time inserts, and mergeable across shards or nodes. UltraLogLog keeps all of that.

What UltraLogLog adds is information density. Where HLL packs 6-bit registers, ULL uses byte-aligned 8-bit ones — two extra bits per slot, recovered many times over by a 2024 estimator family that extracts more signal from each register. The net is 24–28% less memory at the same standard error, depending on which estimator you choose. The byte alignment is also a BEAM win: registers map directly to plain binaries, no bit-packing on the hot path.

When not to use it: small N (just use a MapSet), or any case that requires exact counts. ULL is an approximate counter; the smallest useful precision (p=10, 1 KB) carries ~3.1% relative standard error.

Estimators

Estimator Storage factor Rel. std. error Use when
:fgra (default) 4.895 0.782/√m Default. Single-pass, no iteration.
:mle 4.631 0.761/√m Tightest bound; secant solver runs ~5 iterations per query.
:martingale 3.466 0.658/√m Single-stream sketch never merged; returns {:error, :invalidated_by_merge} after merge/2.
{:ok, count} = UltraLogLog.cardinality(ull, estimator: :mle)

Merge is a CRDT operation. Element-wise register merge under the ULL partial order is commutative, associative, and idempotent — so distributed cardinality is trivial: shards independently maintain sketches, merge on demand, and the answer is exactly as if every insert had hit a single sketch. No coordinator, no quorum, no consensus.

Empirical validation

Every estimator is cross-checked against Hash4j v0.17.0's reference implementation on the same 16 register snapshots (4 precisions × 4 checkpoints), then exercised statistically over 450 random trials. Full reports live under docs/measurements/ in the repository.

FGRA vs Hash4j (16 fixtures)

This is IEEE 754 noise — the implementations agree to the last bit of double precision.

MLE vs Hash4j (16 fixtures)

Most fixtures bit-identical; the worst case is one trailing-bit flip in floating-point accumulation. The secant solver converges in 3–6 iterations on every fixture and across 100 additional random sketches per precision (300 total).

Martingale vs Hash4j (16 fixtures)

Bit-exact across all 16 fixtures, including the 100k-insert accumulation cases. The estimator shares Hash4j's branch-free integer formulation for the per-register state-change probability, so floating-point divergence has nowhere to come from.

Statistical correctness (15 cells × 30 trials, 450 estimates each)

All three estimators meet the paper's theoretical bounds with significant headroom on every (p, N) cell:

Estimator Worst bias Bound Worst stddev ratio Bound
:fgra +0.411% (p=10, N=10⁴) ±1.339% 1.134 (p=14, N=10⁵) 1.5
:mle +0.404% (p=10, N=10⁶) ±1.302% 1.034 (p=14, N=10⁵) 1.5
:martingale +0.566% (p=10, N=10⁶) ±1.127% 1.070 (p=14, N=100) 1.5

Bias bounds are 3σ around the theoretical relative standard error; stddev bounds allow up to 50% above the theoretical figure. Run the suite locally with:

REPORT=1 mix test --include statistical

See fgra-v0.1.txt, mle-v0.1.txt, and martingale-v0.1.txt in the repository for the full per-cell tables.

Precision and memory

Precision p allocates 2^p 8-bit registers — state size is exactly 2^p bytes:

p=10 →  1 KB,  ~3.1% error
p=12 →  4 KB,  ~1.2% error   (default)
p=14 → 16 KB,  ~0.6% error
p=16 → 64 KB,  ~0.3% error

Pick the smallest precision whose error fits your use case. p=12 is a reasonable default.

Status and roadmap

Planned items are not promised timelines. Watch the repository or CHANGELOG.md for releases.

Citation

If you use UltraLogLog in academic work, please cite the underlying paper:

@article{ertl2024ultraloglog,
  author  = {Otmar Ertl},
  title   = {UltraLogLog: A Practical and More Space-Efficient
             Alternative to HyperLogLog for Approximate Distinct Counting},
  journal = {Proceedings of the VLDB Endowment},
  volume  = {17},
  number  = {7},
  year    = {2024},
  pages   = {1655--1668},
  url     = {https://www.vldb.org/pvldb/vol17/p1655-ertl.pdf}
}

If this package is useful in your production work, a star on the GitHub repository is appreciated.

Acknowledgements

License

Apache 2.0.