UltraLogLog
Space-efficient distinct counting for the BEAM, ported from Ertl 2024 (VLDB).
UltraLogLog is a probabilistic data structure for approximate distinct counting — same constant memory, constant-time inserts, and associative merge as HyperLogLog, but 24–28% less memory at the same accuracy. This package is a paper-faithful Elixir port, cross-validated bit-for-bit against the Hash4j Java reference (v0.17.0) on every estimator. It is listed as the Elixir implementation in the paper's official reproducibility repository.
The algorithm comes from:
Otmar Ertl. UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct Counting. PVLDB 17(7), 2024. [VLDB PDF] · [arXiv extended]
Quick start
Add to mix.exs:
def deps do
[{:ultra_log_log, "~> 0.2.0"}]
end
Count distinct values in a stream:
ull = UltraLogLog.new(precision: 12) # 4 KB, ~1.2% standard error
ull = UltraLogLog.add(ull, "session-abc")
ull = UltraLogLog.add(ull, "session-xyz")
{:ok, count} = UltraLogLog.cardinality(ull)
# count ≈ 2.0
Merge two sketches — merge/2 is commutative, associative, and idempotent:
a = UltraLogLog.new(precision: 12) |> UltraLogLog.add("x")
b = UltraLogLog.new(precision: 12) |> UltraLogLog.add("y")
merged = UltraLogLog.merge(a, b)
{:ok, count} = UltraLogLog.cardinality(merged)
# count ≈ 2.0
Trade memory for accuracy by raising the precision:
small = UltraLogLog.new(precision: 10) # 1 KB, ~3.1% error
big = UltraLogLog.new(precision: 14) # 16 KB, ~0.6% error
Why UltraLogLog
HyperLogLog has been the standard answer for "how many distinct things have I seen?" since 2007: constant memory, constant-time inserts, and mergeable across shards or nodes. UltraLogLog keeps all of that.
What UltraLogLog adds is information density. Where HLL packs 6-bit registers, ULL uses byte-aligned 8-bit ones — two extra bits per slot, recovered many times over by a 2024 estimator family that extracts more signal from each register. The net is 24–28% less memory at the same standard error, depending on which estimator you choose. The byte alignment is also a BEAM win: registers map directly to plain binaries, no bit-packing on the hot path.
When not to use it: small N (just use a MapSet), or any case that
requires exact counts. ULL is an approximate counter; the smallest
useful precision (p=10, 1 KB) carries ~3.1% relative standard error.
Estimators
| Estimator | Storage factor | Rel. std. error | Use when |
|---|---|---|---|
:fgra (default) | 4.895 | 0.782/√m | Default. Single-pass, no iteration. |
:mle | 4.631 | 0.761/√m | Tightest bound; secant solver runs ~5 iterations per query. |
:martingale | 3.466 | 0.658/√m | Single-stream sketch never merged; returns {:error, :invalidated_by_merge} after merge/2. |
{:ok, count} = UltraLogLog.cardinality(ull, estimator: :mle)
Merge is a CRDT operation. Element-wise register merge under the ULL partial order is commutative, associative, and idempotent — so distributed cardinality is trivial: shards independently maintain sketches, merge on demand, and the answer is exactly as if every insert had hit a single sketch. No coordinator, no quorum, no consensus.
Concurrent inserts
UltraLogLog.Concurrent is a lock-free, :atomics-backed variant of
the sketch, safe to insert into from any process or scheduler — no
GenServer, no lock, no message passing. The insert path is a CAS loop
over a per-register 64-bit :atomics cell.
{:ok, c} = UltraLogLog.Concurrent.new(precision: 12)
# Inserts from many processes in parallel — all hit the same sketch
1..1_000
|> Enum.chunk_every(100)
|> Enum.map(fn chunk ->
Task.async(fn -> Enum.each(chunk, &UltraLogLog.Concurrent.add(c, &1)) end)
end)
|> Task.await_many(:infinity)
# Snapshot to the immutable %UltraLogLog{} for estimation, merge, or
# serialization. FGRA and MLE work normally on the snapshot.
sketch = UltraLogLog.Concurrent.snapshot(c)
{:ok, count} = UltraLogLog.cardinality(sketch)
snapshot/1 reads each :atomics cell independently rather than
taking a global lock; if writers are active during the snapshot, cells
may reflect different moments in time. This is safe by construction —
register merge is monotone in the UltraLogLog partial order, so a
"torn" snapshot is always a valid intermediate sketch, never an
invalid one. For a globally consistent snapshot, quiesce writers
first (e.g. Task.await_many/2 on insert tasks, as the example above
does).
Use the immutable UltraLogLog for value semantics — snapshots,
serialization round-trips, explicit CRDT merge. Use
UltraLogLog.Concurrent when many processes need to feed a single
shared sketch at high insert rates.
Empirical validation and benchmarks
Every estimator is cross-checked against Hash4j v0.17.0's reference
implementation on the same 16 register snapshots (4 precisions ×
4 checkpoints), then exercised statistically over 450 random trials.
Full reports live under
docs/measurements/
in the repository.
FGRA vs Hash4j (16 fixtures)
- max relative error: 8.4e-16
- mean: 2.6e-16
This is IEEE 754 noise — the implementations agree to the last bit of double precision.
MLE vs Hash4j (16 fixtures)
- max relative error: 1.3e-16
- mean: 8.4e-18
- secant iterations: mean 4.06, max 6
Most fixtures bit-identical; the worst case is one trailing-bit flip in floating-point accumulation. The secant solver converges in 3–6 iterations on every fixture and across 100 additional random sketches per precision (300 total).
Martingale vs Hash4j (16 fixtures)
- max relative error: 0.0
- mean: 0.0
Bit-exact across all 16 fixtures, including the 100k-insert accumulation cases. The estimator shares Hash4j's branch-free integer formulation for the per-register state-change probability, so floating-point divergence has nowhere to come from.
Statistical correctness (15 cells × 30 trials, 450 estimates each)
All three estimators meet the paper's theoretical bounds with significant headroom on every (p, N) cell:
| Estimator | Worst bias | Bound | Worst stddev ratio | Bound |
|---|---|---|---|---|
:fgra | +0.411% (p=10, N=10⁴) | ±1.339% | 1.134 (p=14, N=10⁵) | 1.5 |
:mle | +0.404% (p=10, N=10⁶) | ±1.302% | 1.034 (p=14, N=10⁵) | 1.5 |
:martingale | +0.566% (p=10, N=10⁶) | ±1.127% | 1.070 (p=14, N=100) | 1.5 |
Bias bounds are 3σ around the theoretical relative standard error; stddev bounds allow up to 50% above the theoretical figure. Run the suite locally with:
REPORT=1 mix test --include statistical
See fgra-v0.1.txt, mle-v0.1.txt, and
martingale-v0.1.txt in the repository for the full
per-cell tables.
Concurrent insert scaling
UltraLogLog.Concurrent uses a lock-free CAS loop over :atomics.
The benchmark suite measures total throughput at increasing process
counts on a 10-core Apple M5 with 100k pre-hashed inserts at p=12:
| Processes | ips | Scaling |
|---|---|---|
| 1 | 24.8 | 1.00× |
| 2 | 43.4 | 1.75× |
| 4 | 69.7 | 2.81× |
| 8 | 82.1 | 3.31× |
| 16 | 105.3 | 4.24× |
| 32 | 107.9 | 4.35× |
The curve scales monotonically and flattens past core count, as
expected for a CPU-bound workload. The high-process-count flattening
is partly Task.async spawn/await overhead in the benchmark harness;
we'd expect long-lived inserter processes feeding a single shared
sketch to scale better, though that scenario isn't measured here.
Single-process comparison: the concurrent path runs at ~25.4 ips
against ~23.6 ips for the immutable path (~1.08× faster), the
opposite of the expected atomics overhead. The immutable add/2
rebuilds a binary on every register-changing insert (an O(m)
head/byte/tail copy); the concurrent path does two constant-time
:atomics operations.
Full Benchee output: concurrent-v0.2.txt in the
repository.
Precision and memory
Precision p allocates 2^p 8-bit registers — state size is exactly
2^p bytes:
p=10 → 1 KB, ~3.1% error
p=12 → 4 KB, ~1.2% error (default)
p=14 → 16 KB, ~0.6% error
p=16 → 64 KB, ~0.3% error
Pick the smallest precision whose error fits your use case. p=12 is a
reasonable default.
Status and roadmap
- v0.1 — immutable sketch, FGRA / MLE / martingale estimators, merge, binary serialization, downsize (precision-preserving only), full validation against Hash4j v0.17.0.
- v0.2 (current) —
UltraLogLog.Concurrent: lock-free,:atomics-backed insert path with a CAS retry loop;snapshot/1conversion to the immutable form; Benchee benchmark suite. - v0.3 (planned) — sharded inserts via
PartitionSupervisorand cluster-wide merge over distributed Erlang. A higher-quality 64-bit hash (likely xxhash3 via NIF) is under consideration on the same timeline. - v0.4 (potential) — ExaLogLog, the 2024 follow-up to ULL for exa-scale cardinalities.
Planned items are not promised timelines. Watch the repository or
CHANGELOG.md for releases.
Citation
If you use UltraLogLog in academic work, please cite the underlying paper:
@article{ertl2024ultraloglog,
author = {Otmar Ertl},
title = {UltraLogLog: A Practical and More Space-Efficient
Alternative to HyperLogLog for Approximate Distinct Counting},
journal = {Proceedings of the VLDB Endowment},
volume = {17},
number = {7},
year = {2024},
pages = {1655--1668},
doi = {10.14778/3654621.3654632},
url = {https://www.vldb.org/pvldb/vol17/p1655-ertl.pdf}
}
If this package is useful in your production work, a star on the GitHub repository is appreciated.
Acknowledgements
- Otmar Ertl (@oertl) and Dynatrace Research for both the paper and the Hash4j Java reference, which served as ground truth for every byte of register encoding and every digit of estimator output.
- The Hash4j contributors at Dynatrace — their public source is what made paper-faithful porting realistic on a reasonable timeline.