etrie

Build StatusGitHubHex.pm

A Fast and Memory-Efficient HAT-Trie Implementation Based on Tessil hat-trie (NIF based). Learn more about the HAT-Trie data structure here.

Implementation Details

Notes

Quick Start

Ensure you have at least Erlang OTP 19.0, as enif_binary_to_term and enif_term_to_binary are not available in previous versions.

Compile:

rebar3 compile

Simple usage:

{ok, T} = etrie:new(),
ok = etrie:insert(T, <<"one">>, 1),
ok = etrie:insert(T, <<"two">>, 2),
ok = etrie:insert(T, <<"three">>, 3),
ok = etrie:insert(T, <<"four">>, 4),
{ok,[{<<"four">>,4}, {<<"two">>,2}, {<<"three">>,3}, {<<"one">>,1}]} = etrie:to_list(T),
{ok, 4} = etrie:lookup(T, <<"four">>),
{ok,[{<<"two">>,2},{<<"three">>,3}]} = etrie:lookup_range(T, <<"t">>),
{ok, <<"three">>, 3} = etrie:longest_prefix(T, <<"three tables">>).

API

Benchmark

The benchmarking process is facilitated by the Erlang module (benchmarks/benchmark.erl), which reads a designated dataset line by line. Each line in the dataset represents a key in the trie, and its associated value is the line number. The dataset is utilized to perform various operations such as insertions, reads, and deletions, with the elapsed time measured for each operation.

The dataset used represents all the titles from the main namespace of the Wikipedia archive.

To run the benchmark:

make benchmark MODULE=etrie DATASET=/Downloads/enwiki-latest-all-titles-in-ns0 LIMIT=1000000

Where:

Results

Unfortunately the okeuday trie implementation (ok_btrie and ok_trie) it's pretty slow and I had to limit the number of items to 2 millions.

Results are:

etrie:insert Time -> operations: 2000000 elapsed: 881.1570 ms, 440.5785 ns/key
etrie:lookup Time -> operations: 2000000 elapsed: 454.7810 ms, 227.3905 ns/key
etrie:remove Time -> operations: 2000000 elapsed: 532.2440 ms, 266.1220 ns/key

ok_btrie:insert Time -> operations: 2000000 elapsed: 60761.0590 ms, 30380.5295 ns/key
ok_btrie:lookup Time -> operations: 2000000 elapsed: 5925.6150 ms, 2962.8075 ns/key
ok_btrie:remove Time -> operations: 2000000 elapsed: 38510.2600 ms, 19255.1300 ns/key

ok_trie:insert Time -> operations: 2000000 elapsed: 61013.2730 ms, 30506.6365 ns/key
ok_trie:lookup Time -> operations: 2000000 elapsed: 2283.5370 ms, 1141.7685 ns/key
ok_trie:remove Time -> operations: 2000000 elapsed: 35662.8770 ms, 17831.4385 ns/key

Memory usage

I faced difficulty in finding an appropriate method to programmatically measure the memory consumption specifically utilized by the trie itself across different libraries. The challenge stems from the fact that the memory allocation by the etrie NIFs is not exclusively performed with enif_* methods. Consequently, Erlang lacks detailed statistics regarding the memory allocated directly through native C++ allocation methods. To gain an understanding of memory allocation, I manually observed the memory usage of the beam.smp process using the Activity Monitor on OSX.

Of course the memory reported contains many other things beside the one used by the trie itself but because of the big gap it's clear that etrie uses less memory.

Results are:

etrie: 680 MB
ok_btrie: 4.67 GB
ok_trie: 5.42 GB

Note: The test employs the line index as a value, which is an integer and stored by etrie as a native data type. For other values lie tuples, lists, (see Implementation Details section) there might be additional overhead caused by term_to_binary encoding.

Tests

In order to run the integrity tests run rebar3 eunit from project root.