Bloomy 🌸

High-performance Bloom Filter library for Elixir, powered by Nx

Bloomy provides probabilistic data structures for efficient set membership testing with minimal memory usage. Built on Nx tensors for blazing-fast performance with optional EXLA/GPU acceleration.

Features

🚀 High Performance - Nx tensor operations with EXLA/GPU acceleration
🎯 Multiple Types - Standard, Counting (with deletion), Scalable (auto-grows), Learned (ML-enhanced)
💾 Persistence - Serialization with optional compression
🔀 Merge Operations - Union/intersection for distributed systems
📊 Rich Statistics - Comprehensive monitoring and metrics
🎓 Well Documented - Extensive docs and examples

Installation

Add bloomy to your dependencies in mix.exs:

def deps do
  [
    {:bloomy, "~> 0.1.0"},
    {:nx, "~> 0.10.0"},
    {:exla, "~> 0.10.0"}  # Optional but recommended for performance
  ]
end

Quick Start

# Create a bloom filter for 10,000 items with 1% false positive rate
filter = Bloomy.new(10_000, false_positive_rate: 0.01)

# Add items
filter = filter
  |> Bloomy.add("user@example.com")
  |> Bloomy.add("alice@example.com")
  |> Bloomy.add("bob@example.com")

# Check membership
Bloomy.member?(filter, "user@example.com")  # => true
Bloomy.member?(filter, "other@example.com") # => false

# Get statistics
Bloomy.info(filter)
# => %{
#   type: :standard,
#   capacity: 10000,
#   items_count: 3,
#   fill_ratio: 0.0021,
#   false_positive_rate: 0.01,
#   ...
# }

Bloom Filter Types

Standard Bloom Filter

Classic space-efficient probabilistic set membership test.

filter = Bloomy.new(10_000)
filter = Bloomy.add(filter, "apple")
Bloomy.member?(filter, "apple")  # => true

Counting Bloom Filter

Supports deletion operations using counters instead of bits.

filter = Bloomy.new(10_000, type: :counting)

filter = Bloomy.add(filter, "item")
Bloomy.member?(filter, "item")  # => true

filter = Bloomy.remove(filter, "item")
Bloomy.member?(filter, "item")  # => false

Scalable Bloom Filter

Automatically grows to maintain target false positive rate.

filter = Bloomy.new(1_000, type: :scalable)

# Add millions of items - it will automatically scale!
filter = Enum.reduce(1..1_000_000, filter, fn i, f ->
  Bloomy.add(f, "item_#{i}")
end)

info = Bloomy.info(filter)
info.slices_count  # => Multiple slices created automatically

Learned Bloom Filter

ML-enhanced filter with lower false positive rates.

filter = Bloomy.new(10_000, type: :learned)

# Train with positive and negative examples
training_data = %{
  positive: ["valid_user_1", "valid_user_2", "valid_user_3"],
  negative: ["spam_1", "spam_2", "spam_3"]
}

filter = Bloomy.train(filter, training_data)
filter = Bloomy.add(filter, "valid_user_1")

Bloomy.member?(filter, "valid_user_1")  # => true (uses ML model + backup filter)

Performance: Using EXLA Backend

Why EXLA?

EXLA provides significant performance improvements (2-20x faster) by:

JIT compilation of numerical operations
CPU/GPU acceleration
Optimized tensor operations

Option 1: Set Default Backend (Recommended)

In your config/config.exs:

# Use EXLA with CPU
config :nx, default_backend: EXLA.Backend

# Or use EXLA with GPU (if CUDA is available)
config :nx, default_backend: {EXLA.Backend, client: :cuda}

# Or use EXLA with specific device
config :nx, default_backend: {EXLA.Backend, client: :cuda, device_id: 0}

Option 2: Per-Filter Backend

Specify backend when creating filters:

# Use EXLA backend for this filter
filter = Bloomy.new(100_000, backend: EXLA.Backend)

# Use BinaryBackend (default, no compilation)
filter = Bloomy.new(100_000, backend: Nx.BinaryBackend)

Option 3: Runtime Backend Selection

# Set backend at runtime
Nx.default_backend(EXLA.Backend)

# All filters created after this will use EXLA
filter = Bloomy.new(100_000)

GPU Acceleration

To use GPU acceleration (requires CUDA):

# In config/config.exs
config :nx, default_backend: {EXLA.Backend, client: :cuda}

# Or at runtime
Nx.default_backend({EXLA.Backend, client: :cuda})

# Create filter - will automatically use GPU
filter = Bloomy.new(1_000_000)

Backend Performance Comparison

Run the backend comparison benchmark:

mix run benchmarks/backend_comparison.exs

Typical Performance Gains:

Small filters (<10K): ~1.5x faster
Medium filters (10K-100K): 2-5x faster
Large filters (>100K): 5-10x faster
Batch operations: 10-20x faster
GPU (large filters): 20-50x faster

Operations

Batch Operations

# Add multiple items efficiently
filter = Bloomy.add_all(filter, ["apple", "banana", "orange"])

# Batch membership testing
results = Bloomy.batch_member?(filter, ["apple", "grape", "banana"])
# => %{"apple" => true, "grape" => false, "banana" => true}

# Create from list
filter = Bloomy.from_list(["a", "b", "c", "d"])

Merge Operations

# Union (merge) filters
f1 = Bloomy.new(10_000) |> Bloomy.add("apple")
f2 = Bloomy.new(10_000) |> Bloomy.add("banana")

merged = Bloomy.union(f1, f2)
Bloomy.member?(merged, "apple")   # => true
Bloomy.member?(merged, "banana")  # => true

# Union multiple filters
filters = [filter1, filter2, filter3]
merged = Bloomy.union_all(filters)

# Intersection
intersected = Bloomy.intersect_all([filter1, filter2])

Similarity Metrics

# Calculate Jaccard similarity
similarity = Bloomy.jaccard_similarity(filter1, filter2)
# => 0.75 (75% similar)

# Calculate overlap coefficient
overlap = Bloomy.Operations.overlap_coefficient(filter1, filter2)

Persistence

Binary Serialization

# Serialize to binary
binary = Bloomy.to_binary(filter)

# With compression (recommended for large filters)
binary = Bloomy.to_binary(filter, compress: true)

# Deserialize
{:ok, loaded_filter} = Bloomy.from_binary(binary)

File I/O

# Save to file
:ok = Bloomy.save(filter, "my_filter.bloom")

# Save with compression
:ok = Bloomy.save(filter, "my_filter.bloom", compress: true)

# Load from file
{:ok, filter} = Bloomy.load("my_filter.bloom")

Benchmarking

Run the included benchmarks to see performance on your system:

# Basic operations (add, member?, batch operations)
mix run benchmarks/basic_operations.exs

# Compare filter types (Standard, Counting, Scalable)
mix run benchmarks/filter_comparison.exs

# Merge operations (union, intersection, similarity)
mix run benchmarks/operations.exs

# Serialization and persistence
mix run benchmarks/serialization.exs

# Backend comparison (BinaryBackend vs EXLA)
mix run benchmarks/backend_comparison.exs

Configuration Options

Creating Filters

Bloomy.new(capacity, opts)

Common Options:

:type - Filter type: :standard, :counting, :scalable, :learned (default: :standard)
:false_positive_rate - Target false positive rate (default: 0.01 or 1%)
:backend - Nx backend: Nx.BinaryBackend, EXLA.Backend (default: Nx.default_backend())

Counting Filter Options:

:counter_width - Bits per counter: 8, 16, 32 (default: 8)

Scalable Filter Options:

:growth_factor - Capacity multiplier for new slices (default: 2)
:error_tightening_ratio - Error rate multiplier (default: 0.8)

Learned Filter Options:

:confidence_threshold - Model confidence threshold (default: 0.7)

Examples

# Standard with custom false positive rate
filter = Bloomy.new(10_000, false_positive_rate: 0.001)

# Counting with 16-bit counters and EXLA
filter = Bloomy.new(10_000,
  type: :counting,
  counter_width: 16,
  backend: EXLA.Backend
)

# Scalable with custom growth
filter = Bloomy.new(1_000,
  type: :scalable,
  growth_factor: 3,
  error_tightening_ratio: 0.5
)

Use Cases

Web Application - User Session Tracking

# Track active sessions
defmodule SessionTracker do
  use GenServer

  def start_link(_) do
    GenServer.start_link(__MODULE__, nil, name: __MODULE__)
  end

  def init(_) do
    # 1M sessions, 0.1% false positive rate, use EXLA for speed
    filter = Bloomy.new(1_000_000,
      false_positive_rate: 0.001,
      backend: EXLA.Backend
    )
    {:ok, filter}
  end

  def track_session(session_id) do
    GenServer.cast(__MODULE__, {:track, session_id})
  end

  def active?(session_id) do
    GenServer.call(__MODULE__, {:check, session_id})
  end

  def handle_cast({:track, session_id}, filter) do
    {:noreply, Bloomy.add(filter, session_id)}
  end

  def handle_call({:check, session_id}, _from, filter) do
    {:reply, Bloomy.member?(filter, session_id), filter}
  end
end

Distributed System - Seen Items Cache

# Track which items have been processed across nodes
defmodule DistributedCache do
  def merge_caches(node_filters) do
    # Collect filters from all nodes
    filters = Enum.map(node_filters, & &1)

    # Merge into single filter
    merged = Bloomy.union_all(filters)

    # Distribute back to all nodes
    distribute_filter(merged)
  end

  def sync_filter(local_filter) do
    # Save filter to shared storage
    Bloomy.save(local_filter, "/shared/cache.bloom", compress: true)
  end

  def load_filter do
    case Bloomy.load("/shared/cache.bloom") do
      {:ok, filter} -> filter
      {:error, _} -> Bloomy.new(1_000_000)
    end
  end
end

Data Pipeline - Deduplication

defmodule Deduplicator do
  def process_stream(stream) do
    initial_filter = Bloomy.new(10_000_000, type: :scalable)

    Stream.transform(stream, initial_filter, fn item, filter ->
      if Bloomy.member?(filter, item.id) do
        # Skip duplicate
        {[], filter}
      else
        # New item, add to filter and pass through
        {[item], Bloomy.add(filter, item.id)}
      end
    end)
  end
end

Performance Tips

Use EXLA Backend - Significant performance improvement (2-20x faster)
```
config :nx, default_backend: EXLA.Backend
```

Choose Appropriate Capacity - Overestimate slightly for better performance

# If expecting ~10K items, use 15K capacity
filter = Bloomy.new(15_000)

Batch Operations - Use add_all and batch_member? for multiple items

# Fast: single batch operation
filter = Bloomy.add_all(filter, items)

# Slower: many individual operations
filter = Enum.reduce(items, filter, &Bloomy.add(&2, &1))

Use Scalable for Unknown Sizes - Automatically adapts to data volume
```
filter = Bloomy.new(1_000, type: :scalable)
```
Serialize with Compression - Reduces storage/network overhead
```
Bloomy.save(filter, "cache.bloom", compress: true)
```

How It Works

Bloom Filters

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set:

False positives are possible (item might be in set)
False negatives are impossible (if not found, definitely not in set)
Uses multiple hash functions to set/check bits in an array

Mathematical Foundation

Optimal bit array size: m = -(n * ln(p)) / (ln(2)^2)
Optimal hash functions: k = (m / n) * ln(2)
False positive rate: p = (1 - e^(-kn/m))^k

Where:

n = number of items
p = false positive rate
m = bit array size
k = number of hash functions

Implementation Details

Hash Functions: MurmurHash3 + FNV-1a with double hashing technique
Storage: Nx tensors (:u8 for bits, :u8/:u16/:u32 for counters)
Acceleration: EXLA JIT compilation for vectorized operations
ML: Simple neural network for learned filters

API Reference

Main Functions

new/2 - Create new bloom filter
add/2 - Add item to filter
add_all/2 - Add multiple items
member?/2 - Check if item might be in set
remove/2 - Remove item (counting filters only)
clear/1 - Reset filter to empty state
info/1 - Get statistics and metadata

Operations

union/2, union_all/1 - Merge filters
intersect_all/1 - Intersect filters
batch_member?/2 - Test multiple items
from_list/2 - Create filter from list
jaccard_similarity/2 - Calculate similarity

Persistence

save/3 - Save to file
load/2 - Load from file
to_binary/2 - Serialize to binary
from_binary/2 - Deserialize from binary

ML (Learned Filters)

train/3 - Train model on examples

Testing

Run the test suite:

mix test

Run the integration tests:

mix run test_bloomy.exs

License

MIT License

Acknowledgments

Built with Nx - Numerical computing library
Uses EXLA - Accelerated Linear Algebra
ML features powered by Scholar
Inspired by classical bloom filter research and "The Case for Learned Index Structures"

Resources

Made with ❤️ and Elixir