ChromEx

The open-source embedding database for Elixir

ChromEx is an idiomatic Elixir client for Chroma, the AI-native open-source embedding database. ChromEx embeds Chroma's Rust implementation directly via Rustler NIFs for maximum performance.

ChromEx makes it easy to build LLM applications with embeddings. It handles tokenization, embedding generation, and indexing automatically.

Features

Installation

Add chromex to your mix.exs dependencies:

def deps do
  [
    {:chromex, "~> 0.1.5"}
  ]
end

Prerequisites

Required:

Note for Livebook users: Livebook environments may not have Rust available by default. If you encounter compilation errors, you'll need to either:

Building

The project automatically fetches and compiles the Chroma Rust source during the build process:

mix deps.get
mix compile

Quick Start

# Create a collection
{:ok, collection} = ChromEx.Collection.create("my_collection")

# Add documents with automatic embedding generation
ChromEx.Collection.add(collection,
  ids: ["id1", "id2", "id3"],
  documents: [
    "This is a document about cats",
    "This is a document about dogs",
    "This is a document about birds"
  ],
  metadatas: [
    %{topic: "pets", animal: "cat"},
    %{topic: "pets", animal: "dog"},
    %{topic: "birds", animal: "bird"}
  ]
)

# Query with natural language (auto-generates query embedding)
{:ok, results} = ChromEx.Collection.query(collection,
  query_texts: ["Tell me about cats"],
  n_results: 2
)

IO.inspect(results["documents"])
# => [["This is a document about cats", "This is a document about dogs"]]

IO.inspect(results["metadatas"])
# => [[%{"topic" => "pets", "animal" => "cat"}, %{"topic" => "pets", "animal" => "dog"}]]

IO.inspect(results["distances"])
# => [[0.45, 1.23]]

Usage

Auto-Embedding Generation

ChromEx automatically generates embeddings using the same ONNX model as Python ChromaDB (all-MiniLM-L6-v2). Just provide documents and ChromEx handles the rest:

{:ok, collection} = ChromEx.Collection.create("my_docs")

# Add documents - embeddings generated automatically
ChromEx.Collection.add(collection,
  ids: ["doc1", "doc2"],
  documents: ["Machine learning in production", "Deep learning architectures"]
)

# Query with text - query embedding generated automatically
{:ok, results} = ChromEx.Collection.query(collection,
  query_texts: ["AI deployment"],
  n_results: 5
)

Using Pre-computed Embeddings

If you have embeddings from OpenAI, Cohere, or custom models, you can provide them directly. This is useful in environments where Rust/ONNX isn't available (like some Livebook setups):

# Generate embeddings from an external service
embeddings = YourEmbeddingService.generate(["First doc", "Second doc"])

ChromEx.Collection.add(collection,
  ids: ["id1", "id2"],
  embeddings: embeddings,  # Provide embeddings directly
  documents: ["First doc", "Second doc"],
  metadatas: [%{source: "web"}, %{source: "api"}]
)

# Query with pre-computed embeddings
query_embedding = YourEmbeddingService.generate(["search query"])

{:ok, results} = ChromEx.Collection.query(collection,
  query_embeddings: query_embedding,
  n_results: 10
)

Metadata Filtering

Chroma uses a structured query language for metadata filtering with operators like $and, $or, $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin.

# Add documents with rich metadata
ChromEx.Collection.add(collection,
  ids: ["id1", "id2", "id3"],
  documents: ["Doc A", "Doc B", "Doc C"],
  metadatas: [
    %{source: "web", year: 2024, category: "tech"},
    %{source: "api", year: 2023, category: "science"},
    %{source: "web", year: 2024, category: "science"}
  ]
)

# Query with single condition
{:ok, results} = ChromEx.Collection.query(collection,
  query_texts: ["search term"],
  where: %{"year" => 2024},
  n_results: 5
)

# Query with multiple conditions using $and
{:ok, results} = ChromEx.Collection.query(collection,
  query_texts: ["search term"],
  where: %{"$and" => [%{"year" => 2024}, %{"source" => "web"}]},
  n_results: 5
)

# Query with $or operator
{:ok, results} = ChromEx.Collection.query(collection,
  query_texts: ["search term"],
  where: %{"$or" => [%{"year" => 2024}, %{"year" => 2023}]},
  n_results: 5
)

# Query with comparison operators
{:ok, results} = ChromEx.Collection.query(collection,
  query_texts: ["search term"],
  where: %{"year" => %{"$gte" => 2023}},
  n_results: 5
)

# Query with $in operator
{:ok, results} = ChromEx.Collection.query(collection,
  query_texts: ["search term"],
  where: %{"source" => %{"$in" => ["web", "api"]}},
  n_results: 5
)

Collection Management

# Create or get collection
{:ok, collection} = ChromEx.Collection.create("my_collection", get_or_create: true)

# List all collections
{:ok, collections} = ChromEx.Collection.list()

# Count documents in collection
{:ok, count} = ChromEx.Collection.count(collection)

# Delete collection
ChromEx.Collection.delete("my_collection")

Document Operations

# Get documents by IDs
{:ok, docs} = ChromEx.Collection.get_documents(collection, ids: ["id1", "id2"])

# Update documents (both styles work)
ChromEx.Collection.update_documents(collection, ["id1"],
  documents: ["Updated content"],
  metadatas: [%{updated_at: DateTime.utc_now()}]
)

# Or with keyword-based IDs
ChromEx.Collection.update_documents(collection,
  ids: ["id1"],
  documents: ["Updated content"]
)

# Upsert (insert or update)
ChromEx.Collection.upsert(collection, ["id1", "id2"],
  documents: ["New or updated doc 1", "New or updated doc 2"]
)

# Or with keyword-based IDs
ChromEx.Collection.upsert(collection,
  ids: ["id1", "id2"],
  documents: ["New or updated doc 1", "New or updated doc 2"]
)

# Delete documents
ChromEx.Collection.delete_documents(collection, ids: ["id1"])

# Delete by metadata
ChromEx.Collection.delete_documents(collection, where: %{"source" => "old"})

Bang (!) Variants

All functions have bang variants that raise on error:

collection = ChromEx.Collection.create!("my_collection")
ChromEx.Collection.add!(collection, ids: ["id1"], documents: ["doc"])
results = ChromEx.Collection.query!(collection, query_texts: ["search"])

Idiomatic Elixir Features

ChromEx supports multiple calling styles for ergonomic Elixir code:

# Keyword list style
ChromEx.Collection.add(collection,
  ids: ["id1", "id2"],
  documents: ["Doc 1", "Doc 2"],
  metadatas: [%{key: "value"}, [key: "value"]]  # Maps or keyword lists
)

# IDs as second argument
ChromEx.Collection.add(collection, ["id1", "id2"],
  documents: ["Doc 1", "Doc 2"]
)

# All as keywords
ChromEx.Collection.add(collection,
  ids: ["id1"],
  documents: ["Doc 1"]
)

Multi-tenancy

# First, create the database in the tenant
{:ok, _db} = ChromEx.Database.create("production", tenant: "acme_corp")

# Then create collection in that tenant/database
{:ok, collection} = ChromEx.Collection.create("my_collection",
  tenant: "acme_corp",
  database: "production"
)

# All operations are scoped to the tenant/database
ChromEx.Collection.add(collection, ids: ["id1"], documents: ["Doc"])

Database Management

# Create database
{:ok, db} = ChromEx.Database.create("production", tenant: "acme_corp")

# Get database
{:ok, db} = ChromEx.Database.get("production", tenant: "acme_corp")

# Delete database
ChromEx.Database.delete("production", tenant: "acme_corp")

Configuration

Configure ChromEx in your application:

# config/config.exs
config :chromex,
  allow_reset: false,
  persist_path: "./chroma_data",
  hnsw_cache_size_mb: 1000,
  # Pool size for parallel embedding generation (defaults to CPU cores)
  embedding_pool_size: 8

Or configure at runtime:

# In your application.ex
children = [
  {ChromEx.Client, [
    allow_reset: false,
    persist_path: "./chroma_data",
    hnsw_cache_size_mb: 1000
  ]},
  {ChromEx.EmbeddingsPool, [pool_size: 8]}
]

Architecture

ChromEx consists of three layers:

  1. Native Layer (ChromEx.Native) - Rustler NIFs interfacing with Chroma Rust code
  2. Domain Layer (ChromEx.Collection, ChromEx.Database) - Idiomatic Elixir APIs
  3. Facade Layer (ChromEx) - Top-level convenience functions

The build process:

API Reference

ChromEx

Core client functions:

ChromEx.Collection

Collection and document operations:

ChromEx.Database

Database management:

ChromEx.Embeddings

Embedding generation (used automatically by Collection operations):

Performance

ChromEx uses the same embedding model as Python ChromaDB (all-MiniLM-L6-v2 via ONNX), providing:

The embedding pool allows multiple embedding requests to be processed concurrently, with each worker maintaining its own model instance for maximum parallelism. Pool size automatically defaults to the number of CPU cores but can be configured via :embedding_pool_size.

Benchmark results show comparable performance to Python for embedding generation and query operations.

Comparison with Python ChromaDB

ChromEx aims for API compatibility with Python ChromaDB:

Python Elixir
client.create_collection("name")ChromEx.Collection.create("name")
collection.add(ids=[...], documents=[...])ChromEx.Collection.add(collection, ids: [...], documents: [...])
collection.query(query_texts=["..."], n_results=5)ChromEx.Collection.query(collection, query_texts: ["..."], n_results: 5)
collection.get(ids=[...])ChromEx.Collection.get_documents(collection, ids: [...])
collection.update(ids=[...], documents=[...])ChromEx.Collection.update_documents(collection, ids: [...], documents: [...])
collection.delete(ids=[...])ChromEx.Collection.delete_documents(collection, ids: [...])

License

ChromEx is licensed under the MIT License. See LICENSE for details.

This project uses the Chroma vector database, which is licensed under the Apache License 2.0. See LICENSE-CHROMA for Chroma's license terms.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Resources