Chunkex

A powerful Elixir tool for intelligently chunking your project's source code to support Local RAG (Retrieval-Augmented Generation) systems. Chunkex extracts functions from your Elixir codebase and creates structured chunks optimized for embedding and retrieval.

Overview

Chunkex is designed to bridge the gap between your Elixir codebase and AI agents by creating semantic chunks that can be:

Embedded into vector representations by embedding models
Stored in vector databases for efficient retrieval
Retrieved through hybrid search to provide precise context to AI agents
Used as JIT context instead of loading entire files

This approach enables your AI agents to access relevant code snippets with high precision, improving response quality while reducing token usage.

Features

Function-level chunking: Extracts individual functions (def, defp, defmacro) as semantic units
Precise location tracking: Maintains file paths and line ranges for each chunk
Rich metadata: Includes language identification, SHA hashes, and structured text
Mix task integration: Easy-to-use command-line interface
Error resilient: Gracefully handles syntax errors and malformed files
JSONL output: Structured format optimized for embedding pipelines

Installation

Add chunkex to your list of dependencies in mix.exs:

def deps do
  [
    {:chunkex, "~> 0.1.0"}
  ]
end

Then run mix deps.get to install the dependency.

Usage

Command Line Interface

The easiest way to use Chunkex is through the Mix task:

# Chunk the current project
mix chunkex run

# Chunk a specific directory
mix chunkex run /path/to/your/project

Programmatic Usage

You can also use Chunkex directly in your Elixir code:

# Chunk the current directory
Chunkex.chunk()

# Chunk a specific directory
Chunkex.chunk("/path/to/your/project")

Output Format

Chunkex generates a tmp/chunks.jsonl file containing one JSON object per line. Each chunk includes:

{
  "path": "lib/my_module.ex",
  "line_start": 15,
  "line_end": 25,
  "lang": "elixir",
  "text": "  def calculate_total(items) do\n    items |> Enum.sum()\n  end",
  "sha": "a1b2c3d4e5f6..."
}

Field Descriptions

path: Relative path to the source file
line_start: Starting line number of the function
line_end: Ending line number of the function
lang: Language identifier (always "elixir")
text: The actual function code with proper indentation
sha: SHA-256 hash of the source file for change detection

Integration with RAG Systems

1. Embedding Pipeline

# Example: Process chunks for embedding
chunks = File.stream!("tmp/chunks.jsonl")
|> Stream.map(&Jason.decode!/1)
|> Enum.map(fn chunk ->
  %{
    id: "#{chunk["path"]}:#{chunk["line_start"]}",
    content: chunk["text"],
    metadata: %{
      file: chunk["path"],
      line_start: chunk["line_start"],
      line_end: chunk["line_end"],
      sha: chunk["sha"]
    }
  }
end)

2. Vector Storage

Store the embedded chunks in your preferred vector database (Pinecone, Weaviate, Chroma, etc.) with the metadata for hybrid search.

3. Retrieval for AI Agents

When your AI agent needs context about specific functionality:

# Retrieve relevant chunks based on query
relevant_chunks = vector_db.similarity_search(
  query: "How to calculate totals?",
  filter: %{lang: "elixir"},
  limit: 5
)

# Use chunks as JIT context instead of entire files
context = relevant_chunks
|> Enum.map(& &1.content)
|> Enum.join("\n\n")

Benefits for Local RAG

Precision over Recall

Function-level chunks provide focused context
Reduces noise from irrelevant code sections
Improves AI agent response accuracy

Token Efficiency

Smaller, targeted chunks reduce token usage
Enables more context within token limits
Cost-effective for API-based AI services

Incremental Updates

SHA hashes enable change detection
Update only modified chunks in your vector store
Efficient synchronization with codebase changes

Semantic Structure

Functions represent logical code units
Natural boundaries for embedding models
Better semantic understanding by AI agents

Development

Running Tests

mix test

Building Documentation

mix docs

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Roadmap

Support for additional Elixir constructs (modules, typespecs, etc.)
Configurable chunking strategies
Integration with popular vector databases
Real-time file watching for incremental updates

Built To Save the Tokens