Chunkex

A powerful Elixir tool for intelligently chunking your project's source code to support Local RAG (Retrieval-Augmented Generation) systems. Chunkex extracts functions from your Elixir codebase and creates structured chunks optimized for embedding and retrieval.

Overview

Chunkex is designed to bridge the gap between your Elixir codebase and AI agents by creating semantic chunks that can be:

This approach enables your AI agents to access relevant code snippets with high precision, improving response quality while reducing token usage.

Features

Installation

Add chunkex to your list of dependencies in mix.exs:

def deps do
  [
    {:chunkex, "~> 0.1.0"}
  ]
end

Then run mix deps.get to install the dependency.

Usage

Command Line Interface

The easiest way to use Chunkex is through the Mix task:

# Chunk the current project
mix chunkex run

# Chunk a specific directory
mix chunkex run /path/to/your/project

Programmatic Usage

You can also use Chunkex directly in your Elixir code:

# Chunk the current directory
Chunkex.chunk()

# Chunk a specific directory
Chunkex.chunk("/path/to/your/project")

Output Format

Chunkex generates a tmp/chunks.jsonl file containing one JSON object per line. Each chunk includes:

{
  "path": "lib/my_module.ex",
  "line_start": 15,
  "line_end": 25,
  "lang": "elixir",
  "text": "  def calculate_total(items) do\n    items |> Enum.sum()\n  end",
  "sha": "a1b2c3d4e5f6..."
}

Field Descriptions

Integration with RAG Systems

1. Embedding Pipeline

# Example: Process chunks for embedding
chunks = File.stream!("tmp/chunks.jsonl")
|> Stream.map(&Jason.decode!/1)
|> Enum.map(fn chunk ->
  %{
    id: "#{chunk["path"]}:#{chunk["line_start"]}",
    content: chunk["text"],
    metadata: %{
      file: chunk["path"],
      line_start: chunk["line_start"],
      line_end: chunk["line_end"],
      sha: chunk["sha"]
    }
  }
end)

2. Vector Storage

Store the embedded chunks in your preferred vector database (Pinecone, Weaviate, Chroma, etc.) with the metadata for hybrid search.

3. Retrieval for AI Agents

When your AI agent needs context about specific functionality:

# Retrieve relevant chunks based on query
relevant_chunks = vector_db.similarity_search(
  query: "How to calculate totals?",
  filter: %{lang: "elixir"},
  limit: 5
)

# Use chunks as JIT context instead of entire files
context = relevant_chunks
|> Enum.map(& &1.content)
|> Enum.join("\n\n")

Benefits for Local RAG

Precision over Recall

Token Efficiency

Incremental Updates

Semantic Structure

Development

Running Tests

mix test

Building Documentation

mix docs

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Roadmap


Built To Save the Tokens