Chunkex
A powerful Elixir tool for intelligently chunking your project's source code to support Local RAG (Retrieval-Augmented Generation) systems. Chunkex extracts functions from your Elixir codebase and creates structured chunks optimized for embedding and retrieval.
Overview
Chunkex is designed to bridge the gap between your Elixir codebase and AI agents by creating semantic chunks that can be:
- Embedded into vector representations by embedding models
- Stored in vector databases for efficient retrieval
- Retrieved through hybrid search to provide precise context to AI agents
- Used as JIT context instead of loading entire files
This approach enables your AI agents to access relevant code snippets with high precision, improving response quality while reducing token usage.
Features
- Function-level chunking: Extracts individual functions (
def,defp,defmacro) as semantic units - Precise location tracking: Maintains file paths and line ranges for each chunk
- Rich metadata: Includes language identification, SHA hashes, and structured text
- Mix task integration: Easy-to-use command-line interface
- Error resilient: Gracefully handles syntax errors and malformed files
- JSONL output: Structured format optimized for embedding pipelines
Installation
Add chunkex to your list of dependencies in mix.exs:
def deps do
[
{:chunkex, "~> 0.1.0"}
]
end
Then run mix deps.get to install the dependency.
Usage
Command Line Interface
The easiest way to use Chunkex is through the Mix task:
# Chunk the current project
mix chunkex run
# Chunk a specific directory
mix chunkex run /path/to/your/projectProgrammatic Usage
You can also use Chunkex directly in your Elixir code:
# Chunk the current directory
Chunkex.chunk()
# Chunk a specific directory
Chunkex.chunk("/path/to/your/project")Output Format
Chunkex generates a tmp/chunks.jsonl file containing one JSON object per line. Each chunk includes:
{
"path": "lib/my_module.ex",
"line_start": 15,
"line_end": 25,
"lang": "elixir",
"text": " def calculate_total(items) do\n items |> Enum.sum()\n end",
"sha": "a1b2c3d4e5f6..."
}Field Descriptions
path: Relative path to the source fileline_start: Starting line number of the functionline_end: Ending line number of the functionlang: Language identifier (always "elixir")text: The actual function code with proper indentationsha: SHA-256 hash of the source file for change detection
Integration with RAG Systems
1. Embedding Pipeline
# Example: Process chunks for embedding
chunks = File.stream!("tmp/chunks.jsonl")
|> Stream.map(&Jason.decode!/1)
|> Enum.map(fn chunk ->
%{
id: "#{chunk["path"]}:#{chunk["line_start"]}",
content: chunk["text"],
metadata: %{
file: chunk["path"],
line_start: chunk["line_start"],
line_end: chunk["line_end"],
sha: chunk["sha"]
}
}
end)2. Vector Storage
Store the embedded chunks in your preferred vector database (Pinecone, Weaviate, Chroma, etc.) with the metadata for hybrid search.
3. Retrieval for AI Agents
When your AI agent needs context about specific functionality:
# Retrieve relevant chunks based on query
relevant_chunks = vector_db.similarity_search(
query: "How to calculate totals?",
filter: %{lang: "elixir"},
limit: 5
)
# Use chunks as JIT context instead of entire files
context = relevant_chunks
|> Enum.map(& &1.content)
|> Enum.join("\n\n")Benefits for Local RAG
Precision over Recall
- Function-level chunks provide focused context
- Reduces noise from irrelevant code sections
- Improves AI agent response accuracy
Token Efficiency
- Smaller, targeted chunks reduce token usage
- Enables more context within token limits
- Cost-effective for API-based AI services
Incremental Updates
- SHA hashes enable change detection
- Update only modified chunks in your vector store
- Efficient synchronization with codebase changes
Semantic Structure
- Functions represent logical code units
- Natural boundaries for embedding models
- Better semantic understanding by AI agents
Development
Running Tests
mix testBuilding Documentation
mix docsContributing
- Fork the repository
-
Create a feature branch (
git checkout -b feature/amazing-feature) -
Commit your changes (
git commit -m 'Add amazing feature') -
Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Roadmap
- Support for additional Elixir constructs (modules, typespecs, etc.)
- Configurable chunking strategies
- Integration with popular vector databases
- Real-time file watching for incremental updates
Built To Save the Tokens