Portfolio Index

Portfolio Index Logo

Hex.pmDocumentationBuild StatusLicense

Production adapters and pipelines for the PortfolioCore hexagonal architecture. Vector stores, graph databases, embedders, Broadway pipelines, and advanced RAG strategies.


Overview

Portfolio Index implements the port specifications defined in Portfolio Core, providing:

Prerequisites

PostgreSQL with pgvector

# Ubuntu/WSL
sudo apt install postgresql postgresql-contrib libpq-dev postgresql-16-pgvector

# Create database
createdb portfolio_index_dev

Neo4j

# Install via apt (Ubuntu/WSL)
curl -fsSL https://debian.neo4j.com/neotechnology.gpg.key | \
  sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/neo4j.gpg
echo "deb https://debian.neo4j.com stable latest" | \
  sudo tee /etc/apt/sources.list.d/neo4j.list
sudo apt update && sudo apt install neo4j

# Start service
sudo systemctl enable neo4j && sudo systemctl start neo4j

# Set password
sudo neo4j-admin dbms set-initial-password password

Access Points:

Service URL Credentials
Neo4j Browser http://localhost:7474 neo4j / password
Bolt endpoint bolt://localhost:7687 neo4j / password

Gemini API Key

export GEMINI_API_KEY="your-api-key"

Installation

Add portfolio_index to your list of dependencies in mix.exs:

def deps do
  [
    {:portfolio_index, "~> 0.4.0"}
  ]
end

Then run:

mix deps.get
mix ecto.create
mix ecto.migrate

Quick Start

Vector Search

alias PortfolioIndex.Adapters.VectorStore.Pgvector
alias PortfolioIndex.Adapters.Embedder.Gemini

# Create index
:ok = Pgvector.create_index("docs", %{dimensions: 768, metric: :cosine})

# Generate embedding and store
{:ok, %{vector: vec}} = Gemini.embed("Hello, world!")
:ok = Pgvector.store("docs", "doc_1", vec, %{content: "Hello, world!"})

# Search
{:ok, results} = Pgvector.search("docs", query_vector, 10, [])

Graph Operations

alias PortfolioIndex.Adapters.GraphStore.Neo4j

# Create a graph namespace
:ok = Neo4j.create_graph("knowledge", %{})

# Create nodes
{:ok, node1} = Neo4j.create_node("knowledge", %{
  labels: ["Concept"],
  properties: %{name: "Elixir", type: "language"}
})

{:ok, node2} = Neo4j.create_node("knowledge", %{
  labels: ["Concept"],
  properties: %{name: "GenServer", type: "behaviour"}
})

# Create relationship
{:ok, _edge} = Neo4j.create_edge("knowledge", %{
  from_id: node1.id,
  to_id: node2.id,
  type: "HAS_FEATURE",
  properties: %{since: "1.0"}
})

# Query neighbors
{:ok, neighbors} = Neo4j.get_neighbors("knowledge", node1.id, direction: :outgoing)

RAG Query

alias PortfolioIndex.RAG.Strategies.Hybrid

{:ok, result} = Hybrid.retrieve(
  "How does authentication work?",
  %{index_id: "docs"},
  k: 10
)

# result.items contains ranked results
# result.timing_ms contains query duration

Self-RAG with Critique

alias PortfolioIndex.RAG.Strategies.SelfRAG

{:ok, result} = SelfRAG.retrieve(
  "What is GenServer?",
  %{index_id: "docs"},
  k: 5, min_critique_score: 3
)

# result.answer contains the generated answer
# result.critique contains relevance/support/completeness scores

Broadway Pipeline

# Start ingestion pipeline
{:ok, _} = PortfolioIndex.Pipelines.Ingestion.start(
  paths: ["/path/to/docs"],
  patterns: ["**/*.md", "**/*.ex"],
  index_id: "my_index",
  chunk_size: 1000,
  chunk_overlap: 200
)

# Start embedding pipeline
{:ok, _} = PortfolioIndex.Pipelines.Embedding.start(
  index_id: "my_index",
  rate_limit: 100,
  batch_size: 50
)

Configuration

Environment Variables

Variable Description Default
DATABASE_URL PostgreSQL connection URL -
NEO4J_URI Neo4j Bolt URI bolt://localhost:7687
NEO4J_USER Neo4j username neo4j
NEO4J_PASSWORD Neo4j password -
GEMINI_API_KEY Google Gemini API key -
OPENAI_API_KEY OpenAI API key (OpenAI + Codex) -
OPENAI_ORGANIZATION OpenAI organization ID (optional) -
ANTHROPIC_API_KEY Anthropic API key -
CODEX_API_KEY Codex SDK API key (optional) -
OLLAMA_HOST Ollama host URL http://localhost:11434
OLLAMA_BASE_URL Ollama base URL (override) http://localhost:11434/api
OLLAMA_API_KEY Ollama API key (optional) -
VLLM_BASE_URL vLLM base URL http://localhost:8000/v1
VLLM_API_KEY vLLM API key (optional) -

Local Model Setup

Ollama examples require a running Ollama server and these models:

Install them with:

ollama pull llama3.2
ollama pull nomic-embed-text

Or run:

mix run examples/ollama_setup.exs

The vLLM example expects a vLLM server running at VLLM_BASE_URL.

Config Files

# config/dev.exs
config :portfolio_index, PortfolioIndex.Repo,
  username: "postgres",
  password: "postgres",
  hostname: "localhost",
  database: "portfolio_index_dev"

config :boltx, Boltx,
  uri: "bolt://localhost:7687",
  auth: [username: "neo4j", password: "password"],
  pool_size: 10

Adapters

Vector Store

Adapter Backend Features
Pgvector PostgreSQL + pgvector IVFFlat, HNSW indexes, cosine/euclidean/dot_product

Graph Store

Adapter Backend Features
Neo4j Neo4j via boltx Multi-graph isolation, Cypher queries

Embedders

Adapter Provider Model
Gemini Google text-embedding-004 (768 dims)
OpenAI OpenAI text-embedding-3-small/large
Ollama Ollama nomic-embed-text (default)

LLMs

Adapter Provider Model
Gemini Google gemini-flash-lite-latest
Anthropic Anthropic Claude (SDK default)
OpenAI OpenAI gpt-4o-mini (default)
Codex OpenAI Codex SDK default
Ollama Ollama llama3.2 (default)
VLLM vLLM llama3 (default)

Chunker

Adapter Strategy Features
Recursive Recursive text splitting Format-aware for 17+ languages
Character Character-based Boundary modes: word, sentence, none
Sentence Sentence-based NLP tokenization, abbreviation handling
Paragraph Paragraph-based Intelligent merge/split at boundaries
Semantic Embedding similarity Groups by semantic coherence

Supported Formats

Category Formats
Languages Elixir, Ruby, PHP, Python, JavaScript, TypeScript, Vue
Markup Markdown, HTML, LaTeX
Documents doc, docx, epub, odt, pdf, rtf

Token-Based Chunking

All chunkers support custom size measurement via :get_chunk_size:

# Character-based (default)
Recursive.chunk(text, :elixir, %{chunk_size: 1000})

# Token-based (for LLM context limits)
Recursive.chunk(text, :elixir, %{
  chunk_size: 256,
  get_chunk_size: &MyTokenizer.count_tokens/1
})

# Byte-based (for storage limits)
Recursive.chunk(text, :plain, %{
  chunk_size: 4096,
  get_chunk_size: &byte_size/1
})

RAG Strategies

Strategy Description
Hybrid Vector + keyword search with Reciprocal Rank Fusion
SelfRAG Retrieval with self-critique and answer refinement
Agentic Tool-based iterative retrieval
GraphRAG Graph-aware retrieval with vector fusion

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Portfolio Index                          │
├─────────────────────────────────────────────────────────────┤
│  Adapters                                                   │
│  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐      │
│  │ Vector Store  │ │ Graph Store   │ │   Embedder    │      │
│  │ • Pgvector    │ │ • Neo4j       │ │ • Gemini      │      │
│  │               │ │               │ │ • OpenAI      │      │
│  └───────────────┘ └───────────────┘ └───────────────┘      │
│  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐      │
│  │     LLM       │ │   Chunker     │ │ Document Store│      │
│  │ • Gemini      │ │ • Recursive   │ │ • Postgres    │      │
│  │ • Anthropic   │ │               │ │               │      │
│  │ • OpenAI      │ │               │ │               │      │
│  │ • Codex       │ │               │ │               │      │
│  │ • Ollama      │ │               │ │               │      │
│  │ • vLLM        │ │               │ │               │      │
│  └───────────────┘ └───────────────┘ └───────────────┘      │
├─────────────────────────────────────────────────────────────┤
│  Pipelines (Broadway)                                       │
│  ┌───────────────────────────┐ ┌───────────────────────────┐│
│  │        Ingestion          │ │        Embedding          ││
│  │ FileProducer → Chunker    │ │ ETSProducer → VectorStore ││
│  └───────────────────────────┘ └───────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│  RAG Strategies                                             │
│  ┌────────────────────┐ ┌────────────────────┐              │
│  │       Hybrid       │ │      Self-RAG      │              │
│  │ Vector + RRF fusion│ │ Critique + Refine  │              │
│  └────────────────────┘ └────────────────────┘              │
│  ┌────────────────────┐ ┌────────────────────┐              │
│  │      Agentic       │ │      GraphRAG      │              │
│  │   (placeholder)    │ │   (placeholder)    │              │
│  └────────────────────┘ └────────────────────┘              │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     Portfolio Core                          │
│              (Port Specifications & Registry)               │
└─────────────────────────────────────────────────────────────┘

Testing

# Run unit tests (mocked adapters)
mix test

# Run integration tests (requires running services)
mix test --include integration

# Run only Neo4j integration tests
mix test test/adapters/graph_store/neo4j_test.exs --include integration

# Run only Pgvector integration tests
mix test test/adapters/vector_store/pgvector_test.exs --include integration

Test Structure

The test suite separates unit tests (mocked, fast) from integration tests (live services):

Test Type Tag Services Required Run Command
Unit (default) None mix test
Integration @tag :integration Neo4j, PostgreSQL mix test --include integration

Integration tests are excluded by default in test/test_helper.exs:

ExUnit.start(exclude: [:integration, :skip])

Test Fixtures

test/support/fixtures.ex provides test data generators:

alias PortfolioIndex.Fixtures

# Vector fixtures
Fixtures.random_vector(768)              # Random 768-dim vector
Fixtures.random_normalized_vector(768)   # Normalized (unit length)

# Graph fixtures
Fixtures.sample_node("node_1")           # %{id, labels, properties}
Fixtures.sample_edge("from", "to")       # %{id, type, from_id, to_id, properties}
Fixtures.sample_graph(5)                 # %{nodes: [...], edges: [...]}

# Document fixtures
Fixtures.sample_document()               # Sample markdown content
Fixtures.sample_code()                   # Sample Elixir code
Fixtures.sample_chunks(content, 3)       # Split content into chunks

Neo4j Details

Schema Management

Unlike SQL databases, Neo4j doesn’t use traditional migrations. Instead, PortfolioIndex.Adapters.GraphStore.Neo4j.Schema provides schema management:

alias PortfolioIndex.Adapters.GraphStore.Neo4j.Schema

# Setup all constraints and indexes
Schema.setup!()

# Check current schema version
Schema.version()
#=> 1

# Run migrations up to a specific version
Schema.migrate!(2)

# Reset database (DANGEROUS - testing only)
Schema.reset!()

# Clean a specific graph namespace
Schema.clean_graph!("my_graph")

Schema Versioning

Schema versions are tracked in a :SchemaVersion node:

(:SchemaVersion {id: 'current', version: 1, updated_at: datetime()})

Each migration is idempotent and can be re-run safely.

Constraints and Indexes

The schema setup creates:

Type Name Description
Constraint node_id_unique Unique node IDs within a graph
Constraint edge_id_unique Unique edge IDs within a graph
Index idx_node_graph_id Fast graph isolation queries
Index idx_node_labels Label-based queries
Index idx_fulltext_content Full-text search on content/name/title

Multi-Graph Isolation

All nodes and edges include a _graph_id property for namespace isolation:

# Each graph is isolated by its graph_id
Neo4j.create_graph("project_a", %{})
Neo4j.create_graph("project_b", %{})

# Nodes in different graphs don't interfere
Neo4j.create_node("project_a", %{labels: ["File"], properties: %{path: "/app.ex"}})
Neo4j.create_node("project_b", %{labels: ["File"], properties: %{path: "/app.ex"}})

# Queries are scoped to a graph
Neo4j.get_neighbors("project_a", node_id, direction: :outgoing)

The underlying Cypher uses _graph_id for isolation:

MATCH (n {id: $node_id, _graph_id: $graph_id})
RETURN n, labels(n) as labels

Custom Cypher Queries

Execute arbitrary Cypher with automatic graph_id injection:

cypher = """
MATCH (p:Person {_graph_id: $graph_id})
WHERE p.age > $min_age
RETURN p.name AS name, p.age AS age
ORDER BY p.age DESC
"""

{:ok, result} = Neo4j.query("my_graph", cypher, %{min_age: 25})
# result.records contains [%{"name" => "Alice", "age" => 30}, ...]

Both $graph_id and $_graph_id are available in queries.

Boltx Driver

This adapter uses boltx (v0.0.6+) for Neo4j connectivity:

# config/dev.exs
config :boltx, Boltx,
  uri: "bolt://localhost:7687",
  auth: [username: "neo4j", password: "password"],
  pool_size: 10,
  name: Boltx  # Required for connection pool registration

Neo4j Integration Tests

Integration tests create isolated graph namespaces per test:

defmodule MyNeo4jTest do
  use ExUnit.Case, async: true

  alias PortfolioIndex.Adapters.GraphStore.Neo4j

  describe "my feature" do
    @tag :integration
    test "creates nodes" do
      # Create unique graph for this test
      graph_id = "test_#{System.unique_integer([:positive])}"
      :ok = Neo4j.create_graph(graph_id, %{})

      # Test logic...
      {:ok, node} = Neo4j.create_node(graph_id, %{
        labels: ["Test"],
        properties: %{name: "example"}
      })

      assert is_binary(node.id)

      # Cleanup
      Neo4j.delete_graph(graph_id)
    end
  end
end

Telemetry Events

The Neo4j adapter emits telemetry events:

# Event names
[:portfolio_index, :graph_store, :create_node]
[:portfolio_index, :graph_store, :create_edge]
[:portfolio_index, :graph_store, :query]

# Measurements
%{duration_ms: 5}

# Metadata
%{graph_id: "my_graph"}

Attach handlers for observability:

:telemetry.attach(
  "neo4j-logger",
  [:portfolio_index, :graph_store, :query],
  fn _event, %{duration_ms: ms}, %{graph_id: id}, _config ->
    Logger.info("Neo4j query on #{id} took #{ms}ms")
  end,
  nil
)

Pgvector Details

PostgreSQL Setup

# Install PostgreSQL and pgvector extension
sudo apt install postgresql postgresql-contrib libpq-dev postgresql-16-pgvector

# Create database
createdb portfolio_index_dev

# Enable pgvector extension
psql -d portfolio_index_dev -c "CREATE EXTENSION IF NOT EXISTS vector;"

# Run migrations
mix ecto.migrate

Index Configuration

Create vector indexes with customizable parameters:

alias PortfolioIndex.Adapters.VectorStore.Pgvector

# Basic index with defaults
:ok = Pgvector.create_index("docs", %{dimensions: 768})

# Cosine similarity with HNSW index
:ok = Pgvector.create_index("embeddings", %{
  dimensions: 768,
  metric: :cosine,
  index_type: :hnsw,
  options: %{m: 16, ef_construction: 64}
})

# Euclidean distance with IVFFlat index
:ok = Pgvector.create_index("images", %{
  dimensions: 512,
  metric: :euclidean,
  index_type: :ivfflat,
  options: %{lists: 100}
})

Distance Metrics

Metric Operator Use Case
:cosine<=> Text embeddings, normalized vectors
:euclidean<-> Image embeddings, spatial data
:dot_product<#> When vectors are already normalized

Index Types

Type Description Best For
:ivfflat Inverted file index Large datasets, good recall
:hnsw Hierarchical navigable small world Fast queries, high recall
:flat No index (exact search) Small datasets, perfect accuracy

Vector Operations

alias PortfolioIndex.Adapters.VectorStore.Pgvector

# Store a vector with metadata
:ok = Pgvector.store("docs", "doc_1", embedding_vector, %{
  source: "/path/to/file.md",
  title: "My Document",
  chunk_index: 0
})

# Batch store (more efficient)
items = [
  {"doc_1", vector1, %{source: "/a.md"}},
  {"doc_2", vector2, %{source: "/b.md"}},
  {"doc_3", vector3, %{source: "/c.md"}}
]
{:ok, 3} = Pgvector.store_batch("docs", items)

# Search with k nearest neighbors
{:ok, results} = Pgvector.search("docs", query_vector, 10, [])
# results = [%{id: "doc_1", score: 0.95, metadata: %{...}}, ...]

# Search with metadata filter
{:ok, results} = Pgvector.search("docs", query_vector, 10,
  filter: %{source: "/a.md"}
)

# Search with minimum score threshold
{:ok, results} = Pgvector.search("docs", query_vector, 10,
  min_score: 0.8
)

# Include vectors in results
{:ok, results} = Pgvector.search("docs", query_vector, 10,
  include_vector: true
)

# Delete a vector
:ok = Pgvector.delete("docs", "doc_1")

# Get index statistics
{:ok, stats} = Pgvector.index_stats("docs")
# stats = %{count: 1000, dimensions: 768, metric: :cosine, size_bytes: ...}

# Check if index exists
Pgvector.index_exists?("docs")  # => true or false

# Delete entire index
:ok = Pgvector.delete_index("docs")

Table Structure

Each index creates a table with this schema:

CREATE TABLE vectors_<index_id> (
  id VARCHAR(255) PRIMARY KEY,
  embedding vector(<dimensions>),
  metadata JSONB DEFAULT &#39;{}&#39;,
  created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

Index metadata is tracked in the registry:

CREATE TABLE vector_index_registry (
  index_id VARCHAR(255) PRIMARY KEY,
  dimensions INTEGER NOT NULL,
  metric VARCHAR(50) NOT NULL,
  index_type VARCHAR(50) NOT NULL,
  options JSONB DEFAULT &#39;{}&#39;,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

Ecto Configuration

# config/dev.exs
config :portfolio_index, PortfolioIndex.Repo,
  username: "postgres",
  password: "postgres",
  hostname: "localhost",
  database: "portfolio_index_dev",
  pool_size: 10

# config/test.exs
config :portfolio_index, PortfolioIndex.Repo,
  username: "postgres",
  password: "postgres",
  hostname: "localhost",
  database: "portfolio_index_test",
  pool: Ecto.Adapters.SQL.Sandbox

Pgvector Integration Tests

Integration tests use Ecto sandbox for isolation:

defmodule MyVectorTest do
  use ExUnit.Case, async: false

  alias PortfolioIndex.Adapters.VectorStore.Pgvector

  setup do
    pid = Ecto.Adapters.SQL.Sandbox.start_owner!(PortfolioIndex.Repo, shared: true)
    on_exit(fn -> Ecto.Adapters.SQL.Sandbox.stop_owner(pid) end)
    :ok
  end

  @tag :integration
  test "stores and searches vectors" do
    index_id = "test_#{System.unique_integer([:positive])}"
    :ok = Pgvector.create_index(index_id, %{dimensions: 768})

    vector = for _ <- 1..768, do: :rand.uniform()
    :ok = Pgvector.store(index_id, "doc_1", vector, %{})

    {:ok, results} = Pgvector.search(index_id, vector, 1, [])
    assert hd(results).id == "doc_1"

    Pgvector.delete_index(index_id)
  end
end

Telemetry Events

The Pgvector adapter emits telemetry events:

# Event names
[:portfolio_index, :vector_store, :store]
[:portfolio_index, :vector_store, :store_batch]
[:portfolio_index, :vector_store, :search]

# Measurements
%{duration_ms: 5}                    # store
%{duration_ms: 50, count: 100}       # store_batch
%{duration_ms: 10, k: 10, results: 8} # search

# Metadata
%{index_id: "my_index"}

Attach handlers for monitoring:

:telemetry.attach(
  "pgvector-logger",
  [:portfolio_index, :vector_store, :search],
  fn _event, %{duration_ms: ms, results: n}, %{index_id: id}, _config ->
    Logger.info("Search on #{id}: #{n} results in #{ms}ms")
  end,
  nil
)

Performance Tips

  1. Use HNSW for production - Better query performance than IVFFlat
  2. Batch inserts - Use store_batch/2 for bulk ingestion
  3. Tune HNSW parameters:
    • m: Higher = better recall, more memory (default: 16)
    • ef_construction: Higher = better index quality, slower build (default: 64)
  4. Use metadata filters - Reduces search space before vector comparison
  5. Set appropriate min_score - Filters low-quality matches early

Documentation

Related Packages

Acknowledgments

Significant portions of this library’s architecture and features were derived from analysis of Arcana by George Guimarães, licensed under the Apache License 2.0.

Features inspired by Arcana include:

See docs/20251230/arcana_gap_analysis/ for detailed analysis.

License

MIT License - see LICENSE for details.