Portfolio Index

Portfolio Index Logo

Production adapters and pipelines for the PortfolioCore hexagonal architecture. Vector stores, graph databases, embedders, Broadway pipelines, and advanced RAG strategies.

Overview

Portfolio Index implements the port specifications defined in Portfolio Core, providing:

Vector Store Adapters - pgvector (PostgreSQL + fulltext hybrid)
Graph Store Adapters - Neo4j via boltx + community operations
Embedding Providers - Google Gemini
LLM Providers - Google Gemini, Anthropic Claude, OpenAI (openai_ex), Codex (codex_sdk), Ollama, vLLM (OpenAI-compatible)
Broadway Pipelines - Ingestion and embedding with backpressure
RAG Strategies - Hybrid (RRF fusion), Self-RAG (self-critique), GraphRAG, Agentic

Prerequisites

PostgreSQL with pgvector

# Ubuntu/WSL
sudo apt install postgresql postgresql-contrib libpq-dev postgresql-16-pgvector

# Create database
createdb portfolio_index_dev

Neo4j

# Install via apt (Ubuntu/WSL)
curl -fsSL https://debian.neo4j.com/neotechnology.gpg.key | \
  sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/neo4j.gpg
echo "deb https://debian.neo4j.com stable latest" | \
  sudo tee /etc/apt/sources.list.d/neo4j.list
sudo apt update && sudo apt install neo4j

# Start service
sudo systemctl enable neo4j && sudo systemctl start neo4j

# Set password
sudo neo4j-admin dbms set-initial-password password

Access Points:

Service	URL	Credentials
Neo4j Browser	http://localhost:7474	neo4j / password
Bolt endpoint	bolt://localhost:7687	neo4j / password

Gemini API Key

export GEMINI_API_KEY="your-api-key"

Installation

Add portfolio_index to your list of dependencies in mix.exs:

def deps do
  [
    {:portfolio_index, "~> 0.4.0"}
  ]
end

Then run:

mix deps.get
mix ecto.create
mix ecto.migrate

Quick Start

Vector Search

alias PortfolioIndex.Adapters.VectorStore.Pgvector
alias PortfolioIndex.Adapters.Embedder.Gemini

# Create index
:ok = Pgvector.create_index("docs", %{dimensions: 768, metric: :cosine})

# Generate embedding and store
{:ok, %{vector: vec}} = Gemini.embed("Hello, world!")
:ok = Pgvector.store("docs", "doc_1", vec, %{content: "Hello, world!"})

# Search
{:ok, results} = Pgvector.search("docs", query_vector, 10, [])

Graph Operations

alias PortfolioIndex.Adapters.GraphStore.Neo4j

# Create a graph namespace
:ok = Neo4j.create_graph("knowledge", %{})

# Create nodes
{:ok, node1} = Neo4j.create_node("knowledge", %{
  labels: ["Concept"],
  properties: %{name: "Elixir", type: "language"}
})

{:ok, node2} = Neo4j.create_node("knowledge", %{
  labels: ["Concept"],
  properties: %{name: "GenServer", type: "behaviour"}
})

# Create relationship
{:ok, _edge} = Neo4j.create_edge("knowledge", %{
  from_id: node1.id,
  to_id: node2.id,
  type: "HAS_FEATURE",
  properties: %{since: "1.0"}
})

# Query neighbors
{:ok, neighbors} = Neo4j.get_neighbors("knowledge", node1.id, direction: :outgoing)

RAG Query

alias PortfolioIndex.RAG.Strategies.Hybrid

{:ok, result} = Hybrid.retrieve(
  "How does authentication work?",
  %{index_id: "docs"},
  k: 10
)

# result.items contains ranked results
# result.timing_ms contains query duration

Self-RAG with Critique

alias PortfolioIndex.RAG.Strategies.SelfRAG

{:ok, result} = SelfRAG.retrieve(
  "What is GenServer?",
  %{index_id: "docs"},
  k: 5, min_critique_score: 3
)

# result.answer contains the generated answer
# result.critique contains relevance/support/completeness scores

Broadway Pipeline

# Start ingestion pipeline
{:ok, _} = PortfolioIndex.Pipelines.Ingestion.start(
  paths: ["/path/to/docs"],
  patterns: ["**/*.md", "**/*.ex"],
  index_id: "my_index",
  chunk_size: 1000,
  chunk_overlap: 200
)

# Start embedding pipeline
{:ok, _} = PortfolioIndex.Pipelines.Embedding.start(
  index_id: "my_index",
  rate_limit: 100,
  batch_size: 50
)

Configuration

Environment Variables

Variable	Description	Default
`DATABASE_URL`	PostgreSQL connection URL	-
`NEO4J_URI`	Neo4j Bolt URI	`bolt://localhost:7687`
`NEO4J_USER`	Neo4j username	`neo4j`
`NEO4J_PASSWORD`	Neo4j password	-
`GEMINI_API_KEY`	Google Gemini API key	-
`OPENAI_API_KEY`	OpenAI API key (OpenAI + Codex)	-
`OPENAI_ORGANIZATION`	OpenAI organization ID (optional)	-
`ANTHROPIC_API_KEY`	Anthropic API key	-
`CODEX_API_KEY`	Codex SDK API key (optional)	-
`OLLAMA_HOST`	Ollama host URL	`http://localhost:11434`
`OLLAMA_BASE_URL`	Ollama base URL (override)	`http://localhost:11434/api`
`OLLAMA_API_KEY`	Ollama API key (optional)	-
`VLLM_BASE_URL`	vLLM base URL	`http://localhost:8000/v1`
`VLLM_API_KEY`	vLLM API key (optional)	-

Local Model Setup

Ollama examples require a running Ollama server and these models:

llama3.2 (LLM)
nomic-embed-text (embeddings)

Install them with:

ollama pull llama3.2
ollama pull nomic-embed-text

Or run:

mix run examples/ollama_setup.exs

The vLLM example expects a vLLM server running at VLLM_BASE_URL.

Config Files

# config/dev.exs
config :portfolio_index, PortfolioIndex.Repo,
  username: "postgres",
  password: "postgres",
  hostname: "localhost",
  database: "portfolio_index_dev"

config :boltx, Boltx,
  uri: "bolt://localhost:7687",
  auth: [username: "neo4j", password: "password"],
  pool_size: 10

Adapters

Vector Store

Adapter	Backend	Features
`Pgvector`	PostgreSQL + pgvector	IVFFlat, HNSW indexes, cosine/euclidean/dot_product

Graph Store

Adapter	Backend	Features
`Neo4j`	Neo4j via boltx	Multi-graph isolation, Cypher queries

Embedders

Gemini - Google Gemini text-embedding-004
OpenAI - text-embedding-3-small/large
Ollama - Local embeddings via ollixir

Adapter	Provider	Model
`Gemini`	Google	text-embedding-004 (768 dims)
`OpenAI`	OpenAI	text-embedding-3-small/large
`Ollama`	Ollama	nomic-embed-text (default)

LLMs

Gemini - gemini-flash-lite-latest with streaming
Anthropic - Claude via claude_agent_sdk
OpenAI - GPT-4o-mini (low-cost default) via openai_ex
Codex - OpenAI Codex SDK with agentic support
Ollama - Local models via ollixir
vLLM - OpenAI-compatible endpoints via openai_ex

Adapter	Provider	Model
`Gemini`	Google	gemini-flash-lite-latest
`Anthropic`	Anthropic	Claude (SDK default)
`OpenAI`	OpenAI	gpt-4o-mini (default)
`Codex`	OpenAI	Codex SDK default
`Ollama`	Ollama	llama3.2 (default)
`VLLM`	vLLM	llama3 (default)

Chunker

Adapter	Strategy	Features
`Recursive`	Recursive text splitting	Format-aware for 17+ languages
`Character`	Character-based	Boundary modes: word, sentence, none
`Sentence`	Sentence-based	NLP tokenization, abbreviation handling
`Paragraph`	Paragraph-based	Intelligent merge/split at boundaries
`Semantic`	Embedding similarity	Groups by semantic coherence

Supported Formats

Category	Formats
Languages	Elixir, Ruby, PHP, Python, JavaScript, TypeScript, Vue
Markup	Markdown, HTML, LaTeX
Documents	doc, docx, epub, odt, pdf, rtf

Token-Based Chunking

All chunkers support custom size measurement via :get_chunk_size:

# Character-based (default)
Recursive.chunk(text, :elixir, %{chunk_size: 1000})

# Token-based (for LLM context limits)
Recursive.chunk(text, :elixir, %{
  chunk_size: 256,
  get_chunk_size: &MyTokenizer.count_tokens/1
})

# Byte-based (for storage limits)
Recursive.chunk(text, :plain, %{
  chunk_size: 4096,
  get_chunk_size: &byte_size/1
})

RAG Strategies

Hybrid - Vector + keyword search with Reciprocal Rank Fusion
SelfRAG - Retrieval with self-critique and answer refinement
GraphRAG - Graph-aware retrieval
Agentic - Tool-based iterative retrieval

Strategy	Description
`Hybrid`	Vector + keyword search with Reciprocal Rank Fusion
`SelfRAG`	Retrieval with self-critique and answer refinement
`Agentic`	Tool-based iterative retrieval
`GraphRAG`	Graph-aware retrieval with vector fusion

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Portfolio Index                          │
├─────────────────────────────────────────────────────────────┤
│  Adapters                                                   │
│  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐      │
│  │ Vector Store  │ │ Graph Store   │ │   Embedder    │      │
│  │ • Pgvector    │ │ • Neo4j       │ │ • Gemini      │      │
│  │               │ │               │ │ • OpenAI      │      │
│  └───────────────┘ └───────────────┘ └───────────────┘      │
│  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐      │
│  │     LLM       │ │   Chunker     │ │ Document Store│      │
│  │ • Gemini      │ │ • Recursive   │ │ • Postgres    │      │
│  │ • Anthropic   │ │               │ │               │      │
│  │ • OpenAI      │ │               │ │               │      │
│  │ • Codex       │ │               │ │               │      │
│  │ • Ollama      │ │               │ │               │      │
│  │ • vLLM        │ │               │ │               │      │
│  └───────────────┘ └───────────────┘ └───────────────┘      │
├─────────────────────────────────────────────────────────────┤
│  Pipelines (Broadway)                                       │
│  ┌───────────────────────────┐ ┌───────────────────────────┐│
│  │        Ingestion          │ │        Embedding          ││
│  │ FileProducer → Chunker    │ │ ETSProducer → VectorStore ││
│  └───────────────────────────┘ └───────────────────────────┘│
├─────────────────────────────────────────────────────────────┤
│  RAG Strategies                                             │
│  ┌────────────────────┐ ┌────────────────────┐              │
│  │       Hybrid       │ │      Self-RAG      │              │
│  │ Vector + RRF fusion│ │ Critique + Refine  │              │
│  └────────────────────┘ └────────────────────┘              │
│  ┌────────────────────┐ ┌────────────────────┐              │
│  │      Agentic       │ │      GraphRAG      │              │
│  │   (placeholder)    │ │   (placeholder)    │              │
│  └────────────────────┘ └────────────────────┘              │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                     Portfolio Core                          │
│              (Port Specifications & Registry)               │
└─────────────────────────────────────────────────────────────┘

Testing

# Run unit tests (mocked adapters)
mix test

# Run integration tests (requires running services)
mix test --include integration

# Run only Neo4j integration tests
mix test test/adapters/graph_store/neo4j_test.exs --include integration

# Run only Pgvector integration tests
mix test test/adapters/vector_store/pgvector_test.exs --include integration

Test Structure

The test suite separates unit tests (mocked, fast) from integration tests (live services):

Test Type	Tag	Services Required	Run Command
Unit	(default)	None	`mix test`
Integration	`@tag :integration`	Neo4j, PostgreSQL	`mix test --include integration`

Integration tests are excluded by default in test/test_helper.exs:

ExUnit.start(exclude: [:integration, :skip])

Test Fixtures

test/support/fixtures.ex provides test data generators:

alias PortfolioIndex.Fixtures

# Vector fixtures
Fixtures.random_vector(768)              # Random 768-dim vector
Fixtures.random_normalized_vector(768)   # Normalized (unit length)

# Graph fixtures
Fixtures.sample_node("node_1")           # %{id, labels, properties}
Fixtures.sample_edge("from", "to")       # %{id, type, from_id, to_id, properties}
Fixtures.sample_graph(5)                 # %{nodes: [...], edges: [...]}

# Document fixtures
Fixtures.sample_document()               # Sample markdown content
Fixtures.sample_code()                   # Sample Elixir code
Fixtures.sample_chunks(content, 3)       # Split content into chunks

Neo4j Details

Schema Management

Unlike SQL databases, Neo4j doesn’t use traditional migrations. Instead, PortfolioIndex.Adapters.GraphStore.Neo4j.Schema provides schema management:

alias PortfolioIndex.Adapters.GraphStore.Neo4j.Schema

# Setup all constraints and indexes
Schema.setup!()

# Check current schema version
Schema.version()
#=> 1

# Run migrations up to a specific version
Schema.migrate!(2)

# Reset database (DANGEROUS - testing only)
Schema.reset!()

# Clean a specific graph namespace
Schema.clean_graph!("my_graph")

Schema Versioning

Schema versions are tracked in a :SchemaVersion node:

(:SchemaVersion {id: 'current', version: 1, updated_at: datetime()})

Each migration is idempotent and can be re-run safely.

Constraints and Indexes

The schema setup creates:

Type	Name	Description
Constraint	`node_id_unique`	Unique node IDs within a graph
Constraint	`edge_id_unique`	Unique edge IDs within a graph
Index	`idx_node_graph_id`	Fast graph isolation queries
Index	`idx_node_labels`	Label-based queries
Index	`idx_fulltext_content`	Full-text search on content/name/title

Multi-Graph Isolation

All nodes and edges include a _graph_id property for namespace isolation:

# Each graph is isolated by its graph_id
Neo4j.create_graph("project_a", %{})
Neo4j.create_graph("project_b", %{})

# Nodes in different graphs don&#39;t interfere
Neo4j.create_node("project_a", %{labels: ["File"], properties: %{path: "/app.ex"}})
Neo4j.create_node("project_b", %{labels: ["File"], properties: %{path: "/app.ex"}})

# Queries are scoped to a graph
Neo4j.get_neighbors("project_a", node_id, direction: :outgoing)

The underlying Cypher uses _graph_id for isolation:

MATCH (n {id: $node_id, _graph_id: $graph_id})
RETURN n, labels(n) as labels

Custom Cypher Queries

Execute arbitrary Cypher with automatic graph_id injection:

cypher = """
MATCH (p:Person {_graph_id: $graph_id})
WHERE p.age > $min_age
RETURN p.name AS name, p.age AS age
ORDER BY p.age DESC
"""

{:ok, result} = Neo4j.query("my_graph", cypher, %{min_age: 25})
# result.records contains [%{"name" => "Alice", "age" => 30}, ...]

Both $graph_id and $_graph_id are available in queries.

Boltx Driver

This adapter uses boltx (v0.0.6+) for Neo4j connectivity:

# config/dev.exs
config :boltx, Boltx,
  uri: "bolt://localhost:7687",
  auth: [username: "neo4j", password: "password"],
  pool_size: 10,
  name: Boltx  # Required for connection pool registration

Neo4j Integration Tests

Integration tests create isolated graph namespaces per test:

defmodule MyNeo4jTest do
  use ExUnit.Case, async: true

  alias PortfolioIndex.Adapters.GraphStore.Neo4j

  describe "my feature" do
    @tag :integration
    test "creates nodes" do
      # Create unique graph for this test
      graph_id = "test_#{System.unique_integer([:positive])}"
      :ok = Neo4j.create_graph(graph_id, %{})

      # Test logic...
      {:ok, node} = Neo4j.create_node(graph_id, %{
        labels: ["Test"],
        properties: %{name: "example"}
      })

      assert is_binary(node.id)

      # Cleanup
      Neo4j.delete_graph(graph_id)
    end
  end
end

Telemetry Events

The Neo4j adapter emits telemetry events:

# Event names
[:portfolio_index, :graph_store, :create_node]
[:portfolio_index, :graph_store, :create_edge]
[:portfolio_index, :graph_store, :query]

# Measurements
%{duration_ms: 5}

# Metadata
%{graph_id: "my_graph"}

Attach handlers for observability:

:telemetry.attach(
  "neo4j-logger",
  [:portfolio_index, :graph_store, :query],
  fn _event, %{duration_ms: ms}, %{graph_id: id}, _config ->
    Logger.info("Neo4j query on #{id} took #{ms}ms")
  end,
  nil
)

Pgvector Details

PostgreSQL Setup

# Install PostgreSQL and pgvector extension
sudo apt install postgresql postgresql-contrib libpq-dev postgresql-16-pgvector

# Create database
createdb portfolio_index_dev

# Enable pgvector extension
psql -d portfolio_index_dev -c "CREATE EXTENSION IF NOT EXISTS vector;"

# Run migrations
mix ecto.migrate

Index Configuration

Create vector indexes with customizable parameters:

alias PortfolioIndex.Adapters.VectorStore.Pgvector

# Basic index with defaults
:ok = Pgvector.create_index("docs", %{dimensions: 768})

# Cosine similarity with HNSW index
:ok = Pgvector.create_index("embeddings", %{
  dimensions: 768,
  metric: :cosine,
  index_type: :hnsw,
  options: %{m: 16, ef_construction: 64}
})

# Euclidean distance with IVFFlat index
:ok = Pgvector.create_index("images", %{
  dimensions: 512,
  metric: :euclidean,
  index_type: :ivfflat,
  options: %{lists: 100}
})

Distance Metrics

Metric	Operator	Use Case
`:cosine`	`<=>`	Text embeddings, normalized vectors
`:euclidean`	`<->`	Image embeddings, spatial data
`:dot_product`	`<#>`	When vectors are already normalized

Index Types

Type	Description	Best For
`:ivfflat`	Inverted file index	Large datasets, good recall
`:hnsw`	Hierarchical navigable small world	Fast queries, high recall
`:flat`	No index (exact search)	Small datasets, perfect accuracy

Vector Operations

alias PortfolioIndex.Adapters.VectorStore.Pgvector

# Store a vector with metadata
:ok = Pgvector.store("docs", "doc_1", embedding_vector, %{
  source: "/path/to/file.md",
  title: "My Document",
  chunk_index: 0
})

# Batch store (more efficient)
items = [
  {"doc_1", vector1, %{source: "/a.md"}},
  {"doc_2", vector2, %{source: "/b.md"}},
  {"doc_3", vector3, %{source: "/c.md"}}
]
{:ok, 3} = Pgvector.store_batch("docs", items)

# Search with k nearest neighbors
{:ok, results} = Pgvector.search("docs", query_vector, 10, [])
# results = [%{id: "doc_1", score: 0.95, metadata: %{...}}, ...]

# Search with metadata filter
{:ok, results} = Pgvector.search("docs", query_vector, 10,
  filter: %{source: "/a.md"}
)

# Search with minimum score threshold
{:ok, results} = Pgvector.search("docs", query_vector, 10,
  min_score: 0.8
)

# Include vectors in results
{:ok, results} = Pgvector.search("docs", query_vector, 10,
  include_vector: true
)

# Delete a vector
:ok = Pgvector.delete("docs", "doc_1")

# Get index statistics
{:ok, stats} = Pgvector.index_stats("docs")
# stats = %{count: 1000, dimensions: 768, metric: :cosine, size_bytes: ...}

# Check if index exists
Pgvector.index_exists?("docs")  # => true or false

# Delete entire index
:ok = Pgvector.delete_index("docs")

Table Structure

Each index creates a table with this schema:

CREATE TABLE vectors_<index_id> (
  id VARCHAR(255) PRIMARY KEY,
  embedding vector(<dimensions>),
  metadata JSONB DEFAULT &#39;{}&#39;,
  created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

Index metadata is tracked in the registry:

CREATE TABLE vector_index_registry (
  index_id VARCHAR(255) PRIMARY KEY,
  dimensions INTEGER NOT NULL,
  metric VARCHAR(50) NOT NULL,
  index_type VARCHAR(50) NOT NULL,
  options JSONB DEFAULT &#39;{}&#39;,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

Ecto Configuration

# config/dev.exs
config :portfolio_index, PortfolioIndex.Repo,
  username: "postgres",
  password: "postgres",
  hostname: "localhost",
  database: "portfolio_index_dev",
  pool_size: 10

# config/test.exs
config :portfolio_index, PortfolioIndex.Repo,
  username: "postgres",
  password: "postgres",
  hostname: "localhost",
  database: "portfolio_index_test",
  pool: Ecto.Adapters.SQL.Sandbox

Pgvector Integration Tests

Integration tests use Ecto sandbox for isolation:

defmodule MyVectorTest do
  use ExUnit.Case, async: false

  alias PortfolioIndex.Adapters.VectorStore.Pgvector

  setup do
    pid = Ecto.Adapters.SQL.Sandbox.start_owner!(PortfolioIndex.Repo, shared: true)
    on_exit(fn -> Ecto.Adapters.SQL.Sandbox.stop_owner(pid) end)
    :ok
  end

  @tag :integration
  test "stores and searches vectors" do
    index_id = "test_#{System.unique_integer([:positive])}"
    :ok = Pgvector.create_index(index_id, %{dimensions: 768})

    vector = for _ <- 1..768, do: :rand.uniform()
    :ok = Pgvector.store(index_id, "doc_1", vector, %{})

    {:ok, results} = Pgvector.search(index_id, vector, 1, [])
    assert hd(results).id == "doc_1"

    Pgvector.delete_index(index_id)
  end
end

Telemetry Events

The Pgvector adapter emits telemetry events:

# Event names
[:portfolio_index, :vector_store, :store]
[:portfolio_index, :vector_store, :store_batch]
[:portfolio_index, :vector_store, :search]

# Measurements
%{duration_ms: 5}                    # store
%{duration_ms: 50, count: 100}       # store_batch
%{duration_ms: 10, k: 10, results: 8} # search

# Metadata
%{index_id: "my_index"}

Attach handlers for monitoring:

:telemetry.attach(
  "pgvector-logger",
  [:portfolio_index, :vector_store, :search],
  fn _event, %{duration_ms: ms, results: n}, %{index_id: id}, _config ->
    Logger.info("Search on #{id}: #{n} results in #{ms}ms")
  end,
  nil
)

Performance Tips

Use HNSW for production - Better query performance than IVFFlat
Batch inserts - Use store_batch/2 for bulk ingestion
Tune HNSW parameters:
- m: Higher = better recall, more memory (default: 16)
- ef_construction: Higher = better index quality, slower build (default: 64)
Use metadata filters - Reduces search space before vector comparison
Set appropriate min_score - Filters low-quality matches early

Documentation

HexDocs

Related Packages

portfolio_core - Hexagonal architecture primitives
portfolio_manager - CLI and application layer

Acknowledgments

Significant portions of this library’s architecture and features were derived from analysis of Arcana by George Guimarães, licensed under the Apache License 2.0.

Features inspired by Arcana include:

RAG pipeline architecture (query rewriting, expansion, decomposition)
Evaluation system design (IR metrics, test case generation)
Chunker token utilities and sizing options
Telemetry patterns and agent system design

See docs/20251230/arcana_gap_analysis/ for detailed analysis.

License

MIT License - see LICENSE for details.