LangExtract

Extract structured data from text using LLMs, with every extraction grounded to exact byte positions in the source. An Elixir port of google/langextract.

client = LangExtract.new(:claude, api_key: System.get_env("ANTHROPIC_API_KEY"))

template = %LangExtract.Prompt.Template{
  description: "Extract people and locations from the text.",
  examples: [
    %LangExtract.Prompt.ExampleData{
      text: "Hamlet is set in Denmark.",
      extractions: [
        %LangExtract.Pipeline.Extraction{class: "work", text: "Hamlet", attributes: %{"type" => "play"}},
        %LangExtract.Pipeline.Extraction{class: "location", text: "Denmark", attributes: %{}}
      ]
    }
  ]
}

{:ok, {spans, _errors}} = LangExtract.run(client, "Romeo and Juliet was written by William Shakespeare.", template)

for span <- spans do
  IO.puts("#{span.class}: \"#{span.text}\" [bytes #{span.byte_start}..#{span.byte_end}] (#{span.status})")
end
# person: "William Shakespeare" [bytes 31..50] (exact)
# work: "Romeo and Juliet" [bytes 0..16] (exact)

Every extraction maps back to its exact position in the source binary via binary_part(source, span.byte_start, span.byte_end - span.byte_start).

Installation

Add lang_extract to your list of dependencies in mix.exs:

def deps do
  [
    {:lang_extract, "~> 0.1.0"}
  ]
end

LangExtract uses Req for HTTP calls. No additional adapter configuration is needed.

Quick Start

1. Create a client

client = LangExtract.new(:claude, api_key: "sk-ant-...")

Supported providers: :claude, :openai, :gemini.

Provider-specific options are passed as keyword arguments:

# OpenAI with a specific model
client = LangExtract.new(:openai, api_key: "sk-...", model: "gpt-4o")

# Gemini
client = LangExtract.new(:gemini, api_key: "gm-...")

# OpenAI-compatible endpoint (Ollama, vLLM, etc.)
client = LangExtract.new(:openai,
  api_key: "not-needed",
  base_url: "http://localhost:11434",
  json_mode: false
)

2. Define a prompt template

The template tells the LLM what to extract. Few-shot examples teach it the output format using dynamic keys — the extraction class name becomes the YAML key, which reads naturally in context:

template = %LangExtract.Prompt.Template{
  description: "Extract medical conditions and medications from clinical text.",
  examples: [
    %LangExtract.Prompt.ExampleData{
      text: "Patient was diagnosed with diabetes and prescribed metformin.",
      extractions: [
        %LangExtract.Pipeline.Extraction{
          class: "condition",
          text: "diabetes",
          attributes: %{"chronicity" => "chronic"}
        },
        %LangExtract.Pipeline.Extraction{
          class: "medication",
          text: "metformin",
          attributes: %{}
        }
      ]
    }
  ]
}

3. Run extraction

source = "The patient presents with hypertension and is taking lisinopril daily."

{:ok, {spans, errors}} = LangExtract.run(client, source, template)

When some chunks fail to parse, the successful spans are still returned alongside the errors. Check errors to detect failures. Infrastructure failures (task exits, timeouts) return {:error, reason} instead.

Each span contains:

Field	Description
`text`	The extracted text as returned by the LLM
`class`	Entity type (e.g., `"condition"`, `"medication"`)
`attributes`	Arbitrary metadata the LLM attached
`byte_start`	Inclusive byte offset in source (`nil` if not found)
`byte_end`	Exclusive byte offset in source (`nil` if not found)
`status`	`:exact`, `:fuzzy`, or `:not_found`

Verify byte offsets round-trip:

for span <- spans, span.byte_start != nil do
  extracted = binary_part(source, span.byte_start, span.byte_end - span.byte_start)
  IO.puts("#{span.class}: #{extracted}")
end

Chunking

For documents that exceed LLM token limits, pass :max_chunk_chars to split the source into sentence-aware chunks and process them in parallel:

{:ok, {spans, errors}} = LangExtract.run(client, long_document, template,
  max_chunk_chars: 4000,
  max_concurrency: 5
)

Byte offsets in the returned spans are adjusted to reference the original source, not individual chunks.

Prompt Validation

Validate that your few-shot examples actually align with their own source text before burning LLM tokens:

# Returns :ok or {:error, [issues]}
:ok = LangExtract.Prompt.Validator.validate(template)

# Or raise on failure
:ok = LangExtract.Prompt.Validator.validate!(template)

The validator reports what it finds. You decide what to do — log, raise, or ignore. No built-in severity levels.

Alignment Without an LLM

If you already have extraction strings (e.g., from a different source), you can align them against source text directly:

spans = LangExtract.align("the quick brown fox", ["quick brown", "fox"])
# [%Span{text: "quick brown", byte_start: 4, byte_end: 15, status: :exact},
#  %Span{text: "fox", byte_start: 16, byte_end: 19, status: :exact}]

Or parse raw LLM output and align in one step:

yaml = "extractions:\n- class: animal\n  text: fox"
{:ok, spans} = LangExtract.extract("the quick brown fox", yaml)

Both canonical format (class/text/attributes keys) and dynamic-key format ("animal": "fox") are accepted. Markdown fences and <think> tags are stripped automatically.

Serialization

Convert results to plain maps for storage or interop:

map = LangExtract.IO.to_map(source, spans)
# %{"text" => "...", "extractions" => [%{"class" => "...", "status" => "exact", ...}]}

{:ok, {source, spans}} = LangExtract.IO.from_map(map)

Save and load multiple results as JSONL:

LangExtract.IO.save_jsonl([{source1, spans1}, {source2, spans2}], "results.jsonl")
{:ok, results} = LangExtract.IO.load_jsonl("results.jsonl")

Provider Options

All providers accept these common options:

Option	Default	Description
`:api_key`	From env var	API key (falls back to provider-specific env var)
`:model`	Provider default	Model ID
`:max_tokens`	`4096`	Maximum response tokens
`:temperature`	`0`	Sampling temperature
`:base_url`	Provider default	API base URL

Environment variable fallbacks: ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY.

Provider-specific options:

Provider	Option	Default	Description
`:openai`	`:json_mode`	`true`	Enable JSON mode. Set `false` for compatible endpoints that don't support it.

How It Works

The pipeline has five stages:

1. Prompt Builder    — Renders few-shot Q&A prompt with dynamic-key examples
2. LLM Provider      — Calls Claude/OpenAI/Gemini via Req
3. Format Handler    — Strips fences/<think> tags, normalizes dynamic keys to canonical form
4. Parser            — Validates and constructs Extraction structs
5. Aligner           — Maps extraction text to byte positions via Myers diff + fuzzy fallback

The aligner uses two phases:

Phase 1 (Exact): List.myers_difference/2 on downcased word tokens. If a contiguous equal segment covers all extraction tokens, it's an exact match.
Phase 2 (Fuzzy): Sliding window with token frequency overlap. The window with the highest overlap ratio above :fuzzy_threshold (default 0.75) wins.

Architecture

lib/lang_extract/
├── alignment/              # Tokenizer, Token, Aligner, Span
├── pipeline/               # FormatHandler, Parser, Extraction, ChunkError
├── prompt/                 # Template, ExampleData, Builder, Validator
├── provider/               # Claude, OpenAI, Gemini implementations
├── client.ex               # Configured LLM client struct
├── orchestrator.ex         # Pipeline wiring + chunking
├── chunker.ex              # Sentence-aware text splitting
├── pipeline.ex             # Extraction pipeline public API
└── io.ex                   # Serialization + JSONL

Compared to the Python Original

This is an Elixir port of google/langextract. Key differences:

	Python	Elixir
Codebase	~4,000 LOC	~1,400 LOC
Providers	Gemini, OpenAI, Ollama	Claude, OpenAI, Gemini
Offsets	Character positions	Byte positions
Parallelism	ThreadPoolExecutor	Task.async_stream
Chunking	Always-on (1000 chars)	Always-on (1000 chars, configurable)
Alignment statuses	4 (exact, lesser, greater, fuzzy)	3 (exact, fuzzy, not_found)
Prompt validation	Built-in severity levels	Caller decides

Not ported: visualization (HTML output), multi-pass extraction, batch Vertex AI, plugin system. See ROADMAP.md for planned improvements.

License

See LICENSE for details.