LangExtract
Extract structured data from text using LLMs, with every extraction grounded to exact byte positions in the source. An Elixir port of google/langextract.
client = LangExtract.new(:claude, api_key: System.get_env("ANTHROPIC_API_KEY"))
template = %LangExtract.Prompt.Template{
description: "Extract people and locations from the text.",
examples: [
%LangExtract.Prompt.ExampleData{
text: "Hamlet is set in Denmark.",
extractions: [
%LangExtract.Pipeline.Extraction{class: "work", text: "Hamlet", attributes: %{"type" => "play"}},
%LangExtract.Pipeline.Extraction{class: "location", text: "Denmark", attributes: %{}}
]
}
]
}
{:ok, {spans, _errors}} = LangExtract.run(client, "Romeo and Juliet was written by William Shakespeare.", template)
for span <- spans do
IO.puts("#{span.class}: \"#{span.text}\" [bytes #{span.byte_start}..#{span.byte_end}] (#{span.status})")
end
# person: "William Shakespeare" [bytes 31..50] (exact)
# work: "Romeo and Juliet" [bytes 0..16] (exact)
Every extraction maps back to its exact position in the source binary via
binary_part(source, span.byte_start, span.byte_end - span.byte_start).
Installation
Add lang_extract to your list of dependencies in mix.exs:
def deps do
[
{:lang_extract, "~> 0.1.0"}
]
endLangExtract uses Req for HTTP calls. No additional adapter configuration is needed.
Quick Start
1. Create a client
client = LangExtract.new(:claude, api_key: "sk-ant-...")
Supported providers: :claude, :openai, :gemini.
Provider-specific options are passed as keyword arguments:
# OpenAI with a specific model
client = LangExtract.new(:openai, api_key: "sk-...", model: "gpt-4o")
# Gemini
client = LangExtract.new(:gemini, api_key: "gm-...")
# OpenAI-compatible endpoint (Ollama, vLLM, etc.)
client = LangExtract.new(:openai,
api_key: "not-needed",
base_url: "http://localhost:11434",
json_mode: false
)2. Define a prompt template
The template tells the LLM what to extract. Few-shot examples teach it the output format using dynamic keys — the extraction class name becomes the YAML key, which reads naturally in context:
template = %LangExtract.Prompt.Template{
description: "Extract medical conditions and medications from clinical text.",
examples: [
%LangExtract.Prompt.ExampleData{
text: "Patient was diagnosed with diabetes and prescribed metformin.",
extractions: [
%LangExtract.Pipeline.Extraction{
class: "condition",
text: "diabetes",
attributes: %{"chronicity" => "chronic"}
},
%LangExtract.Pipeline.Extraction{
class: "medication",
text: "metformin",
attributes: %{}
}
]
}
]
}3. Run extraction
source = "The patient presents with hypertension and is taking lisinopril daily."
{:ok, {spans, errors}} = LangExtract.run(client, source, template)
When some chunks fail to parse, the successful spans are still returned alongside
the errors. Check errors to detect failures. Infrastructure failures (task exits,
timeouts) return {:error, reason} instead.
Each span contains:
| Field | Description |
|---|---|
text | The extracted text as returned by the LLM |
class |
Entity type (e.g., "condition", "medication") |
attributes | Arbitrary metadata the LLM attached |
byte_start |
Inclusive byte offset in source (nil if not found) |
byte_end |
Exclusive byte offset in source (nil if not found) |
status | :exact, :fuzzy, or :not_found |
Verify byte offsets round-trip:
for span <- spans, span.byte_start != nil do
extracted = binary_part(source, span.byte_start, span.byte_end - span.byte_start)
IO.puts("#{span.class}: #{extracted}")
endChunking
For documents that exceed LLM token limits, pass :max_chunk_chars to split the
source into sentence-aware chunks and process them in parallel:
{:ok, {spans, errors}} = LangExtract.run(client, long_document, template,
max_chunk_chars: 4000,
max_concurrency: 5
)Byte offsets in the returned spans are adjusted to reference the original source, not individual chunks.
Prompt Validation
Validate that your few-shot examples actually align with their own source text before burning LLM tokens:
# Returns :ok or {:error, [issues]}
:ok = LangExtract.Prompt.Validator.validate(template)
# Or raise on failure
:ok = LangExtract.Prompt.Validator.validate!(template)The validator reports what it finds. You decide what to do — log, raise, or ignore. No built-in severity levels.
Alignment Without an LLM
If you already have extraction strings (e.g., from a different source), you can align them against source text directly:
spans = LangExtract.align("the quick brown fox", ["quick brown", "fox"])
# [%Span{text: "quick brown", byte_start: 4, byte_end: 15, status: :exact},
# %Span{text: "fox", byte_start: 16, byte_end: 19, status: :exact}]Or parse raw LLM output and align in one step:
yaml = "extractions:\n- class: animal\n text: fox"
{:ok, spans} = LangExtract.extract("the quick brown fox", yaml)
Both canonical format (class/text/attributes keys) and dynamic-key format
("animal": "fox") are accepted. Markdown fences and <think> tags are
stripped automatically.
Serialization
Convert results to plain maps for storage or interop:
map = LangExtract.IO.to_map(source, spans)
# %{"text" => "...", "extractions" => [%{"class" => "...", "status" => "exact", ...}]}
{:ok, {source, spans}} = LangExtract.IO.from_map(map)Save and load multiple results as JSONL:
LangExtract.IO.save_jsonl([{source1, spans1}, {source2, spans2}], "results.jsonl")
{:ok, results} = LangExtract.IO.load_jsonl("results.jsonl")Provider Options
All providers accept these common options:
| Option | Default | Description |
|---|---|---|
:api_key | From env var | API key (falls back to provider-specific env var) |
:model | Provider default | Model ID |
:max_tokens | 4096 | Maximum response tokens |
:temperature | 0 | Sampling temperature |
:base_url | Provider default | API base URL |
Environment variable fallbacks: ANTHROPIC_API_KEY, OPENAI_API_KEY,
GEMINI_API_KEY.
Provider-specific options:
| Provider | Option | Default | Description |
|---|---|---|---|
:openai | :json_mode | true |
Enable JSON mode. Set false for compatible endpoints that don't support it. |
How It Works
The pipeline has five stages:
1. Prompt Builder — Renders few-shot Q&A prompt with dynamic-key examples
2. LLM Provider — Calls Claude/OpenAI/Gemini via Req
3. Format Handler — Strips fences/<think> tags, normalizes dynamic keys to canonical form
4. Parser — Validates and constructs Extraction structs
5. Aligner — Maps extraction text to byte positions via Myers diff + fuzzy fallbackThe aligner uses two phases:
- Phase 1 (Exact):
List.myers_difference/2on downcased word tokens. If a contiguous equal segment covers all extraction tokens, it's an exact match. - Phase 2 (Fuzzy): Sliding window with token frequency overlap. The window
with the highest overlap ratio above
:fuzzy_threshold(default 0.75) wins.
Architecture
lib/lang_extract/
├── alignment/ # Tokenizer, Token, Aligner, Span
├── pipeline/ # FormatHandler, Parser, Extraction, ChunkError
├── prompt/ # Template, ExampleData, Builder, Validator
├── provider/ # Claude, OpenAI, Gemini implementations
├── client.ex # Configured LLM client struct
├── orchestrator.ex # Pipeline wiring + chunking
├── chunker.ex # Sentence-aware text splitting
├── pipeline.ex # Extraction pipeline public API
└── io.ex # Serialization + JSONLCompared to the Python Original
This is an Elixir port of google/langextract. Key differences:
| Python | Elixir | |
|---|---|---|
| Codebase | ~4,000 LOC | ~1,400 LOC |
| Providers | Gemini, OpenAI, Ollama | Claude, OpenAI, Gemini |
| Offsets | Character positions | Byte positions |
| Parallelism | ThreadPoolExecutor | Task.async_stream |
| Chunking | Always-on (1000 chars) | Always-on (1000 chars, configurable) |
| Alignment statuses | 4 (exact, lesser, greater, fuzzy) | 3 (exact, fuzzy, not_found) |
| Prompt validation | Built-in severity levels | Caller decides |
Not ported: visualization (HTML output), multi-pass extraction, batch Vertex AI, plugin system. See ROADMAP.md for planned improvements.
License
See LICENSE for details.