ExtractousEx

Fast and comprehensive document text extraction for Elixir.

ExtractousEx is an Elixir library for extracting text and metadata from various document formats using the powerful Extractous Rust library.

Features

High Performance: Built on Rust with precompiled binaries for fast extraction
Multiple Formats: Supports PDF, Microsoft Office documents, HTML, plain text, CSV, JSON, Markdown, and many more
Metadata Extraction: Extracts document metadata alongside text content
XML Output: Optional structured XML output for preserving document formatting
Cross-Platform: Precompiled binaries for macOS (arm64/x64), Linux (arm64/x64), and Windows (x64)
Elixir Native: Idiomatic Elixir API with proper error handling

Installation

Add extractous_ex to your list of dependencies in mix.exs:

def deps do
  [
    {:extractous_ex, "~> 0.1.0"}
  ]
end

Usage

Basic Text Extraction

ExtractousEx provides three main extraction methods:

# Extract from a file on disk
{:ok, result} = ExtractousEx.extract_from_file("document.pdf")

# Extract from binary data in memory
{:ok, data} = File.read("document.pdf")
{:ok, result} = ExtractousEx.extract_from_bytes(data)

# Extract from a URL
{:ok, result} = ExtractousEx.extract_from_url("https://example.com/document.pdf")

# Access extracted content and metadata
IO.puts(result.content)
# => "Document text content..."

IO.inspect(result.metadata)
# => %{"author" => "John Doe", "title" => "My Document", ...}

Options

All extraction methods support the same options:

# Extract with XML structure preserved
{:ok, result} = ExtractousEx.extract_from_file("document.html", xml: true)

# Limit extracted text length (default: 500,000 characters)
{:ok, result} = ExtractousEx.extract_from_bytes(data, max_length: 100_000)

# Specify encoding (UTF-8, UTF-16BE, US-ASCII)
{:ok, result} = ExtractousEx.extract_from_url(url, encoding: "UTF-16BE")

# Combine options
{:ok, result} = ExtractousEx.extract_from_file("doc.pdf",
  xml: true,
  max_length: 50_000,
  encoding: "UTF-8"
)

Error Handling

# Standard tuple return for all methods
case ExtractousEx.extract_from_file("nonexistent.pdf") do
  {:ok, result} ->
    IO.puts("Content: #{result.content}")
  {:error, reason} ->
    IO.puts("Failed to extract: #{reason}")
end

# Bang versions available for all methods (raise on error)
result = ExtractousEx.extract_from_file!("document.pdf")
result = ExtractousEx.extract_from_bytes!(data)
result = ExtractousEx.extract_from_url!(url)

Supported Formats

ExtractousEx supports a wide variety of document formats including:

PDF: Portable Document Format
Microsoft Office: Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt)
OpenDocument: Writer (.odt), Calc (.ods), Impress (.odp)
Web: HTML, XML
Text: Plain text (.txt), Markdown (.md), CSV
E-books: EPUB
Images: With OCR support (when available)
Archives: ZIP, TAR (extracts contained documents)
Email: EML, MSG formats

Configuration

Force Building from Source

If you need to build from source instead of using precompiled binaries:

export EXTRACTOUS_EX_BUILD=1
mix deps.compile extractous_ex

Development Setup

For development or when precompiled binaries aren't available:

# In config/config.exs
config :rustler_precompiled, :force_build, extractous_ex: true

Performance

ExtractousEx leverages the Extractous Rust library, which provides:

~18x faster extraction compared to unstructured-io
~11x less memory consumption
High-quality content extraction across formats

Building and Releasing

This project uses GitHub Actions to automatically build precompiled binaries for multiple platforms. When you push a version tag (e.g., v0.1.0), the workflow will:

Build binaries for macOS, Linux, and Windows
Create a GitHub release with the precompiled .tar.gz files
Users automatically get fast precompiled binaries instead of compiling Rust

Contributing

Fork the repository
Create a feature branch
Make your changes
Run tests: mix test
Submit a pull request

License

This project is licensed under the MIT License. Extractous which this library depends upon is licensed under the Apache License 2.0 - see the LICENSE file for details.

Foundation

The library is built on top of:

Extractous - a fast Rust library for document text extraction
Rustler for seamless Elixir-Rust integration
rustler_precompiled for cross-platform binary distribution