ExtractousEx


Hex VersionHex DocsCIMITDownloads

Fast and comprehensive document text extraction for Elixir.

ExtractousEx is an Elixir library for extracting text and metadata from various document formats using the powerful Extractous Rust library.

Features

Installation

Add extractous_ex to your list of dependencies in mix.exs:

def deps do
  [
    {:extractous_ex, "~> 0.1.0"}
  ]
end

Usage

Basic Text Extraction

ExtractousEx provides three main extraction methods:

# Extract from a file on disk
{:ok, result} = ExtractousEx.extract_from_file("document.pdf")

# Extract from binary data in memory
{:ok, data} = File.read("document.pdf")
{:ok, result} = ExtractousEx.extract_from_bytes(data)

# Extract from a URL
{:ok, result} = ExtractousEx.extract_from_url("https://example.com/document.pdf")

# Access extracted content and metadata
IO.puts(result.content)
# => "Document text content..."

IO.inspect(result.metadata)
# => %{"author" => "John Doe", "title" => "My Document", ...}

Options

All extraction methods support the same options:

# Extract with XML structure preserved
{:ok, result} = ExtractousEx.extract_from_file("document.html", xml: true)

# Limit extracted text length (default: 500,000 characters)
{:ok, result} = ExtractousEx.extract_from_bytes(data, max_length: 100_000)

# Specify encoding (UTF-8, UTF-16BE, US-ASCII)
{:ok, result} = ExtractousEx.extract_from_url(url, encoding: "UTF-16BE")

# Combine options
{:ok, result} = ExtractousEx.extract_from_file("doc.pdf",
  xml: true,
  max_length: 50_000,
  encoding: "UTF-8"
)

Error Handling

# Standard tuple return for all methods
case ExtractousEx.extract_from_file("nonexistent.pdf") do
  {:ok, result} ->
    IO.puts("Content: #{result.content}")
  {:error, reason} ->
    IO.puts("Failed to extract: #{reason}")
end

# Bang versions available for all methods (raise on error)
result = ExtractousEx.extract_from_file!("document.pdf")
result = ExtractousEx.extract_from_bytes!(data)
result = ExtractousEx.extract_from_url!(url)

Supported Formats

ExtractousEx supports a wide variety of document formats including:

Configuration

Force Building from Source

If you need to build from source instead of using precompiled binaries:

export EXTRACTOUS_EX_BUILD=1
mix deps.compile extractous_ex

Development Setup

For development or when precompiled binaries aren't available:

# In config/config.exs
config :rustler_precompiled, :force_build, extractous_ex: true

Performance

ExtractousEx leverages the Extractous Rust library, which provides:

Building and Releasing

This project uses GitHub Actions to automatically build precompiled binaries for multiple platforms. When you push a version tag (e.g., v0.1.0), the workflow will:

  1. Build binaries for macOS, Linux, and Windows
  2. Create a GitHub release with the precompiled .tar.gz files
  3. Users automatically get fast precompiled binaries instead of compiling Rust

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests: mix test
  5. Submit a pull request

License

This project is licensed under the MIT License. Extractous which this library depends upon is licensed under the Apache License 2.0 - see the LICENSE file for details.

Foundation

The library is built on top of: