Native Elixir PDF Utilities

A small, native Elixir toolkit for working with classic PDFs:

A fast, pure-Elixir tokenizer that turns a PDF byte stream into tokens.
A pragmatic PDF merger that renumbers objects, collects pages, and writes a fresh xref/trailer to produce a valid combined PDF.

No NIFs, no ports, no external binaries just Elixir. Targeted at structural tasks and best-effort merging for common PDFs.

Status

Elixir: ~> 1.18
Version: 0.1.0 (API may evolve)

Features

Tokenizer:
- Numbers (integers/reals), names (/Name with #xx hex escapes), strings (literal and hex).
- PDF keywords and punctuation: obj, endobj, stream, endstream, xref, trailer, startxref, R, <<, >>, [, ].
- Operators surfaced as {:op, word} for content streams.
- Stream handling: emits :stream then {:stream_data, binary} using /Length when present, or scans up to endstream as fallback. Provides span info via next_with_span/1.
Merger:
- Merges multiple PDF binaries: renumbers objects to avoid collisions.
- Rewrites indirect references to the new numbering.
- Builds a new Catalog and Pages tree and collects all page objects.
- Preserves stream bytes and declared /Length (direct or indirect reference hints).
- Emits a classic xref table and trailer (not an xref stream).

Installation

This project is not yet published on Hex. Add it as a local dependency or use a Git reference once public.

Local path (for development):

def deps do
  [
    {:native_elixir_pdf_utilities, path: "../native_elixir_pdf_utilities"}
  ]
end

When the package is published to Hex, it will look like:

def deps do
  [
    {:native_elixir_pdf_utilities, "~> 0.1.0"}
  ]
end

Quick Start

Interactive shell:

iex -S mix

Tokenize a PDF binary:

alias NativeElixirPdfUtilities.Tokenizer

pdf = File.read!("sample.pdf")
st = Tokenizer.new(pdf)
tokens = Tokenizer.tokenize_all(st)
IO.inspect(tokens, limit: 50)

Merge PDFs:

alias NativeElixirPdfUtilities.Merge

bins = [
  File.read!("a.pdf"),
  File.read!("b.pdf")
]

{:ok, merged} = Merge.merge(bins)
File.write!("merged.pdf", merged)

Tokenizer API

Module: NativeElixirPdfUtilities.Tokenizer

new(binary): Initialize with a PDF byte stream.
next(t): Return {token, t2} for the next token; emits {:eof, nil} at end.
peek(t): Look at the next token without advancing.
next_with_span(t): Like next/1 but also returns byte-span metadata for the token (%{from: pos, to: pos, stream_mode?: :length | :scanned | nil}).
tokenize_all(t): Tokenize all tokens into a list.
tokenize_all_with_spans(t): Tokenize all tokens with spans included.
pending_stream_length(t): If just saw :stream, returns {:direct, int} when /Length was a direct int, {:indirect, {obj, gen}} for an indirect hint, or :unknown.

Token forms include:

{:int, integer} | {:real, float} | {:name, binary} | {:string, binary}
{:hex_string, binary} | {:stream_data, binary} | {:op, binary}
:lbracket | :rbracket | :dict_start | :dict_end | :R
:obj | :endobj | :stream | :endstream | :xref | :trailer | :startxref
:null | true | false | {:eof, nil}

Notes:

Whitespace and % comments are skipped.
Literal strings support escapes (\n, \r, \t, \b, \f, \\, octal) and nested parentheses.
Hex strings <...> are decoded; odd nibble counts are padded per PDF spec.
For streams, one EOL immediately after stream is not part of stream data.

Merging PDFs

Module: NativeElixirPdfUtilities.Merge

merge([pdf_bin]) :: {:ok, pdf_bin}
- Indexes each input into objects and page ids using the tokenizer.
- Assigns non-overlapping id ranges and remaps indirect references.
- Builds a fresh Pages (object 1) and Catalog (object 2) and appends all input objects.
- Constructs a classic xref table and trailer pointing to the generated root.

Example:

{:ok, out} = NativeElixirPdfUtilities.Merge.merge([
  File.read!("doc1.pdf"),
  File.read!("doc2.pdf"),
  File.read!("doc3.pdf")
])
File.write!("merged.pdf", out)

Behavior and constraints:

Preserves all object bodies and stream bytes; does not decode or re-encode filters.
Uses /Length when the value is a direct integer. For indirect lengths, keeps the reference and scans to endstream when emitting token streams.
Expects classic PDFs (xref tables). PDFs using only xref streams or incremental updates may not be fully handled yet.
Rewrites Page dictionaries to ensure a valid tree: sets Parent, keeps or injects Resources and MediaBox when missing; builds a combined Pages tree.

Running Tests

mix test

The test suite exercises the tokenizer (numbers, names, strings, dicts, arrays, operators, streams via /Length and fallback scanning).

Roadmap / Ideas

Include more PDF utilities as I think of them, or suggested by the community.
Make it able to handle more kinds of PDF's.

License

MIT. See LICENSE.