Native Elixir PDF Utilities
A small, native Elixir toolkit for working with classic PDFs:
- A fast, pure-Elixir tokenizer that turns a PDF byte stream into tokens.
- A pragmatic PDF merger that renumbers objects, collects pages, and writes a fresh xref/trailer to produce a valid combined PDF.
No NIFs, no ports, no external binaries just Elixir. Targeted at structural tasks and best-effort merging for common PDFs.
Status
-
Elixir:
~> 1.18 -
Version:
0.1.0(API may evolve)
Features
-
Tokenizer:
-
Numbers (integers/reals), names (
/Namewith#xxhex escapes), strings (literal and hex). -
PDF keywords and punctuation:
obj,endobj,stream,endstream,xref,trailer,startxref,R,<<,>>,[,]. -
Operators surfaced as
{:op, word}for content streams. -
Stream handling: emits
:streamthen{:stream_data, binary}using/Lengthwhen present, or scans up toendstreamas fallback. Provides span info vianext_with_span/1.
-
Numbers (integers/reals), names (
-
Merger:
- Merges multiple PDF binaries: renumbers objects to avoid collisions.
- Rewrites indirect references to the new numbering.
-
Builds a new
CatalogandPagestree and collects all page objects. -
Preserves stream bytes and declared
/Length(direct or indirect reference hints). - Emits a classic xref table and trailer (not an xref stream).
Installation
This project is not yet published on Hex. Add it as a local dependency or use a Git reference once public.
Local path (for development):
def deps do
[
{:native_elixir_pdf_utilities, path: "../native_elixir_pdf_utilities"}
]
endWhen the package is published to Hex, it will look like:
def deps do
[
{:native_elixir_pdf_utilities, "~> 0.1.0"}
]
endQuick Start
Interactive shell:
iex -S mixTokenize a PDF binary:
alias NativeElixirPdfUtilities.Tokenizer
pdf = File.read!("sample.pdf")
st = Tokenizer.new(pdf)
tokens = Tokenizer.tokenize_all(st)
IO.inspect(tokens, limit: 50)Merge PDFs:
alias NativeElixirPdfUtilities.Merge
bins = [
File.read!("a.pdf"),
File.read!("b.pdf")
]
{:ok, merged} = Merge.merge(bins)
File.write!("merged.pdf", merged)Tokenizer API
Module: NativeElixirPdfUtilities.Tokenizer
new(binary): Initialize with a PDF byte stream.next(t): Return{token, t2}for the next token; emits{:eof, nil}at end.peek(t): Look at the next token without advancing.next_with_span(t): Likenext/1but also returns byte-span metadata for the token (%{from: pos, to: pos, stream_mode?: :length | :scanned | nil}).tokenize_all(t): Tokenize all tokens into a list.tokenize_all_with_spans(t): Tokenize all tokens with spans included.pending_stream_length(t): If just saw:stream, returns{:direct, int}when/Lengthwas a direct int,{:indirect, {obj, gen}}for an indirect hint, or:unknown.
Token forms include:
{:int, integer}|{:real, float}|{:name, binary}|{:string, binary}{:hex_string, binary}|{:stream_data, binary}|{:op, binary}:lbracket|:rbracket|:dict_start|:dict_end|:R:obj|:endobj|:stream|:endstream|:xref|:trailer|:startxref:null|true|false|{:eof, nil}
Notes:
-
Whitespace and
%comments are skipped. -
Literal strings support escapes (
\n,\r,\t,\b,\f,\\, octal) and nested parentheses. -
Hex strings
<...>are decoded; odd nibble counts are padded per PDF spec. -
For streams, one EOL immediately after
streamis not part of stream data.
Merging PDFs
Module: NativeElixirPdfUtilities.Merge
merge([pdf_bin]) :: {:ok, pdf_bin}- Indexes each input into objects and page ids using the tokenizer.
- Assigns non-overlapping id ranges and remaps indirect references.
-
Builds a fresh
Pages(object 1) andCatalog(object 2) and appends all input objects. - Constructs a classic xref table and trailer pointing to the generated root.
Example:
{:ok, out} = NativeElixirPdfUtilities.Merge.merge([
File.read!("doc1.pdf"),
File.read!("doc2.pdf"),
File.read!("doc3.pdf")
])
File.write!("merged.pdf", out)Behavior and constraints:
- Preserves all object bodies and stream bytes; does not decode or re-encode filters.
-
Uses
/Lengthwhen the value is a direct integer. For indirect lengths, keeps the reference and scans toendstreamwhen emitting token streams. - Expects classic PDFs (xref tables). PDFs using only xref streams or incremental updates may not be fully handled yet.
-
Rewrites Page dictionaries to ensure a valid tree: sets
Parent, keeps or injectsResourcesandMediaBoxwhen missing; builds a combinedPagestree.
Running Tests
mix test
The test suite exercises the tokenizer (numbers, names, strings, dicts, arrays,
operators, streams via /Length and fallback scanning).
Roadmap / Ideas
- Include more PDF utilities as I think of them, or suggested by the community.
- Make it able to handle more kinds of PDF's.
License
MIT. See LICENSE.