PdfEx

A pure-Elixir PDF parsing and surgery engine. No NIFs, no C bindings, no external binaries — one runtime dependency (:telemetry).

PdfEx is built around a lossless invariant: serialize(open(bytes)) == bytes for any unmodified document, and edits are appended as PDF incremental updates so the original bytes are always a byte-for-byte prefix of the output.

What it does

{:ok, doc} = PdfEx.open(File.read!("report.pdf"))
{:ok, n} = PdfEx.page_count(doc)
{:ok, text} = PdfEx.extract_text(doc)
# Structural surgery — lossless, incremental
{:ok, doc} = PdfEx.Editor.delete_page(doc, 2)
edited_bytes = PdfEx.Serializer.serialize(doc)
# byte_size(edited_bytes) > byte_size(original); original is a prefix
# Rewrite the text of a run (addressed by a stable glyph UID)
{:ok, doc} = PdfEx.ContentEdit.replace_text(doc, "p_3_g_0", "Revised heading")
# Semantic HTML, with data-uid back-references for round-tripping edits
{:ok, html} = PdfEx.Convert.to_html(doc, mode: :semantic)
# Collaborative session: reads bypass the server; writes are OT-coordinated
{:ok, id} = PdfEx.Session.open(doc)
{:ok, _op} = PdfEx.Session.apply_op(id, %PdfEx.Op.UpdateText{uid: "p_3_g_0", text: "Hi"})
{:ok, doc} = PdfEx.Session.fetch(id)

Design

Current limitations (0.1.x)

Installation

def deps do
[
{:pdf_ex, "~> 0.1.0"}
]
end

Documentation

Generate the docs locally with ExDoc:

mix docs # writes HTML to doc/

Testing

mix test # unit + integration suite
mix test --include corpus # also sweep real PDFs in test/fixtures/corpus/
mix dialyzer # static analysis

The corpus sweep asserts the library's hard invariants against any PDFs you drop into test/fixtures/corpus/ (gitignored): open never raises, unmutated round-trips are byte-identical, and incremental edits re-parse.

License

MIT — see LICENSE.