html-to-markdown
Elixir bindings for the Rust html-to-markdown engine. The package exposes a fast HTML to Markdown converter implemented with Rustler. Ship identical Markdown across every runtime while enjoying native performance with Rustler NIF bindings.
What This Package Provides
- Same renderer as every binding — output matches Rust, Python, Node.js, Ruby, PHP, Go, Java, .NET, Elixir, R, Dart, Swift, Zig, C FFI, and WASM.
- Structured conversion result — Markdown plus metadata, links, headings, images, tables, and warnings where the binding exposes them.
- Production defaults — HTML is parsed with the Rust core, sanitized by default, and rendered without runtime-specific Markdown drift.
- BEAM package — Rustler NIF binding that keeps conversion inside the VM boundary.
Installation
Add {:html_to_markdown, "~> 3.0"} to mix.exs deps
Requires Elixir 1.19+ and OTP 28. Add to your mix.exs:
def deps do
[
{:html_to_markdown, "~> 3.6.0-rc.2"}
]
end
Performance Snapshot
Apple M4 · convert() · Real Wikipedia documents
| Document | Size | Latency | Throughput |
|---|---|---|---|
| Lists (Timeline) | 129KB | 321.7 MB/s | |
| Tables (Countries) | 360KB | 293.8 MB/s | |
| Medium (Python) | 656KB | 281.5 MB/s | |
| Large (Rust) | 567KB | 268.7 MB/s | |
| Small (Intro) | 463KB | 262.9 MB/s |
Quick Start
Basic conversion:
{:ok, result} = HtmlToMarkdown.convert("<h1>Hello</h1><p>This is <strong>fast</strong>!</p>")
IO.puts(result.content)
With conversion options:
opts = %HtmlToMarkdown.Options{wrap: true, wrap_width: 40}
{:ok, result} = HtmlToMarkdown.convert("<h1>Hello</h1><p>World</p>", opts)
IO.puts(result.content)
Architecture
The converter routes each input through one of three tiers based on a fast prescan of the byte stream:
- Tier-1 — single-pass byte scanner. Handles 110+ HTML tags directly. Bails on any construct it cannot prove byte-equivalent to Tier-2.
- Tier-2 — DOM walker. Picks up Tier-1 bails and inputs the classifier rejected up front.
- Tier-3 — standards-conformant parser. Engaged for malformed HTML requiring full HTML5 repair.
The dispatcher is invisible to the caller. Output is byte-identical across tiers — enforced by a 116-snapshot oracle.
Capabilities
- 16 languages, one Rust core. Rust, Python, Node.js, WASM, Java, Go, C#, PHP, Ruby, Elixir, R, Dart, Kotlin (Android), Swift, Zig, C ABI.
- CommonMark-compatible Markdown with GFM-style tables.
- Djot output: set
output_format = "djot"(see Djot Output Format section below). - Real-HTML robust: unclosed tags, CDATA, custom elements, malformed entities, nested tables, mixed encodings handled without losing content.
- Metadata extraction, visitor API, inline images, configurable preprocessing presets.
- Per-group regression gates in CI: every PR runs the bench harness against per-group thresholds.
API Reference
Core Function
HtmlToMarkdown.convert(html, options \\ nil) :: {:ok, ConversionResult.t()} | {:error, term()}
Converts HTML to Markdown. Returns {:ok, result} where result is a struct with all results in a single call.
{:ok, result} = HtmlToMarkdown.convert(html)
result.content # Converted Markdown string
result.metadata # Metadata map (when extract_metadata: true)
result.tables # Table data list (when extract_tables: true)
result.document # Document-level info
result.images # Extracted images
result.warnings # Any conversion warnings
Options
ConversionOptions – Key configuration fields:
heading_style: Heading format ("underlined"|"atx"|"atx_closed") — default:"underlined"list_indent_width: Spaces per indent level — default:2bullets: Bullet characters cycle — default:"*+-"wrap: Enable text wrapping — default:falsewrap_width: Wrap at column — default:80code_language: Default fenced code block language — default: noneextract_metadata: Enable metadata extraction intoresult.metadata— default:falseextract_tables: Enable structured table extraction intoresult.tables— default:falseoutput_format: Output markup format ("markdown"|"djot"|"plain") — default:"markdown"
Djot Output Format
The library supports converting HTML to Djot, a lightweight markup language similar to Markdown but with a different syntax for some elements. Set output_format to "djot" to use this format.
Syntax Differences
| Element | Markdown | Djot |
|---|---|---|
| Strong | **text** | *text* |
| Emphasis | *text* | _text_ |
| Strikethrough | ~~text~~ | {-text-} |
| Inserted/Added | N/A | {+text+} |
| Highlighted | N/A | {=text=} |
| Subscript | N/A | ~text~ |
| Superscript | N/A | ^text^ |
Example Usage
html = "<p>This is <strong>bold</strong> and <em>italic</em> text.</p>"
# Default Markdown output
{:ok, markdown} = HtmlToMarkdown.convert(html)
# Result: "This is **bold** and *italic* text."
# Djot output
{:ok, djot} = HtmlToMarkdown.convert(html, %{output_format: "djot"})
# Result: "This is *bold* and _italic_ text."
Djot's extended syntax allows you to express more semantic meaning in lightweight text, making it useful for documents that require strikethrough, insertion tracking, or mathematical notation.
Plain Text Output
Set output_format to "plain" to strip all markup and return only visible text. This bypasses the Markdown conversion pipeline entirely for maximum speed.
html = "<h1>Title</h1><p>This is <strong>bold</strong> and <em>italic</em> text.</p>"
{:ok, plain} = HtmlToMarkdown.convert(html, %{output_format: "plain"})
# Result: "Title\n\nThis is bold and italic text."
Plain text mode is useful for search indexing, text extraction, and feeding content to LLMs.
Metadata Extraction
The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass — all via the standard convert() function.
Use Cases:
- SEO analysis – Extract title, description, Open Graph tags, Twitter cards
- Table of contents generation – Build structured outlines from heading hierarchy
- Content migration – Document all external links and resources
- Accessibility audits – Check for images without alt text, empty links, invalid heading hierarchy
- Link validation – Classify and validate anchor, internal, external, email, and phone links
Zero Overhead When Disabled: Metadata extraction adds negligible overhead and happens during the HTML parsing pass. Pass extract_metadata: true in ConversionOptions to enable it; the result is available at result.metadata.
Example: Quick Start
html = "<h1>Article</h1><img src=\"test.jpg\" alt=\"test\">"
opts = %HtmlToMarkdown.Options{extract_metadata: true}
{:ok, result} = HtmlToMarkdown.convert(html, opts)
IO.puts(result.content) # Converted Markdown
IO.inspect(result.metadata["document"]["title"]) # Document title
IO.inspect(result.metadata["headers"]) # All h1-h6 elements
IO.inspect(result.metadata["links"]) # All hyperlinks
IO.inspect(result.metadata["images"]) # All images with alt text
IO.inspect(result.metadata["structured_data"]) # JSON-LD, Microdata, RDFa
Visitor Pattern
The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal. Pass a visitor as the third argument to convert().
Use Cases:
- Custom Markdown dialects – Convert to Obsidian, Notion, or other flavors
- Content filtering – Remove tracking pixels, ads, or unwanted elements
- URL rewriting – Rewrite CDN URLs, add query parameters, validate links
- Accessibility validation – Check alt text, heading hierarchy, link text
- Analytics – Track element usage, link destinations, image sources
Supported Visitor Methods: 40+ callbacks for text, inline elements, links, images, headings, lists, blocks, and tables.
Example: Quick Start
defmodule MyVisitor do
use HtmlToMarkdown.Visitor
@impl true
def handle_link(_ctx, href, text, _title) do
# Rewrite CDN URLs
href = if String.starts_with?(href, "https://old-cdn.com") do
String.replace(href, "https://old-cdn.com", "https://new-cdn.com")
else
href
end
{:custom, "[#{text}](#{href})"}
end
@impl true
def handle_image(_ctx, src, _alt, _title) do
# Skip tracking pixels
if String.contains?(src, "tracking"), do: :skip, else: :continue
end
end
html = "<a href=\"https://old-cdn.com/file.pdf\">Download</a>"
opts = %HtmlToMarkdown.Options{visitor: MyVisitor}
{:ok, result} = HtmlToMarkdown.convert(html, opts)
result.content
Examples
Links
- GitHub:github.com/kreuzberg-dev/html-to-markdown
- Hex.pm:hex.pm/packages/html_to_markdown
- Discord:discord.gg/xt9WY3GnKR
Part of Kreuzberg.dev
- Kreuzberg — document intelligence: text, tables, metadata from 90+ formats with optional OCR.
- Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
- kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
- tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
- alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.
- Discord — community, roadmap, announcements.
Contributing
We welcome contributions! Please see our Contributing Guide for details on:
- Setting up the development environment
- Running tests locally
- Submitting pull requests
- Reporting issues
All contributions must follow our code quality standards (enforced via pre-commit hooks):
- Proper test coverage (Rust 95%+, language bindings 80%+)
- Formatting and linting checks
- Documentation for public APIs
License
MIT License – see LICENSE. Copyright © Kreuzberg, Inc.
Support
If you find this library useful, consider sponsoring the project.
Have questions or run into issues? We're here to help:
- GitHub Issues:github.com/kreuzberg-dev/html-to-markdown/issues
- Issues:github.com/kreuzberg-dev/html-to-markdown/issues
- Discord Community:discord.gg/xt9WY3GnKR