html-to-markdown

RustPythonNode.jsWASMJavaGoC#PHPRubyElixirRCDocumentationLicense
html-to-markdown
Discord

Elixir bindings for the Rust html-to-markdown engine. The package exposes a fast HTML to Markdown converter implemented with Rustler. Ship identical Markdown across every runtime while enjoying native performance with Rustler NIF bindings.

Installation

Add {:html_to_markdown, "~> 3.0"} to mix.exs deps

Requires Elixir 1.19+ and OTP 28. Add to your mix.exs:

def deps do
  [
    {:html_to_markdown, "~> 3.0.0"}
  ]
end

Performance Snapshot

Apple M4 • Real Wikipedia documents • convert() (Elixir)

Document Size Ops/sec Throughput
Lists (Timeline) 129KB 2,547 321.7 MB/s
Tables (Countries) 360KB 835 293.8 MB/s
Medium (Python) 656KB 439 281.5 MB/s
Large (Rust) 567KB 485 268.7 MB/s
Small (Intro) 463KB 581 262.9 MB/s

Quick Start

Basic conversion:

{:ok, result} = HtmlToMarkdown.convert("<h1>Hello</h1><p>This is <strong>fast</strong>!</p>")
IO.puts(result.content)

With conversion options:

opts = %HtmlToMarkdown.Options{wrap: true, wrap_width: 40}
{:ok, result} = HtmlToMarkdown.convert("<h1>Hello</h1><p>World</p>", opts)
IO.puts(result.content)

API Reference

Core Function

HtmlToMarkdown.convert(html, options \\ nil) :: {:ok, ConversionResult.t()} | {:error, term()}

Converts HTML to Markdown. Returns {:ok, result} where result is a struct with all results in a single call.

{:ok, result} = HtmlToMarkdown.convert(html)
result.content    # Converted Markdown string
result.metadata   # Metadata map (when extract_metadata: true)
result.tables     # Table data list (when extract_tables: true)
result.document   # Document-level info
result.images     # Extracted images
result.warnings   # Any conversion warnings

Options

ConversionOptions – Key configuration fields:

Djot Output Format

The library supports converting HTML to Djot, a lightweight markup language similar to Markdown but with a different syntax for some elements. Set output_format to "djot" to use this format.

Syntax Differences

Element Markdown Djot
Strong **text***text*
Emphasis *text*_text_
Strikethrough ~~text~~{-text-}
Inserted/Added N/A {+text+}
Highlighted N/A {=text=}
Subscript N/A ~text~
Superscript N/A ^text^

Example Usage

html = "<p>This is <strong>bold</strong> and <em>italic</em> text.</p>"

# Default Markdown output
{:ok, markdown} = HtmlToMarkdown.convert(html)
# Result: "This is **bold** and *italic* text."

# Djot output
{:ok, djot} = HtmlToMarkdown.convert(html, %{output_format: "djot"})
# Result: "This is *bold* and _italic_ text."

Djot's extended syntax allows you to express more semantic meaning in lightweight text, making it useful for documents that require strikethrough, insertion tracking, or mathematical notation.

Plain Text Output

Set output_format to "plain" to strip all markup and return only visible text. This bypasses the Markdown conversion pipeline entirely for maximum speed.

html = "<h1>Title</h1><p>This is <strong>bold</strong> and <em>italic</em> text.</p>"

{:ok, plain} = HtmlToMarkdown.convert(html, %{output_format: "plain"})
# Result: "Title\n\nThis is bold and italic text."

Plain text mode is useful for search indexing, text extraction, and feeding content to LLMs.

Metadata Extraction

The metadata extraction feature enables comprehensive document analysis during conversion. Extract document properties, headers, links, images, and structured data in a single pass — all via the standard convert() function.

Use Cases:

Zero Overhead When Disabled: Metadata extraction adds negligible overhead and happens during the HTML parsing pass. Pass extract_metadata: true in ConversionOptions to enable it; the result is available at result.metadata.

Example: Quick Start

html = "<h1>Article</h1><img src=\"test.jpg\" alt=\"test\">"
opts = %HtmlToMarkdown.Options{extract_metadata: true}
{:ok, result} = HtmlToMarkdown.convert(html, opts)

IO.puts(result.content)                           # Converted Markdown
IO.inspect(result.metadata["document"]["title"])  # Document title
IO.inspect(result.metadata["headers"])            # All h1-h6 elements
IO.inspect(result.metadata["links"])              # All hyperlinks
IO.inspect(result.metadata["images"])             # All images with alt text
IO.inspect(result.metadata["structured_data"])    # JSON-LD, Microdata, RDFa

Visitor Pattern

The visitor pattern enables custom HTML→Markdown conversion logic by providing callbacks for specific HTML elements during traversal. Pass a visitor as the third argument to convert().

Use Cases:

Supported Visitor Methods: 40+ callbacks for text, inline elements, links, images, headings, lists, blocks, and tables.

Example: Quick Start

defmodule MyVisitor do
  use HtmlToMarkdown.Visitor

  @impl true
  def handle_link(_ctx, href, text, _title) do
    # Rewrite CDN URLs
    href = if String.starts_with?(href, "https://old-cdn.com") do
      String.replace(href, "https://old-cdn.com", "https://new-cdn.com")
    else
      href
    end
    {:custom, "[#{text}](#{href})"}
  end

  @impl true
  def handle_image(_ctx, src, _alt, _title) do
    # Skip tracking pixels
    if String.contains?(src, "tracking"), do: :skip, else: :continue
  end
end

html = "<a href=\"https://old-cdn.com/file.pdf\">Download</a>"
opts = %HtmlToMarkdown.Options{visitor: MyVisitor}
{:ok, result} = HtmlToMarkdown.convert(html, opts)
result.content

Examples

Links

Contributing

We welcome contributions! Please see our Contributing Guide for details on:

All contributions must follow our code quality standards (enforced via pre-commit hooks):

License

MIT License – see LICENSE.

Support

If you find this library useful, consider sponsoring the project.

Have questions or run into issues? We're here to help: