Html2Markdown

Hex.pmHex DocsLicenseCI

Convert HTML to clean, readable Markdown. Designed for content extraction, this library handles common HTML patterns while filtering out non-content elements like navigation and and scripts.

Installation

Add html2markdown to your list of dependencies in mix.exs:

def deps do
  [
    {:html2markdown, "~> 0.3.1"}
  ]
end

Quick Start

# Basic conversion
Html2Markdown.convert("<h1>Hello World</h1><p>Welcome to <strong>Elixir</strong>!</p>")
# => "\n# Hello World\n\n\n\nWelcome to **Elixir**!\n"

# With custom options
Html2Markdown.convert(html, %{
  navigation_classes: ["nav", "menu", "custom-nav"],
  normalize_whitespace: true
})

Features

Configuration Options

Html2Markdown.convert(html, %{
  # CSS classes that identify navigation elements to remove
  navigation_classes: ["footer", "menu", "nav", "sidebar", "aside"],
  
  # HTML tags to filter out during conversion
  non_content_tags: ["script", "style", "form", "nav", ...],
  
  # Markdown flavor (currently :basic, future: :gfm, :commonmark)
  markdown_flavor: :basic,
  
  # Normalize whitespace (collapses multiple spaces, trims)
  normalize_whitespace: true
})

Common Use Cases

Web Scraping

Extract readable content from web pages:

{:ok, %{body: html}} = Req.get!(url)
markdown = Html2Markdown.convert(html)

Content Migration

Convert existing HTML content to Markdown:

# Convert blog posts from HTML to Markdown
html_content
|> Html2Markdown.convert(%{normalize_whitespace: true})
|> save_as_markdown()

Email Processing

Clean up HTML emails for plain text storage:

email_html
|> Html2Markdown.convert(%{
  non_content_tags: ["style", "script", "meta"],
  navigation_classes: ["unsubscribe", "footer"]
})

Supported Elements

Documentation

Full documentation is available at https://hexdocs.pm/html2markdown.

Development

This project includes comprehensive testing and quality assurance tools:

Running Tests

# Run all tests
mix test

# Run tests with coverage
mix coveralls.html

Code Quality

# Run all quality checks (formatting, security, linting)
mix quality

# Individual checks
mix format --check-formatted  # Code formatting
mix credo --only warning       # Code linting
mix sobelow --config          # Security analysis

CI/CD

This project uses GitHub Actions for continuous integration with:

License

MIT License - see LICENSE file for details.