kreuzcrawl
Elixir bindings for kreuzcrawl — a high-performance Rust web crawling engine. Uses Rustler NIFs for native BEAM integration with OTP-compatible error tuples and ResourceArc handles.
What This Package Provides
- Same crawler as every binding — one Rust engine behind Python, Node.js, Ruby, Go, Java, .NET, PHP, Elixir, Dart, Kotlin Android, Swift, Zig, WASM, and C FFI.
- Structured scrape output — HTML, Markdown, metadata, links, assets, response headers, and extraction warnings with consistent field names.
- Crawl controls — depth, page limits, concurrency, URL filters, robots/sitemap handling, rate limits, and partial failure reporting.
- Rendering path — optional browser rendering for JavaScript-heavy pages; direct HTTP path for fast static pages.
- BEAM package — Rustler NIF binding with OTP-compatible error tuples.
Installation
def deps do
[{:kreuzcrawl, "~> 0.3.0-rc.49"}]
end
Quick Start
# Simplest case: scrape a single page with default settings.
{:ok, engine} = Kreuzcrawl.create_engine()
{:ok, scrape_json} = Kreuzcrawl.scrape_async(engine, "https://example.com/")
scrape = Jason.decode!(scrape_json)
IO.puts("Title: #{scrape["metadata"]["title"]}")
IO.puts("Status: #{scrape["status_code"]}")
IO.puts("Links found: #{length(scrape["links"] || [])}")
# Crawl from a seed URL, limited to one hop and a handful of pages.
config_json = Jason.encode!(%Kreuzcrawl.CrawlConfig{max_depth: 1, max_pages: 5})
{:ok, crawl_engine} = Kreuzcrawl.create_engine(config_json)
{:ok, crawl_json} =
Kreuzcrawl.crawl_async(crawl_engine, "https://en.wikipedia.org/wiki/Web_scraping")
crawl = Jason.decode!(crawl_json)
IO.puts("Pages crawled: #{length(crawl["pages"] || [])}")
API Reference
Full API documentation is available at docs.kreuzcrawl.kreuzberg.dev.
Key functions:
create_engine(config?)— Create a crawl engine with optional configurationscrape(engine, url)— Scrape a single URLcrawl(engine, url)— Crawl a website following linksmap_urls(engine, url)— Discover all pages on a sitebatch_scrape(engine, urls)— Scrape multiple URLs concurrentlybatch_crawl(engine, urls)— Crawl multiple seed URLs concurrently
Contributing
Contributions are welcome! Please see our Contributing Guide for details.
Part of Kreuzberg.dev
- Kreuzberg — document intelligence: text, tables, metadata from 90+ formats with optional OCR.
- Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
- html-to-markdown — fast, lossless HTML→Markdown engine.
- liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
- tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
- alef — the polyglot binding generator that produces this README and all per-language bindings.
- Discord — community, roadmap, announcements.
License
This project is licensed under Elastic License 2.0.