Crawler

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ.

Features

Crawler is under active development, below is a non-comprehensive list of features (to be) implemented.

Set the maximum crawl depth.
Save to disk.
Set timeouts.
Crawl assets.
- [x] js
- [x] css
- [x] images
The ability to manually stop/pause/restart the crawler.
Restrict crawlable domains, paths or file types.
Limit concurrent crawlers.
Limit rate of crawling.
Set crawler's user agent.
The ability to retry a failed crawl.
DSL for scraping page content.

Architecture

Below is a very high level architecture diagram demonstrating how Crawler works.

Usage

Crawler.crawl("http://elixir-lang.org", max_depths: 2)

There are several ways of accessing the crawled page data:

Use Crawler.Store
Tap into the registry(?) Crawler.Store.DB
If the :save_to option is set, pages will be saved to disk in addition to the above mentioned places
Provide your own custom parser and manage how data is stored and accessed yourself

Configurations

Option	Type	Default Value	Description

Custom Modules

It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:

Retrier

See Crawler.Fetcher.Retrier.

Crawler uses ElixirRetry's exponential backoff strategy by default.

defmodule CustomRetrier do
  @behaviour Crawler.Fetcher.Retrier.Spec
end

URL Filter

See Crawler.Fetcher.UrlFilter.

defmodule CustomUrlFilter do
  @behaviour Crawler.Fetcher.UrlFilter.Spec
end

Parser

See Crawler.Parser.

defmodule CustomParser do
  @behaviour Crawler.Parser.Spec
end

API Reference

Please see https://hexdocs.pm/crawler.

Changelog

Please see CHANGELOG.md.

License

Licensed under MIT.