Crawler

TravisCode ClimateCodeBeatCoverageHex.pm

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ.

Features

Crawler is under active development, below is a non-comprehensive list of features (to be) implemented.

Usage

Crawler.crawl("http://elixir-lang.org", max_depths: 2)

Configurations

Option Type Default Value Description

| :max_depths | integer | 3 | Maximum nested depth of pages to crawl. | :workers | integer | 10 | Maximum number of concurrent workers for crawling. | :interval | integer | 0 | Rate limit control - number of milliseconds before crawling more pages, defaults to 0 which is effectively no rate limit. | :timeout | integer | 5000 | Timeout value for fetching a page, in ms. | :user_agent | string | Crawler/x.x.x (...) | User-Agent value sent by the fetch requests. | :save_to | string | nil | When provided, the path for saving crawled pages. | :assets | list | [] | Whether to fetch any asset files, available options: "css", "js", "images". | :retrier | module | Crawler.Fetcher.Retrier | Custom fetch retrier, useful when you need to retry failed crawls. | :url_filter | module | Crawler.Fetcher.UrlFilter | Custom URL filter, useful when you need to restrict crawlable domains, paths or file types. | :parser | module | Crawler.Parser | Custom parser, useful when you need to handle parsing differently or to add extra functionalities.

Custom Modules

It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:

Retrier

See Crawler.Fetcher.Retrier.

Crawler uses ElixirRetry's exponential backoff strategy by default.

defmodule CustomRetrier do
  @behaviour Crawler.Fetcher.Retrier.Spec
end

URL Filter

See Crawler.Fetcher.UrlFilter.

defmodule CustomUrlFilter do
  @behaviour Crawler.Fetcher.UrlFilter.Spec
end

Parser

See Crawler.Parser.

defmodule CustomParser do
  @behaviour Crawler.Parser.Spec
end

Changelog

Please see CHANGELOG.md.

License

Licensed under MIT.