Crawler

TravisCode ClimateCodeBeatCoverageHex.pm

A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ.

Features

Crawler is under active development, below is a non-comprehensive list of features (to be) implemented.

Architecture

Below is a very high level architecture diagram demonstrating how Crawler works.

Usage

Crawler.crawl("http://elixir-lang.org", max_depths: 2)

There are several ways to access the crawled page data:

  1. Use Crawler.Store
  2. Tap into the registry(?) Crawler.Store.DB
  3. Use your own scraper
  4. If the :save_to option is set, pages will be saved to disk in addition to the above mentioned places
  5. Provide your own custom parser and manage how data is stored and accessed yourself

Configurations

Option Type Default Value Description

| :assets | list | [] | Whether to fetch any asset files, available options: "css", "js", "images". | :save_to | string | nil | When provided, the path for saving crawled pages. | :workers | integer | 10 | Maximum number of concurrent workers for crawling. | :interval | integer | 0 | Rate limit control - number of milliseconds before crawling more pages, defaults to 0 which is effectively no rate limit. | :max_depths | integer | 3 | Maximum nested depth of pages to crawl. | :timeout | integer | 5000 | Timeout value for fetching a page, in ms. Can also be set to :infinity, useful when combined with Crawler.pause/1. | :user_agent | string | Crawler/x.x.x (...) | User-Agent value sent by the fetch requests. | :url_filter | module | Crawler.Fetcher.UrlFilter | Custom URL filter, useful for restricting crawlable domains, paths or content types. | :retrier | module | Crawler.Fetcher.Retrier | Custom fetch retrier, useful for retrying failed crawls. | :scraper | module | Crawler.Scraper | Custom scraper, useful for scraping content as soon as the parser parses it. | :parser | module | Crawler.Parser | Custom parser, useful for handling parsing differently or to add extra functionalities.

Custom Modules

It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:

Retrier

See Crawler.Fetcher.Retrier.

Crawler uses ElixirRetry's exponential backoff strategy by default.

defmodule CustomRetrier do
  @behaviour Crawler.Fetcher.Retrier.Spec
end

URL Filter

See Crawler.Fetcher.UrlFilter.

defmodule CustomUrlFilter do
  @behaviour Crawler.Fetcher.UrlFilter.Spec
end

Scraper

See Crawler.Scraper.

defmodule CustomScraper do
  @behaviour Crawler.Scraper.Spec
end

Parser

See Crawler.Parser.

defmodule CustomParser do
  @behaviour Crawler.Parser.Spec
end

Pause / Resume / Stop Crawler

Crawler provides pause/1, resume/1 and stop/1, see below.

{:ok, opts} = Crawler.crawl("http://elixir-lang.org")

Crawler.pause(opts)

Crawler.resume(opts)

Crawler.stop(opts)

Please note that when pausing Crawler, you would need to set a large enough :timeout (or even set it to :infinity) otherwise parser would timeout due to unprocessed links.

API Reference

Please see https://hexdocs.pm/crawler.

Changelog

Please see CHANGELOG.md.

License

Licensed under MIT.