Crawler
A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ.
Features
Crawler is under active development, below is a non-comprehensive list of features (to be) implemented.
- Crawl assets (javascript, css and images).
- Save to disk.
- Hook for scraping content.
- Restrict crawlable domains, paths or content types.
- Limit concurrent crawlers.
- Limit rate of crawling.
- Set the maximum crawl depth.
- Set timeouts.
- Set retries strategy.
- Set crawler's user agent.
- Manually pause/resume/stop the crawler.
Architecture
Below is a very high level architecture diagram demonstrating how Crawler works.
Usage
Crawler.crawl("http://elixir-lang.org", max_depths: 2)There are several ways to access the crawled page data:
-
Use
Crawler.Store -
Tap into the registry(?)
Crawler.Store.DB - Use your own scraper
-
If the
:save_tooption is set, pages will be saved to disk in addition to the above mentioned places - Provide your own custom parser and manage how data is stored and accessed yourself
Configurations
| Option | Type | Default Value | Description |
|---|
| :assets | list | [] | Whether to fetch any asset files, available options: "css", "js", "images".
| :save_to | string | nil | When provided, the path for saving crawled pages.
| :workers | integer | 10 | Maximum number of concurrent workers for crawling.
| :interval | integer | 0 | Rate limit control - number of milliseconds before crawling more pages, defaults to 0 which is effectively no rate limit.
| :max_depths | integer | 3 | Maximum nested depth of pages to crawl.
| :timeout | integer | 5000 | Timeout value for fetching a page, in ms. Can also be set to :infinity, useful when combined with Crawler.pause/1.
| :user_agent | string | Crawler/x.x.x (...) | User-Agent value sent by the fetch requests.
| :url_filter | module | Crawler.Fetcher.UrlFilter | Custom URL filter, useful for restricting crawlable domains, paths or content types.
| :retrier | module | Crawler.Fetcher.Retrier | Custom fetch retrier, useful for retrying failed crawls.
| :scraper | module | Crawler.Scraper | Custom scraper, useful for scraping content as soon as the parser parses it.
| :parser | module | Crawler.Parser | Custom parser, useful for handling parsing differently or to add extra functionalities.
Custom Modules
It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:
Retrier
Crawler uses ElixirRetry's exponential backoff strategy by default.
defmodule CustomRetrier do
@behaviour Crawler.Fetcher.Retrier.Spec
endURL Filter
See Crawler.Fetcher.UrlFilter.
defmodule CustomUrlFilter do
@behaviour Crawler.Fetcher.UrlFilter.Spec
endScraper
See Crawler.Scraper.
defmodule CustomScraper do
@behaviour Crawler.Scraper.Spec
endParser
See Crawler.Parser.
defmodule CustomParser do
@behaviour Crawler.Parser.Spec
endPause / Resume / Stop Crawler
Crawler provides pause/1, resume/1 and stop/1, see below.
{:ok, opts} = Crawler.crawl("http://elixir-lang.org")
Crawler.pause(opts)
Crawler.resume(opts)
Crawler.stop(opts)
Please note that when pausing Crawler, you would need to set a large enough :timeout (or even set it to :infinity) otherwise parser would timeout due to unprocessed links.
API Reference
Please see https://hexdocs.pm/crawler.
Changelog
Please see CHANGELOG.md.
License
Licensed under MIT.