Crawler
A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ.
Features
Crawler is under active development, below is a non-comprehensive list of features (to be) implemented.
- Set the maximum crawl depth.
- Save to disk.
- Set timeouts.
-
Crawl assets.
- [x] js
- [x] css
- [x] images
- The ability to manually stop/pause/restart the crawler.
- Restrict crawlable domains, paths or file types.
- Limit concurrent crawlers.
- Limit rate of crawling.
- Set crawler's user agent.
- The ability to retry a failed crawl.
- DSL for scraping page content.
Architecture
Below is a very high level architecture diagram demonstrating how Crawler works.
Usage
Crawler.crawl("http://elixir-lang.org", max_depths: 2)There are several ways of accessing the crawled page data:
-
Use
Crawler.Store -
Tap into the registry(?)
Crawler.Store.DB -
If the
:save_tooption is set, pages will be saved to disk in addition to the above mentioned places - Provide your own custom parser and manage how data is stored and accessed yourself
Configurations
| Option | Type | Default Value | Description |
|---|
| :max_depths | integer | 3 | Maximum nested depth of pages to crawl.
| :workers | integer | 10 | Maximum number of concurrent workers for crawling.
| :interval | integer | 0 | Rate limit control - number of milliseconds before crawling more pages, defaults to 0 which is effectively no rate limit.
| :timeout | integer | 5000 | Timeout value for fetching a page, in ms.
| :user_agent | string | Crawler/x.x.x (...) | User-Agent value sent by the fetch requests.
| :save_to | string | nil | When provided, the path for saving crawled pages.
| :assets | list | [] | Whether to fetch any asset files, available options: "css", "js", "images".
| :retrier | module | Crawler.Fetcher.Retrier | Custom fetch retrier, useful when you need to retry failed crawls.
| :url_filter | module | Crawler.Fetcher.UrlFilter | Custom URL filter, useful when you need to restrict crawlable domains, paths or file types.
| :parser | module | Crawler.Parser | Custom parser, useful when you need to handle parsing differently or to add extra functionalities.
Custom Modules
It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:
Retrier
Crawler uses ElixirRetry's exponential backoff strategy by default.
defmodule CustomRetrier do
@behaviour Crawler.Fetcher.Retrier.Spec
endURL Filter
See Crawler.Fetcher.UrlFilter.
defmodule CustomUrlFilter do
@behaviour Crawler.Fetcher.UrlFilter.Spec
endParser
See Crawler.Parser.
defmodule CustomParser do
@behaviour Crawler.Parser.Spec
endAPI Reference
Please see https://hexdocs.pm/crawler.
Changelog
Please see CHANGELOG.md.
License
Licensed under MIT.