Crawler
A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ.
Features
Crawler is under active development, below is a non-comprehensive list of features (to be) implemented.
- Set the maximum crawl depth.
- Save to disk.
- Set timeouts.
-
Crawl assets.
- [x] js
- [x] css
- [x] images
- The ability to manually stop/pause/restart the crawler.
- Restrict crawlable domains, paths or file types.
- Limit concurrent crawlers.
- Limit rate of crawling.
- Set crawler's user agent.
- The ability to retry a failed crawl.
- DSL for scraping page content.
Usage
Crawler.crawl("http://elixir-lang.org", max_depths: 2)Configurations
| Option | Type | Default Value | Description |
|---|
| :max_depths | integer | 3 | Maximum nested depth of pages to crawl.
| :workers | integer | 10 | Maximum number of concurrent workers for crawling.
| :interval | integer | 0 | Rate limit control - number of milliseconds before crawling more pages, defaults to 0 which is effectively no rate limit.
| :timeout | integer | 5000 | Timeout value for fetching a page, in ms.
| :user_agent | string | Crawler/x.x.x (...) | User-Agent value sent by the fetch requests.
| :save_to | string | nil | When provided, the path for saving crawled pages.
| :assets | list | [] | Whether to fetch any asset files, available options: "css", "js", "images".
| :retrier | module | Crawler.Fetcher.Retrier | Custom fetch retrier, useful when you need to retry failed crawls.
| :url_filter | module | Crawler.Fetcher.UrlFilter | Custom URL filter, useful when you need to restrict crawlable domains, paths or file types.
| :parser | module | Crawler.Parser | Custom parser, useful when you need to handle parsing differently or to add extra functionalities.
Custom Modules
It is possible to swap in your custom logic as shown in the configurations section. Your custom modules need to conform to their respective behaviours:
Retrier
Crawler uses ElixirRetry's exponential backoff strategy by default.
defmodule CustomRetrier do
@behaviour Crawler.Fetcher.Retrier.Spec
endURL Filter
See Crawler.Fetcher.UrlFilter.
defmodule CustomUrlFilter do
@behaviour Crawler.Fetcher.UrlFilter.Spec
endParser
See Crawler.Parser.
defmodule CustomParser do
@behaviour Crawler.Parser.Spec
endChangelog
Please see CHANGELOG.md.
License
Licensed under MIT.