Crawler
A high performance web crawler in Elixir, with worker pooling and rate limiting via OPQ.
Usage
Crawler.crawl("http://elixir-lang.org", max_depths: 2)
Configurations
| Option | Type | Default Value | Description |
|---|---|---|---|
:max_depths | integer | 3 | Maximum nested depth of pages to crawl. |
:workers | integer | 10 | Maximum number of concurrent workers for crawling. |
:interval | integer | 0 | Rate limit control - number of milliseconds before crawling more pages, defaults to 0 which is effectively no rate limit. |
:timeout | integer | 5000 | Timeout value for fetching a page, in ms. |
:user_agent | string | Crawler/x.x.x (...) | User-Agent value sent by the fetch requests. |
:save_to | string | nil | When provided, the path for saving crawled pages. |
:parser | module | Crawler.Parser | The default parser, useful when you need to handle parsing differently or to add extra functionalities. |
Features Backlog
Crawler is under active development, below is a non-comprehensive list of features to be implemented.
- Set the maximum crawl depth.
- Save to disk.
- Set timeouts.
- Crawl assets (CSS and images, etc).
- The ability to manually stop/pause/restart the crawler.
- Restrict crawlable domains, paths or file types.
- Limit concurrent crawlers.
- Limit rate of crawling.
- Set crawler's user agent.
- The ability to retry a failed crawl.
- DSL for scraping page content.
Changelog
Please see CHANGELOG.md.
License
Licensed under MIT.