Meeseeks
Meeseeks is an Elixir library for extracting data from HTML.
# Fetch HTML with your preferred library
html = Tesla.get("https://news.ycombinator.com/").body
# Select stories and return a map containing the title and url of each
for story <- Meeseeks.all(html, "tr.athing") do
title_a = Meeseeks.one(story, ".title a")
%{:title => Meeseeks.text(title_a),
:url => Meeseeks.attr(title_a, "href")}
end
#=> [%{:title => "...", :url => "..."}, %{:title => "...", :url => "..."}, ...]Installation
Add Meeseeks to your mix.exs:
defp deps do
[
{:meeseeks, "~> 0.2.0"},
]
end
Then run mix get.deps.
Dependencies
Meeseeks depends on html5ever via html5ever_elixir.
Because html5ever is a Rust library, you will need to have the Rust compiler installed in order to use Meeseeks.
This is necessary because there are no HTML5 spec compliant parsers in Erlang/Elixir. The mochiweb_html library is decent, but can have problems parsing malformed HTML correctly, which leads to weirdness I would just as soon avoid.
Overview
Parsing
Meeseeks parses a source (HTML string or tuple-tree) into a Document. A Document is just an easily queriable view of the source HTML with the nodes assigned ids and the parent-child relationships made explicit.
# Can parse html as a string
Meeseeks.parse("<div id=main><p>Hello, Github!</p></div>")
#=> %Meeseeks.Document{...}
# Or as a tuple-tree
Meeseeks.parse({"div", [{"id", "main"}], [{"p", [], ["Hello, Github!"]}]})
#=> %Meeseeks.Document{...}
The selection functions all and one will accept unparsed HTML, but parsing is expensive, so parse ahead of time if you are planning to run multiple selections on the same Document.
Selecting
Meeseeks has two selection functions, all and one, which both accept a queryable (a source, a Document, or a Result) and selectors as arguments.
all returns a list of Results representing every node that matches a selector, while one returns a Result representing the first node that matches a selector (depth-first).
A Result is a node id packaged with the Document for which that id is valid.
html = "<div id=main><p>1</p><p>2</p><p>3</p></div>"
document = Meeseeks.parse(html)
# Selection functions will accept raw html as a source, first parsing it
Meeseeks.all(html, "#main p")
#=> [%Meeseeks.Result{...}, %Meeseeks.Result{...}, %Meeseeks.Result{...}]
# Selection functions will also accept a `Document` as a source
Meeseeks.one(document, "#main p")
#=> %Meeseeks.Result{...}
# Selection functions accept a `Result` as a source
Meeseeks.one(html, "#main") |> Meeseeks.all("p")
#=> [%Meeseeks.Result{...}, %Meeseeks.Result{...}, %Meeseeks.Result{...}]For an overview of valid selectors, see the selector syntax
Extracting
In order to transform a Result into useful data, you need to use an extraction function.
The provided extraction functions are tag, attrs, attr, tree, text, and data.
html = "<div id=main><p>1</p><p>2</p><p>3</p></div>"
result = Meeseeks.one(html, "#main")
# Maybe you want your result's tag
Meeseeks.tag(result)
#=> "div"
# Or a specific attribute from your result
Meeseeks.attr(result, "id")
#=> "main"
# Or a tuple tree representing your result and its children
Meeseeks.tree(result)
#=> {"div", [{"id", "main"}], [{"p", [], ["1"]}, {"p", [], ["2"]}, ...]}
# Or the joined text of a node and its children
Meeseeks.text(result)
#=> "123"Selector Syntax
Meeseeks's selector syntax is based on CSS selector syntax.
| Pattern | Example | Notes |
|---|---|---|
| Basic Selectors | --- | --- |
* | * |
Matches any for ns or tag |
tag | div |
| ns|tag | <foo:div> | |
| #id | div#bar, #bar | |
| .class | div.baz, .baz | |
| [attr] | a[href], [lang] | |
| [^attrPrefix] | div[^data-] | |
| [attr=val] | a[rel="nofollow"] | |
| [attr~=valIncludes] | div[things~=thing1] | |
| [attr|=valDash] | p[lang|=en] | |
| [attr^=valPrefix] | a[href^=https:] | |
| [attr$=valSuffix] | img[src$=".png"] | |
| [attr*=valContaining] | a[href*=admin] | |
| ​ | | |
| Pseudo Classes | --- | --- |
| :first-child | li:first-child | |
| :first-of-type | li:first-of-type | |
| :last-child | tr:last-child | |
| :last-of-type | tr:last-of-type | |
| :not | not(p:nth-child(even)) | Selectors cannot contain combinators or the not pseudo class |
| :nth-child(n) | p:nth-child(even) | Supports even, odd, 1.., or an+b formulas |
| :nth-last-child(n) | p:nth-last-child(2) | Supports even, odd, 1.., or an+b formulas |
| :nth-last-of-type(n) | p:nth-last-of-type(2n+1) | Supports even, odd, 1.., or an+b formulas |
| :nth-of-type(n) | p:nth-of-type(1) | Supports even, odd, 1.., or an+b formulas |
| ​ | | |
| Combinators | --- | --- |
| X Y | div.header .logo | Y descendant of X |
| X > Y | ol > li | Y child of X |
| X + Y | div + p | Y is sibling directly after X |
| X ~ Y | div ~ p | Y is any sibling after X |
| X, Y, Z | button.standard, button.alert | Matches X, Y, or Z |