ExCavator
Excavate JavaScript variable values from HTML script tags using proper AST parsing — no regex needed.
Uses OXC (via igniter_js) for fast Rust-based JS parsing, then walks the ESTree AST in pure Elixir to extract variable bindings.
Installation
Add excavator to your list of dependencies in mix.exs:
def deps do
[
{:excavator, "~> 0.1.0"}
]
endUsage
Extract all variables from HTML
html = """
<html>
<script>
window.__INITIAL_STATE__ = {"user": "alice", "token": "abc-xyz"};
var apiKey = "secret_key_890";
const data = JSON.parse('{"items":[1,2,3]}');
</script>
</html>
"""
{:ok, results} = ExCavator.extract_all(html)
# [
# %{name: "window.__INITIAL_STATE__", value: %{"user" => "alice", "token" => "abc-xyz"}, source: :assignment},
# %{name: "apiKey", value: "secret_key_890", source: :variable_declaration},
# %{name: "data", value: %{"items" => [1, 2, 3]}, source: :variable_declaration}
# ]Extract a specific variable
{:ok, state} = ExCavator.extract(html, "__INITIAL_STATE__")
# %{"user" => "alice", "token" => "abc-xyz"}
{:ok, key} = ExCavator.extract(html, "apiKey")
# "secret_key_890"Parse JS directly (skip HTML)
{:ok, results} = ExCavator.extract_from_js("const x = 42;")
# [%{name: "x", value: 42, source: :variable_declaration}]Supported patterns
var/let/const x = VALUE— string, number, boolean, null, object, arraywindow.x = VALUE— dotted member assignments (including nested:window.a.b)const x = JSON.parse('...')— string arg decoded with Jason-
Negative numbers (
-42), template literals (`hello`) - Nested objects and arrays
How it works
- Floki parses HTML and extracts inline
<script>tag contents - OXC (via
igniter_jsRustler NIF) parses JS to ESTree AST - Pure Elixir pattern matching walks the AST to find assignments and convert values
Here is the most pragmatic and common approach to solving this in Elixir.
Step 1: Extract the <script> contents with Floki
To parse HTML in Elixir, Floki is the gold standard. It allows you to query HTML using CSS selectors.
First, add Floki to your mix.exs:
def deps do
[
{:floki, "~> 0.35.0"} # check hex.pm for the latest version
]
endThen, parse the HTML to isolate the script tags:
html = """
<html>
<head>
<script>
var unimportant = "ignore me";
</script>
<script>
window.__INITIAL_STATE__ = {"user_id": 123, "token": "abc-xyz"};
var apiKey = "secret_key_890";
</script>
</head>
<body>...</body>
</html>
"""
# Parse the document
{:ok, document} = Floki.parse_document(html)
# Find all script tags and extract their raw text content
script_contents =
document
|> Floki.find("script")
|> Enum.map(&Floki.text/1)
|> Enum.join("\n") # Combine them if you want to search all at onceStep 2: Extract the JavaScript Variables
Elixir doesn't have a built-in JavaScript AST (Abstract Syntax Tree) parser. For extracting variables during web scraping, developers almost always use Regex combined with a JSON parser (like Jason), rather than trying to fully parse the JavaScript execution context.
Here are the two most common scenarios:
Scenario A: Extracting a simple string/integer variable
If you just need a standard variable assignment (e.g., var apiKey = "secret_key_890";), Regex is your best friend.
# Extracting the apiKey variable
regex = ~r/var apiKey = "(.*?)";/
case Regex.run(regex, script_contents) do
[_, api_key] ->
IO.puts("Found API Key: #{api_key}")
nil ->
IO.puts("API Key not found")
endScenario B: Extracting a JSON-like object
Often, modern web apps inject state into the HTML (like window.__INITIAL_STATE__ = {...};). You can use Regex to grab the JSON payload and decode it with Jason.
# Add {:jason, "~> 1.4"} to your mix.exs deps
# Look for the variable, capture everything between the braces
regex = ~r/window\.__INITIAL_STATE__ = (\{.*?\});/s
case Regex.run(regex, script_contents) do
[_, json_string] ->
# Decode the captured string into an Elixir map
case Jason.decode(json_string) do
{:ok, parsed_state} ->
IO.inspect(parsed_state, label: "Parsed State")
# Now you can access parsed_state["token"]
{:error, _} ->
IO.puts("Found the variable, but it wasn't valid JSON")
end
nil ->
IO.puts("Initial state not found")
endWhat if the JavaScript is too complex for Regex?
If the JavaScript is highly complex, minified unpredictably, or requires actual execution (e.g., the variable is generated by a function like var token = generateToken();), Regex won't cut it.
In those rare cases, you have two options in Elixir:
- Tree-sitter bindings: You can use the tree_sitter library with
tree_sitter_javascriptto parse the JS into an AST and traverse it. This is robust but requires a steep learning curve. - Execute it externally: Use a library like NodeJS to send the script content to a background Node.js process, evaluate it, and return the result back to Elixir.