lexbor_erl

CIlexbor_erl versionHex.pm

An Erlang wrapper for the Lexbor HTML parser and DOM library via a port-based architecture.

Overview

lexbor_erl provides safe, fast HTML parsing, CSS selector querying, DOM manipulation, and streaming parser capabilities for Erlang applications. It wraps the high-performance Lexbor C library using a port-based worker pool architecture for isolation, safety, and parallel processing.

Features

Prerequisites

Installing Lexbor

On macOS with Homebrew:

brew install lexbor

On Ubuntu/Debian:

sudo apt-get install liblexbor-dev

Or build from source:

git clone https://github.com/lexbor/lexbor.git
cd lexbor
mkdir build && cd build
cmake ..
make
sudo make install

Building

make

Quick Start

1> lexbor_erl:start().
ok

%% Stateless: parse and serialize
2> {ok, Html} = lexbor_erl:parse_serialize(<<"<div>Hello<span>World">>).
{ok,<<"<html><head></head><body><div>Hello<span>World</span></div></body></html>">>}

%% Stateless: select elements
3> {ok, List} = lexbor_erl:select_html(
     <<"<ul><li class=a>A</li><li class=b>B</li></ul>">>, 
     <<"li.b">>).
{ok,[<<"<li class=\"b\">B</li>">>]}

%% Stateful: parse document
4> {ok, Doc} = lexbor_erl:parse(
     <<"<div id=app><ul><li class=a>A</li><li class=b>B</li></ul></div>">>).
{ok,1}

%% Select nodes
5> {ok, Nodes} = lexbor_erl:select(Doc, <<"#app li">>).
{ok,[{node,140735108544752},{node,140735108544896}]}

%% Get node HTML
6> [lexbor_erl:outer_html(Doc, N) || N <- Nodes].
[{ok,<<"<li class=\"a\">A</li>">>},{ok,<<"<li class=\"b\">B</li>">>}]

%% DOM manipulation: modify attributes
7> {ok, [Li]} = lexbor_erl:select(Doc, <<"li.a">>).
{ok,[{node,140735108544752}]}

8> lexbor_erl:set_attribute(Doc, Li, <<"class">>, <<"modified">>).
ok

9> lexbor_erl:get_attribute(Doc, Li, <<"class">>).
{ok,<<"modified">>}

%% DOM manipulation: modify text content
10> lexbor_erl:set_text(Doc, Li, <<"New Text">>).
ok

11> lexbor_erl:get_text(Doc, Li).
{ok,<<"New Text">>}

%% Content manipulation: append HTML to matching elements
12> {ok, NumModified} = lexbor_erl:append_content(Doc, <<"ul">>, <<"<li>New Item</li>">>).
{ok,1}

13> {ok, Html} = lexbor_erl:serialize(Doc).
{ok,<<"<!DOCTYPE html><html><head></head><body><div id=\"app\"><ul><li class=\"modified\">New Text</li><li class=\"b\">B</li><li>New Item</li></ul></div></body></html>">>}

%% Streaming parser: parse incrementally
14> {ok, Session} = lexbor_erl:parse_stream_begin().
{ok,72057594037927937}

15> ok = lexbor_erl:parse_stream_chunk(Session, <<"<div><p>He">>).
ok

16> ok = lexbor_erl:parse_stream_chunk(Session, <<"llo</p></div>">>).
ok

17> {ok, StreamDoc} = lexbor_erl:parse_stream_end(Session).
{ok,72057594037927938}

%% Release documents
18> ok = lexbor_erl:release(Doc).
ok

19> ok = lexbor_erl:release(StreamDoc).
ok

20> lexbor_erl:stop().
ok

Also check out examples/ directory.

Supported Operations

Document Lifecycle

Stateless Operations

CSS Selectors

DOM Queries

Attribute Manipulation

Text Content

HTML Content Manipulation

DOM Tree Manipulation

Streaming Parser

Application Management

How to use it in your application?

Add to your rebar.config:

{deps, [
    {lexbor_erl, "0.3.0"}
]}.

Then run:

rebar3 get-deps
rebar3 compile

Note: lexbor_erl is a port-based application and cannot be packaged as an escript. It must be used as a library dependency with access to the compiled C port executable.

See the demo/ directory for complete working application.

Additional configuration

In your sys.config:

{lexbor_erl, [
  {port_cmd, "priv/lexbor_port"},
  {op_timeout_ms, 3000}
]}.

Parallelism and Concurrency

lexbor_erl uses a worker pool architecture to enable true parallel processing of HTML operations:

Architecture

Configuration

Set the pool size in your sys.config:

{lexbor_erl, [
  {pool_size, 8},              % Number of parallel workers (default: scheduler count)
  {op_timeout_ms, 3000}        % Timeout per operation
]}.

Or via environment variable when starting the application:

application:set_env(lexbor_erl, pool_size, 8).

Thread Safety and Fault Tolerance

Performance Characteristics

License

LGPL-2.1-or-later

Credits

Built on top of the Lexbor HTML parser library.