loki_xml

Effortless XML parsing for Erlang, simple, fast, and neat.

A high-performance XML parser designed for processing large XML files using chunked, parallel, and distributed processing with built-in fault tolerance.

Hex.pmHex DocsLicense

Features

Architecture

Master Node
│
├── File Supervisor (1 per XML file)
│   ├── Splitter Process (splits XML into chunks)
│   ├── Chunk Supervisor (1 per chunk)
│   │   ├── Parser Worker (parses chunk)
│   │   └── XPath Worker (optional query execution)
│   └── Merger Process (combines results)
│
└── Distributor (optional, sends chunks to remote nodes)

Quick Start

%% Start the application
loki_xml:start().

%% Parse a simple XML binary
XML = <<"<book id=\"1\">
          <title>Erlang Programming</title>
          <author>Joe Armstrong</author>
        </book>">>,

{ok, Parsed} = loki_xml:parse(XML).
%% Returns: [{element, <<"book">>, [{<<"id">>, <<"1">>}], [...]}]

%% Parse a file
{ok, Parsed} = loki_xml:parse("large.xml").

%% Parse with options
{ok, Parsed} = loki_xml:parse("huge.xml", #{
    tag => <<"record">>,           % Split by <record> tags
    chunk_size => 1048576,          % 1MB chunks (fallback)
    format => map,                  % Convert to maps
    max_retries => 3                % Retry failed chunks
}).

XPath Queries

{ok, Parsed} = loki_xml:parse("library.xml"),

%% Query all titles
Titles = loki_xml:query(Parsed, <<"//title">>).
%% Returns: [<<"Erlang Programming">>, <<"Learn You Some Erlang">>]

%% Query attributes
Ids = loki_xml:query(Parsed, <<"@id">>).
%% Returns: [<<"1">>, <<"2">>]

%% Nested queries
Authors = loki_xml:query(Parsed, <<"//book/author">>).

Distributed Parsing

Process large XML files across multiple Erlang nodes:

%% Start additional nodes
%% Terminal 1: erl -sname node1@localhost
%% Terminal 2: erl -sname node2@localhost

%% On master node
Nodes = [&#39;node1@localhost&#39;, &#39;node2@localhost&#39;],

{ok, Parsed} = loki_xml:parse("huge.xml", #{
    distributed => true,
    nodes => Nodes,
    tag => <<"record">>
}).

Map Conversion

%% Parse directly to maps
{ok, [Book]} = loki_xml:parse(XML, #{format => map}).

%% Access fields
#{tag := <<"book">>,
  attributes := #{<<"id">> := <<"1">>},
  children := [
      #{tag := <<"title">>, children := [<<"Erlang Programming">>]},
      ...
  ]} = Book.

%% Or convert manually
{ok, [Element]} = loki_xml:parse(XML),
Map = loki_xml:to_map(Element).

API Reference

Main API (loki_xml)

parse/1

-spec parse(binary() | file:filename()) -> {ok, [xml_element()]} | {error, term()}.

Parse XML from binary or file with default options.

parse/2

-spec parse(binary() | file:filename(), parse_opts()) -> {ok, [xml_element()]} | {error, term()}.

Options:

query/2

-spec query([xml_element()], binary()) -> [binary() | xml_element()].

Execute XPath query on parsed XML.

to_map/1

-spec to_map(xml_element()) -> map().

Convert XML element to Erlang map.

Parser Module (loki_xml_parser)

parse/1

-spec parse(binary()) -> {ok, [xml_element()]} | {error, term()}.

Low-level parser for XML chunks.

Splitter Module (loki_xml_splitter)

split_by_tag/2

-spec split_by_tag(binary(), binary()) -> {ok, [binary()]} | {error, term()}.

Split XML by specific tag using regex.

split_fixed/2

-spec split_fixed(binary(), pos_integer()) -> {ok, [binary()]}.

Split XML into fixed-size chunks with boundary reconstruction.

validate_chunk/1

-spec validate_chunk(binary()) -> boolean().

Validate that a chunk is well-formed XML.

Query Module (loki_xml_query)

Supported XPath expressions:

Error Handling

The parser is designed for fault tolerance:

%% Partial success - some chunks fail
{ok, Parsed} = loki_xml:parse("file.xml", #{max_retries => 3}).
%% Returns successfully parsed chunks, logs errors

%% Complete failure
{error, {all_chunks_failed, Reasons}} = loki_xml:parse("bad.xml").

%% Individual chunk errors
{error, {parse_error, Reason}} = loki_xml_parser:parse(<<"<unclosed">>).

Error types:

Optimization tips:

Testing

# Run all tests
rebar3 eunit

# Run specific test module
rebar3 eunit --module=loki_xml_tests

# Run examples
erl -pa _build/default/lib/*/ebin
1> loki_xml_examples:run_all_examples().

Design Decisions

Why Regex-Based Splitting?

Regex splitting allows:

Fallback to fixed-size ensures robustness for unstructured XML.

Why SAX-Like Parsing?

Why OTP Supervision?

Limitations

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass: rebar3 eunit
  5. Submit a pull request

License

Apache License 2.0

Acknowledgments

Support

Related Projects


Star ⭐ this project if you find it useful!