loki_xml
Effortless XML parsing for Erlang, simple, fast, and neat.
A high-performance XML parser designed for processing large XML files using chunked, parallel, and distributed processing with built-in fault tolerance.
Features
- High Performance: Streaming SAX-like parsing without loading entire files into memory
- Chunked Processing: Intelligent XML splitting using regex or fixed-size chunks
- Parallel Parsing: Process multiple chunks concurrently
- Distributed: Scale across multiple Erlang nodes
- Fault Tolerant: OTP supervision with automatic recovery
- XPath Queries: Basic XPath support (
//,/,@attribute) - Flexible Output: Convert to Erlang terms or maps
Architecture
Master Node
│
├── File Supervisor (1 per XML file)
│ ├── Splitter Process (splits XML into chunks)
│ ├── Chunk Supervisor (1 per chunk)
│ │ ├── Parser Worker (parses chunk)
│ │ └── XPath Worker (optional query execution)
│ └── Merger Process (combines results)
│
└── Distributor (optional, sends chunks to remote nodes)Quick Start
%% Start the application
loki_xml:start().
%% Parse a simple XML binary
XML = <<"<book id=\"1\">
<title>Erlang Programming</title>
<author>Joe Armstrong</author>
</book>">>,
{ok, Parsed} = loki_xml:parse(XML).
%% Returns: [{element, <<"book">>, [{<<"id">>, <<"1">>}], [...]}]
%% Parse a file
{ok, Parsed} = loki_xml:parse("large.xml").
%% Parse with options
{ok, Parsed} = loki_xml:parse("huge.xml", #{
tag => <<"record">>, % Split by <record> tags
chunk_size => 1048576, % 1MB chunks (fallback)
format => map, % Convert to maps
max_retries => 3 % Retry failed chunks
}).XPath Queries
{ok, Parsed} = loki_xml:parse("library.xml"),
%% Query all titles
Titles = loki_xml:query(Parsed, <<"//title">>).
%% Returns: [<<"Erlang Programming">>, <<"Learn You Some Erlang">>]
%% Query attributes
Ids = loki_xml:query(Parsed, <<"@id">>).
%% Returns: [<<"1">>, <<"2">>]
%% Nested queries
Authors = loki_xml:query(Parsed, <<"//book/author">>).Distributed Parsing
Process large XML files across multiple Erlang nodes:
%% Start additional nodes
%% Terminal 1: erl -sname node1@localhost
%% Terminal 2: erl -sname node2@localhost
%% On master node
Nodes = ['node1@localhost', 'node2@localhost'],
{ok, Parsed} = loki_xml:parse("huge.xml", #{
distributed => true,
nodes => Nodes,
tag => <<"record">>
}).Map Conversion
%% Parse directly to maps
{ok, [Book]} = loki_xml:parse(XML, #{format => map}).
%% Access fields
#{tag := <<"book">>,
attributes := #{<<"id">> := <<"1">>},
children := [
#{tag := <<"title">>, children := [<<"Erlang Programming">>]},
...
]} = Book.
%% Or convert manually
{ok, [Element]} = loki_xml:parse(XML),
Map = loki_xml:to_map(Element).API Reference
Main API (loki_xml)
parse/1
-spec parse(binary() | file:filename()) -> {ok, [xml_element()]} | {error, term()}.Parse XML from binary or file with default options.
parse/2
-spec parse(binary() | file:filename(), parse_opts()) -> {ok, [xml_element()]} | {error, term()}.Options:
format:term(default) ormapdistributed:boolean()- Enable distributed parsingnodes:[node()]- List of nodes for distributionchunk_size:pos_integer()- Chunk size in bytes (default: 1MB)tag:binary()- Tag name for regex-based splittingmax_retries:non_neg_integer()- Max retry attempts (default: 3)
query/2
-spec query([xml_element()], binary()) -> [binary() | xml_element()].Execute XPath query on parsed XML.
to_map/1
-spec to_map(xml_element()) -> map().Convert XML element to Erlang map.
Parser Module (loki_xml_parser)
parse/1
-spec parse(binary()) -> {ok, [xml_element()]} | {error, term()}.Low-level parser for XML chunks.
Splitter Module (loki_xml_splitter)
split_by_tag/2
-spec split_by_tag(binary(), binary()) -> {ok, [binary()]} | {error, term()}.Split XML by specific tag using regex.
split_fixed/2
-spec split_fixed(binary(), pos_integer()) -> {ok, [binary()]}.Split XML into fixed-size chunks with boundary reconstruction.
validate_chunk/1
-spec validate_chunk(binary()) -> boolean().Validate that a chunk is well-formed XML.
Query Module (loki_xml_query)
Supported XPath expressions:
//tag- Descendant-or-self axis (all descendants)/tag- Child axis from root//tag/subtag- Nested paths@attribute- Attribute values
Error Handling
The parser is designed for fault tolerance:
%% Partial success - some chunks fail
{ok, Parsed} = loki_xml:parse("file.xml", #{max_retries => 3}).
%% Returns successfully parsed chunks, logs errors
%% Complete failure
{error, {all_chunks_failed, Reasons}} = loki_xml:parse("bad.xml").
%% Individual chunk errors
{error, {parse_error, Reason}} = loki_xml_parser:parse(<<"<unclosed">>).Error types:
{parse_error, Reason}- Malformed XML{unclosed_tag, Tag}- Missing closing tag{tag_mismatch, Expected, Actual}- Mismatched tags{timeout, Pid}- Chunk parsing timeout{max_retries, Chunk}- Exceeded retry limit
Optimization tips:
-
Use
tagoption for structured XML (faster than fixed-size) - Enable distribution for files > 100MB
-
Adjust
chunk_sizebased on record size - Use streaming for very large files
Testing
# Run all tests
rebar3 eunit
# Run specific test module
rebar3 eunit --module=loki_xml_tests
# Run examples
erl -pa _build/default/lib/*/ebin
1> loki_xml_examples:run_all_examples().Design Decisions
Why Regex-Based Splitting?
Regex splitting allows:
- Fast chunk identification without full parsing
- Preservation of logical record boundaries
- Better parallelization for structured XML
Fallback to fixed-size ensures robustness for unstructured XML.
Why SAX-Like Parsing?
- Memory efficient for large files
- Stream processing without loading entire document
- Better performance for selective queries
Why OTP Supervision?
- Automatic recovery from parser crashes
- Isolated failures (one bad chunk doesn't fail entire parse)
- Production-ready fault tolerance
Limitations
- XPath: Basic support only (
//,/,@) - no predicates or functions - Namespaces: Not currently supported
- DTD/Schema: No validation support
- Entities: Limited entity reference support
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
-
Ensure all tests pass:
rebar3 eunit - Submit a pull request
License
Apache License 2.0
Acknowledgments
- Inspired by Erlang's robust concurrency model
- Built with OTP supervision principles
- Thanks to the Erlang community
Support
- Issues: https://github.com/roquess/loki_xml/issues
- Documentation: https://hexdocs.pm/loki_xml
- Discussions: https://github.com/roquess/loki_xml/discussions
Related Projects
xmerl- Erlang's standard XML parsererlsom- SAX-based XML parser for Erlangfast_xml- Fast XML parser (C NIF-based)
Star ⭐ this project if you find it useful!