gen_xml
A behaviour module for processing an XML document.
This is the generic version of a couple of near-identical modules I have created in other projects.
The module will scan a supplied XML document using
xmerl_sax_parser:file/2. While scanning the XML file the callback
functions are called when encountering the start/end element tags. See
the overview docs for details.
The module is particularly useful for scanning/processing very large
documents. The simpler alternative scanner, xmerl_scan:file/1, does
not scale well for large files, for example a document with 1,000,000
elements.
The behaviour has the added advantage of enabling processing of the XML elements asynchronously while the scanner continues with scanning the rest of the document.
To use the module in a project add gen_xml to rebar3.config, e.g.
{deps, [ gen_xml ]}.
See the Examples directory for some example callback modules.
Recent changes
-
The
pathscallback module has been improved and moved to the main library. - The test suite has been extended for more complete coverage.
-
A new callback module,
attrs, has been added to the examples.
Build and test
The rebar3 tool is used for all the development processes.
$ rebar3 dialyzer
$ rebar3 eunit
$ rebar3 shell
...The callback modules
A number of (hopefilly) useful callback modules have been included
with the library. These are described below. There are more in the
Examples/genxml directory.
The null callback module
The genxml_null module is used for testing and benchmarking.
It can also be used as a template for new callback modules.
The callback module can be run manually against a file File with:
$ rebar3 shell
1> genxml_null:start("Examples/sample-xml-files-sample-5.xml").
{ok,null}
2> timer:tc(genxml_null, start, ["Examples/sample-xml-files-sample-4.xml"]).
{158,{ok,null}}
3> timer:tc(genxml_null, start, ["Examples/sample-xml-files-sample-6.xml"]).
{2841,{ok,null}}
The counts callback module
The genxml_counts module is used for testing and benchmarking.
It can also be used as a template for new callback modules.
The module will return the count of the element tags found in the XML document.
The callback module can be run manually against a file File with:
$ rebar3 shell
1> genxml_counts:start("Examples/sample-xml-files-sample-4.xml").
{ok,#{name => 2,root => 1,title => 1,author => 1,person => 2,
age => 2,email => 2,book => 1,year => 1}}
2> genxml_counts:start("Examples/sample-xml-files-sample-6.xml").
{ok,#{name => 48,root => 1,title => 24,author => 24,person => 48,
age => 48,email => 48,book => 24,year => 24}}
The paths callback module
This module reads an XML document and generates a list of paths that represents the XML document structure.
The module includes two helper functions: collect/1 and print/1,
the former returns the paths as a list, while the latter prints them
directly to the terminal.
$ rebar3 shell
1> genxml_paths:collect("Examples/sample-xml-files-sample-4.xml").
{ok,["root/book/year","root/book/author","root/book/title",
"root/book","root/person/email","root/person/age",
"root/person/name","root/person","root/person/email",
"root/person/age","root/person/name","root/person","root"]}
2> genxml_paths:print("Examples/sample-xml-files-sample-4.xml").
root
root/person
root/person/name
root/person/age
root/person/email
root/person
root/person/name
root/person/age
root/person/email
root/book
root/book/title
root/book/author
root/book/year
{ok,[]}The CLI
The command line interface enables running the various handlers from the terminal.
The module is gen_xml_cli.erl, however, the escript generated with
rebar3 is renamed to gen_xml.
$ rebar3 escriptize
$ ./_build/default/bin/gen_xml
error: gen_xml: subcommand expected
Usage:
gen_xml {counts|null|paths} [-v] [--verbose] <file>
Subcommands:
counts run the counts callback module
null run the null callback module
paths run the paths callback module
Arguments:
file file
Optional arguments:
-v, --verbose be verbose, can use multiple times for warning to debugThe CLI subcommands and options are defined in the include file.
You can create a symlink for convenience:
$ ln -sv _build/default/bin/gen_xmlThe output of the CLI is formatted for readability as well as for further processing:
$ ./gen_xml counts Examples/sample-xml-files-sample-5.xml
16,name
1,root
8,title
8,author
16,person
16,age
16,email
8,book
8,year
$ ./gen_xml paths Examples/sample-xml-files-sample-4.xml
root
root/person
root/person/name
root/person/age
root/person/email
root/person
root/person/name
root/person/age
root/person/email
root/book
root/book/title
root/book/author
root/book/year
$ ./gen_xml paths Examples/sample-xml-files-sample-4.xml | sort -u
root
root/book
root/book/author
root/book/title
root/book/year
root/person
root/person/age
root/person/email
root/person/name