Text.Stemmer

Pre-compiled Snowball stemmers for Elixir.

This package ships 36 stemming algorithms covering a wide range of natural languages, accessible through a single Text.Stemmer.stem/2 entry point. The stemmers themselves were generated from the canonical Snowball algorithm sources using the :snowball compiler, which is included as a runtime dependency.

Installation

Add :text_stemmer to your mix.exs deps:

def deps do
  [
    {:text_stemmer, "~> 0.1"}
  ]
end

Usage

iex> Text.Stemmer.stem("generalizations", :en)
"general"

iex> Text.Stemmer.stem("gouvernements", :fr)
"gouvern"

iex> Text.Stemmer.stem_list(["running", "ran", "runs"], :en)
["run", "ran", "run"]

Languages are identified by their ISO 639-1 two-letter code. Algorithm-specific variants use a <code>_<algorithm> form: :en_porter, :en_lovins, :nl_porter. See the Text.Stemmer moduledoc for the full table.

iex> length(Text.Stemmer.supported_languages())
36

Regenerating stemmers

The pre-generated stemmer modules under lib/text/stemmer/stemmers/ are produced from the .sbl algorithm sources vendored in src/algorithms/ (taken from snowballstem/snowball). To regenerate after editing or updating a source file, run:

mix snowball.gen --module-prefix Text.Stemmer.Stemmers \
                 --output-dir lib/text/stemmer/stemmers

The mix snowball.gen task is supplied by the :snowball compiler dependency.

Compliance testing

Each generated stemmer is verified against the canonical Snowball corpus from snowballstem/snowball-data, vendored under test/data/<lang>/ as gzipped voc.txt/output.txt pairs. Compliance tests are tagged :compliance and excluded by default; run them explicitly with:

mix test --only compliance

The corpus is not shipped with the Hex package — it lives in the source tree only. Per-language licensing notes from upstream are preserved in test/data/<lang>/COPYING files. The Arabic corpus is GPLv3; the rest are mostly BSD-3-Clause or CC BY-SA. See test/data/COPYING for the umbrella terms.

Documentation

Full API documentation is published at https://hexdocs.pm/text_stemmer.

License

Apache-2.0. See LICENSE.md.