Text

Text & language processing for Elixir.

A toolkit for tokenization, language identification, sentiment analysis, named-entity recognition, word clouds, phonetic encoding, search ranking, and the supporting plumbing — all in pure BEAM, with optional ML backends behind feature flags.

Capabilities

Detection and analysis

Language identification (Text.Language.Classifier.Fasttext) — pure-Elixir port of fastText's lid.176, 176 languages, validated bit-for-bit against the reference. ~100 µs per prediction with EXLA.
Sentiment analysis (Text.Sentiment) — multilingual; bundled AFINN lexicons (default, 7 languages + emoticons) or XLM-RoBERTa via Bumblebee (optional, ~30 languages).
Part-of-speech tagging (Text.POS) — via Bumblebee, English by default.
Named-entity recognition (Text.NER) — via Bumblebee, multilingual (10 high-resource languages).

Strings

String distance (Text.Distance) — Levenshtein, Damerau-Levenshtein, Hamming, Jaro, Jaro-Winkler.
Set similarity (Text.Similarity) — Jaccard, Dice, overlap, cosine.
Phonetic encoding (Text.Phonetic.Soundex, Text.Phonetic.Metaphone).
Slugification (Text.Slug) — locale-aware Unicode folding with cross-script transliteration.
Segmentation (Text.Segment) — UAX #29 word/sentence boundaries with CLDR abbreviation suppressions.

Statistics and search

N-grams and word counts (Text.Ngram, Text.Word).
TF-IDF and BM25 (Text.IR) — indexed corpus with scoring and top-K search.
Collocation extraction (Text.Collocation) — bigrams ranked by frequency, PMI, or log-likelihood.
Concordance (Text.KWIC) — keyword-in-context lookup.
Word embeddings (Text.Embedding) — load fastText .vec files, then cosine similarity, nearest neighbours, and analogies.
Word clouds (Text.WordCloud) — multilingual keyword extraction (six scoring backends) plus spiral layout and SVG rendering.
Stopwords (Text.Stopwords) — bundled lists for ~60 languages from stopwords-iso.

Inflection

English pluralization (Text.Inflect.En) — modern and classical modes.

Installation

def deps do
  [
    {:text, "~> 0.3.0"}
  ]
end

For the language identifier, fetch the lid.176.bin model once after install:

mix text.download_lid176

For production environments using the optional Bumblebee-backed modules, mix text.download_models (plural) pre-fetches every external artefact — lid.176.bin plus the default Hugging Face checkpoints — so the first call to each module never hits the network.

A taste

# Sentiment — multilingual, no model download by default.
Text.Sentiment.analyze("J&#39;adore ce livre.", language: :fr).label
#=> :positive

# Language identification — load the fastText model once.
{:ok, model} = Text.Language.Classifier.Fasttext.ModelLoader.load(
  Path.join(:code.priv_dir(:text), "lid_176/lid.176.bin")
)

{:ok, "es"} = Text.Language.Classifier.Fasttext.classify("Hola, ¿cómo estás?", model)

# Word cloud → SVG file in four piped steps.
text
|> Text.WordCloud.terms(language: :en)
|> Text.WordCloud.Layout.layout(width: 800, height: 600, rotations: :radial)
|> Text.WordCloud.SVG.render(palette: Color.Palette.tonal("#3b82f6"))
|> then(&File.write!("cloud.svg", &1))

Guides

In-depth walkthroughs with worked examples:

Text classification (language identification) — setup, detect/3, CLDR locale resolution, Hans/Hant disambiguation, performance tuning.
Sentiment analysis — lexicon vs Bumblebee backends, multilingual lexicons, custom lexicons, threshold tuning, production wiring.
Part-of-speech tagging and NER — Bumblebee setup, tag sets, model pre-download, named Nx.Servings for high-QPS workloads.
Keyword-in-context concordance — Text.KWIC.concordance/3, formatting, collocate scans, sense disambiguation patterns.
Word clouds — six scoring backends (YAKE!, frequency, RAKE, TextRank, TF-IDF, KeyBERT), Wordle-style layouts (:radial/:spiral), SVG rendering with Color.Palette integration.

Optional dependencies

The package works without any optional deps. Adding them enables progressively heavier capabilities:

Dep	Enables
`:exla`	Order-of-magnitude faster inference for Fasttext and the Bumblebee-backed modules. Strongly recommended in production.
`:bumblebee`	Neural sentiment, POS, NER, and the KeyBERT word-cloud backend.
`:localize`	CLDR-canonical locale resolution (`fr-Latn-CA`, `zh-Hans-CN`) and `Localize.LanguageTag` input shapes.
`:color`	`Color.Palette.Tonal` and `Theme` palettes for SVG word-cloud rendering.
`:text_stemmer`	Snowball stemming (~30 languages) for word-cloud morphological-variant consolidation.

Calls that need a missing optional dep raise with installation instructions; the rest of the package keeps working.

Every public function that takes a :language (or :locale) accepts an atom (:fr), a string ("fr", "fr-CA", "zh-Hans-CN"), or a Localize.LanguageTag struct (when :localize is loaded). See Text.Language for the normalisation helpers.

Roadmap

Quantized model support for lid.176.ftz (917 KB variant).
Nx.Serving for batched inference for throughput-bound workloads.
CLDR-tailored segmentation once the unicode_set regex engine matures.

License

Apache 2.0 — see LICENSE.md.