Text
Text & language processing for Elixir.
A toolkit for tokenization, language identification, sentiment analysis, named-entity recognition, word clouds, phonetic encoding, search ranking, and the supporting plumbing — all in pure BEAM, with optional ML backends behind feature flags.
Capabilities
Detection and analysis
- Language identification (
Text.Language.Classifier.Fasttext) — pure-Elixir port of fastText'slid.176, 176 languages, validated bit-for-bit against the reference. ~100 µs per prediction with EXLA. - Sentiment analysis (
Text.Sentiment) — multilingual; bundled AFINN lexicons (default, 7 languages + emoticons) or XLM-RoBERTa via Bumblebee (optional, ~30 languages). - Part-of-speech tagging (
Text.POS) — via Bumblebee, English by default. - Named-entity recognition (
Text.NER) — via Bumblebee, multilingual (10 high-resource languages).
Strings
- String distance (
Text.Distance) — Levenshtein, Damerau-Levenshtein, Hamming, Jaro, Jaro-Winkler. - Set similarity (
Text.Similarity) — Jaccard, Dice, overlap, cosine. - Phonetic encoding (
Text.Phonetic.Soundex,Text.Phonetic.Metaphone). - Slugification (
Text.Slug) — locale-aware Unicode folding with cross-script transliteration. - Segmentation (
Text.Segment) — UAX #29 word/sentence boundaries with CLDR abbreviation suppressions.
Statistics and search
- N-grams and word counts (
Text.Ngram,Text.Word). - TF-IDF and BM25 (
Text.IR) — indexed corpus with scoring and top-K search. - Collocation extraction (
Text.Collocation) — bigrams ranked by frequency, PMI, or log-likelihood. - Concordance (
Text.KWIC) — keyword-in-context lookup. - Word embeddings (
Text.Embedding) — load fastText.vecfiles, then cosine similarity, nearest neighbours, and analogies. - Word clouds (
Text.WordCloud) — multilingual keyword extraction (six scoring backends) plus spiral layout and SVG rendering. - Stopwords (
Text.Stopwords) — bundled lists for ~60 languages from stopwords-iso.
Inflection
- English pluralization (
Text.Inflect.En) — modern and classical modes.
Installation
def deps do
[
{:text, "~> 0.3.0"}
]
end
For the language identifier, fetch the lid.176.bin model once after install:
mix text.download_lid176
For production environments using the optional Bumblebee-backed modules, mix text.download_models (plural) pre-fetches every external artefact — lid.176.bin plus the default Hugging Face checkpoints — so the first call to each module never hits the network.
A taste
# Sentiment — multilingual, no model download by default.
Text.Sentiment.analyze("J'adore ce livre.", language: :fr).label
#=> :positive
# Language identification — load the fastText model once.
{:ok, model} = Text.Language.Classifier.Fasttext.ModelLoader.load(
Path.join(:code.priv_dir(:text), "lid_176/lid.176.bin")
)
{:ok, "es"} = Text.Language.Classifier.Fasttext.classify("Hola, ¿cómo estás?", model)
# Word cloud → SVG file in four piped steps.
text
|> Text.WordCloud.terms(language: :en)
|> Text.WordCloud.Layout.layout(width: 800, height: 600, rotations: :radial)
|> Text.WordCloud.SVG.render(palette: Color.Palette.tonal("#3b82f6"))
|> then(&File.write!("cloud.svg", &1))Guides
In-depth walkthroughs with worked examples:
Text classification (language identification) — setup,
detect/3, CLDR locale resolution, Hans/Hant disambiguation, performance tuning.Sentiment analysis — lexicon vs Bumblebee backends, multilingual lexicons, custom lexicons, threshold tuning, production wiring.
Part-of-speech tagging and NER — Bumblebee setup, tag sets, model pre-download, named
Nx.Servings for high-QPS workloads.Keyword-in-context concordance —
Text.KWIC.concordance/3, formatting, collocate scans, sense disambiguation patterns.Word clouds — six scoring backends (YAKE!, frequency, RAKE, TextRank, TF-IDF, KeyBERT), Wordle-style layouts (
:radial/:spiral), SVG rendering withColor.Paletteintegration.
Optional dependencies
The package works without any optional deps. Adding them enables progressively heavier capabilities:
| Dep | Enables |
|---|---|
:exla | Order-of-magnitude faster inference for Fasttext and the Bumblebee-backed modules. Strongly recommended in production. |
:bumblebee | Neural sentiment, POS, NER, and the KeyBERT word-cloud backend. |
:localize |
CLDR-canonical locale resolution (fr-Latn-CA, zh-Hans-CN) and Localize.LanguageTag input shapes. |
:color | Color.Palette.Tonal and Theme palettes for SVG word-cloud rendering. |
:text_stemmer | Snowball stemming (~30 languages) for word-cloud morphological-variant consolidation. |
Calls that need a missing optional dep raise with installation instructions; the rest of the package keeps working.
Every public function that takes a :language (or :locale) accepts an atom (:fr), a string ("fr", "fr-CA", "zh-Hans-CN"), or a Localize.LanguageTag struct (when :localize is loaded). See Text.Language for the normalisation helpers.
Roadmap
- Quantized model support for
lid.176.ftz(917 KB variant). Nx.Servingfor batched inference for throughput-bound workloads.- CLDR-tailored segmentation once the
unicode_setregex engine matures.
License
Apache 2.0 — see LICENSE.md.