ExNLP

A comprehensive Natural Language Processing library for Elixir, providing tokenization, stemming, ranking algorithms, similarity metrics, and text analysis tools. Inspired by Python's NLTK and designed with idiomatic Elixir patterns.

Features

Installation

Add ex_nlp to your list of dependencies in mix.exs:

def deps do
  [
    {:ex_nlp, "~> 0.1.0"}
  ]
end

Then run mix deps.get.

Quick Start

Tokenization

# Simple word tokenization
iex> ExNlp.Tokenizer.word_tokenize("Hello, world!")
["Hello", "world"]

# Get tokens with position and offset information
iex> ExNlp.Tokenizer.tokenize("Hello, world!")
[
  %ExNlp.Token{text: "Hello", position: 0, start_offset: 0, end_offset: 5},
  %ExNlp.Token{text: "world", position: 1, start_offset: 7, end_offset: 12}
]

# Custom regex tokenizer
iex> ExNlp.Tokenizer.regexp_tokenize("abc123 def456", "\\d+")
["123", "456"]

Stemming

# Stem words in multiple languages
iex> ExNlp.Snowball.stem("running", :english)
"run"

iex> ExNlp.Snowball.stem("caminando", :spanish)
"camin"

iex> ExNlp.Snowball.stem_words(["running", "jumping", "beautiful"], :english)
["run", "jump", "beauti"]

# Check supported languages
iex> ExNlp.Snowball.supported_languages()
[:english, :spanish, :portuguese, :french, :german, :italian, :polish]

Ranking Algorithms

TF-IDF

# Calculate TF-IDF score for a term in a document
iex> documents = ["The quick brown fox", "A brown dog"]
iex> ExNlp.Ranking.TfIdf.calculate("fox", "The quick brown fox", documents)
0.5108256237659907

# With preprocessing options
iex> ExNlp.Ranking.TfIdf.calculate("running", "The runner is running fast", documents,
...>   stem: true, language: :english, remove_stopwords: true)
0.6931471805599453

BM25

# Score documents against a query
iex> documents = ["BM25 is a ranking function", "used by search engines"]
iex> ExNlp.Ranking.Bm25.score(documents, ["ranking", "search"])
[1.8455076734299591, 1.0126973514850315]

# Rank documents with custom parameters
iex> ExNlp.Ranking.Bm25.score(documents, ["ranking", "search"], 
...>   k1: 1.5, b: 0.75, stem: true, language: :english)
[1.923456, 1.123456]

Similarity Metrics

# Levenshtein distance
iex> ExNlp.Similarity.levenshtein("kitten", "sitting")
3

# Levenshtein similarity (normalized)
iex> ExNlp.Similarity.levenshtein_similarity("kitten", "sitting")
0.5714285714285714

# Jaccard similarity
iex> ExNlp.Similarity.jaccard(["cat", "dog"], ["cat", "bird"])
0.3333333333333333

# Jaro-Winkler similarity
iex> ExNlp.Similarity.jaro_winkler_similarity("martha", "marhta")
0.9611111111111111

# Dice coefficient
iex> ExNlp.Similarity.dice_coefficient(["cat", "dog"], ["cat", "bird"])
0.5

Stopwords

# Check if a word is a stopword
iex> ExNlp.Stopwords.is_stopword?("the", :english)
true

# Remove stopwords from a list
iex> words = ["the", "quick", "brown", "fox"]
iex> ExNlp.Stopwords.remove(words, :english)
["quick", "brown", "fox"]

# Get list of stopwords
iex> ExNlp.Stopwords.list(:english)
["a", "all", "and", "as", "at", ...]

Text Filtering

# Build a filtering pipeline
iex> tokens = ExNlp.Tokenizer.tokenize("The Quick Brown Fox")
iex> tokens
...> |> ExNlp.Filter.lowercase()
...> |> ExNlp.Filter.stop_words(:english)
...> |> ExNlp.Filter.min_length(3)
[
  %ExNlp.Token{text: "quick", ...},
  %ExNlp.Token{text: "brown", ...},
  %ExNlp.Token{text: "fox", ...}
]

N-grams

# Character n-grams
iex> ExNlp.Ngram.char_ngrams("hello", 2)
["he", "el", "ll", "lo"]

# Word n-grams
iex> ExNlp.Ngram.word_ngrams(["the", "quick", "brown", "fox"], 2)
[["the", "quick"], ["quick", "brown"], ["brown", "fox"]]

Statistics

# Term frequency in a document
iex> ExNlp.Statistics.term_frequency("cat", ["the", "cat", "sat", "on", "the", "mat"])
1

# Document frequency in a corpus
iex> corpus = [["cat", "dog"], ["cat", "bird"], ["dog", "fish"]]
iex> ExNlp.Statistics.document_frequency("cat", corpus)
2

# Most frequent terms
iex> corpus = [["cat", "dog"], ["cat", "cat"], ["dog"]]
iex> ExNlp.Statistics.most_frequent(corpus, 2)
[{"cat", 3}, {"dog", 2}]

Co-occurrence Analysis

# Build co-occurrence matrix
iex> corpus = [["cat", "dog"], ["cat", "bird", "dog"], ["bird"]]
iex> matrix = ExNlp.Cooccurrence.cooccurrence_matrix(corpus)
iex> matrix["cat"]["dog"]
2

# Find co-occurring terms
iex> ExNlp.Cooccurrence.cooccurring_terms("cat", corpus, 2)
[{"dog", 2}, {"bird", 1}]

Supported Languages

Stemming

Stopwords

Stopword lists are available for 30+ languages including: English, Spanish, Portuguese, French, German, Italian, Polish, Russian, Dutch, Swedish, Norwegian, Danish, Finnish, Turkish, Arabic, Chinese, and more. See priv/stopwords/ for the complete list.

Architecture

The library is organized into logical modules:

Performance

The library includes benchmark suites for critical operations. Run benchmarks with:

mix run benchmarks/tokenizer_bench.exs
mix run benchmarks/similarity_bench.exs
mix run benchmarks/ranking_bench.exs

Testing

Run the test suite with:

mix test

Documentation

Generate documentation with:

mix docs

Contributing

Contributions are welcome! This library aims to be a comprehensive NLP toolkit for Elixir. Areas for contribution:

Credits

License

MIT License - see LICENSE file for details.