Rake

RAKE (Rapid Automatic Keyword Extraction) for Elixir.

Extract keywords from documents using word co-occurrence patterns. RAKE is an unsupervised, domain-independent, and language-independent algorithm that requires no training data.

Based on the paper: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In Text Mining: Applications and Theory. John Wiley & Sons.

Features

Zero dependencies - Pure Elixir implementation
No training required - Works on individual documents without a corpus
Language independent - Configure stop words for any language
Multiple scoring metrics - Choose between frequency, degree, or degree/frequency ratio
Configurable - Adjust minimum word length, maximum phrase length, and more
Graph inspection - Access the word co-occurrence graph for analysis

Installation

Add rake to your list of dependencies in mix.exs:

def deps do
  [
    {:rake, "~> 0.1.0"}
  ]
end

Quick Start

text = """
Compatibility of systems of linear constraints over the set of natural numbers.
Criteria of compatibility of a system of linear Diophantine equations, strict
inequations, and nonstrict inequations are considered.
"""

Rake.extract(text)
# => [
#   %{keyword: "linear Diophantine equations", score: 9.0, words: ["linear", "diophantine", "equations"]},
#   %{keyword: "nonstrict inequations", score: 4.0, words: ["nonstrict", "inequations"]},
#   %{keyword: "strict inequations", score: 4.0, words: ["strict", "inequations"]},
#   %{keyword: "linear constraints", score: 4.5, words: ["linear", "constraints"]},
#   ...
# ]

How RAKE Works

RAKE identifies keywords through a simple but effective process:

Split text into candidate keywords - The text is divided at stop words (like “the”, “and”, “of”) and phrase delimiters (punctuation). The sequences of words between these boundaries become candidate keywords.
Build a word co-occurrence graph - For each candidate keyword, RAKE tracks which words appear together. Words that frequently co-occur in candidates are likely part of important phrases.
Score words and candidates - Each word gets a score based on its frequency and degree (number of co-occurrences). Candidate keywords are scored as the sum of their word scores.
Return top keywords - By default, returns the top T keywords where T = (number of unique words) / 3.

Usage

Basic Extraction

# Uses default English stop words
keywords = Rake.extract("Your document text here...")

Custom Stop Words

# For domain-specific text or other languages
stop_words = ~w(el la los las un una de en y que)
keywords = Rake.extract(spanish_text, stop_words: stop_words)

Scoring Metrics

RAKE supports three scoring metrics:

Metric	Description	Best For
`:deg_freq`	degree(word) / frequency(word)	Default. Favors words that appear predominantly in longer phrases
`:deg`	degree(word)	Favors words that appear in many/long phrases
`:freq`	frequency(word)	Favors frequently occurring words

# Use degree scoring (favors words in longer phrases)
Rake.extract(text, score_metric: :deg)

# Use frequency scoring
Rake.extract(text, score_metric: :freq)

Controlling Output

# Return exactly 10 keywords
Rake.extract(text, top: 10)

# Return ALL scored candidates (no limit)
Rake.extract(text, top: :all)

Filtering

# Only include words with 3+ characters
Rake.extract(text, min_word_length: 3)

# Limit phrases to 4 words maximum
Rake.extract(text, max_words: 4)

Adjoining Keywords

RAKE can detect keywords that frequently appear adjacent to each other (with stop words between) and combine them:

# "axis" + "of" + "evil" -> "axis of evil"
Rake.extract(text, adjoining: true)

For adjoining detection to create a combined keyword, the pair must appear adjacent at least twice in the document.

Inspecting the Word Graph

For debugging or analysis, you can access the word co-occurrence graph:

{keywords, graph} = Rake.extract_with_graph(text)

# Check statistics for a specific word
graph.words["algorithm"]
# => %{freq: 3, deg: 7}

# freq = how many times the word appears in candidates
# deg = sum of co-occurrences (how connected the word is)

Options Reference

Option	Type	Default	Description
`:stop_words`	list	`Rake.StopWords.english()`	Words that split candidate keywords
`:score_metric`	atom	`:deg_freq`	One of `:freq`, `:deg`, or `:deg_freq`
`:top`	integer or `:all`	T = words/3	Number of keywords to return
`:min_word_length`	integer	1	Minimum characters per word
`:max_words`	integer	unlimited	Maximum words per keyword phrase
`:adjoining`	boolean	false	Detect adjoining keyword pairs

Default Stop Words

Rake.StopWords.english() provides ~100 common English stop words derived from the original RAKE paper’s keyword adjacency stoplist:

Rake.StopWords.english()
# => ["the", "and", "of", "a", "in", "is", "for", "to", ...]

For other languages or domains, provide your own stop word list.

Performance

RAKE is designed for efficiency:

Single pass scoring - Unlike iterative algorithms (e.g., TextRank), RAKE scores keywords in one pass
No external dependencies - Pure Elixir, no NIFs or external services
Scales linearly - Processing time grows linearly with document size

The original paper reports RAKE processing 500 abstracts in 160ms (vs 1002ms for TextRank).

Examples

Scientific Abstract

abstract = """
Machine learning models require large amounts of labeled training data.
Active learning reduces labeling costs by selecting the most informative
samples for annotation. This paper presents a novel active learning
strategy for deep neural networks.
"""

Rake.extract(abstract, top: 5)
# => [
#   %{keyword: "active learning strategy", score: 9.0, ...},
#   %{keyword: "deep neural networks", score: 9.0, ...},
#   %{keyword: "labeled training data", score: 9.0, ...},
#   %{keyword: "Machine learning models", score: 9.0, ...},
#   %{keyword: "labeling costs", score: 4.0, ...}
# ]

News Article

article = """
The Federal Reserve announced today that interest rates will remain
unchanged. Fed Chair Jerome Powell cited ongoing inflation concerns
and labor market strength as key factors in the decision.
"""

Rake.extract(article, top: 5)
# => [
#   %{keyword: "ongoing inflation concerns", score: 9.0, ...},
#   %{keyword: "labor market strength", score: 9.0, ...},
#   %{keyword: "Federal Reserve", score: 4.0, ...},
#   %{keyword: "Fed Chair Jerome Powell", score: 8.0, ...},
#   %{keyword: "interest rates", score: 4.0, ...}
# ]

With Custom Domain Stop Words

# For legal documents, add domain-specific stop words
legal_stop_words = Rake.StopWords.english() ++ ~w(
  hereby whereas therefore pursuant herein thereof
  plaintiff defendant court jurisdiction
)

Rake.extract(legal_document, stop_words: legal_stop_words)

Comparison with Other Approaches

Method	Training Required	Corpus Required	Speed
RAKE	No	No	Fast (single pass)
TF-IDF	No	Yes	Fast
TextRank	No	No	Slower (iterative)
BERT KeyPhrase	Yes	Yes	Slow (neural)

RAKE is ideal when you need fast, unsupervised keyword extraction from individual documents without maintaining a corpus.

License

MIT License. See LICENSE for details.

References

Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Applications and Theory. John Wiley & Sons.