Rake

Hex.pmDocs

RAKE (Rapid Automatic Keyword Extraction) for Elixir.

Extract keywords from documents using word co-occurrence patterns. RAKE is an unsupervised, domain-independent, and language-independent algorithm that requires no training data.

Based on the paper: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In Text Mining: Applications and Theory. John Wiley & Sons.

Features

Installation

Add rake to your list of dependencies in mix.exs:

def deps do
  [
    {:rake, "~> 0.1.0"}
  ]
end

Quick Start

text = """
Compatibility of systems of linear constraints over the set of natural numbers.
Criteria of compatibility of a system of linear Diophantine equations, strict
inequations, and nonstrict inequations are considered.
"""

Rake.extract(text)
# => [
#   %{keyword: "linear Diophantine equations", score: 9.0, words: ["linear", "diophantine", "equations"]},
#   %{keyword: "nonstrict inequations", score: 4.0, words: ["nonstrict", "inequations"]},
#   %{keyword: "strict inequations", score: 4.0, words: ["strict", "inequations"]},
#   %{keyword: "linear constraints", score: 4.5, words: ["linear", "constraints"]},
#   ...
# ]

How RAKE Works

RAKE identifies keywords through a simple but effective process:

  1. Split text into candidate keywords - The text is divided at stop words (like “the”, “and”, “of”) and phrase delimiters (punctuation). The sequences of words between these boundaries become candidate keywords.

  2. Build a word co-occurrence graph - For each candidate keyword, RAKE tracks which words appear together. Words that frequently co-occur in candidates are likely part of important phrases.

  3. Score words and candidates - Each word gets a score based on its frequency and degree (number of co-occurrences). Candidate keywords are scored as the sum of their word scores.

  4. Return top keywords - By default, returns the top T keywords where T = (number of unique words) / 3.

Usage

Basic Extraction

# Uses default English stop words
keywords = Rake.extract("Your document text here...")

Custom Stop Words

# For domain-specific text or other languages
stop_words = ~w(el la los las un una de en y que)
keywords = Rake.extract(spanish_text, stop_words: stop_words)

Scoring Metrics

RAKE supports three scoring metrics:

Metric Description Best For
:deg_freq degree(word) / frequency(word) Default. Favors words that appear predominantly in longer phrases
:deg degree(word) Favors words that appear in many/long phrases
:freq frequency(word) Favors frequently occurring words
# Use degree scoring (favors words in longer phrases)
Rake.extract(text, score_metric: :deg)

# Use frequency scoring
Rake.extract(text, score_metric: :freq)

Controlling Output

# Return exactly 10 keywords
Rake.extract(text, top: 10)

# Return ALL scored candidates (no limit)
Rake.extract(text, top: :all)

Filtering

# Only include words with 3+ characters
Rake.extract(text, min_word_length: 3)

# Limit phrases to 4 words maximum
Rake.extract(text, max_words: 4)

Adjoining Keywords

RAKE can detect keywords that frequently appear adjacent to each other (with stop words between) and combine them:

# "axis" + "of" + "evil" -> "axis of evil"
Rake.extract(text, adjoining: true)

For adjoining detection to create a combined keyword, the pair must appear adjacent at least twice in the document.

Inspecting the Word Graph

For debugging or analysis, you can access the word co-occurrence graph:

{keywords, graph} = Rake.extract_with_graph(text)

# Check statistics for a specific word
graph.words["algorithm"]
# => %{freq: 3, deg: 7}

# freq = how many times the word appears in candidates
# deg = sum of co-occurrences (how connected the word is)

Options Reference

Option Type Default Description
:stop_words list Rake.StopWords.english() Words that split candidate keywords
:score_metric atom :deg_freq One of :freq, :deg, or :deg_freq
:top integer or :all T = words/3 Number of keywords to return
:min_word_length integer 1 Minimum characters per word
:max_words integer unlimited Maximum words per keyword phrase
:adjoining boolean false Detect adjoining keyword pairs

Default Stop Words

Rake.StopWords.english() provides ~100 common English stop words derived from the original RAKE paper’s keyword adjacency stoplist:

Rake.StopWords.english()
# => ["the", "and", "of", "a", "in", "is", "for", "to", ...]

For other languages or domains, provide your own stop word list.

Performance

RAKE is designed for efficiency:

The original paper reports RAKE processing 500 abstracts in 160ms (vs 1002ms for TextRank).

Examples

Scientific Abstract

abstract = """
Machine learning models require large amounts of labeled training data.
Active learning reduces labeling costs by selecting the most informative
samples for annotation. This paper presents a novel active learning
strategy for deep neural networks.
"""

Rake.extract(abstract, top: 5)
# => [
#   %{keyword: "active learning strategy", score: 9.0, ...},
#   %{keyword: "deep neural networks", score: 9.0, ...},
#   %{keyword: "labeled training data", score: 9.0, ...},
#   %{keyword: "Machine learning models", score: 9.0, ...},
#   %{keyword: "labeling costs", score: 4.0, ...}
# ]

News Article

article = """
The Federal Reserve announced today that interest rates will remain
unchanged. Fed Chair Jerome Powell cited ongoing inflation concerns
and labor market strength as key factors in the decision.
"""

Rake.extract(article, top: 5)
# => [
#   %{keyword: "ongoing inflation concerns", score: 9.0, ...},
#   %{keyword: "labor market strength", score: 9.0, ...},
#   %{keyword: "Federal Reserve", score: 4.0, ...},
#   %{keyword: "Fed Chair Jerome Powell", score: 8.0, ...},
#   %{keyword: "interest rates", score: 4.0, ...}
# ]

With Custom Domain Stop Words

# For legal documents, add domain-specific stop words
legal_stop_words = Rake.StopWords.english() ++ ~w(
  hereby whereas therefore pursuant herein thereof
  plaintiff defendant court jurisdiction
)

Rake.extract(legal_document, stop_words: legal_stop_words)

Comparison with Other Approaches

Method Training Required Corpus Required Speed
RAKE No No Fast (single pass)
TF-IDF No Yes Fast
TextRank No No Slower (iterative)
BERT KeyPhrase Yes Yes Slow (neural)

RAKE is ideal when you need fast, unsupervised keyword extraction from individual documents without maintaining a corpus.

License

MIT License. See LICENSE for details.

References