motif

Pure Erlang keyword and topic extraction using the RAKE algorithm. Supports French, English, and German with built-in stop-word lists. No external dependencies.

Installation

%% rebar.config
{deps, [{motif, "0.1.0"}]}.

Quick start

%% Extract from English text (language auto-detected)
Results = motif:extract(<<"Red roses are a symbol of love and beauty.">>),
%% [{<<"red roses">>, 4.0}, {<<"symbol">>, 1.0}, {<<"love">>, 1.0}, {<<"beauty">>, 1.0}]

%% Explicit language + max results
Top3 = motif:extract(Text, #{lang => fr, max => 3}),

%% Auto-detect language (samples first 200 words)
Auto = motif:extract(Text, #{lang => auto}),

%% Get the stop-word list for a language
Stops = motif:stop_words(fr).

API

%% Extract keyword candidates. Returns [{Keyword, Score}] sorted by score desc.
-spec extract(binary()) -> [{binary(), float()}].
-spec extract(binary(), #{max  => pos_integer(),
                           lang => fr | en | de | auto}) -> [{binary(), float()}].

%% Return the built-in stop-word list for a language.
-spec stop_words(fr | en | de) -> [binary()].

Algorithm

RAKE (Rapid Automatic Keyword Extraction):

  1. Split text into sentences on . ! ?
  2. Within each sentence, split into candidate phrases on stop words
  3. Score each word: degree(word) / frequency(word) where degree(w) = sum of phrase lengths containing w
  4. Score each candidate: sum of its word scores
  5. Return sorted by score descending, deduplicated

Multi-word phrases with co-occurring rare words score highest.

Language detection

lang => auto samples the first 200 words, counts stop-word hits per language, and picks the language with the most hits. Falls back to en on a tie or empty input.

License

Apache 2.0 — see LICENSE.