Gibran

Yesterday is but today's memory, and tomorrow is today's dream.

Gibran

Gibran is an Elixir port of WordsCounted, a Ruby natural language processor. I have lofty goals for Gibran, such as:

Metaphone phonetic coding system
Levenshtein distance algorithm
Soundex algorithm
Porter Stemming algorithm
String similarity as described by Simon White

But for now, you'll have to be content with a powerful tokeniser and a utility counter.

Token count, unique token count, and character count.
Average characters per token.
HashDicts of tokens and their frequencies, lengths, and densities.
The longest token(s) and its length.
The most frequent token(s) and its frequency.
Unique tokens.

Usage

Let's start with something simple.

alias Gibran.Tokeniser
alias Gibran.Counter

str = "Yesterday is but today&#39;s memory, and tomorrow is today&#39;s dream."
Tokeniser.tokenise(str)
# => ["yesterday", "is", "but", "today&#39;s", "memory", "and", "tomorrow", "is", "today&#39;s", "dream"]

Tokeniser.tokenise(str) |> Counter.uniq_token_count
# => 8

By default Gibran uses the following regular expression to tokenise strings: ~r/[^\p{L}'-]/u. However, you can provide your own regular expression through the pattern option. You can also combine pattern with exclude to create sophisticated tokenisation strategies.

Tokeniser.tokenise(string, exclude: &String.length(&1) < 4) |> Counter.token_count
# => 6

The exclude option accepts a string, a function, a regular expression, or a list combining any one or more of those types.

# Using `exclude` with a function.
Tokeniser.tokenise("Kingdom of the Imagination", exclude: &(String.length(&1) < 10))
["imagination"]

# Using `exclude` with a regular expression.
Tokeniser.tokenise("Sand and Foam", exclude: ~r/and/)
["foam"]

# Using `exclude` with a string.
Tokeniser.tokenise("Eye of The Prophet", exclude: "eye of")
["the", "prophet"]

# Using `exclude` with a list of a combination of types.
Tokeniser.tokenise("Eye of The Prophet", exclude: ["eye", &(String.ends_with?(&1, "he")), ~r/of/])
["prophet"]

Gibran has a shortcut to work with strings directly instead of running them through the tokeniser first.

Gibran.from_string(str, :token_count, opts: [exclude: &String.length(&1) < 4])
# => 6

Gibran normalises input before applying transformations to avoid inconsistencies that can arise from character-casing.

The doctests contain extensive usage examples. Please take a look there for more details.