Simile
String similarity and distance algorithms for Elixir.
Algorithms
Distance (lower = more similar)
- Levenshtein -- insert, delete, substitute
- Damerau-Levenshtein -- insert, delete, substitute, transpose (unrestricted)
- Optimal String Alignment (OSA) -- restricted edit distance with transpositions
- Hamming -- positional differences (equal-length strings only)
- Indel -- insert and delete only (no substitution)
- LCS -- longest common subsequence length
- N-gram -- configurable n-gram overlap distance
Similarity (0.0 to 1.0, higher = more similar)
- Jaro -- matching characters and transpositions
- Jaro-Winkler -- Jaro with prefix bonus
- Sorensen-Dice -- bigram overlap coefficient
Usage
Simile.levenshtein("kitten", "sitting") #=> 3
Simile.damerau_levenshtein("abc", "bac") #=> 1
Simile.osa_distance("abc", "bac") #=> 1
Simile.hamming("karolin", "kathrin") #=> {:ok, 3}
Simile.indel("kitten", "sitting") #=> 5
Simile.lcs("kitten", "sitting") #=> 4
Simile.ngram_distance("night", "nacht", 2) #=> 0.75
Simile.jaro("martha", "marhta") #=> 0.944...
Simile.jaro_winkler("martha", "marhta") #=> 0.961...
Simile.sorensen_dice("night", "nacht") #=> 0.25
Simile.normalized_levenshtein("kitten", "sitting") #=> 0.428...
Simile.normalized_indel("kitten", "sitting") #=> 0.714...
Simile.indel_similarity("kitten", "sitting") #=> 0.285...Matching
Simile.best_match("elxir", ["elixir", "erlang", "elm"])
#=> [{"elixir", 0.94...}]
Simile.best_match("rb", ["ruby", "rust", "python"], top: 2)
#=> [{"ruby", ...}, {"rust", ...}]
Simile.filter("elxir", ["elixir", "erlang", "elm"], min_score: 0.8)
#=> [{"elixir", 0.94...}]
Both accept a :by option to use any scoring function:
Simile.best_match("night", ["nacht", "nite", "day"],
by: &Simile.sorensen_dice/2
)Installation
def deps do
[
{:simile, "~> 0.1.0"}
]
endDocumentation: hexdocs.pm/simile
License
MIT