Kiri (切り)
Japanese morphological analyzer for Elixir, powered by Sudachi dictionaries.
Kiri reads Sudachi-format binary .dic dictionaries and produces segmented
morphemes with part-of-speech tags, readings, normalized forms, and synonym
group IDs.
Installation
Add kiri to your list of dependencies in mix.exs:
def deps do
[
{:kiri, "~> 0.1.0"}
]
end
You also need a Sudachi dictionary file (system_core.dic). Download one from
the SudachiDict releases.
Usage
# Create a tokenizer from a system dictionary
{:ok, tokenizer} = Kiri.create_tokenizer("path/to/system_core.dic")
# Tokenize text (default mode :c — longest units)
morphemes = Kiri.tokenize(tokenizer, "東京都に行った")
# Inspect results
for m <- morphemes do
IO.puts("#{m.surface}\t#{Enum.join(m.part_of_speech, ",")}\t#{m.normalized_form}")
end
# 東京都 名詞,固有名詞,地名,一般,*,* 東京都
# に 助詞,格助詞,*,*,*,* に
# 行っ 動詞,非自立可能,*,*,五段-カ行,連用形-促音便 行く
# た 助動詞,*,*,*,助動詞-タ,終止形-一般 たSplit modes
Kiri supports three split modes matching Sudachi's behavior:
:c(default) — longest units / named entities (e.g."東京都"):b— middle-length units:a— shortest units (e.g."東京","都")
Kiri.tokenize(tokenizer, "東京都", mode: :a)Options
Pass options to Kiri.create_tokenizer/2:
| Option | Default | Description |
|---|---|---|
:mode | :c | Default split mode |
:user_dictionaries | [] | List of user dictionary file paths |
:prolonged_sound_marks | false | Collapse repeated prolonged sound marks |
:ignore_yomigana | false | Strip bracketed readings before tokenization |
:disable_normalization | false | Skip NFKC normalization |
:disable_numeric_normalize | false | Skip numeric sequence normalization |
Architecture
Kiri uses a Rust NIF (via Rustler) for dictionary loading and Viterbi lattice search. The input text processing pipeline (NFKC normalization, character categories, prolonged sound marks) and path rewrite plugins (katakana OOV joining, numeric normalization) are implemented in pure Elixir.
License
Apache-2.0