Kiri (切り)

Japanese morphological analyzer for Elixir, powered by Sudachi dictionaries.

Kiri reads Sudachi-format binary .dic dictionaries and produces segmented morphemes with part-of-speech tags, readings, normalized forms, and synonym group IDs.

Installation

Add kiri to your list of dependencies in mix.exs:

def deps do
  [
    {:kiri, "~> 0.1.0"}
  ]
end

You also need a Sudachi dictionary file (system_core.dic). Download one from the SudachiDict releases.

Usage

# Create a tokenizer from a system dictionary
{:ok, tokenizer} = Kiri.create_tokenizer("path/to/system_core.dic")

# Tokenize text (default mode :c — longest units)
morphemes = Kiri.tokenize(tokenizer, "東京都に行った")

# Inspect results
for m <- morphemes do
  IO.puts("#{m.surface}\t#{Enum.join(m.part_of_speech, ",")}\t#{m.normalized_form}")
end
# 東京都   名詞,固有名詞,地名,一般,*,*   東京都
# に       助詞,格助詞,*,*,*,*             に
# 行っ     動詞,非自立可能,*,*,五段-カ行,連用形-促音便  行く
# た       助動詞,*,*,*,助動詞-タ,終止形-一般            た

Split modes

Kiri supports three split modes matching Sudachi's behavior:

:c (default) — longest units / named entities (e.g. "東京都")
:b — middle-length units
:a — shortest units (e.g. "東京", "都")

Kiri.tokenize(tokenizer, "東京都", mode: :a)

Options

Pass options to Kiri.create_tokenizer/2:

Option	Default	Description
`:mode`	`:c`	Default split mode
`:user_dictionaries`	`[]`	List of user dictionary file paths
`:prolonged_sound_marks`	`false`	Collapse repeated prolonged sound marks
`:ignore_yomigana`	`false`	Strip bracketed readings before tokenization
`:disable_normalization`	`false`	Skip NFKC normalization
`:disable_numeric_normalize`	`false`	Skip numeric sequence normalization

Architecture

Kiri uses a Rust NIF (via Rustler) for dictionary loading and Viterbi lattice search. The input text processing pipeline (NFKC normalization, character categories, prolonged sound marks) and path rewrite plugins (katakana OOV joining, numeric normalization) are implemented in pure Elixir.

License

Apache-2.0