Kiri (切り)

Japanese morphological analyzer for Elixir, powered by Sudachi dictionaries.

Kiri reads Sudachi-format dictionaries (converted to .kiri format) and produces segmented morphemes with part-of-speech tags, readings, normalized forms, and synonym group IDs. Pure Elixir implementation — no Rust toolchain required.

Installation

Add kiri to your list of dependencies in mix.exs:

def deps do
  [
    {:kiri, "~> 0.2"}
  ]
end

Dictionary Setup

Download a Sudachi dictionary and convert it to .kiri format:

# Download
mkdir -p ~/.kiri-ji/dict
curl -L -o ~/.kiri-ji/dict/sudachi-dictionary-core.zip \
  https://github.com/WorksApplications/SudachiDict/releases/download/v20260116/sudachi-dictionary-20260116-core.zip
unzip -o ~/.kiri-ji/dict/sudachi-dictionary-core.zip -d ~/.kiri-ji/dict
mv ~/.kiri-ji/dict/sudachi-dictionary-*/system_core.dic ~/.kiri-ji/dict/

# Convert to .kiri format (one-time step)
mix kiri.convert ~/.kiri-ji/dict/system_core.dic ~/.kiri-ji/system_core.kiri

Usage

# Load once at application startup
{:ok, dict} = Kiri.load_dictionary("~/.kiri-ji/system_core.kiri")

# Tokenize from anywhere — concurrent safe, no GenServer
morphemes = Kiri.tokenize(dict, "東京都に行った")

for m <- morphemes do
  IO.puts "#{m.surface}\t#{Enum.join(m.part_of_speech, ",")}\t#{m.normalized_form}"
end
# 東京都  名詞,固有名詞,地名,一般,*,*  東京都
# に      助詞,格助詞,*,*,*,*            に
# 行っ    動詞,非自立可能,*,*,五段-カ行,連用形-促音便  行く
# た      助動詞,*,*,*,助動詞-タ,終止形-一般  た

Concurrency

The %Dictionary{} struct is a ~2 KB handle. The actual ~150 MB binary data lives in :persistent_term, shared across all processes with zero copy.

texts
|> Task.async_stream(&Kiri.tokenize(dict, &1), max_concurrency: 100)
|> Enum.to_list()

Split Modes

Override the default split mode per call:

morphemes = Kiri.tokenize(dict, "関西国際空港", mode: :a)

Options

Options can be passed to Kiri.tokenize/3:

Option Type Default Description
mode:a | :b | :c:c Split mode (A/B/C)
prolonged_sound_marksbooleanfalse Collapse repeated prolonged sound marks
ignore_yomiganabooleanfalse Strip bracketed readings after kanji
disable_normalizationbooleanfalse Skip NFKC input text normalization
disable_numeric_normalizebooleanfalse Skip numeric normalization in path rewrite
backend:elixir | :nif:elixir Tokenization backend

Architecture

Pure Elixir implementation — the full plugin stack (input text normalization, path rewriting, split modes, prolonged sound marks, yomigana stripping) and core algorithms (Viterbi lattice solver, DARTSCLONE trie, MeCab OOV) are implemented in Elixir using binary pattern matching against :persistent_term-stored dictionary sections. An optional NIF backend is available for users who want Rust-accelerated lattice construction.

License

Apache-2.0