TinySegmenter

An Elixir port of TinySegmenter, an extremely compact Japanese tokenizer.

TinySegmenter is a compact Japanese word segmentation/tokenization library that was originally written in JavaScript by Taku Kudo in 2008. This is a faithful port to Elixir that maintains compatibility with the original implementation.

Features

Installation

Add tinysegmenter to your list of dependencies in mix.exs:

def deps do
  [
    {:tinysegmenter, "~> 0.1.0"}
  ]
end

Usage

Basic Tokenization

# Tokenize Japanese text into a list of words
TinySegmenter.tokenize("私の名前は中野です")
# => ["私", "の", "名前", "は", "中野", "です"]

# Get tokens as a delimited string
TinySegmenter.segment("私の名前は中野です")
# => "私 | の | 名前 | は | 中野 | です"

# Use a custom delimiter
TinySegmenter.segment("私の名前は中野です", " / ")
# => "私 / の / 名前 / は / 中野 / です"

Character Type Classification

TinySegmenter.ctype("あ")  # => "I" (Hiragana)
TinySegmenter.ctype("ア")  # => "K" (Katakana)
TinySegmenter.ctype("日")  # => "H" (Kanji)
TinySegmenter.ctype("A")   # => "A" (Alphabet)
TinySegmenter.ctype("1")   # => "N" (Number)
TinySegmenter.ctype("@")   # => "O" (Other)

Examples

# Mixed Japanese text
TinySegmenter.tokenize("今日はいい天気ですね")
# => ["今日", "は", "いい", "天気", "です", "ね"]

# Text with numbers and dates
TinySegmenter.tokenize("2023年1月1日")
# => ["2", "0", "2", "3", "年", "1月", "1", "日"]

# Location names
TinySegmenter.tokenize("東京都")
# => ["東京都"]

# Technical terms in Katakana
TinySegmenter.tokenize("コンピュータ")
# => ["コンピュータ"]

API

tokenize(text)

Tokenizes Japanese text into a list of words.

segment(text, delimiter \\ " | ")

Convenience function that returns a list of tokens separated by a delimiter.

ctype(char)

Returns the character type for a given character.

How It Works

TinySegmenter uses a machine learning-based approach to segment Japanese text. It:

  1. Classifies each character into types (Kanji, Hiragana, Katakana, etc.)
  2. Uses pre-trained scoring weights to evaluate potential word boundaries
  3. Applies unigram, bigram, and trigram features for accurate segmentation
  4. Makes boundary decisions based on cumulative scores

The scoring weights were trained on a Japanese corpus and are embedded directly into the module at compile time for maximum performance.

Credits

Original Implementation

Python Port

Elixir Port

License

TinySegmenter is freely distributable under the terms of the New BSD License, following the original implementation by Taku Kudo.

References