TinySegmenter

An Elixir port of TinySegmenter, an extremely compact Japanese tokenizer.

TinySegmenter is a compact Japanese word segmentation/tokenization library that was originally written in JavaScript by Taku Kudo in 2008. This is a faithful port to Elixir that maintains compatibility with the original implementation.

Features

Compact: Extremely small footprint with pre-trained weights compiled into the module
Fast: Uses efficient Elixir data structures and pattern matching
Accurate: Machine learning-based segmentation with pre-trained models
No Dependencies: Pure Elixir implementation with no external dependencies

Installation

Add tinysegmenter to your list of dependencies in mix.exs:

def deps do
  [
    {:tinysegmenter, "~> 0.1.0"}
  ]
end

Usage

Basic Tokenization

# Tokenize Japanese text into a list of words
TinySegmenter.tokenize("私の名前は中野です")
# => ["私", "の", "名前", "は", "中野", "です"]

# Get tokens as a delimited string
TinySegmenter.segment("私の名前は中野です")
# => "私 | の | 名前 | は | 中野 | です"

# Use a custom delimiter
TinySegmenter.segment("私の名前は中野です", " / ")
# => "私 / の / 名前 / は / 中野 / です"

Character Type Classification

TinySegmenter.ctype("あ")  # => "I" (Hiragana)
TinySegmenter.ctype("ア")  # => "K" (Katakana)
TinySegmenter.ctype("日")  # => "H" (Kanji)
TinySegmenter.ctype("A")   # => "A" (Alphabet)
TinySegmenter.ctype("1")   # => "N" (Number)
TinySegmenter.ctype("@")   # => "O" (Other)

Examples

# Mixed Japanese text
TinySegmenter.tokenize("今日はいい天気ですね")
# => ["今日", "は", "いい", "天気", "です", "ね"]

# Text with numbers and dates
TinySegmenter.tokenize("2023年1月1日")
# => ["2", "0", "2", "3", "年", "1月", "1", "日"]

# Location names
TinySegmenter.tokenize("東京都")
# => ["東京都"]

# Technical terms in Katakana
TinySegmenter.tokenize("コンピュータ")
# => ["コンピュータ"]

API

`tokenize(text)`

Tokenizes Japanese text into a list of words.

Parameters: text - A Japanese text string to be tokenized
Returns: A list of strings, where each string is a token (word)

`segment(text, delimiter \\ " | ")`

Convenience function that returns a list of tokens separated by a delimiter.

Parameters:
- text - A Japanese text string to be tokenized
- delimiter - The string to use between tokens (default: " | ")
Returns: A string with tokens joined by the delimiter

`ctype(char)`

Returns the character type for a given character.

Parameters: char - A single character string
Returns: A string representing the character type:
- "H" - Kanji (漢字)
- "I" - Hiragana (ひらがな)
- "K" - Katakana (カタカナ)
- "A" - Alphabet
- "N" - Number
- "M" - Numeric Kanji (一、二、三、etc.)
- "O" - Other

How It Works

TinySegmenter uses a machine learning-based approach to segment Japanese text. It:

Classifies each character into types (Kanji, Hiragana, Katakana, etc.)
Uses pre-trained scoring weights to evaluate potential word boundaries
Applies unigram, bigram, and trigram features for accurate segmentation
Makes boundary decisions based on cumulative scores

The scoring weights were trained on a Japanese corpus and are embedded directly into the module at compile time for maximum performance.

Credits

Original Implementation

Original Author: Taku Kudo (taku@chasen.org)
Original JavaScript Version: (c) 2008 under New BSD License
Original URL: http://chasen.org/~taku/software/TinySegmenter/

Python Port

Author: Masato Hagiwara
URL: https://masatohagiwara.net/tinysegmenter-in-python.html

Elixir Port

Author: Simon Edwardsson
Ported from the Python implementation
Maintains compatibility with the original algorithm

License

TinySegmenter is freely distributable under the terms of the New BSD License, following the original implementation by Taku Kudo.