Razdel
Rule-based Russian sentence and word tokenization — Elixir port of Natasha Razdel.
Part of the natasha-ex ecosystem: Russian NLP for Elixir.
Usage
iex> Razdel.sentenize("Привет. Как дела?")
[%Razdel.Substring{start: 0, stop: 7, text: "Привет."},
%Razdel.Substring{start: 8, stop: 17, text: "Как дела?"}]
iex> Razdel.tokenize("Кружка-термос на 0.5л")
[%Razdel.Substring{start: 0, stop: 13, text: "Кружка-термос"},
%Razdel.Substring{start: 14, stop: 16, text: "на"},
%Razdel.Substring{start: 17, stop: 20, text: "0.5"},
%Razdel.Substring{start: 20, stop: 21, text: "л"}]
Handles Russian abbreviations (т.е., г., ст., к.п.н.), initials (Л. В. Щербы), quotes, brackets, dialogue dashes, list bullets, and smileys.
Installation
def deps do
[{:razdel, "~> 0.1.0"}]
endAlgorithm
Split-then-rejoin: text is split at every potential delimiter (.?!;""»…)), then a chain of heuristic rules decides which splits are false positives and should be rejoined.
Rules (in priority order):
- empty_side — join if either side is empty
- no_space_prefix — join if no space after delimiter
- lower_right — join if next token is lowercase
- delimiter_right — join if next token is punctuation
- abbr_left — join for known abbreviations (400+ entries)
- inside_pair_abbr — join for paired abbreviations (т.е., и т.д.)
- initials_left — join for single uppercase letters (initials)
- list_item — join for numbered/lettered list bullets
- close_quote — handle closing quotes correctly
- close_bracket — handle closing brackets
- dash_right — join dialogue dashes before lowercase words
Performance
Benchmarked on the original test data (48,735 sentence texts, 208,995 token texts), Apple M5:
| Operation | Python (CPython 3.13) | Elixir (OTP 27) | Ratio |
|---|---|---|---|
| sentenize | 78,000 /s | 23,000 /s | Python ~3.4× |
| tokenize | 323,000 /s | 363,000 /s | Elixir ~1.1× |
The sentenize gap is due to UTF-8 char-counting overhead in Substring.locate — the BEAM has no O(1) byte→char offset conversion. The split+segment core alone runs faster than Python.
mix run bench/bench.exsLicense
MIT — Danila Poyarkov