TiktokenEx
Pure Elixir TikToken-style byte-level BPE tokenizer (Kimi K2 compatible).
TiktokenEx is a small, dependency-light implementation of the core TikToken idea:
-
Split text with a Unicode-aware regex (
pat_str) -
Encode pieces with byte-pair encoding (BPE) using
mergeable_ranks -
Optionally recognize special tokens (e.g.
<|im_end|>)
It’s focused on matching the behavior of MoonshotAI’s Kimi K2 tokenizers
that ship a tiktoken.model file and a TikToken-compatible pat_str.
Installation
Add tiktoken_ex to your dependencies:
def deps do
[
{:tiktoken_ex, "~> 0.2.0"}
]
endUsage
Build an encoding directly
alias TiktokenEx.Encoding
mergeable_ranks = %{
"He" => 0,
"ll" => 1,
"llo" => 2,
"H" => 10,
"e" => 11,
"l" => 12,
"o" => 13
}
{:ok, enc} = Encoding.new(pat_str: ".+", mergeable_ranks: mergeable_ranks)
{:ok, ids} = Encoding.encode(enc, "Hello")
{:ok, text} = Encoding.decode(enc, ids)Load a Kimi K2 encoding from local HuggingFace artifacts
Kimi provides:
tiktoken.model(mergeable ranks)tokenizer_config.json(special tokens, etc)
alias TiktokenEx.{Encoding, Kimi}
{:ok, enc} =
Kimi.from_hf_files(
tiktoken_model_path: "/path/to/tiktoken.model",
tokenizer_config_path: "/path/to/tokenizer_config.json"
)
{:ok, ids} = Encoding.encode(enc, "Say hi")
{:ok, decoded} = Encoding.decode(enc, ids)Load a Kimi K2 encoding from a HuggingFace repo (cached)
from_hf_repo/2 downloads and caches tiktoken.model and
tokenizer_config.json under your user cache directory.
alias TiktokenEx.{Encoding, Kimi}
{:ok, enc} =
Kimi.from_hf_repo(
"moonshotai/Kimi-K2-Thinking",
revision: "main",
encoding_cache: true
)
{:ok, ids} = Encoding.encode(enc, "Say hi")
To test without network, inject a :fetch_fun (see TiktokenEx.HuggingFace).
Special tokens
Special tokens are recognized by default. To treat them as plain text:
{:ok, ids} = TiktokenEx.Encoding.encode(enc, "<|im_end|>", allow_special_tokens: false)Special token matching
When special tokens overlap (one is a prefix of another), the matching behavior depends on the regex alternative order.
-
Default:
special_token_matching: :parity(unspecified order; closer to upstreamtiktoken). -
Optional:
special_token_matching: :longest(deterministic "longest match wins").
Regex compatibility note
Kimi’s upstream pat_str uses character-class intersections (&&), which are
not supported by Erlang’s PCRE engine. TiktokenEx.Kimi.pat_str/0 provides a
PCRE-compatible translation.
Development
-
Run tests:
mix test -
Run oracle parity tests (downloads HF artifacts):
mix test --include oracle -
Run tests across backends:
scripts/test_backends.sh(add--oracleto include parity) -
Run dialyzer:
mix dialyzer
License
MIT © 2025 North-Shore-AI