TiktokenEx Logo

TiktokenEx

Pure Elixir TikToken-style byte-level BPE tokenizer (Kimi K2 compatible).

CIHex.pmDocsLicense

TiktokenEx is a small, dependency-light implementation of the core TikToken idea:

It’s focused on matching the behavior of MoonshotAI’s Kimi K2 tokenizers that ship a tiktoken.model file and a TikToken-compatible pat_str.

Installation

Add tiktoken_ex to your dependencies:

def deps do
  [
    {:tiktoken_ex, "~> 0.2.0"}
  ]
end

Usage

Build an encoding directly

alias TiktokenEx.Encoding

mergeable_ranks = %{
  "He" => 0,
  "ll" => 1,
  "llo" => 2,
  "H" => 10,
  "e" => 11,
  "l" => 12,
  "o" => 13
}

{:ok, enc} = Encoding.new(pat_str: ".+", mergeable_ranks: mergeable_ranks)

{:ok, ids} = Encoding.encode(enc, "Hello")
{:ok, text} = Encoding.decode(enc, ids)

Load a Kimi K2 encoding from local HuggingFace artifacts

Kimi provides:

alias TiktokenEx.{Encoding, Kimi}

{:ok, enc} =
  Kimi.from_hf_files(
    tiktoken_model_path: "/path/to/tiktoken.model",
    tokenizer_config_path: "/path/to/tokenizer_config.json"
  )

{:ok, ids} = Encoding.encode(enc, "Say hi")
{:ok, decoded} = Encoding.decode(enc, ids)

Load a Kimi K2 encoding from a HuggingFace repo (cached)

from_hf_repo/2 downloads and caches tiktoken.model and tokenizer_config.json under your user cache directory.

alias TiktokenEx.{Encoding, Kimi}

{:ok, enc} =
  Kimi.from_hf_repo(
    "moonshotai/Kimi-K2-Thinking",
    revision: "main",
    encoding_cache: true
  )

{:ok, ids} = Encoding.encode(enc, "Say hi")

To test without network, inject a :fetch_fun (see TiktokenEx.HuggingFace).

Special tokens

Special tokens are recognized by default. To treat them as plain text:

{:ok, ids} = TiktokenEx.Encoding.encode(enc, "<|im_end|>", allow_special_tokens: false)

Special token matching

When special tokens overlap (one is a prefix of another), the matching behavior depends on the regex alternative order.

Regex compatibility note

Kimi’s upstream pat_str uses character-class intersections (&&), which are not supported by Erlang’s PCRE engine. TiktokenEx.Kimi.pat_str/0 provides a PCRE-compatible translation.

Development

License

MIT © 2025 North-Shore-AI