IREE.Tokenizers

Fast Hugging Face tokenizer.json, OpenAI .tiktoken, and SentencePiece .model bindings for Elixir backed by the IREE tokenizer runtime. I discovered IREE Tokenizers from the ZML.ai blog, a company I deeply admire!

Features

Scope

V1 is intentionally inference-only.

Repository Usage

Install dependencies and run the full local validation flow from the repo root:

mix deps.get
mix test
cargo test --manifest-path native/iree_tokenizers_native/Cargo.toml

In :dev and :test, the project forces a local source build of the Rust NIF, so you do not need precompiled release assets for normal development.

Example

{:ok, tokenizer} = IREE.Tokenizers.Tokenizer.from_file("tokenizer.json")

{:ok, encoding} =
  IREE.Tokenizers.Tokenizer.encode(tokenizer, "Hello world", add_special_tokens: false)

encoding.ids

{:ok, text} =
  IREE.Tokenizers.Tokenizer.decode(tokenizer, encoding.ids, skip_special_tokens: false)

For local .tiktoken files, use the same constructors with format: :tiktoken. If the filename carries a standard encoding name, it is inferred automatically:

{:ok, tokenizer} =
  IREE.Tokenizers.Tokenizer.from_file("gpt2.tiktoken", format: :tiktoken)

IREE.Tokenizers.Tokenizer.supported_tiktoken_encodings()

You can also load directly from the Hugging Face Hub:

{:ok, tokenizer} = IREE.Tokenizers.Tokenizer.from_pretrained("gpt2")
{:ok, cl100k} =
  IREE.Tokenizers.Tokenizer.from_pretrained("openai/cl100k_base", format: :tiktoken)

{:ok, gpt4o} =
  IREE.Tokenizers.Tokenizer.from_pretrained("gpt-4o", format: :tiktoken)

For custom .tiktoken repos or arbitrary in-memory buffers, pass tiktoken_encoding: explicitly when it cannot be inferred from the repo/model name or filename.

For SentencePiece .model files, use format: :sentencepiece_model for raw buffers and pretrained loads. Local files ending in .model are inferred automatically:

{:ok, tokenizer} =
  IREE.Tokenizers.Tokenizer.from_file("spiece.model")

{:ok, tokenizer} =
  IREE.Tokenizers.Tokenizer.from_pretrained("google-t5/t5-small",
    format: :sentencepiece_model
  )

If you need authentication for gated/private repos:

{:ok, tokenizer} =
  IREE.Tokenizers.Tokenizer.from_pretrained("some/private-model",
    token: System.fetch_env!("HF_TOKEN")
  )

Benchmarks

Current Local Results

The benchmark harness compares this package against the published tokenizers package.

On a recent local GPT-2 batch-of-100 encode run, this package measured 9.4M tokens/sec. The IREE tokenizer author reports 10.1M tokens/sec in the upstream post. That difference is small enough to be unsurprising and does not indicate a correctness problem by itself:

The important result is that the implementation remains in the same performance class and preserves the expected large speedup over the Elixir tokenizers package.

Local fixture comparison against elixir-nx/tokenizers

The local fixture comparison script now writes:

The SentencePiece-specific comparison script writes:

Fixture encode latency chart:

Fixture encode comparison

Fixture decode latency chart:

Fixture decode comparison

SentencePiece .model comparison

The SentencePiece-specific comparison script checks direct .model loading against the official tokenizers package loaded from the corresponding tokenizer.json.

Current checked-in results from bench/results/sentencepiece_compare.md:

Model Repo Input bytes Output ids IREE .modeltokenizers Speedup
T5-small (SentencePiece Unigram) encode google-t5/t5-small521012.0 μs35.0 μs2.92x
LLaMA tokenizer (SentencePiece BPE) encode hf-internal-testing/llama-tokenizer441215.0 μs16.0 μs1.07x
T5-small (SentencePiece Unigram) decode google-t5/t5-small52104.0 μs3.0 μs0.75x
LLaMA tokenizer (SentencePiece BPE) decode hf-internal-testing/llama-tokenizer44129.0 μs12.0 μs1.33x

SentencePiece encode latency chart:

SentencePiece encode comparison

SentencePiece decode latency chart:

SentencePiece decode comparison

Model latency comparison

The current checked-in local snapshot from bench/results/model_matrix.md contains:

Model Repo used Tokenizers package (ms) IREE oneshot / stream (ms) Speedup
LiquidAI/LFM2.5-1.2B-InstructLiquidAI/LFM2.5-1.2B-Instruct61.4 ms15.8 ms / 5.03 ms3.9x / 12.2x
Qwen/Qwen3.5-9BQwen/Qwen3.5-9B69.5 ms10.9 ms / 10.7 ms6.4x / 6.5x
zai-org/GLM-5.1zai-org/GLM-5.159.2 ms10.7 ms / 5.51 ms5.5x / 10.7x
mistralai/Ministral-3-3B-Reasoning-2512mistralai/Ministral-3-3B-Reasoning-251279.0 ms10.8 ms / 5.89 ms7.3x / 13.4x
BAAI/bge-m3BAAI/bge-m346.7 ms23.1 ms / 14.3 ms2.0x / 3.3x
google/gemma-4-31B-itgoogle/gemma-4-31B-it20.4 ms10.3 ms / 3.78 ms2.0x / 5.4x

The benchmark harness intentionally keeps only one representative repo per tokenizer family when multiple model variants share the same tokenizer. The current family-level matrix targets:

Latency chart:

Model matrix latency

Speedup chart:

Model matrix speedup

Benchmark Harness

The benchmark harness lives under bench/.

Set it up once:

cd bench
mix deps.get

Run the generic encode/decode comparison:

mix run compare.exs

This generates the fixture comparison markdown and SVG charts in bench/results/.

Generate the SentencePiece .model comparison charts:

mix run sentencepiece_compare.exs

Generate the multi-model latency/speedup graphs:

mix run model_matrix_graphs.exs

Limit the multi-model run to a single model while iterating:

MODEL_FILTER="Qwen/Qwen3.5-9B" mix run model_matrix_graphs.exs

You can also target the latest GLM run specifically:

MODEL_FILTER="zai-org/GLM-5.1" mix run model_matrix_graphs.exs

All benchmark outputs are written to bench/results/.

If any benchmark target requires authentication, set HF_TOKEN before running the script:

HF_TOKEN=... mix run model_matrix_graphs.exs

Vendored IREE Bundle

The native crate builds against a curated vendored source bundle under native/iree_tokenizers_native/vendor/iree_tokenizer_src.

The vendored bundle is pinned to the IREE commit recorded in native/iree_tokenizers_native/vendor/IREE_COMMIT.

To refresh that bundle from the pinned upstream IREE checkout:

scripts/update_iree_bundle.sh /path/to/iree