IREE.Tokenizers

IREE.Tokenizers is an inference-only Elixir tokenizer package backed by the IREE tokenizer runtime. It lets Elixir applications load common LLM tokenizer assets and run fast local encode/decode without a Python service. I first discovered IREE's tokenizer work through the ZML.ai blog, and deeply admire the company and the engineering behind it.

In one sentence: this package turns Hugging Face tokenizer.json, OpenAI .tiktoken, and SentencePiece .model files into BEAM-friendly tokenizer handles with one-shot, batch, streaming, offset, mask, and vocab helper APIs.

What this package does

Why use it

Use this package when an Elixir system needs tokenizer performance and LLM-style runtime ergonomics without leaving the BEAM:

Current results

The checked-in benchmark and parity files are generated by scripts in bench/. The README only summarizes results that have corresponding artifacts in bench/results/.

Correctness/parity

bench/validate_parity.exs compares IREE.Tokenizers with elixir-nx/tokenizers, the Rust-backed Hugging Face tokenizers reference package. The current selected matrix is green for 7 public tokenizer families, 19 representative inputs per family, and both add_special_tokens: true and false modes. It also checks batch encode and stream encode parity.

See the full report: bench/results/parity_report.md.

Currently green selected matrix:

Model / load path Coverage in the report
Qwen/Qwen2.5-7B-Instruct 19/19 cases, both special-token modes; batch OK; stream OK
google-bert/bert-base-uncased 19/19 cases, both special-token modes; batch OK; stream OK
openai-community/gpt2 19/19 cases, both special-token modes; batch OK; stream OK
microsoft/Phi-3-mini-4k-instruct 19/19 cases, both special-token modes; batch OK; stream OK
google-t5/t5-small from tokenizer.json 19/19 cases, both special-token modes; batch OK; stream OK
google-t5/t5-small from SentencePiece .model 19/19 cases, both special-token modes; batch OK; stream OK
sentence-transformers/all-MiniLM-L6-v2 19/19 cases, both special-token modes; batch OK; stream OK

The benchmark-matrix rows currently published in bench/results/model_matrix.md were also re-checked on this branch for representative one-shot, batch, and stream parity:

Historical upstream/runtime gaps and local fixes are documented in docs/UPSTREAM_BUGS.md. Do not treat that file as the live status by itself; the latest parity report is the authoritative current result.

Performance

Benchmark numbers depend on machine, OTP/Elixir versions, CPU, and cache state. The checked-in numbers show the current shape:

Benchmark artifact Summary
bench/results/model_matrix.md Curated real-model prompt workload: IREE one-shot is 1.6x-5.6x faster than tokenizers; IREE stream is 5.4x-14.0x faster on the published rows.
bench/results/tokenizers_compare.md Local BPE fixture: medium/long encode is about 1.3x faster; medium/long decode is about 10x faster.
bench/results/sentencepiece_compare.md Direct .model loading: T5-small encode is 1.97x faster; LLaMA tokenizer encode is 1.18x faster; LLaMA decode is 1.81x faster.

The model-matrix run reports latency only for rows where the benchmark corpus produces equivalent outputs across both libraries, and reports stream numbers only when streamed output matches IREE one-shot output on that corpus.

Latency chart:

Model matrix latency

Speedup chart:

Model matrix speedup

Installation

Add the package to your Mix dependencies:

def deps do
  [
    {:iree_tokenizers, "~> 0.7.0"}
  ]
end

Then run:

mix deps.get

The package uses rustler_precompiled for release builds. The current prebuilt NIF target list is:

In :dev and :test, the project forces a local Rust source build. You can also force a local build with:

IREE_TOKENIZERS_BUILD=1 mix compile

Quick start

Load from the Hugging Face Hub

alias IREE.Tokenizers.Tokenizer

{:ok, tokenizer} = Tokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

{:ok, encoding} =
  Tokenizer.encode(tokenizer, "Hello from Elixir", add_special_tokens: false)

encoding.ids
#=> token ids

{:ok, text} = Tokenizer.decode(tokenizer, encoding.ids, skip_special_tokens: false)
#=> "Hello from Elixir"

For gated or private Hugging Face repositories, pass a token:

{:ok, tokenizer} =
  Tokenizer.from_pretrained("some/private-model",
    token: System.fetch_env!("HF_TOKEN")
  )

from_pretrained/2 caches downloaded tokenizer assets by ETag in a per-user cache directory by default. You can pass cache_dir:, revision:, subfolder:, filename:, use_cache: false, or a custom http_client:.

Load a local tokenizer.json

{:ok, tokenizer} = Tokenizer.from_file("tokenizer.json")

{:ok, encoding} =
  Tokenizer.encode(tokenizer, "Hello world",
    add_special_tokens: true,
    track_offsets: true
  )

encoding.ids
encoding.tokens
encoding.offsets
encoding.attention_mask
encoding.special_tokens_mask

Load OpenAI .tiktoken encodings

{:ok, tokenizer} =
  Tokenizer.from_pretrained("gpt-4o", format: :tiktoken)

{:ok, cl100k} =
  Tokenizer.from_pretrained("openai/cl100k_base", format: :tiktoken)

Tokenizer.supported_tiktoken_encodings()
#=> ["cl100k_base", "o200k_base", "o200k_harmony", "r50k_base", "gpt2", "p50k_base", "p50k_edit"]

For local .tiktoken files, pass format: :tiktoken when inference from the filename is not enough:

{:ok, tokenizer} =
  Tokenizer.from_file("gpt2.tiktoken", format: :tiktoken)

{:ok, tokenizer} =
  Tokenizer.from_buffer(buffer,
    format: :tiktoken,
    tiktoken_encoding: "cl100k_base"
  )

Load SentencePiece .model files

Local files ending in .model are inferred automatically:

{:ok, tokenizer} = Tokenizer.from_file("spiece.model")

From Hugging Face, request the SentencePiece path explicitly:

{:ok, tokenizer} =
  Tokenizer.from_pretrained("google-t5/t5-small",
    format: :sentencepiece_model
  )

Batch encode/decode

{:ok, encodings} =
  Tokenizer.encode_batch(tokenizer, ["short prompt", "another prompt"],
    add_special_tokens: false
  )

ids_batch = Enum.map(encodings, & &1.ids)
{:ok, texts} = Tokenizer.decode_batch(tokenizer, ids_batch, skip_special_tokens: false)

encode_batch/3 is intentionally parity-first: it routes through the same single-input encode/3 path for each item so tokenizer defaults, local fixes, and transformations are identical to one-shot encoding.

Streaming encode/decode

alias IREE.Tokenizers.{DecodeStream, EncodeStream}

{:ok, stream} = EncodeStream.new(tokenizer, add_special_tokens: false)
{:ok, ids1} = EncodeStream.feed(stream, "Hello ")
{:ok, ids2} = EncodeStream.feed(stream, "world")
{:ok, ids3} = EncodeStream.finalize(stream)
ids = ids1 ++ ids2 ++ ids3

{:ok, decode_stream} = DecodeStream.new(tokenizer, skip_special_tokens: false)
{:ok, text1} = DecodeStream.feed(decode_stream, Enum.take(ids, 2))
{:ok, text2} = DecodeStream.feed(decode_stream, Enum.drop(ids, 2))
{:ok, text3} = DecodeStream.finalize(decode_stream)
text = text1 <> text2 <> text3

For tokenizer families where the native streaming runtime can diverge at chunk boundaries, the wrapper uses buffered-finalize strategies so the final stream output still matches one-shot encode.

Encode transformations

alias IREE.Tokenizers.Encoding.Transformation

{:ok, encoding} =
  Tokenizer.encode(tokenizer, "hello",
    add_special_tokens: false,
    encoding_transformations: [
      Transformation.truncate(128),
      Transformation.pad(128, pad_id: 0, pad_token: "[PAD]")
    ]
  )

When a Hugging Face tokenizer.json carries fixed padding or truncation config, that default config is applied automatically. Explicit transformations are then applied after those defaults.

API map

Module Purpose
IREE.Tokenizers.Tokenizer Main load/encode/decode/vocab API.
IREE.Tokenizers.Encoding Struct and helpers for token IDs, masks, offsets, tokens, padding, and truncation.
IREE.Tokenizers.Encoding.Transformation Builders for post-encode transformations.
IREE.Tokenizers.EncodeStream Incremental encode state.
IREE.Tokenizers.DecodeStream Incremental decode state.
IREE.Tokenizers.Model and model modules Build simple BPE, WordPiece, or Unigram specs from Elixir data.

Supported scope

Supported now:

Deferred or intentionally out of scope for v1:

Unsupported pair input returns:

{:error, {:invalid_argument, "pair sequence inputs are not supported in v1"}}

How it is implemented

The implementation has four layers:

  1. Elixir public API

    • lib/iree/tokenizers/tokenizer.ex owns loading, options, Hugging Face downloads/caching, batch behavior, tokenizer JSON defaults, and public result shaping.
    • lib/iree/tokenizers/encoding.ex mirrors the practical Encoding helper surface: IDs, masks, offsets, tokens, pad/truncate/transform.
    • lib/iree/tokenizers/encode_stream.ex and decode_stream.ex provide BEAM stream state wrappers.
  2. Rust NIF bridge

    • lib/iree/tokenizers/native.ex uses RustlerPrecompiled in releases and source builds in development/test.
    • native/iree_tokenizers_native/src/tokenizer.rs maps Rust resources and NIF structs to the Elixir API.
    • Dirty CPU NIFs are used for encode/decode paths that can do significant native work.
  3. Vendored IREE tokenizer runtime

    • The native crate builds a curated C source bundle under native/iree_tokenizers_native/vendor/iree_tokenizer_src.
    • The pinned upstream commit is recorded in native/iree_tokenizers_native/vendor/IREE_COMMIT.
    • scripts/update_iree_bundle.sh refreshes the vendored source bundle from a matching upstream IREE checkout.
  4. Parity-preserving compatibility layer

    • SentencePiece .model buffers are converted to tokenizer JSON in Rust before construction.
    • Some tokenizer families use special decode or buffered stream strategies to match the Hugging Face reference output.
    • Encode buffers grow with bounded retry logic so native output-capacity issues return clear errors instead of silently truncating or exhausting the BEAM.
    • encode_batch/3 delegates through one-shot encode/3 for each input to preserve correctness across known native batch-runtime edge cases.
    • Hugging Face tokenizer.json padding/truncation defaults are parsed and applied in the Elixir layer.

Repository usage

Install dependencies and run the normal local checks from the repository root:

mix deps.get
mix test
cargo test --manifest-path native/iree_tokenizers_native/Cargo.toml

Format Elixir and Rust code:

mix format
cargo fmt --manifest-path native/iree_tokenizers_native/Cargo.toml

Run optional pretrained integration suites:

RUN_PRETRAINED_BATCH_INTEGRATION=1 mix test test/iree_tokenizers/batch_integration_test.exs
RUN_PRETRAINED_STREAM_INTEGRATION=1 mix test test/iree_tokenizers/stream_integration_test.exs
RUN_SENTENCEPIECE_INTEGRATION=1 mix test test/iree_tokenizers/sentencepiece_integration_test.exs

Run the full selected parity matrix:

cd bench
mix deps.get
mix run validate_parity.exs

Limit the parity matrix while iterating:

cd bench
MODEL_FILTER="Qwen/Qwen2.5-7B-Instruct" mix run validate_parity.exs

The parity report is written to bench/results/parity_report.md.

Benchmark harness

Set up once:

cd bench
mix deps.get

Run the generic fixture comparison:

mix run compare.exs

Generate the SentencePiece .model comparison charts:

mix run sentencepiece_compare.exs

Generate the curated model latency/speedup matrix:

mix run model_matrix_graphs.exs

Limit a model-matrix run while iterating:

MODEL_FILTER="Qwen/Qwen3.5-9B" mix run model_matrix_graphs.exs

All benchmark outputs are written to bench/results/. If a benchmark target requires authentication, set HF_TOKEN before running the script.

Vendored IREE bundle

The native crate builds against the vendored source bundle under native/iree_tokenizers_native/vendor/iree_tokenizer_src.

The pinned IREE commit is recorded in:

native/iree_tokenizers_native/vendor/IREE_COMMIT

To refresh the bundle from a matching upstream checkout:

scripts/update_iree_bundle.sh /path/to/iree

After any vendor refresh, run Rust tests, Elixir tests, and the pretrained parity suites. Vendor updates can overwrite local C patches that are required for parity.

License

This package is distributed under the Apache-2.0 license. The vendored IREE runtime carries its own license file under native/iree_tokenizers_native/vendor/iree_tokenizer_src/IREE-LICENSE.