IREE.Tokenizers

IREE.Tokenizers is an inference-only Elixir tokenizer package backed by the IREE tokenizer runtime. It lets Elixir applications load common LLM tokenizer assets and run fast local encode/decode without a Python service. I first discovered IREE's tokenizer work through the ZML.ai blog, and deeply admire the company and the engineering behind it.

In one sentence: this package turns Hugging Face tokenizer.json, OpenAI .tiktoken, and SentencePiece .model files into BEAM-friendly tokenizer handles with one-shot, batch, streaming, offset, mask, and vocab helper APIs.

What this package does

Loads tokenizer assets from local files, in-memory buffers, or the Hugging Face Hub.
Supports Hugging Face tokenizer.json, OpenAI .tiktoken, and SentencePiece .model formats.
Supports BPE, WordPiece, and Unigram model families.
Encodes and decodes single inputs, lists of inputs, and streams of chunks.
Returns token IDs, token strings, type IDs, attention masks, special-token masks, and optional byte offsets.
Applies tokenizer-level tokenizer.json padding/truncation defaults where the reference tokenizers package applies them.
Uses a native Rust/C runtime through Rustler, with precompiled NIFs for common release targets and local source builds in development/test.

Why use it

Use this package when an Elixir system needs tokenizer performance and LLM-style runtime ergonomics without leaving the BEAM:

serving or batching LLM prompts in Phoenix, Livebook, Broadway, Oban, Nx, or custom inference services
counting or packing tokens before model calls
streaming tokenization for large prompts or ingestion pipelines
using OpenAI/tiktoken-compatible encodings from Elixir
loading SentencePiece .model files directly when a model repository does not expose the exact tokenizer.json path you want

Current results

The checked-in benchmark and parity files are generated by scripts in bench/. The README only summarizes results that have corresponding artifacts in bench/results/.

Correctness/parity

bench/validate_parity.exs compares IREE.Tokenizers with elixir-nx/tokenizers, the Rust-backed Hugging Face tokenizers reference package. The current selected matrix is green for 7 public tokenizer families, 19 representative inputs per family, and both add_special_tokens: true and false modes. It also checks batch encode and stream encode parity.

See the full report: bench/results/parity_report.md.

Currently green selected matrix:

Model / load path	Coverage in the report
`Qwen/Qwen2.5-7B-Instruct`	19/19 cases, both special-token modes; batch OK; stream OK
`google-bert/bert-base-uncased`	19/19 cases, both special-token modes; batch OK; stream OK
`openai-community/gpt2`	19/19 cases, both special-token modes; batch OK; stream OK
`microsoft/Phi-3-mini-4k-instruct`	19/19 cases, both special-token modes; batch OK; stream OK
`google-t5/t5-small` from `tokenizer.json`	19/19 cases, both special-token modes; batch OK; stream OK
`google-t5/t5-small` from SentencePiece `.model`	19/19 cases, both special-token modes; batch OK; stream OK
`sentence-transformers/all-MiniLM-L6-v2`	19/19 cases, both special-token modes; batch OK; stream OK

The benchmark-matrix rows currently published in bench/results/model_matrix.md were also re-checked on this branch for representative one-shot, batch, and stream parity:

LiquidAI/LFM2.5-1.2B-Instruct
Qwen/Qwen3.5-9B
zai-org/GLM-5.1
mistralai/Ministral-3-3B-Reasoning-2512
google/gemma-4-31B-it

Historical upstream/runtime gaps and local fixes are documented in docs/UPSTREAM_BUGS.md. Do not treat that file as the live status by itself; the latest parity report is the authoritative current result.

Performance

Benchmark numbers depend on machine, OTP/Elixir versions, CPU, and cache state. The checked-in numbers show the current shape:

Benchmark artifact	Summary
`bench/results/model_matrix.md`	Curated real-model prompt workload: IREE one-shot is 1.6x-5.6x faster than `tokenizers`; IREE stream is 5.4x-14.0x faster on the published rows.
`bench/results/tokenizers_compare.md`	Local BPE fixture: medium/long encode is about 1.3x faster; medium/long decode is about 10x faster.
`bench/results/sentencepiece_compare.md`	Direct `.model` loading: T5-small encode is 1.97x faster; LLaMA tokenizer encode is 1.18x faster; LLaMA decode is 1.81x faster.

The model-matrix run reports latency only for rows where the benchmark corpus produces equivalent outputs across both libraries, and reports stream numbers only when streamed output matches IREE one-shot output on that corpus.

Latency chart:

Model matrix latency

Speedup chart:

Model matrix speedup

Installation

Add the package to your Mix dependencies:

def deps do
  [
    {:iree_tokenizers, "~> 0.7.0"}
  ]
end

Then run:

mix deps.get

The package uses rustler_precompiled for release builds. The current prebuilt NIF target list is:

aarch64-apple-darwin
x86_64-apple-darwin
x86_64-unknown-linux-gnu

In :dev and :test, the project forces a local Rust source build. You can also force a local build with:

IREE_TOKENIZERS_BUILD=1 mix compile

Quick start

Load from the Hugging Face Hub

alias IREE.Tokenizers.Tokenizer

{:ok, tokenizer} = Tokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

{:ok, encoding} =
  Tokenizer.encode(tokenizer, "Hello from Elixir", add_special_tokens: false)

encoding.ids
#=> token ids

{:ok, text} = Tokenizer.decode(tokenizer, encoding.ids, skip_special_tokens: false)
#=> "Hello from Elixir"

For gated or private Hugging Face repositories, pass a token:

{:ok, tokenizer} =
  Tokenizer.from_pretrained("some/private-model",
    token: System.fetch_env!("HF_TOKEN")
  )

from_pretrained/2 caches downloaded tokenizer assets by ETag in a per-user cache directory by default. You can pass cache_dir:, revision:, subfolder:, filename:, use_cache: false, or a custom http_client:.

Load a local `tokenizer.json`

{:ok, tokenizer} = Tokenizer.from_file("tokenizer.json")

{:ok, encoding} =
  Tokenizer.encode(tokenizer, "Hello world",
    add_special_tokens: true,
    track_offsets: true
  )

encoding.ids
encoding.tokens
encoding.offsets
encoding.attention_mask
encoding.special_tokens_mask

Load OpenAI `.tiktoken` encodings

{:ok, tokenizer} =
  Tokenizer.from_pretrained("gpt-4o", format: :tiktoken)

{:ok, cl100k} =
  Tokenizer.from_pretrained("openai/cl100k_base", format: :tiktoken)

Tokenizer.supported_tiktoken_encodings()
#=> ["cl100k_base", "o200k_base", "o200k_harmony", "r50k_base", "gpt2", "p50k_base", "p50k_edit"]

For local .tiktoken files, pass format: :tiktoken when inference from the filename is not enough:

{:ok, tokenizer} =
  Tokenizer.from_file("gpt2.tiktoken", format: :tiktoken)

{:ok, tokenizer} =
  Tokenizer.from_buffer(buffer,
    format: :tiktoken,
    tiktoken_encoding: "cl100k_base"
  )

Load SentencePiece `.model` files

Local files ending in .model are inferred automatically:

{:ok, tokenizer} = Tokenizer.from_file("spiece.model")

From Hugging Face, request the SentencePiece path explicitly:

{:ok, tokenizer} =
  Tokenizer.from_pretrained("google-t5/t5-small",
    format: :sentencepiece_model
  )

Batch encode/decode

{:ok, encodings} =
  Tokenizer.encode_batch(tokenizer, ["short prompt", "another prompt"],
    add_special_tokens: false
  )

ids_batch = Enum.map(encodings, & &1.ids)
{:ok, texts} = Tokenizer.decode_batch(tokenizer, ids_batch, skip_special_tokens: false)

encode_batch/3 is intentionally parity-first: it routes through the same single-input encode/3 path for each item so tokenizer defaults, local fixes, and transformations are identical to one-shot encoding.

Streaming encode/decode

alias IREE.Tokenizers.{DecodeStream, EncodeStream}

{:ok, stream} = EncodeStream.new(tokenizer, add_special_tokens: false)
{:ok, ids1} = EncodeStream.feed(stream, "Hello ")
{:ok, ids2} = EncodeStream.feed(stream, "world")
{:ok, ids3} = EncodeStream.finalize(stream)
ids = ids1 ++ ids2 ++ ids3

{:ok, decode_stream} = DecodeStream.new(tokenizer, skip_special_tokens: false)
{:ok, text1} = DecodeStream.feed(decode_stream, Enum.take(ids, 2))
{:ok, text2} = DecodeStream.feed(decode_stream, Enum.drop(ids, 2))
{:ok, text3} = DecodeStream.finalize(decode_stream)
text = text1 <> text2 <> text3

For tokenizer families where the native streaming runtime can diverge at chunk boundaries, the wrapper uses buffered-finalize strategies so the final stream output still matches one-shot encode.

Encode transformations

alias IREE.Tokenizers.Encoding.Transformation

{:ok, encoding} =
  Tokenizer.encode(tokenizer, "hello",
    add_special_tokens: false,
    encoding_transformations: [
      Transformation.truncate(128),
      Transformation.pad(128, pad_id: 0, pad_token: "[PAD]")
    ]
  )

When a Hugging Face tokenizer.json carries fixed padding or truncation config, that default config is applied automatically. Explicit transformations are then applied after those defaults.

API map

Module	Purpose
`IREE.Tokenizers.Tokenizer`	Main load/encode/decode/vocab API.
`IREE.Tokenizers.Encoding`	Struct and helpers for token IDs, masks, offsets, tokens, padding, and truncation.
`IREE.Tokenizers.Encoding.Transformation`	Builders for post-encode transformations.
`IREE.Tokenizers.EncodeStream`	Incremental encode state.
`IREE.Tokenizers.DecodeStream`	Incremental decode state.
`IREE.Tokenizers.Model` and model modules	Build simple BPE, WordPiece, or Unigram specs from Elixir data.

Supported scope

Supported now:

inference-time encode/decode
Hugging Face tokenizer.json
OpenAI .tiktoken
SentencePiece .model
BPE, WordPiece, and Unigram tokenizers
single input encode/decode
list input batch encode/decode
streaming encode/decode
token offsets, type IDs, attention masks, special-token masks, token strings
special token ID lookup helpers
tokenizer vocabulary lookup helpers

Deferred or intentionally out of scope for v1:

pair-sequence encode input such as {left, right}
tokenizer training APIs
full tokenizer mutation APIs
full surface-area parity with every elixir-nx/tokenizers option
word ID tracking and overflowing-window output

Unsupported pair input returns:

{:error, {:invalid_argument, "pair sequence inputs are not supported in v1"}}

How it is implemented

The implementation has four layers:

Elixir public API
- lib/iree/tokenizers/tokenizer.ex owns loading, options, Hugging Face downloads/caching, batch behavior, tokenizer JSON defaults, and public result shaping.
- lib/iree/tokenizers/encoding.ex mirrors the practical Encoding helper surface: IDs, masks, offsets, tokens, pad/truncate/transform.
- lib/iree/tokenizers/encode_stream.ex and decode_stream.ex provide BEAM stream state wrappers.
Rust NIF bridge
- lib/iree/tokenizers/native.ex uses RustlerPrecompiled in releases and source builds in development/test.
- native/iree_tokenizers_native/src/tokenizer.rs maps Rust resources and NIF structs to the Elixir API.
- Dirty CPU NIFs are used for encode/decode paths that can do significant native work.
Vendored IREE tokenizer runtime
- The native crate builds a curated C source bundle under native/iree_tokenizers_native/vendor/iree_tokenizer_src.
- The pinned upstream commit is recorded in native/iree_tokenizers_native/vendor/IREE_COMMIT.
- scripts/update_iree_bundle.sh refreshes the vendored source bundle from a matching upstream IREE checkout.
Parity-preserving compatibility layer
- SentencePiece .model buffers are converted to tokenizer JSON in Rust before construction.
- Some tokenizer families use special decode or buffered stream strategies to match the Hugging Face reference output.
- Encode buffers grow with bounded retry logic so native output-capacity issues return clear errors instead of silently truncating or exhausting the BEAM.
- encode_batch/3 delegates through one-shot encode/3 for each input to preserve correctness across known native batch-runtime edge cases.
- Hugging Face tokenizer.json padding/truncation defaults are parsed and applied in the Elixir layer.

Repository usage

Install dependencies and run the normal local checks from the repository root:

mix deps.get
mix test
cargo test --manifest-path native/iree_tokenizers_native/Cargo.toml

Format Elixir and Rust code:

mix format
cargo fmt --manifest-path native/iree_tokenizers_native/Cargo.toml

Run optional pretrained integration suites:

RUN_PRETRAINED_BATCH_INTEGRATION=1 mix test test/iree_tokenizers/batch_integration_test.exs
RUN_PRETRAINED_STREAM_INTEGRATION=1 mix test test/iree_tokenizers/stream_integration_test.exs
RUN_SENTENCEPIECE_INTEGRATION=1 mix test test/iree_tokenizers/sentencepiece_integration_test.exs

Run the full selected parity matrix:

cd bench
mix deps.get
mix run validate_parity.exs

Limit the parity matrix while iterating:

cd bench
MODEL_FILTER="Qwen/Qwen2.5-7B-Instruct" mix run validate_parity.exs

The parity report is written to bench/results/parity_report.md.

Benchmark harness

Set up once:

cd bench
mix deps.get

Run the generic fixture comparison:

mix run compare.exs

Generate the SentencePiece .model comparison charts:

mix run sentencepiece_compare.exs

Generate the curated model latency/speedup matrix:

mix run model_matrix_graphs.exs

Limit a model-matrix run while iterating:

MODEL_FILTER="Qwen/Qwen3.5-9B" mix run model_matrix_graphs.exs

All benchmark outputs are written to bench/results/. If a benchmark target requires authentication, set HF_TOKEN before running the script.

Vendored IREE bundle

The native crate builds against the vendored source bundle under native/iree_tokenizers_native/vendor/iree_tokenizer_src.

The pinned IREE commit is recorded in:

native/iree_tokenizers_native/vendor/IREE_COMMIT

To refresh the bundle from a matching upstream checkout:

scripts/update_iree_bundle.sh /path/to/iree

After any vendor refresh, run Rust tests, Elixir tests, and the pretrained parity suites. Vendor updates can overwrite local C patches that are required for parity.

License

This package is distributed under the Apache-2.0 license. The vendored IREE runtime carries its own license file under native/iree_tokenizers_native/vendor/iree_tokenizer_src/IREE-LICENSE.