IREE.Tokenizers
IREE.Tokenizers is an inference-only Elixir tokenizer package backed by the
IREE tokenizer runtime. It lets
Elixir applications load common LLM tokenizer assets and run fast local
encode/decode without a Python service. I first discovered IREE's tokenizer work
through the ZML.ai blog, and deeply
admire the company and the engineering behind it.
In one sentence: this package turns Hugging Face tokenizer.json, OpenAI
.tiktoken, and SentencePiece .model files into BEAM-friendly tokenizer
handles with one-shot, batch, streaming, offset, mask, and vocab helper APIs.
What this package does
- Loads tokenizer assets from local files, in-memory buffers, or the Hugging Face Hub.
-
Supports Hugging Face
tokenizer.json, OpenAI.tiktoken, and SentencePiece.modelformats. - Supports BPE, WordPiece, and Unigram model families.
- Encodes and decodes single inputs, lists of inputs, and streams of chunks.
- Returns token IDs, token strings, type IDs, attention masks, special-token masks, and optional byte offsets.
-
Applies tokenizer-level
tokenizer.jsonpadding/truncation defaults where the referencetokenizerspackage applies them. - Uses a native Rust/C runtime through Rustler, with precompiled NIFs for common release targets and local source builds in development/test.
Why use it
Use this package when an Elixir system needs tokenizer performance and LLM-style runtime ergonomics without leaving the BEAM:
- serving or batching LLM prompts in Phoenix, Livebook, Broadway, Oban, Nx, or custom inference services
- counting or packing tokens before model calls
- streaming tokenization for large prompts or ingestion pipelines
- using OpenAI/tiktoken-compatible encodings from Elixir
-
loading SentencePiece
.modelfiles directly when a model repository does not expose the exacttokenizer.jsonpath you want
Current results
The checked-in benchmark and parity files are generated by scripts in bench/.
The README only summarizes results that have corresponding artifacts in
bench/results/.
Correctness/parity
bench/validate_parity.exs compares IREE.Tokenizers with
elixir-nx/tokenizers, the Rust-backed
Hugging Face tokenizers reference package. The current selected matrix is
green for 7 public tokenizer families, 19 representative inputs per family, and
both add_special_tokens: true and false modes. It also checks batch encode
and stream encode parity.
See the full report: bench/results/parity_report.md.
Currently green selected matrix:
| Model / load path | Coverage in the report |
|---|---|
Qwen/Qwen2.5-7B-Instruct | 19/19 cases, both special-token modes; batch OK; stream OK |
google-bert/bert-base-uncased | 19/19 cases, both special-token modes; batch OK; stream OK |
openai-community/gpt2 | 19/19 cases, both special-token modes; batch OK; stream OK |
microsoft/Phi-3-mini-4k-instruct | 19/19 cases, both special-token modes; batch OK; stream OK |
google-t5/t5-small from tokenizer.json | 19/19 cases, both special-token modes; batch OK; stream OK |
google-t5/t5-small from SentencePiece .model | 19/19 cases, both special-token modes; batch OK; stream OK |
sentence-transformers/all-MiniLM-L6-v2 | 19/19 cases, both special-token modes; batch OK; stream OK |
The benchmark-matrix rows currently published in
bench/results/model_matrix.md were also
re-checked on this branch for representative one-shot, batch, and stream parity:
LiquidAI/LFM2.5-1.2B-InstructQwen/Qwen3.5-9Bzai-org/GLM-5.1mistralai/Ministral-3-3B-Reasoning-2512google/gemma-4-31B-it
Historical upstream/runtime gaps and local fixes are documented in
docs/UPSTREAM_BUGS.md. Do not treat that file as the
live status by itself; the latest parity report is the authoritative current
result.
Performance
Benchmark numbers depend on machine, OTP/Elixir versions, CPU, and cache state. The checked-in numbers show the current shape:
| Benchmark artifact | Summary |
|---|---|
bench/results/model_matrix.md |
Curated real-model prompt workload: IREE one-shot is 1.6x-5.6x faster than tokenizers; IREE stream is 5.4x-14.0x faster on the published rows. |
bench/results/tokenizers_compare.md | Local BPE fixture: medium/long encode is about 1.3x faster; medium/long decode is about 10x faster. |
bench/results/sentencepiece_compare.md |
Direct .model loading: T5-small encode is 1.97x faster; LLaMA tokenizer encode is 1.18x faster; LLaMA decode is 1.81x faster. |
The model-matrix run reports latency only for rows where the benchmark corpus produces equivalent outputs across both libraries, and reports stream numbers only when streamed output matches IREE one-shot output on that corpus.
Latency chart:
Speedup chart:
Installation
Add the package to your Mix dependencies:
def deps do
[
{:iree_tokenizers, "~> 0.7.0"}
]
endThen run:
mix deps.get
The package uses rustler_precompiled for release builds. The current prebuilt
NIF target list is:
aarch64-apple-darwinx86_64-apple-darwinx86_64-unknown-linux-gnu
In :dev and :test, the project forces a local Rust source build. You can
also force a local build with:
IREE_TOKENIZERS_BUILD=1 mix compileQuick start
Load from the Hugging Face Hub
alias IREE.Tokenizers.Tokenizer
{:ok, tokenizer} = Tokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
{:ok, encoding} =
Tokenizer.encode(tokenizer, "Hello from Elixir", add_special_tokens: false)
encoding.ids
#=> token ids
{:ok, text} = Tokenizer.decode(tokenizer, encoding.ids, skip_special_tokens: false)
#=> "Hello from Elixir"For gated or private Hugging Face repositories, pass a token:
{:ok, tokenizer} =
Tokenizer.from_pretrained("some/private-model",
token: System.fetch_env!("HF_TOKEN")
)from_pretrained/2 caches downloaded tokenizer assets by ETag in a per-user
cache directory by default. You can pass cache_dir:, revision:, subfolder:,
filename:, use_cache: false, or a custom http_client:.
Load a local tokenizer.json
{:ok, tokenizer} = Tokenizer.from_file("tokenizer.json")
{:ok, encoding} =
Tokenizer.encode(tokenizer, "Hello world",
add_special_tokens: true,
track_offsets: true
)
encoding.ids
encoding.tokens
encoding.offsets
encoding.attention_mask
encoding.special_tokens_mask
Load OpenAI .tiktoken encodings
{:ok, tokenizer} =
Tokenizer.from_pretrained("gpt-4o", format: :tiktoken)
{:ok, cl100k} =
Tokenizer.from_pretrained("openai/cl100k_base", format: :tiktoken)
Tokenizer.supported_tiktoken_encodings()
#=> ["cl100k_base", "o200k_base", "o200k_harmony", "r50k_base", "gpt2", "p50k_base", "p50k_edit"]
For local .tiktoken files, pass format: :tiktoken when inference from the
filename is not enough:
{:ok, tokenizer} =
Tokenizer.from_file("gpt2.tiktoken", format: :tiktoken)
{:ok, tokenizer} =
Tokenizer.from_buffer(buffer,
format: :tiktoken,
tiktoken_encoding: "cl100k_base"
)
Load SentencePiece .model files
Local files ending in .model are inferred automatically:
{:ok, tokenizer} = Tokenizer.from_file("spiece.model")From Hugging Face, request the SentencePiece path explicitly:
{:ok, tokenizer} =
Tokenizer.from_pretrained("google-t5/t5-small",
format: :sentencepiece_model
)Batch encode/decode
{:ok, encodings} =
Tokenizer.encode_batch(tokenizer, ["short prompt", "another prompt"],
add_special_tokens: false
)
ids_batch = Enum.map(encodings, & &1.ids)
{:ok, texts} = Tokenizer.decode_batch(tokenizer, ids_batch, skip_special_tokens: false)encode_batch/3 is intentionally parity-first: it routes through the same
single-input encode/3 path for each item so tokenizer defaults, local fixes,
and transformations are identical to one-shot encoding.
Streaming encode/decode
alias IREE.Tokenizers.{DecodeStream, EncodeStream}
{:ok, stream} = EncodeStream.new(tokenizer, add_special_tokens: false)
{:ok, ids1} = EncodeStream.feed(stream, "Hello ")
{:ok, ids2} = EncodeStream.feed(stream, "world")
{:ok, ids3} = EncodeStream.finalize(stream)
ids = ids1 ++ ids2 ++ ids3
{:ok, decode_stream} = DecodeStream.new(tokenizer, skip_special_tokens: false)
{:ok, text1} = DecodeStream.feed(decode_stream, Enum.take(ids, 2))
{:ok, text2} = DecodeStream.feed(decode_stream, Enum.drop(ids, 2))
{:ok, text3} = DecodeStream.finalize(decode_stream)
text = text1 <> text2 <> text3For tokenizer families where the native streaming runtime can diverge at chunk boundaries, the wrapper uses buffered-finalize strategies so the final stream output still matches one-shot encode.
Encode transformations
alias IREE.Tokenizers.Encoding.Transformation
{:ok, encoding} =
Tokenizer.encode(tokenizer, "hello",
add_special_tokens: false,
encoding_transformations: [
Transformation.truncate(128),
Transformation.pad(128, pad_id: 0, pad_token: "[PAD]")
]
)
When a Hugging Face tokenizer.json carries fixed padding or truncation config,
that default config is applied automatically. Explicit transformations are then
applied after those defaults.
API map
| Module | Purpose |
|---|---|
IREE.Tokenizers.Tokenizer | Main load/encode/decode/vocab API. |
IREE.Tokenizers.Encoding | Struct and helpers for token IDs, masks, offsets, tokens, padding, and truncation. |
IREE.Tokenizers.Encoding.Transformation | Builders for post-encode transformations. |
IREE.Tokenizers.EncodeStream | Incremental encode state. |
IREE.Tokenizers.DecodeStream | Incremental decode state. |
IREE.Tokenizers.Model and model modules | Build simple BPE, WordPiece, or Unigram specs from Elixir data. |
Supported scope
Supported now:
- inference-time encode/decode
-
Hugging Face
tokenizer.json -
OpenAI
.tiktoken -
SentencePiece
.model - BPE, WordPiece, and Unigram tokenizers
- single input encode/decode
- list input batch encode/decode
- streaming encode/decode
- token offsets, type IDs, attention masks, special-token masks, token strings
- special token ID lookup helpers
- tokenizer vocabulary lookup helpers
Deferred or intentionally out of scope for v1:
-
pair-sequence encode input such as
{left, right} - tokenizer training APIs
- full tokenizer mutation APIs
-
full surface-area parity with every
elixir-nx/tokenizersoption - word ID tracking and overflowing-window output
Unsupported pair input returns:
{:error, {:invalid_argument, "pair sequence inputs are not supported in v1"}}How it is implemented
The implementation has four layers:
Elixir public API
lib/iree/tokenizers/tokenizer.exowns loading, options, Hugging Face downloads/caching, batch behavior, tokenizer JSON defaults, and public result shaping.lib/iree/tokenizers/encoding.exmirrors the practicalEncodinghelper surface: IDs, masks, offsets, tokens, pad/truncate/transform.lib/iree/tokenizers/encode_stream.exanddecode_stream.exprovide BEAM stream state wrappers.
Rust NIF bridge
lib/iree/tokenizers/native.exusesRustlerPrecompiledin releases and source builds in development/test.native/iree_tokenizers_native/src/tokenizer.rsmaps Rust resources and NIF structs to the Elixir API.- Dirty CPU NIFs are used for encode/decode paths that can do significant native work.
Vendored IREE tokenizer runtime
-
The native crate builds a curated C source bundle under
native/iree_tokenizers_native/vendor/iree_tokenizer_src. -
The pinned upstream commit is recorded in
native/iree_tokenizers_native/vendor/IREE_COMMIT. scripts/update_iree_bundle.shrefreshes the vendored source bundle from a matching upstream IREE checkout.
-
The native crate builds a curated C source bundle under
Parity-preserving compatibility layer
-
SentencePiece
.modelbuffers are converted to tokenizer JSON in Rust before construction. - Some tokenizer families use special decode or buffered stream strategies to match the Hugging Face reference output.
- Encode buffers grow with bounded retry logic so native output-capacity issues return clear errors instead of silently truncating or exhausting the BEAM.
encode_batch/3delegates through one-shotencode/3for each input to preserve correctness across known native batch-runtime edge cases.-
Hugging Face
tokenizer.jsonpadding/truncation defaults are parsed and applied in the Elixir layer.
-
SentencePiece
Repository usage
Install dependencies and run the normal local checks from the repository root:
mix deps.get
mix test
cargo test --manifest-path native/iree_tokenizers_native/Cargo.tomlFormat Elixir and Rust code:
mix format
cargo fmt --manifest-path native/iree_tokenizers_native/Cargo.tomlRun optional pretrained integration suites:
RUN_PRETRAINED_BATCH_INTEGRATION=1 mix test test/iree_tokenizers/batch_integration_test.exs
RUN_PRETRAINED_STREAM_INTEGRATION=1 mix test test/iree_tokenizers/stream_integration_test.exs
RUN_SENTENCEPIECE_INTEGRATION=1 mix test test/iree_tokenizers/sentencepiece_integration_test.exsRun the full selected parity matrix:
cd bench
mix deps.get
mix run validate_parity.exsLimit the parity matrix while iterating:
cd bench
MODEL_FILTER="Qwen/Qwen2.5-7B-Instruct" mix run validate_parity.exs
The parity report is written to bench/results/parity_report.md.
Benchmark harness
Set up once:
cd bench
mix deps.getRun the generic fixture comparison:
mix run compare.exs
Generate the SentencePiece .model comparison charts:
mix run sentencepiece_compare.exsGenerate the curated model latency/speedup matrix:
mix run model_matrix_graphs.exsLimit a model-matrix run while iterating:
MODEL_FILTER="Qwen/Qwen3.5-9B" mix run model_matrix_graphs.exs
All benchmark outputs are written to bench/results/. If a benchmark target
requires authentication, set HF_TOKEN before running the script.
Vendored IREE bundle
The native crate builds against the vendored source bundle under
native/iree_tokenizers_native/vendor/iree_tokenizer_src.
The pinned IREE commit is recorded in:
native/iree_tokenizers_native/vendor/IREE_COMMITTo refresh the bundle from a matching upstream checkout:
scripts/update_iree_bundle.sh /path/to/ireeAfter any vendor refresh, run Rust tests, Elixir tests, and the pretrained parity suites. Vendor updates can overwrite local C patches that are required for parity.
License
This package is distributed under the Apache-2.0 license. The vendored IREE
runtime carries its own license file under
native/iree_tokenizers_native/vendor/iree_tokenizer_src/IREE-LICENSE.