whisper_ct2
whisper_ct2 is an Elixir library for running OpenAI Whisper speech-to-text
models inside the BEAM. It loads CTranslate2-converted Whisper models through a
Rustler NIF, so Elixir code can transcribe f32 PCM buffers without starting
Python or a separate inference service.
CTranslate2 is the speed-optimised C++ inference engine that powers
faster-whisper — 4-8x faster
than vanilla openai-whisper on the same hardware, with int8 / int8-float16
quantisation and CUDA / oneDNN / MKL / Accelerate backends.
Installation
def deps do
[{:whisper_ct2, "~> 0.5"}]
endInstallation downloads a precompiled NIF artefact matching your target triple from the project's GitHub releases. No Rust toolchain or CMake is needed on the consumer side.
Source builds
Set WHISPER_CT2_BUILD=1 in your environment (or
config :rustler_precompiled, :force_build, whisper_ct2: true in your parent
project) to compile from source instead. The first source build of CTranslate2
takes ~10 minutes and requires:
-
Rust toolchain (
rustup, stable) cmake, a C++17 compiler,make-
Linux:
libstdc++,libgompavailable at link time -
CUDA toolkit 12+ if building with
cudaorcuda-dynamicfeatures
Models
Point WhisperCt2.load_model/2 at a directory containing a CTranslate2-converted
Whisper model. Required files:
model.bin
config.json
tokenizer.json
vocabulary.txt
preprocessor_config.json
The Systran/faster-whisper-* repositories
ship the first four directly. They do not include
preprocessor_config.json; copy the canonical one from openai/whisper-tiny.en
(or any other openai/whisper-* repo - all Whisper sizes share the same file):
uvx hf download Systran/faster-whisper-tiny.en \
--local-dir models/faster-whisper-tiny.en
uvx hf download openai/whisper-tiny.en preprocessor_config.json \
--local-dir models/faster-whisper-tiny.enBackends
The published Hex package ships four precompiled artefacts; install picks the right one automatically based on your target triple:
| Target triple | CPU backend | GPU | Notes |
|---|---|---|---|
aarch64-apple-darwin | Accelerate | none | Apple Silicon (M1+). Uses Accelerate / AMX paths. |
x86_64-unknown-linux-gnu | oneDNN | cuda-dynamic | Default x86_64 binary; runs well on Intel and AMD. |
x86_64-unknown-linux-gnu (mkl) | Intel MKL | cuda-dynamic | Intel-tuned variant. Opt in via env var (below). |
aarch64-unknown-linux-gnu | oneDNN | cuda-dynamic | Graviton/Grace, optional CUDA on GH200-class hosts. |
cuda-dynamic defers loading libcudart until first GPU use, so each artefact
still runs on hosts without CUDA installed. :device selection picks CUDA when
available, otherwise CPU.
x86_64 macOS and Windows are not shipped.
Selecting the MKL variant
For Intel-only fleets where you want maximum SGEMM throughput:
WHISPER_CT2_VARIANT=mkl mix deps.compile whisper_ct2rustler_precompiled reads this env var at install time and selects the --mkl
artefact instead of the default.
Build from source with a custom backend
For source builds you can pick any combination of ct2rs features:
WHISPER_CT2_BUILD=1 WHISPER_CT2_FEATURES="dnnl cuda-dynamic" mix compile
# other options: mkl, openblas, accelerate, cuda, cuda-dynamicRuntime device selection
WhisperCt2.available_devices()
#=> {:ok, %{cpu: 1, cuda: 1, cuda_supported: true}}
{:ok, model} =
WhisperCt2.load_model("models/faster-whisper-tiny.en",
device: :auto, # :cpu | :cuda | :auto (default)
compute_type: :auto, # :default | :auto | :float16 | :int8_float16 | ...
device_indices: [0]
):auto picks CUDA when the artefact supports it and at least one CUDA device
is visible; otherwise CPU. Explicit :cuda returns
{:error, %WhisperCt2.Error{reason: :invalid_request}} if either condition
fails.
Usage
{:ok, model} = WhisperCt2.load_model("models/faster-whisper-tiny.en")
# Decode/resample to 16 kHz mono f32 PCM upstream (ffmpeg, Membrane,
# anything that produces little-endian f32 bytes).
pcm = File.read!("jfk.pcm")
{:ok, %WhisperCt2.Transcription{text: text, segments: segs}} =
WhisperCt2.transcribe(model, {:pcm_f32, pcm}, language: "en")
IO.puts(text)
# => "And so, my fellow Americans ask not what your country can do for you ..."
for s <- segs do
IO.puts("[#{s.start}-#{s.end}] (no_speech=#{Float.round(s.no_speech_prob, 3)}) #{s.text}")
end%WhisperCt2.Segment{} carries absolute :start / :end seconds,
:no_speech_prob, :avg_logprob, the underlying text token IDs, and
(when :word_timestamps is on) a list of %WhisperCt2.Word{} with
per-word timing.
Audio contract
CTranslate2 expects mono f32 PCM samples at the model's sample rate
(16 kHz for every published Whisper checkpoint), normalized to the
-1.0..1.0 range. transcribe/3 and transcribe_batch/3 accept exactly
one shape:
{:pcm_f32, binary}- little-endian f32 samples at the model's sample rate.
Anything else (paths, raw bare binaries, WAV bytes, MP3, 44.1 kHz, ...)
is rejected at the boundary with an :invalid_request error. There is
no bundled audio decoder; decode, downmix, and resample upstream using
your tool of choice. For a one-shot file conversion:
ffmpeg -i input.mp3 -ar 16000 -ac 1 -f f32le output.pcmAudio longer than 30 s is chunked into Whisper windows automatically; the encoder runs once across every chunk in the batch.
Batched transcribe and word timestamps
# Diarization-driven workflow: one master decode upstream, many short
# splices fed in as PCM byte ranges.
samples = File.read!("call.pcm")
turns =
[
WhisperCt2.Pcm.slice(samples, 16_000, 0.0, 3.2),
WhisperCt2.Pcm.slice(samples, 16_000, 3.2, 4.5)
# ...
]
|> Enum.map(fn {:ok, bin} -> {:pcm_f32, bin} end)
{:ok, transcriptions} =
WhisperCt2.transcribe_batch(model, turns, language: "en", word_timestamps: true)transcribe_batch/3 stacks every chunk of every input into one encoder
forward pass. :word_timestamps adds one batched DTW alignment pass and
attaches %Word{} entries to each segment.
Decoding biases
WhisperCt2.transcribe(model, {:pcm_f32, talk_pcm},
language: "en",
initial_prompt: "Discussion of CTranslate2, BEAM, and Whisper internals.",
prefix: "Welcome back to the show."
):initial_prompt prepends free-text context (via <|startofprev|>) so the
decoder is biased toward your domain vocabulary or speaker style;
:prefix forces the start of the generated transcript.
Options
transcribe/3 and transcribe_batch/3 accept any subset of:
| Option | Type | Notes |
|---|---|---|
:language | String.t | nil |
ISO code ("en"). nil auto-detects on multilingual. |
:initial_prompt | String.t | nil |
Free-text context prepended via <|startofprev|>. |
:prefix | String.t | nil | Forced text the generation must start with. |
:word_timestamps | boolean | Attach per-word timing via a batched DTW alignment. |
:with_timestamps | boolean |
Emit <|t_..|> segment timestamps (default true). false for fine-tunes that emit plain text. |
:beam_size | pos_integer | Beam-search width. |
:patience | float | Beam-search patience. |
:length_penalty | float | Decoding length penalty. |
:repetition_penalty | float | Decoding repetition penalty. |
:no_repeat_ngram_size | non_neg_integer | Disallow repeated n-grams of this size. |
:sampling_temperature | float | Sampling temperature. |
:sampling_topk | pos_integer | Top-k sampling. |
:suppress_blank | boolean | Suppress the initial blank token. |
:suppress_tokens | [integer] | Suppress these token IDs. |
:max_length | pos_integer | Max tokens per chunk. |
:num_hypotheses | pos_integer | Number of decoded hypotheses. |
:max_initial_timestamp_index | non_neg_integer | Cap the first timestamp token. |
Unset values use the CTranslate2 defaults. no_speech_prob and
avg_logprob are always populated on each segment - there is no opt-in
return-knob.
Unknown option keys and out-of-range values return
{:error, %WhisperCt2.Error{reason: :invalid_request}} before reaching the
NIF.
Errors
All failures return {:error, %WhisperCt2.Error{}}. reason is one of
:invalid_request, :load_error, :inference_error, :runtime_error,
:nif_panic, or :native_error. The struct also implements Exception, so
raise/1 works.
Testing
Unit tests run with no external dependencies:
mix test
The end-to-end transcription test downloads the faster-whisper-tiny.en model
(~75 MB) and the jfk.wav clip from the whisper.cpp samples:
mix test --include integration
Cached under test/fixtures/. Set WHISPER_CT2_REFRESH=1 to redownload.
License
MIT. CTranslate2 itself is MIT-licensed. The bundled ct2rs crate links
CTranslate2 statically by default.