VoskEx
Elixir bindings for the Vosk API - offline speech recognition toolkit.
VoskEx provides a high-performance interface to Voskβs speech recognition capabilities, allowing you to recognize speech from audio files or streams entirely offline, with no network connection required.
Features
- π― Offline speech recognition - No cloud APIs required
- π High performance - Uses NIF for direct C library integration
- π Streaming support - Process audio in real-time or from files
- π Multi-language - Support for 20+ languages via Vosk models
- π Detailed results - Get word-level timing and confidence scores
- π§΅ Thread-safe - Uses dirty schedulers for non-blocking operation
Installation
VoskEx automatically downloads precompiled Vosk libraries during compilation, so no system dependencies are required!
Simply add vosk_ex to your list of dependencies in mix.exs:
def deps do
[
{:vosk_ex, "~> 0.2.1"}
]
endThen run:
mix deps.get
mix compile # Automatically downloads Vosk library (~2-7 MB) for your platformSupported platforms:
- Linux: x86_64, aarch64 (ARM64)
- macOS: Intel (x86_64), Apple Silicon (M1/M2/M3)
- Windows: x64
The library automatically detects your platform and downloads the appropriate precompiled Vosk library on first compilation.
Windows Users - Additional Setup Required
On Windows, you need to add the Vosk DLL directory to PATH before starting your application. This is a Windows limitation for finding external DLL dependencies.
Why? Unlike bcrypt or other self-contained NIFs, VoskEx depends on external Vosk DLLs (26MB+ of speech recognition libraries). Windows needs to know where to find these at runtime.
Option 1 - Set PATH manually (PowerShell):
# In PowerShell, before running your app
$env:PATH = "_build\dev\lib\vosk_ex\priv\native\windows-x86_64;$env:PATH"
# Then run normally
mix test
mix run
iex -S mixOption 2 - Use the included helper script:
# Copy scripts/windows/run.ps1 to your project root
.\scripts\windows\run.ps1 mix test
.\scripts\windows\run.ps1 iex -S mixOption 3 - Create a startup script for your app:
# my_app.ps1
$env:PATH = "_build\dev\lib\vosk_ex\priv\native\windows-x86_64;$env:PATH"
mix run --no-haltOption 4 - Use Mix releases (recommended for production):
mix release
# Releases automatically bundle all DLLs - no PATH manipulation needed!Note: For test environment, use _build\test\lib\vosk_ex\priv\native\windows-x86_64 instead.
Configuration
VoskEx logs are disabled by default. To enable Vosk/Kaldi internal logging, add to your config/config.exs:
config :vosk_ex,
log_level: 0 # -1 = silent (default), 0 = default logging, >0 = more verboseUsage
1. Download a speech model
Use the built-in Mix task to download a model:
# Download default English model
mix vosk.download_model
# Download Spanish model
mix vosk.download_model es
# Download specific model by name
mix vosk.download_model vosk-model-small-en-us-0.15
Available predefined languages: en-us, es, fr, de, ru, cn, ja, pt, it, and more.
Or download manually from https://alphacephei.com/vosk/models.
2. Basic usage
# Load the model
{:ok, model} = VoskEx.Model.load("vosk-model-small-en-us-0.15")
# Create a recognizer (16kHz sample rate)
{:ok, recognizer} = VoskEx.Recognizer.new(model, 16000.0)
# Optional: Enable word timing
:ok = VoskEx.Recognizer.set_words(recognizer, true)
# Read audio file (PCM 16-bit mono, skip WAV header)
audio = File.read!("audio.wav") |> binary_part(44, byte_size(audio) - 44)
# Process audio in chunks
chunk_size = 8000
for <<chunk::binary-size(chunk_size) <- audio>> do
case VoskEx.Recognizer.accept_waveform(recognizer, chunk) do
:utterance_ended ->
{:ok, result} = VoskEx.Recognizer.result(recognizer)
IO.inspect(result)
:continue ->
{:ok, partial} = VoskEx.Recognizer.partial_result(recognizer)
IO.inspect(partial, label: "Partial")
end
end
# Get final result
{:ok, final} = VoskEx.Recognizer.final_result(recognizer)
IO.inspect(final, label: "Final")3. Result format
# Simple result
%{"text" => "hello world"}
# With word timing (when set_words is enabled)
%{
"result" => [
%{"conf" => 1.0, "end" => 1.110000, "start" => 0.870000, "word" => "hello"},
%{"conf" => 0.98, "end" => 1.530000, "start" => 1.110000, "word" => "world"}
],
"text" => "hello world"
}
# Partial result
%{"partial" => "hello wor"}4. Streaming audio
defmodule AudioProcessor do
use GenServer
def start_link(model_path) do
GenServer.start_link(__MODULE__, model_path)
end
def init(model_path) do
{:ok, model} = VoskEx.Model.load(model_path)
{:ok, recognizer} = VoskEx.Recognizer.new(model, 16000.0)
VoskEx.Recognizer.set_words(recognizer, true)
{:ok, %{model: model, recognizer: recognizer}}
end
def handle_call({:process_audio, audio_chunk}, _from, state) do
result = case VoskEx.Recognizer.accept_waveform(state.recognizer, audio_chunk) do
:utterance_ended ->
{:ok, result} = VoskEx.Recognizer.result(state.recognizer)
{:utterance, result}
:continue ->
{:ok, partial} = VoskEx.Recognizer.partial_result(state.recognizer)
{:partial, partial}
:error ->
{:error, :recognition_failed}
end
{:reply, result, state}
end
endDocumentation
Full documentation is available at https://hexdocs.pm/vosk_ex or you can generate it locally:
mix docs
open doc/index.htmlAPI Reference
VoskEx.Model
load(path)- Load a model from a directoryload!(path)- Load a model, raising on errorfind_word(model, word)- Check if a word exists in vocabulary
VoskEx.Recognizer
new(model, sample_rate)- Create a recognizernew!(model, sample_rate)- Create a recognizer, raising on errorset_max_alternatives(recognizer, max)- Set number of alternativesset_words(recognizer, enabled)- Enable word timing in resultsset_partial_words(recognizer, enabled)- Enable word timing in partial resultsaccept_waveform(recognizer, audio)- Process audio dataresult(recognizer)- Get final resultpartial_result(recognizer)- Get partial resultfinal_result(recognizer)- Get final result at stream endreset(recognizer)- Reset recognizer state
VoskEx (Low-level API)
set_log_level(level)- Set Vosk/Kaldi logging level (-1 = silent, 0 = default, >0 = verbose)
Audio Format
Vosk expects PCM 16-bit mono audio. Sample rates typically used:
- 8000 Hz - Telephone quality
- 16000 Hz - Standard quality (most models)
- 44100 Hz - CD quality (if model supports it)
Converting audio with ffmpeg
# Convert any audio to 16kHz mono PCM WAV
ffmpeg -i input.mp3 -ar 16000 -ac 1 -f wav output.wav
# Extract raw PCM (no WAV header)
ffmpeg -i input.mp3 -ar 16000 -ac 1 -f s16le output.rawPerformance Considerations
- Model size: Larger models are more accurate but slower
- Small models: ~50 MB, fast, less accurate
- Large models: 1-2 GB, slower, more accurate
- Dirty schedulers:
accept_waveformuses dirty CPU schedulers to avoid blocking BEAM - Memory management: Models and recognizers are automatically freed by the garbage collector
- Thread safety: Models can be shared, but each GenServer should have its own recognizer
Available Models
Download models from https://alphacephei.com/vosk/models
Languages include:
- English (US, UK, Indian)
- Chinese, Japanese, Korean
- Spanish, French, German, Italian, Portuguese
- Russian, Ukrainian, Polish, Czech
- And many moreβ¦
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Vosk itself is licensed under the Apache License 2.0.
Acknowledgments
- Vosk Speech Recognition Toolkit
- Alpha Cephei for developing Vosk