OmnivoiceEx
Elixir wrapper for OmniVoice — a unified speech generation model from K2-FSA.
Voice Cloning · Voice Design · Multilingual TTS · Deterministic Generation · 24kHz Output
Features
- 🎤 Voice Cloning — Clone any voice from a short reference audio clip
- 🎨 Voice Design — Describe a voice in natural language ("warm female broadcaster", "deep authoritative narrator")
- 🌍 Multilingual — Supports multiple languages with automatic detection; 646 languages available
- 🔁 Deterministic Generation — Stable, reproducible outputs via seed and temperature controls
- ⚡ GPU Optimized — CUDA, Apple Silicon (MPS), or CPU fallback
- 🔊 24kHz WAV — Professional-grade audio output
- 📦 MessagePack Protocol — Zero-base64 binary transport over Erlang Ports
Requirements
- Elixir ≥ 1.14
- Python ≥ 3.10
- CUDA GPU (recommended), Apple Silicon MPS, or CPU
omnivoicepip package (auto-installed viamix omnivoice_ex.setup)
Installation
Add to your mix.exs:
def deps do
[
{:omnivoice_ex, "~> 0.2.0"}
]
end
Then install Python dependencies:
mix omnivoice_ex.setup
Quick Start
# Start the model server
{:ok, pid} = OmnivoiceEx.start_link(device: "cuda")
# Wait for model to load
:ok = OmnivoiceEx.await_ready(pid)
# Generate speech
{:ok, audio} = OmnivoiceEx.generate(pid, "Hello, world!")
# Save to file
:ok = OmnivoiceEx.save(audio, "output.wav")
# Clean shutdown
OmnivoiceEx.stop(pid)
Voice Design
Describe a voice in natural language and OmniVoice generates it:
{:ok, audio} = OmnivoiceEx.generate(pid,
"Welcome to our luxury resort.",
instruct: "A warm, professional female concierge with a British accent"
)
Voice Cloning
Clone a voice from a reference audio file:
{:ok, audio} = OmnivoiceEx.generate(pid,
"This is a cloned voice speaking English.",
ref_audio: "/path/to/reference.wav",
ref_text: "Transcript of the reference audio" # optional, improves quality
)
Deterministic / Reproducible Generation (v0.2.0+)
OmnivoiceEx now supports fully deterministic generation for stable outputs across runs. This is useful for:
- A/B testing prompts and settings
- CI pipelines that validate audio
- Production systems requiring consistent behavior
Key options:
seed: Random seed for reproducible generation (default: 42)position_temperature: Mask-position selection temperature; 0 = greedy/deterministic (default: 0.0)class_temperature: Token sampling temperature; 0 = greedy/deterministic (default: 0.0)
Example:
{:ok, audio} = OmnivoiceEx.generate(pid,
"This output is fully reproducible.",
seed: 12345,
position_temperature: 0.0,
class_temperature: 0.0
)
Under the hood (v0.2.0 fix):
- Sets
CUBLAS_WORKSPACE_CONFIGfor deterministic CuBLAS (CUDA ≥ 10.2) - Enables
torch.backends.cudnn.deterministic = Trueand best-effortuse_deterministic_algorithms(True) - Seeds torch, CUDA, and NumPy RNGs before each generation
Language Selection
language: OmniVoice language ID (e.g."zh","en","ja","ko","yue"). Auto-detected from text if omitted. For mixed-language content, set this explicitly to avoid unstable detection.
Common IDs: zh (Chinese), en (English), ja (Japanese), ko (Korean), yue (Cantonese), fr (French), de (German), es (Spanish), ru (Russian), pt (Portuguese), it (Italian), th (Thai), vi (Vietnamese), hi (Hindi), ar (Arabic), nl (Dutch), pl (Polish), sv (Swedish), tr (Turkish).
Full list of 646 languages: OmniVoice docs/languages.md
Generation Options
ref_audio— Path to reference audio for cloningref_text— Transcript of reference audio (improves clone quality)instruct— Voice instruction for design (e.g. "A warm female broadcaster")language— Language ID; auto-detected if omittedduration— Target duration in secondsspeed— Playback speed factornum_step— Diffusion steps (higher = better quality, slower). Default: 32guidance_scale— Classifier-free guidance. Default: 2.0seed— Random seed for reproducible generation. Default: 42position_temperature— Mask-position selection temperature; 0 = greedy/deterministic. Default: 0.0class_temperature— Token sampling temperature; 0 = greedy/deterministic. Default: 0.0
Architecture
Elixir (GenServer) ←→ Erlang Port ←→ Python Bridge ←→ OmniVoice Model
(stdin/stdout) (msgpack framed)
Uses MessagePack binary framing over Erlang Ports — audio is transmitted as raw WAV bytes inside msgpack, eliminating the 33% base64 overhead of JSON-based solutions.
Changelog
v0.2.0
- Fixed initialization bug: removed
stderr_to_stdoutfrom port options to avoid blocking / startup issues - Added deterministic generation support:
- New options:
seed,position_temperature,class_temperature - CuBLAS and cuDNN determinism settings for stable GPU outputs
- New options:
- Improved language documentation with common IDs and link to full list
v0.1.0
- Initial release: Voice Cloning, Voice Design, multilingual TTS, 24kHz WAV, MessagePack protocol
Production & Engineering
This section provides practical guidance for using OmnivoiceEx in real systems: concurrency, reliability, monitoring, and common pitfalls.
Concurrency and Request Handling
- The Python bridge is a single process behind one GenServer. Generation calls are executed serially inside the model; concurrent
generate/3requests are queued internally. - Recommended patterns:
- Use a single server per node in most cases:
- Start once at application startup, share it via a named GenServer or Supervisor.
- For high-load clusters:
- Run one OmnivoiceEx instance per GPU (or per model replica).
- Distribute requests across nodes using a load balancer or job queue.
- Use a single server per node in most cases:
Example: named server in supervision tree
defmodule MyApp.OmniVoiceSupervisor do
use Supervisor
def start_link(opts) do
Supervisor.start_link(__MODULE__, opts, name: __MODULE__)
end
def init(_opts) do
children = [
{OmnivoiceEx,
name: OmniVoiceServer,
device: System.get_env("OMNIVOICE_DEVICE") || "cuda",
model: System.get_env("OMNIVOICE_MODEL") || "k2-fsa/OmniVoice"}]
Supervisor.init(children, strategy: :one_for_one)
end
end
# Usage elsewhere:
{:ok, audio} = OmnivoiceEx.generate(OmniVoiceServer, "Hello!", seed: 1)
Timeouts and Backpressure
generate/3andawait_ready/2accept atimeoutargument.- In production:
- Never use unlimited timeouts in HTTP handlers or worker loops.
- Use per-request timeouts (e.g., 30–120s depending on length) so long-running or stuck generations do not freeze processes.
Example:
case OmnivoiceEx.generate(OmniVoiceServer, text, opts, timeout: 60_000) do
{:ok, audio} ->
# handle audio
{:error, :timeout} ->
# fallback / retry / user message
{:error, reason} ->
# log and handle
end
If your system is under heavy load:
- Consider a worker pool (e.g., Quantum, Oban) to isolate TTS jobs.
- Use backoff + limited retries instead of aggressive parallel attempts on the same server.
Error Handling
OmnivoiceEx can return errors from:
- Model failures / OOM
- Invalid inputs
- Bridge crashes
General pattern:
case OmnivoiceEx.generate(OmniVoiceServer, text, opts) do
{:ok, audio} ->
# success
{:error, :timeout} ->
Logger.warn("TTS request timed out")
{:error, msg} when is_binary(msg) ->
Logger.error("TTS bridge error: #{msg}")
{:error, other} ->
Logger.error("TTS unexpected error: #{inspect(other)}")
end
If the Python bridge process exits unexpectedly:
- The GenServer will transition to an error status.
- In production, wrap OmnivoiceEx in a Supervisor with
restart: :transientor:permanentdepending on your policy.
Telemetry and Monitoring
OmnivoiceEx emits telemetry events you can use for observability:
[:omnivoice_ex, :generate]:- Measured:
%{duration_ms: float()}— time to generate audio.
- Measured:
[:omnivoice_ex, :await_ready]:- Measured: model load duration.
Example: attach a handler in your application:
defmodule MyApp do
def start(_type, _args) do
children = [
MyApp.OmniVoiceSupervisor,
{TelemetryPoller, []}
]
Supervisor.start_link(children, strategy: :one_for_one)
:telemetry.attach_many(
"omnivoice-ex-logger",
[
[:omnivoice_ex, :generate],
[:omnivoice_ex, :await_ready]
],
&__MODULE__.handle_event/4,
nil
)
end
def handle_event(event, measurements, _meta, _config) do
Logger.debug(
"OmnivoiceEx #{inspect(event)} duration_ms=#{measurements.duration_ms}"
)
end
end
You can plug this into Prometheus, Grafana, or your internal metrics stack.
Deployment Notes
- Python environment:
- Run
mix omnivoice_ex.setupin your build / deploy step to install required pip packages. - Ensure the same Python interpreter is available at runtime as during setup.
- Run
- GPU:
- Use a dedicated container or VM with stable GPU drivers and sufficient memory.
- Avoid overcommitting GPUs; OmniVoice + large context can use several GBs of VRAM.
- Determinism in production:
- For content pipelines where you must be able to reproduce outputs (e.g., logs, audits), always pass a
seedand keep temperatures at 0.
- For content pipelines where you must be able to reproduce outputs (e.g., logs, audits), always pass a
Common Pitfalls (FAQ-style)
- “Why is startup slow?”
- The model loads into memory on first start. This is expected; use
await_ready/2with a generous timeout and cache the server instead of restarting it frequently.
- The model loads into memory on first start. This is expected; use
- “Why are concurrent requests blocking each other?”
- The Python bridge processes one request at a time. For high concurrency, deploy multiple instances (e.g., one per GPU) and load-balance between them.
- “Audio sounds different across runs.”
- If you need stable output, set
seed,position_temperature: 0.0,class_temperature: 0.0. Without these, outputs may vary due to stochastic sampling.
- If you need stable output, set
- “Language detection seems random for mixed-language text.”
- Set
languageexplicitly (e.g.,"zh"or"en") when mixing languages in the same prompt.
- Set
License
Apache 2.0 — see LICENSE.
Related
- OmniVoice on HuggingFace
- VoxCPMEx — Elixir wrapper for VoxCPM2