OmnivoiceEx
Elixir wrapper for OmniVoice — a unified speech generation model from K2-FSA.
Voice Cloning · Voice Design · Multilingual TTS · Deterministic Generation · 24kHz Output
Features
- 🎤 Voice Cloning — Clone any voice from a short reference audio clip
- 🎨 Voice Design — Describe a voice in natural language ("warm female broadcaster", "deep authoritative narrator")
- 🌍 Multilingual — Supports multiple languages with automatic detection; 646 languages available
- 🔁 Deterministic Generation — Stable, reproducible outputs via seed and temperature controls
- ⚡ GPU Optimized — CUDA, Apple Silicon (MPS), or CPU fallback
- 🔊 24kHz WAV — Professional-grade audio output
- 📦 MessagePack Protocol — Zero-base64 binary transport over Erlang Ports
Requirements
- Elixir ≥ 1.14
- Python ≥ 3.10
- CUDA GPU (recommended), Apple Silicon MPS, or CPU
omnivoicepip package (auto-installed viamix omnivoice_ex.setup)
Installation
Add to your mix.exs:
def deps do
[
{:omnivoice_ex, "~> 0.2.0"}
]
end
Then install Python dependencies:
mix omnivoice_ex.setup
Quick Start
# Start the model server
{:ok, pid} = OmnivoiceEx.start_link(device: "cuda")
# Wait for model to load
:ok = OmnivoiceEx.await_ready(pid)
# Generate speech
{:ok, audio} = OmnivoiceEx.generate(pid, "Hello, world!")
# Save to file
:ok = OmnivoiceEx.save(audio, "output.wav")
# Clean shutdown
OmnivoiceEx.stop(pid)
Voice Design
Describe a voice in natural language and OmniVoice generates it:
{:ok, audio} = OmnivoiceEx.generate(pid,
"Welcome to our luxury resort.",
instruct: "A warm, professional female concierge with a British accent"
)
Voice Cloning
Clone a voice from a reference audio file:
{:ok, audio} = OmnivoiceEx.generate(pid,
"This is a cloned voice speaking English.",
ref_audio: "/path/to/reference.wav",
ref_text: "Transcript of the reference audio" # optional, improves quality
)
Deterministic / Reproducible Generation (v0.2.0+)
OmnivoiceEx now supports fully deterministic generation for stable outputs across runs. This is useful for:
- A/B testing prompts and settings
- CI pipelines that validate audio
- Production systems requiring consistent behavior
Key options:
seed: Random seed for reproducible generation (default: 42)position_temperature: Mask-position selection temperature; 0 = greedy/deterministic (default: 0.0)class_temperature: Token sampling temperature; 0 = greedy/deterministic (default: 0.0)
Example:
{:ok, audio} = OmnivoiceEx.generate(pid,
"This output is fully reproducible.",
seed: 12345,
position_temperature: 0.0,
class_temperature: 0.0
)
Under the hood (v0.2.0 fix):
- Sets
CUBLAS_WORKSPACE_CONFIGfor deterministic CuBLAS (CUDA ≥ 10.2) - Enables
torch.backends.cudnn.deterministic = Trueand best-effortuse_deterministic_algorithms(True) - Seeds torch, CUDA, and NumPy RNGs before each generation
Language Selection
language: OmniVoice language ID (e.g."zh","en","ja","ko","yue"). Auto-detected from text if omitted. For mixed-language content, set this explicitly to avoid unstable detection.
Common IDs: zh (Chinese), en (English), ja (Japanese), ko (Korean), yue (Cantonese), fr (French), de (German), es (Spanish), ru (Russian), pt (Portuguese), it (Italian), th (Thai), vi (Vietnamese), hi (Hindi), ar (Arabic), nl (Dutch), pl (Polish), sv (Swedish), tr (Turkish).
Full list of 646 languages: OmniVoice docs/languages.md
Generation Options
ref_audio— Path to reference audio for cloningref_text— Transcript of reference audio (improves clone quality)instruct— Voice instruction for design (e.g. "A warm female broadcaster")language— Language ID; auto-detected if omittedduration— Target duration in secondsspeed— Playback speed factornum_step— Diffusion steps (higher = better quality, slower). Default: 32guidance_scale— Classifier-free guidance. Default: 2.0seed— Random seed for reproducible generation. Default: 42position_temperature— Mask-position selection temperature; 0 = greedy/deterministic. Default: 0.0class_temperature— Token sampling temperature; 0 = greedy/deterministic. Default: 0.0
Architecture
Elixir (GenServer) ←→ Erlang Port ←→ Python Bridge ←→ OmniVoice Model
(stdin/stdout) (msgpack framed)
Uses MessagePack binary framing over Erlang Ports — audio is transmitted as raw WAV bytes inside msgpack, eliminating the 33% base64 overhead of JSON-based solutions.
Changelog
v0.2.0
- Fixed initialization bug: removed
stderr_to_stdoutfrom port options to avoid blocking / startup issues - Added deterministic generation support:
- New options:
seed,position_temperature,class_temperature - CuBLAS and cuDNN determinism settings for stable GPU outputs
- New options:
- Improved language documentation with common IDs and link to full list
v0.1.0
- Initial release: Voice Cloning, Voice Design, multilingual TTS, 24kHz WAV, MessagePack protocol
License
Apache 2.0 — see LICENSE.
Related
- OmniVoice on HuggingFace
- VoxCPMEx — Elixir wrapper for VoxCPM2