OmnivoiceEx
Elixir wrapper for OmniVoice โ a unified speech generation model from K2-FSA.
Voice Cloning ยท Voice Design ยท Multilingual TTS ยท 24kHz Output
Features
- ๐ค Voice Cloning โ Clone any voice from a short reference audio clip
- ๐จ Voice Design โ Describe a voice in natural language ("warm female broadcaster", "deep authoritative narrator")
- ๐ Multilingual โ Supports multiple languages with automatic detection
- โก GPU Optimized โ CUDA, Apple Silicon (MPS), or CPU fallback
- ๐ 24kHz WAV โ Professional-grade audio output
- ๐ฆ MessagePack Protocol โ Zero-base64 binary transport over Erlang Ports
Requirements
- Elixir โฅ 1.14
- Python โฅ 3.10
- CUDA GPU (recommended), Apple Silicon MPS, or CPU
omnivoicepip package (auto-installed viamix omnivoice_ex.setup)
Installation
Add to your mix.exs:
def deps do
[
{:omnivoice_ex, "~> 0.1.0"}
]
endThen install Python dependencies:
mix omnivoice_ex.setupQuick Start
# Start the model server
{:ok, pid} = OmnivoiceEx.start_link(device: "cuda")
# Wait for model to load
:ok = OmnivoiceEx.await_ready(pid)
# Generate speech
{:ok, audio} = OmnivoiceEx.generate(pid, "Hello, world!")
# Save to file
:ok = OmnivoiceEx.save(audio, "output.wav")
# Clean shutdown
OmnivoiceEx.stop(pid)Voice Design
Describe a voice in natural language and OmniVoice generates it:
{:ok, audio} = OmnivoiceEx.generate(pid,
"Welcome to our luxury resort.",
instruct: "A warm, professional female concierge with a British accent"
)Voice Cloning
Clone a voice from a reference audio file:
{:ok, audio} = OmnivoiceEx.generate(pid,
"This is a cloned voice speaking English.",
ref_audio: "/path/to/reference.wav",
ref_text: "Transcript of the reference audio" # optional, improves quality
)Generation Options
| Option | Type | Default | Description |
|---|---|---|---|
ref_audio | String.t() | โ | Path to reference audio for cloning |
ref_text | String.t() | โ | Transcript of reference audio |
instruct | String.t() | โ | Voice instruction for design |
language | String.t() | โ | Language code (auto-detected) |
duration | float() | โ | Target duration in seconds |
speed | float() | โ | Playback speed factor |
num_step | pos_integer() | 32 | Diffusion steps (more = higher quality) |
guidance_scale | float() | 2.0 | CFG guidance scale |
Architecture
Elixir (GenServer) โโ Erlang Port โโ Python Bridge โโ OmniVoice Model
(stdin/stdout) (msgpack framed)Uses MessagePack binary framing over Erlang Ports โ audio is transmitted as raw WAV bytes inside msgpack, eliminating the 33% base64 overhead of JSON-based solutions.
License
Apache 2.0 โ see LICENSE.
Related
- OmniVoice on HuggingFace
- VoxCPMEx โ Elixir wrapper for VoxCPM2