OmnivoiceEx

Hex.pmLicense

Elixir wrapper for OmniVoice — a unified speech generation model from K2-FSA.

Voice Cloning · Voice Design · Multilingual TTS · Deterministic Generation · 24kHz Output

Features

Requirements

Installation

Add to your mix.exs:

def deps do
[
{:omnivoice_ex, "~> 0.2.0"}
]
end

Then install Python dependencies:

mix omnivoice_ex.setup

Quick Start

# Start the model server
{:ok, pid} = OmnivoiceEx.start_link(device: "cuda")
# Wait for model to load
:ok = OmnivoiceEx.await_ready(pid)
# Generate speech
{:ok, audio} = OmnivoiceEx.generate(pid, "Hello, world!")
# Save to file
:ok = OmnivoiceEx.save(audio, "output.wav")
# Clean shutdown
OmnivoiceEx.stop(pid)

Voice Design

Describe a voice in natural language and OmniVoice generates it:

{:ok, audio} = OmnivoiceEx.generate(pid,
"Welcome to our luxury resort.",
instruct: "A warm, professional female concierge with a British accent"
)

Voice Cloning

Clone a voice from a reference audio file:

{:ok, audio} = OmnivoiceEx.generate(pid,
"This is a cloned voice speaking English.",
ref_audio: "/path/to/reference.wav",
ref_text: "Transcript of the reference audio" # optional, improves quality
)

Deterministic / Reproducible Generation (v0.2.0+)

OmnivoiceEx now supports fully deterministic generation for stable outputs across runs. This is useful for:

Key options:

Example:

{:ok, audio} = OmnivoiceEx.generate(pid,
"This output is fully reproducible.",
seed: 12345,
position_temperature: 0.0,
class_temperature: 0.0
)

Under the hood (v0.2.0 fix):

Language Selection

Common IDs: zh (Chinese), en (English), ja (Japanese), ko (Korean), yue (Cantonese), fr (French), de (German), es (Spanish), ru (Russian), pt (Portuguese), it (Italian), th (Thai), vi (Vietnamese), hi (Hindi), ar (Arabic), nl (Dutch), pl (Polish), sv (Swedish), tr (Turkish).

Full list of 646 languages: OmniVoice docs/languages.md

Generation Options

Architecture

Elixir (GenServer) ←→ Erlang Port ←→ Python Bridge ←→ OmniVoice Model
(stdin/stdout) (msgpack framed)

Uses MessagePack binary framing over Erlang Ports — audio is transmitted as raw WAV bytes inside msgpack, eliminating the 33% base64 overhead of JSON-based solutions.

Changelog

v0.2.0

v0.1.0

License

Apache 2.0 — see LICENSE.