OmnivoiceEx

Elixir wrapper for OmniVoice — a unified speech generation model from K2-FSA.

Voice Cloning · Voice Design · Multilingual TTS · Deterministic Generation · 24kHz Output

Features

🎤 Voice Cloning — Clone any voice from a short reference audio clip
🎨 Voice Design — Describe a voice in natural language ("warm female broadcaster", "deep authoritative narrator")
🌍 Multilingual — Supports multiple languages with automatic detection; 646 languages available
🔁 Deterministic Generation — Stable, reproducible outputs via seed and temperature controls
⚡ GPU Optimized — CUDA, Apple Silicon (MPS), or CPU fallback
🔊 24kHz WAV — Professional-grade audio output
📦 MessagePack Protocol — Zero-base64 binary transport over Erlang Ports

Requirements

Elixir ≥ 1.14
Python ≥ 3.10
CUDA GPU (recommended), Apple Silicon MPS, or CPU
omnivoice pip package (auto-installed via mix omnivoice_ex.setup)

Installation

Add to your mix.exs:

def deps do
  [
    {:omnivoice_ex, "~> 0.2.0"}
  ]
end

Then install Python dependencies:

mix omnivoice_ex.setup

Quick Start

# Start the model server
{:ok, pid} = OmnivoiceEx.start_link(device: "cuda")

# Wait for model to load
:ok = OmnivoiceEx.await_ready(pid)

# Generate speech
{:ok, audio} = OmnivoiceEx.generate(pid, "Hello, world!")

# Save to file
:ok = OmnivoiceEx.save(audio, "output.wav")

# Clean shutdown
OmnivoiceEx.stop(pid)

Voice Design

Describe a voice in natural language and OmniVoice generates it:

{:ok, audio} = OmnivoiceEx.generate(pid,
  "Welcome to our luxury resort.",
  instruct: "A warm, professional female concierge with a British accent"
)

Voice Cloning

Clone a voice from a reference audio file:

{:ok, audio} = OmnivoiceEx.generate(pid,
  "This is a cloned voice speaking English.",
  ref_audio: "/path/to/reference.wav",
  ref_text: "Transcript of the reference audio"  # optional, improves quality
)

Deterministic / Reproducible Generation (v0.2.0+)

OmnivoiceEx now supports fully deterministic generation for stable outputs across runs. This is useful for:

A/B testing prompts and settings
CI pipelines that validate audio
Production systems requiring consistent behavior

Key options:

seed: Random seed for reproducible generation (default: 42)
position_temperature: Mask-position selection temperature; 0 = greedy/deterministic (default: 0.0)
class_temperature: Token sampling temperature; 0 = greedy/deterministic (default: 0.0)

Example:

{:ok, audio} = OmnivoiceEx.generate(pid,
  "This output is fully reproducible.",
  seed: 12345,
  position_temperature: 0.0,
  class_temperature: 0.0
)

Under the hood (v0.2.0 fix):

Sets CUBLAS_WORKSPACE_CONFIG for deterministic CuBLAS (CUDA ≥ 10.2)
Enables torch.backends.cudnn.deterministic = True and best-effort use_deterministic_algorithms(True)
Seeds torch, CUDA, and NumPy RNGs before each generation

Language Selection

language: OmniVoice language ID (e.g. "zh", "en", "ja", "ko", "yue"). Auto-detected from text if omitted. For mixed-language content, set this explicitly to avoid unstable detection.

Common IDs: zh (Chinese), en (English), ja (Japanese), ko (Korean), yue (Cantonese), fr (French), de (German), es (Spanish), ru (Russian), pt (Portuguese), it (Italian), th (Thai), vi (Vietnamese), hi (Hindi), ar (Arabic), nl (Dutch), pl (Polish), sv (Swedish), tr (Turkish).

Full list of 646 languages: OmniVoice docs/languages.md

Generation Options

ref_audio — Path to reference audio for cloning
ref_text — Transcript of reference audio (improves clone quality)
instruct — Voice instruction for design (e.g. "A warm female broadcaster")
language — Language ID; auto-detected if omitted
duration — Target duration in seconds
speed — Playback speed factor
num_step — Diffusion steps (higher = better quality, slower). Default: 32
guidance_scale — Classifier-free guidance. Default: 2.0
seed — Random seed for reproducible generation. Default: 42
position_temperature — Mask-position selection temperature; 0 = greedy/deterministic. Default: 0.0
class_temperature — Token sampling temperature; 0 = greedy/deterministic. Default: 0.0

Architecture

Elixir (GenServer) ←→ Erlang Port ←→ Python Bridge ←→ OmniVoice Model
                    (stdin/stdout)   (msgpack framed)

Uses MessagePack binary framing over Erlang Ports — audio is transmitted as raw WAV bytes inside msgpack, eliminating the 33% base64 overhead of JSON-based solutions.

Changelog

v0.2.0

Fixed initialization bug: removed stderr_to_stdout from port options to avoid blocking / startup issues
Added deterministic generation support:
- New options: seed, position_temperature, class_temperature
- CuBLAS and cuDNN determinism settings for stable GPU outputs
Improved language documentation with common IDs and link to full list

v0.1.0

Initial release: Voice Cloning, Voice Design, multilingual TTS, 24kHz WAV, MessagePack protocol

Production & Engineering

This section provides practical guidance for using OmnivoiceEx in real systems: concurrency, reliability, monitoring, and common pitfalls.

Concurrency and Request Handling

The Python bridge is a single process behind one GenServer. Generation calls are executed serially inside the model; concurrent generate/3 requests are queued internally.
Recommended patterns:
- Use a single server per node in most cases:
  - Start once at application startup, share it via a named GenServer or Supervisor.
- For high-load clusters:
  - Run one OmnivoiceEx instance per GPU (or per model replica).
  - Distribute requests across nodes using a load balancer or job queue.

Example: named server in supervision tree

defmodule MyApp.OmniVoiceSupervisor do
  use Supervisor

  def start_link(opts) do
    Supervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end

  def init(_opts) do
    children = [
      {OmnivoiceEx,
       name: OmniVoiceServer,
       device: System.get_env("OMNIVOICE_DEVICE") || "cuda",
       model: System.get_env("OMNIVOICE_MODEL") || "k2-fsa/OmniVoice"}]

    Supervisor.init(children, strategy: :one_for_one)
  end
end

# Usage elsewhere:
{:ok, audio} = OmnivoiceEx.generate(OmniVoiceServer, "Hello!", seed: 1)

Timeouts and Backpressure

generate/3 and await_ready/2 accept a timeout argument.
In production:
- Never use unlimited timeouts in HTTP handlers or worker loops.
- Use per-request timeouts (e.g., 30–120s depending on length) so long-running or stuck generations do not freeze processes.

Example:

case OmnivoiceEx.generate(OmniVoiceServer, text, opts, timeout: 60_000) do
  {:ok, audio} ->
    # handle audio
  {:error, :timeout} ->
    # fallback / retry / user message
  {:error, reason} ->
    # log and handle
end

If your system is under heavy load:

Consider a worker pool (e.g., Quantum, Oban) to isolate TTS jobs.
Use backoff + limited retries instead of aggressive parallel attempts on the same server.

Error Handling

OmnivoiceEx can return errors from:

Model failures / OOM
Invalid inputs
Bridge crashes

General pattern:

case OmnivoiceEx.generate(OmniVoiceServer, text, opts) do
  {:ok, audio} ->
    # success

  {:error, :timeout} ->
    Logger.warn("TTS request timed out")

  {:error, msg} when is_binary(msg) ->
    Logger.error("TTS bridge error: #{msg}")

  {:error, other} ->
    Logger.error("TTS unexpected error: #{inspect(other)}")
end

If the Python bridge process exits unexpectedly:

The GenServer will transition to an error status.
In production, wrap OmnivoiceEx in a Supervisor with restart: :transient or :permanent depending on your policy.

Telemetry and Monitoring

OmnivoiceEx emits telemetry events you can use for observability:

[:omnivoice_ex, :generate]:
- Measured: %{duration_ms: float()} — time to generate audio.
[:omnivoice_ex, :await_ready]:
- Measured: model load duration.

Example: attach a handler in your application:

defmodule MyApp do
  def start(_type, _args) do
    children = [
      MyApp.OmniVoiceSupervisor,
      {TelemetryPoller, []}
    ]

    Supervisor.start_link(children, strategy: :one_for_one)

    :telemetry.attach_many(
      "omnivoice-ex-logger",
      [
        [:omnivoice_ex, :generate],
        [:omnivoice_ex, :await_ready]
      ],
      &__MODULE__.handle_event/4,
      nil
    )
  end

  def handle_event(event, measurements, _meta, _config) do
    Logger.debug(
      "OmnivoiceEx #{inspect(event)} duration_ms=#{measurements.duration_ms}"
    )
  end
end

You can plug this into Prometheus, Grafana, or your internal metrics stack.

Deployment Notes

Python environment:
- Run mix omnivoice_ex.setup in your build / deploy step to install required pip packages.
- Ensure the same Python interpreter is available at runtime as during setup.
GPU:
- Use a dedicated container or VM with stable GPU drivers and sufficient memory.
- Avoid overcommitting GPUs; OmniVoice + large context can use several GBs of VRAM.
Determinism in production:
- For content pipelines where you must be able to reproduce outputs (e.g., logs, audits), always pass a seed and keep temperatures at 0.

Common Pitfalls (FAQ-style)

“Why is startup slow?”
- The model loads into memory on first start. This is expected; use await_ready/2 with a generous timeout and cache the server instead of restarting it frequently.
“Why are concurrent requests blocking each other?”
- The Python bridge processes one request at a time. For high concurrency, deploy multiple instances (e.g., one per GPU) and load-balance between them.
“Audio sounds different across runs.”
- If you need stable output, set seed, position_temperature: 0.0, class_temperature: 0.0. Without these, outputs may vary due to stochastic sampling.
“Language detection seems random for mixed-language text.”
- Set language explicitly (e.g., "zh" or "en") when mixing languages in the same prompt.

License

Apache 2.0 — see LICENSE.

OmniVoice on HuggingFace
VoxCPMEx — Elixir wrapper for VoxCPM2