AI Guides

What Is OmniVoice? A Simple Guide to Multilingual TTS and Voice Cloning

A beginner-friendly guide to k2-fsa/OmniVoice, covering multilingual zero-shot text-to-speech, voice cloning, voice design, installation, Python and CLI usage, batch inference, deployment patterns, and voice-safety notes.

Published: Jun 4, 2026Updated: Jun 4, 2026Reading time: 11 minViews: 0
OmniVoiceText-to-SpeechVoice CloningMultilingual AIOpen Source AIAI Voice

💡Key Takeaways

  • A beginner-friendly guide to k2-fsa/OmniVoice, covering multilingual zero-shot text-to-speech, voice cloning, voice design, installation, Python and CLI usage, batch inference, deployment patterns, and voice-safety notes.

What Is OmniVoice? A Simple Guide to k2-fsa/OmniVoice for Multilingual TTS and Voice Cloning

OmniVoice project image
OmniVoice project image

Image extracted from the OmniVoice README/project page. This is a JPG image, not SVG.1

Quick summary

OmniVoice is an open-source text-to-speech model designed for very broad multilingual speech generation. The official repository describes it as a massively multilingual zero-shot TTS model supporting over 600 languages, with voice cloning, voice design, high-quality speech and fast inference through a diffusion language model-style architecture.2

In plain terms: you provide text, and OmniVoice generates speech. If you also provide a short reference audio clip, the model can generate the new sentence in a similar voice. If you do not have reference audio, you can describe the desired voice using attributes such as “female, low pitch, british accent,” or let the model automatically pick a voice.

Important: OmniVoice is powerful enough to be misused. The README explicitly prohibits unauthorized voice cloning, voice impersonation, fraud, scams and other illegal or unethical uses.3

What is OmniVoice used for?

OmniVoice is useful for multilingual speech-generation workflows:

  • video voice-over;
  • audiobook generation;
  • AI assistant voices;
  • multilingual TTS experiments;
  • lawful voice cloning from authorized reference audio;
  • voice design by speaker attributes;
  • batch audio generation;
  • research on multilingual TTS and diffusion-style speech models.

Basic flow:

Text
  + optional: reference audio / voice instruction
        ↓
OmniVoice
        ↓
24 kHz waveform audio
        ↓
WAV / voice-over / media pipeline / application

What stands out in the k2-fsa/OmniVoice repo?

The GitHub repository describes OmniVoice as “High-Quality Voice Cloning TTS for 600+ Languages,” uses the Apache-2.0 license, and publishes a Python package named omnivoice.45

README highlights:

  • support for more than 600 languages;
  • voice cloning from a short reference audio;
  • voice design through speaker attributes;
  • non-verbal symbols such as [laughter] and [sigh];
  • pronunciation control with pinyin for Chinese and CMU pronunciation dictionary notation for English;
  • inference RTF as low as 0.025, or 40x faster than real time under the authors’ benchmark conditions;
  • Python API, Gradio web demo, single-item CLI and batch inference CLI.2

The Hugging Face model card lists the task as Text-to-Speech, states 646 languages, Apache-2.0 license, model size of 0.6B parameters, and a model tree connected to Qwen3-0.6B-Base.6

What OmniVoice is not

OmniVoice isOmniVoice is not
A multilingual TTS modelAn ASR speech-to-text tool
A text-to-audio generation toolA professional audio editor
Capable of voice cloning and voice designA permission system for cloning anyone’s voice
Usable through CLI, Python API and Gradio demoA mandatory cloud service
Runnable locally with suitable hardwareAlways lightweight on low-end machines
Useful for TTS research and applicationsA substitute for consent

If you need to turn speech into text, you need ASR. If you need to turn text into speech, OmniVoice fits the category.

What makes OmniVoice different?

According to the paper, OmniVoice uses a discrete non-autoregressive diffusion language model-style architecture. Unlike conventional two-stage TTS pipelines that go from text to semantic tokens to acoustic tokens, OmniVoice directly maps text to multi-codebook acoustic tokens.7

The paper emphasizes two technical ideas:

  1. Full-codebook random masking for efficient training.
  2. Initialization from a pre-trained LLM to improve intelligibility.

The paper also states that OmniVoice uses 581k hours of multilingual training data curated entirely from open-source data and achieves broad language coverage and strong results across Chinese, English and multilingual benchmarks.7

Simplified flow:

Text input
  ↓
Diffusion language model-style TTS
  ↓
Acoustic tokens
  ↓
Audio waveform

Generation modes

The README describes three main generation modes.8

1. Voice cloning

Provide a short reference audio and its transcript. The model generates new speech in a similar voice.

from omnivoice import OmniVoice
import soundfile as sf
import torch

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16,
)

audio = model.generate(
    text="Hello, this is a test of zero-shot voice cloning.",
    ref_audio="ref.wav",
    ref_text="Transcription of the reference audio.",
)

sf.write("out.wav", audio[0], 24000)

If ref_text is omitted, the README says the model will use Whisper ASR to auto-transcribe the reference audio.8

2. Voice design

No reference audio is needed. Describe the speaker with instruct.

audio = model.generate(
    text="Hello, this is a test of zero-shot voice design.",
    instruct="female, low pitch, british accent",
)

Voice design supports attributes such as gender, age, pitch, whisper style, English accent and Chinese dialect.9

3. Auto voice

Only provide text and let the model choose a voice:

audio = model.generate(text="This is a sentence without any voice prompt.")

Use this when speaker identity is not important.

Voice cloning best practices

The README recommends a 3–10 second reference audio clip. Longer clips slow inference and may reduce cloning quality.8

Reference audio checklist:

  • only use voices you have rights or consent to use;
  • 3–10 seconds long;
  • clear speech, minimal noise;
  • no strong music background;
  • not clipped at beginning or end;
  • use same-language reference audio for more standard pronunciation;
  • expect cross-lingual cloning to carry some accent from the reference language.8

Example:

audio = model.generate(
    text="This is a new sentence spoken using the reference voice.",
    ref_audio="voice_sample.wav",
    ref_text="This is the reference audio transcription.",
)

Voice design without reference audio

The voice design docs say instruct is a comma-separated string of speaker attributes. Each attribute belongs to a category such as gender, age, pitch, style, accent or dialect.9

Example:

audio = model.generate(
    text="This is a voice designed without a reference audio.",
    instruct="female, young adult, high pitch, british accent",
)

Supported attribute examples:

CategoryExamples
Gendermale, female
Agechild, teenager, young adult, middle-aged, elderly
Pitchvery low pitch, low pitch, moderate pitch, high pitch, very high pitch
Stylewhisper
English accentamerican accent, british accent, indian accent, chinese accent, japanese accent
Chinese dialect四川话, 陕西话, 东北话, 青岛话, 河南话

The README notes that voice design is trained on Chinese and English data and may be unstable for some low-resource languages or edge cases.89

Installation

Python requirement

pyproject.toml states that omnivoice requires Python >= 3.10 and depends on torch, torchaudio, transformers, accelerate, pydub, gradio, tensorboardX, webdataset, numpy, soundfile and librosa.5

Create a virtual environment

python -m venv .venv
source .venv/bin/activate

Windows PowerShell:

python -m venv .venv
.\.venv\Scripts\Activate.ps1

NVIDIA GPU

README example for CUDA 12.8:

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128

Choose the PyTorch build that matches your CUDA/driver setup.10

Apple Silicon

pip install torch==2.8.0 torchaudio==2.8.0

Then load the model with:

device_map="mps"

in OmniVoice.from_pretrained(...).8

Intel Arc GPU

The README says Intel Arc GPUs are supported through PyTorch’s XPU backend. Install PyTorch from Intel’s wheel index:10

pip install torch torchaudio --index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

Verify:

python -c "import torch; print(torch.xpu.is_available(), torch.xpu.device_count())"

Use:

device_map="xpu"

Install OmniVoice

From PyPI:

pip install omnivoice

From GitHub:

pip install git+https://github.com/k2-fsa/OmniVoice.git

Development install:

git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
pip install -e .

uv setup

git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv sync

If model downloads from Hugging Face are difficult, the README suggests:11

export HF_ENDPOINT="https://hf-mirror.com"

Quickstart: web UI

The easiest starting point is the Gradio demo:

omnivoice-demo --ip 0.0.0.0 --port 8001

Then open:

http://localhost:8001

Use the web UI to test:

  • voice cloning;
  • voice design;
  • auto voice;
  • generation settings;
  • quick audio playback.

The README also links a Hugging Face Space and a Google Colab notebook for trying OmniVoice without setting up a local environment.11

CLI: generate one audio file

Voice cloning

omnivoice-infer \
  --model k2-fsa/OmniVoice \
  --text "This is a test for text to speech." \
  --ref_audio ref.wav \
  --ref_text "Transcription of the reference audio." \
  --output hello.wav

ref_text can be omitted; Whisper will auto-transcribe the reference audio.12

Voice design

omnivoice-infer \
  --model k2-fsa/OmniVoice \
  --text "This is a test for text to speech." \
  --instruct "male, British accent" \
  --output hello.wav

Auto voice

omnivoice-infer \
  --model k2-fsa/OmniVoice \
  --text "This is a test for text to speech." \
  --output hello.wav

CLI: batch inference

omnivoice-infer-batch runs batch inference and can distribute work across multiple GPUs.12

omnivoice-infer-batch \
  --model k2-fsa/OmniVoice \
  --test_list test.jsonl \
  --res_dir results/

Example test.jsonl line:

{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript", "instruct": "female, british accent", "language_id": "en", "duration": 10.0, "speed": 1.0}

The README says only id and text are mandatory. ref_audio and ref_text are for voice cloning, instruct is for voice design, and language_id, duration and speed are optional.12

Speed and duration control

The generation-parameters docs list controls such as num_step, speed, duration, guidance_scale, position_temperature, class_temperature and long-form chunking settings.13

Example:

audio = model.generate(
    text="Hello, this is a test of duration control.",
    num_step=32,
    speed=1.2,
)

Fixed 10-second output:

audio = model.generate(
    text="Hello, this is a test of duration control.",
    duration=10.0,
)

Priority:

duration > speed

If exact duration matters, the docs recommend setting postprocess_output=False, because silence trimming may make output shorter than the requested duration.13

Long-form generation

The generation-parameters docs say long text is automatically split into smaller segments when the estimated speech duration exceeds audio_chunk_threshold. Each segment generates roughly audio_chunk_duration seconds, allowing long-form speech with near-constant VRAM use.13

Example:

audio = model.generate(
    text=long_text,
    audio_chunk_duration=15.0,
    audio_chunk_threshold=30.0,
)

Use cases:

  • audiobooks;
  • long narration;
  • blog reading;
  • tutorial voice-over;
  • batch content generation.

Non-verbal symbols and pronunciation control

OmniVoice supports inline non-verbal tags:14

audio = model.generate(
    text="[laughter] You really got me. I didn't see that coming at all."
)

Supported examples:

[laughter], [sigh], [confirmation-en], [question-en],
[question-ah], [question-oh], [question-ei], [question-yi],
[surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo],
[dissatisfaction-hnn]

Chinese pronunciation can be corrected with pinyin tone numbers:14

audio = model.generate(
    text="这批货物打ZHE2出售后他严重SHE2本了,再也经不起ZHE1腾了。"
)

English pronunciation can use CMU dictionary notation in uppercase brackets:14

audio = model.generate(
    text="He plays the [B EY1 S] guitar while catching a [B AE1 S] fish."
)

Which mode should you choose?

NeedMode
Use a lawful reference voiceVoice cloning
Create a voice by descriptionVoice design
Generate quick audio with no fixed speakerAuto voice
Generate many outputsBatch inference
Test interactivelyomnivoice-demo
Integrate into a backendPython API
Run offline jobsCLI + JSONL

Personal setup guide

Simple NVIDIA GPU setup:

python -m venv .venv
source .venv/bin/activate

pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
pip install omnivoice

omnivoice-demo --ip 127.0.0.1 --port 8001

CLI test:

omnivoice-infer \
  --model k2-fsa/OmniVoice \
  --text "Hello, this is a basic OmniVoice test." \
  --output out.wav

Voice cloning:

omnivoice-infer \
  --model k2-fsa/OmniVoice \
  --text "This new sentence is generated using the reference voice." \
  --ref_audio ref.wav \
  --ref_text "This is the reference audio transcription." \
  --output cloned.wav

Backend deployment pattern

Basic backend architecture:

API request
  ↓
Validate text + voice mode
  ↓
OmniVoice worker
  ↓
WAV output
  ↓
Storage/CDN
  ↓
Return URL

Practical suggestions:

  • Do not run long inference directly in the web request thread.
  • Use a queue such as Redis/RQ, Celery, Sidekiq or a dedicated worker.
  • Cache output by text + voice configuration hash.
  • Limit text length and output duration.
  • Delete temporary files.
  • Separate model workers from the API server.
  • Record metadata: model version, mode, text length, duration and reference audio ID.
  • Avoid logging raw reference audio or private transcripts.

Content-production workflow

Suggested workflow:

  1. Write the script.
  2. Normalize numbers, dates and symbols.
  3. Choose cloning, design or auto voice.
  4. Generate a draft.
  5. Human editor listens and checks pronunciation.
  6. Correct pronunciation using pinyin/CMU notation if needed.
  7. Export WAV.
  8. Mix/master in audio or video editing software.

Do not automatically publish cloned-voice outputs without review. TTS can produce pronunciation errors, bad pauses or unwanted sounds.

Notes for Vietnamese and other languages

OmniVoice supports more than 600 languages, but real quality depends on training data, punctuation, numeric normalization and the reference audio.15

For Vietnamese, practical tips:

  • write full diacritics;
  • use clear punctuation;
  • spell numbers as words if numeric reading is bad;
  • avoid ambiguous abbreviations;
  • use Vietnamese reference audio for natural pronunciation;
  • check names, English mixed text and product codes;
  • split long paragraphs into shorter sentences.

Min Nan / Hokkien note

The tips docs say Min Nan Chinese / Hokkien currently supports only Tai-lo romanization input; Chinese characters are not supported for this language in the current model version.16

This shows that “language support” does not always mean every writing system or orthography variant is supported equally.

Training, evaluation and fine-tuning

The README points to examples/ for the complete pipeline from data preparation to training, evaluation and fine-tuning.17

Use this path if you need to:

  • fine-tune on a domain;
  • evaluate on internal data;
  • build an internal benchmark;
  • improve pronunciation or speaker style;
  • study multilingual TTS architecture.

Most users should start with the pretrained model before considering fine-tuning.

Safety and ethics

Voice cloning is higher risk than many AI tools because it can imitate real people. The README prohibits unauthorized voice cloning, impersonation, fraud, scams and other illegal or unethical activities.3

Required checklist:

  • Only clone voices with clear permission.
  • Do not impersonate real people.
  • Do not use generated voices for fraud, scam calls or fake evidence.
  • Do not publish private voice datasets.
  • Add watermarking or metadata where appropriate.
  • Store consent records for voice samples.
  • Use human review before publishing.
  • Add abuse detection and rate limits for public systems.
  • Provide a deletion process for reference audio.

When should you use OmniVoice?

Use it when:

  • you need multilingual TTS;
  • you have lawful reference audio for voice cloning;
  • you want voice design by speaker attributes;
  • you create voice-over, audiobooks, AI assistants or demos;
  • you research multilingual TTS or diffusion-style TTS;
  • you need batch speech generation.

Do not use it when:

  • you do not have the right to use a reference voice;
  • legal or financial decisions depend on unreviewed generated audio;
  • you need ASR/speech-to-text;
  • your machine cannot run the model;
  • you require strict real-time guarantees without benchmarking;
  • you do not have a misuse-prevention policy.

OmniVoice vs previous repositories

RepositoryPrimary purpose
OmniVoicemultilingual TTS, voice cloning and voice design
PaddleOCROCR and document parsing from images/PDFs
MarkItDowndocument-to-Markdown conversion
NVIDIA Cosmosworld models for Physical AI
Claude Taptrace/debug AI coding agents
Headroomcompress LLM context/tool output
RTKcompress CLI output for coding agents

OmniVoice is in the speech generation category, not document processing, coding-agent tooling or Physical AI.

FAQ

What is OmniVoice?

OmniVoice is a zero-shot multilingual text-to-speech model from k2-fsa, supporting over 600 languages, voice cloning and voice design.2

How many languages does OmniVoice support?

The README says 600+ languages; the supported languages file states 646 languages and 581k hours of training data.215

How long should reference audio be?

The README recommends a 3–10 second reference audio clip. Longer audio can slow inference and degrade cloning quality.8

Is ref_text required?

No. If ref_text is omitted, the README says the model uses Whisper ASR to auto-transcribe the reference audio.8

Is voice design stable?

The README says voice cloning is the most stable mode. Voice design is trained on Chinese and English data and may be unstable for some low-resource languages or edge cases.8

What license does OmniVoice use?

The GitHub repository and Hugging Face model card list Apache-2.0.46

Conclusion

k2-fsa/OmniVoice is notable because it combines very broad language coverage, short-reference voice cloning and text-based voice design in one open TTS project. Beginners should start with omnivoice-demo, then move to omnivoice-infer or the Python API for integration work.

The most important deployment point is not technical. Because OmniVoice supports voice cloning, any production workflow needs consent, misuse controls, access limits, human review and a clear policy. Used responsibly, it is useful for voice-over, audiobook generation, AI assistants and TTS research. Used irresponsibly, it can enable voice impersonation.

References

Footnotes

  1. OmniVoice project image used in README and Hugging Face model card. https://zhu-han.github.io/omnivoice/pics/omnivoice.jpg

  2. OmniVoice README. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md 2 3 4

  3. OmniVoice README, Disclaimer section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md 2

  4. GitHub. k2-fsa/OmniVoice. https://github.com/k2-fsa/OmniVoice 2

  5. OmniVoice pyproject.toml. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/pyproject.toml 2

  6. Hugging Face model card for k2-fsa/OmniVoice. https://huggingface.co/k2-fsa/OmniVoice 2

  7. arXiv paper “OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models.” https://arxiv.org/abs/2604.00688 2

  8. OmniVoice README, Python API section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md 2 3 4 5 6 7 8 9

  9. OmniVoice voice design docs. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/docs/voice-design.md 2 3

  10. OmniVoice README, Installation section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md 2

  11. OmniVoice README, Quick Start section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md 2

  12. OmniVoice README, Command-Line Tools section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md 2 3

  13. OmniVoice generation parameters docs. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/docs/generation-parameters.md 2 3

  14. OmniVoice README, Non-Verbal & Pronunciation Control section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md 2 3

  15. OmniVoice supported languages file. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/docs/languages.md 2

  16. OmniVoice tips file. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/docs/tips.md

  17. OmniVoice README, Training & Evaluation section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md

PR

Written by PixelRouter Editorial Team

We publish deep, authoritative guides on AI infrastructure, API gateway security, cloud financial management, and system optimizations for developers.

FAQ

What is OmniVoice?

OmniVoice is an open-source zero-shot multilingual text-to-speech model from k2-fsa. The article describes it as supporting 600+ languages, voice cloning, and voice design.

How many languages does OmniVoice support?

The article states that the OmniVoice README describes support for 600+ languages, while the supported languages file lists 646 languages.

What generation modes does OmniVoice support?

The article describes three main modes: voice cloning with reference audio, voice design using speaker attributes, and auto voice generation with text only.

How long should reference audio be for voice cloning?

The article says the README recommends a 3–10 second reference audio clip. Longer clips may slow inference and reduce cloning quality.

Is ref_text required for voice cloning?

No. The article states that if ref_text is omitted, OmniVoice can use Whisper ASR to auto-transcribe the reference audio.

What license does OmniVoice use?

The article states that the GitHub repository and Hugging Face model card list Apache-2.0.