AI Guides
What Is OmniVoice? A Simple Guide to Multilingual TTS and Voice Cloning
A beginner-friendly guide to k2-fsa/OmniVoice, covering multilingual zero-shot text-to-speech, voice cloning, voice design, installation, Python and CLI usage, batch inference, deployment patterns, and voice-safety notes.
💡Key Takeaways
- A beginner-friendly guide to k2-fsa/OmniVoice, covering multilingual zero-shot text-to-speech, voice cloning, voice design, installation, Python and CLI usage, batch inference, deployment patterns, and voice-safety notes.
What Is OmniVoice? A Simple Guide to k2-fsa/OmniVoice for Multilingual TTS and Voice Cloning

Image extracted from the OmniVoice README/project page. This is a JPG image, not SVG.1
Quick summary
OmniVoice is an open-source text-to-speech model designed for very broad multilingual speech generation. The official repository describes it as a massively multilingual zero-shot TTS model supporting over 600 languages, with voice cloning, voice design, high-quality speech and fast inference through a diffusion language model-style architecture.2
In plain terms: you provide text, and OmniVoice generates speech. If you also provide a short reference audio clip, the model can generate the new sentence in a similar voice. If you do not have reference audio, you can describe the desired voice using attributes such as “female, low pitch, british accent,” or let the model automatically pick a voice.
Important: OmniVoice is powerful enough to be misused. The README explicitly prohibits unauthorized voice cloning, voice impersonation, fraud, scams and other illegal or unethical uses.3
What is OmniVoice used for?
OmniVoice is useful for multilingual speech-generation workflows:
- video voice-over;
- audiobook generation;
- AI assistant voices;
- multilingual TTS experiments;
- lawful voice cloning from authorized reference audio;
- voice design by speaker attributes;
- batch audio generation;
- research on multilingual TTS and diffusion-style speech models.
Basic flow:
Text
+ optional: reference audio / voice instruction
↓
OmniVoice
↓
24 kHz waveform audio
↓
WAV / voice-over / media pipeline / application
What stands out in the k2-fsa/OmniVoice repo?
The GitHub repository describes OmniVoice as “High-Quality Voice Cloning TTS for 600+ Languages,” uses the Apache-2.0 license, and publishes a Python package named omnivoice.45
README highlights:
- support for more than 600 languages;
- voice cloning from a short reference audio;
- voice design through speaker attributes;
- non-verbal symbols such as
[laughter]and[sigh]; - pronunciation control with pinyin for Chinese and CMU pronunciation dictionary notation for English;
- inference RTF as low as 0.025, or 40x faster than real time under the authors’ benchmark conditions;
- Python API, Gradio web demo, single-item CLI and batch inference CLI.2
The Hugging Face model card lists the task as Text-to-Speech, states 646 languages, Apache-2.0 license, model size of 0.6B parameters, and a model tree connected to Qwen3-0.6B-Base.6
What OmniVoice is not
| OmniVoice is | OmniVoice is not |
|---|---|
| A multilingual TTS model | An ASR speech-to-text tool |
| A text-to-audio generation tool | A professional audio editor |
| Capable of voice cloning and voice design | A permission system for cloning anyone’s voice |
| Usable through CLI, Python API and Gradio demo | A mandatory cloud service |
| Runnable locally with suitable hardware | Always lightweight on low-end machines |
| Useful for TTS research and applications | A substitute for consent |
If you need to turn speech into text, you need ASR. If you need to turn text into speech, OmniVoice fits the category.
What makes OmniVoice different?
According to the paper, OmniVoice uses a discrete non-autoregressive diffusion language model-style architecture. Unlike conventional two-stage TTS pipelines that go from text to semantic tokens to acoustic tokens, OmniVoice directly maps text to multi-codebook acoustic tokens.7
The paper emphasizes two technical ideas:
- Full-codebook random masking for efficient training.
- Initialization from a pre-trained LLM to improve intelligibility.
The paper also states that OmniVoice uses 581k hours of multilingual training data curated entirely from open-source data and achieves broad language coverage and strong results across Chinese, English and multilingual benchmarks.7
Simplified flow:
Text input
↓
Diffusion language model-style TTS
↓
Acoustic tokens
↓
Audio waveform
Generation modes
The README describes three main generation modes.8
1. Voice cloning
Provide a short reference audio and its transcript. The model generates new speech in a similar voice.
from omnivoice import OmniVoice
import soundfile as sf
import torch
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16,
)
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
ref_text="Transcription of the reference audio.",
)
sf.write("out.wav", audio[0], 24000)
If ref_text is omitted, the README says the model will use Whisper ASR to auto-transcribe the reference audio.8
2. Voice design
No reference audio is needed. Describe the speaker with instruct.
audio = model.generate(
text="Hello, this is a test of zero-shot voice design.",
instruct="female, low pitch, british accent",
)
Voice design supports attributes such as gender, age, pitch, whisper style, English accent and Chinese dialect.9
3. Auto voice
Only provide text and let the model choose a voice:
audio = model.generate(text="This is a sentence without any voice prompt.")
Use this when speaker identity is not important.
Voice cloning best practices
The README recommends a 3–10 second reference audio clip. Longer clips slow inference and may reduce cloning quality.8
Reference audio checklist:
- only use voices you have rights or consent to use;
- 3–10 seconds long;
- clear speech, minimal noise;
- no strong music background;
- not clipped at beginning or end;
- use same-language reference audio for more standard pronunciation;
- expect cross-lingual cloning to carry some accent from the reference language.8
Example:
audio = model.generate(
text="This is a new sentence spoken using the reference voice.",
ref_audio="voice_sample.wav",
ref_text="This is the reference audio transcription.",
)
Voice design without reference audio
The voice design docs say instruct is a comma-separated string of speaker attributes. Each attribute belongs to a category such as gender, age, pitch, style, accent or dialect.9
Example:
audio = model.generate(
text="This is a voice designed without a reference audio.",
instruct="female, young adult, high pitch, british accent",
)
Supported attribute examples:
| Category | Examples |
|---|---|
| Gender | male, female |
| Age | child, teenager, young adult, middle-aged, elderly |
| Pitch | very low pitch, low pitch, moderate pitch, high pitch, very high pitch |
| Style | whisper |
| English accent | american accent, british accent, indian accent, chinese accent, japanese accent |
| Chinese dialect | 四川话, 陕西话, 东北话, 青岛话, 河南话 |
The README notes that voice design is trained on Chinese and English data and may be unstable for some low-resource languages or edge cases.89
Installation
Python requirement
pyproject.toml states that omnivoice requires Python >= 3.10 and depends on torch, torchaudio, transformers, accelerate, pydub, gradio, tensorboardX, webdataset, numpy, soundfile and librosa.5
Create a virtual environment
python -m venv .venv
source .venv/bin/activate
Windows PowerShell:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
NVIDIA GPU
README example for CUDA 12.8:
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
Choose the PyTorch build that matches your CUDA/driver setup.10
Apple Silicon
pip install torch==2.8.0 torchaudio==2.8.0
Then load the model with:
device_map="mps"
in OmniVoice.from_pretrained(...).8
Intel Arc GPU
The README says Intel Arc GPUs are supported through PyTorch’s XPU backend. Install PyTorch from Intel’s wheel index:10
pip install torch torchaudio --index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
Verify:
python -c "import torch; print(torch.xpu.is_available(), torch.xpu.device_count())"
Use:
device_map="xpu"
Install OmniVoice
From PyPI:
pip install omnivoice
From GitHub:
pip install git+https://github.com/k2-fsa/OmniVoice.git
Development install:
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
pip install -e .
uv setup
git clone https://github.com/k2-fsa/OmniVoice.git
cd OmniVoice
uv sync
If model downloads from Hugging Face are difficult, the README suggests:11
export HF_ENDPOINT="https://hf-mirror.com"
Quickstart: web UI
The easiest starting point is the Gradio demo:
omnivoice-demo --ip 0.0.0.0 --port 8001
Then open:
http://localhost:8001
Use the web UI to test:
- voice cloning;
- voice design;
- auto voice;
- generation settings;
- quick audio playback.
The README also links a Hugging Face Space and a Google Colab notebook for trying OmniVoice without setting up a local environment.11
CLI: generate one audio file
Voice cloning
omnivoice-infer \
--model k2-fsa/OmniVoice \
--text "This is a test for text to speech." \
--ref_audio ref.wav \
--ref_text "Transcription of the reference audio." \
--output hello.wav
ref_text can be omitted; Whisper will auto-transcribe the reference audio.12
Voice design
omnivoice-infer \
--model k2-fsa/OmniVoice \
--text "This is a test for text to speech." \
--instruct "male, British accent" \
--output hello.wav
Auto voice
omnivoice-infer \
--model k2-fsa/OmniVoice \
--text "This is a test for text to speech." \
--output hello.wav
CLI: batch inference
omnivoice-infer-batch runs batch inference and can distribute work across multiple GPUs.12
omnivoice-infer-batch \
--model k2-fsa/OmniVoice \
--test_list test.jsonl \
--res_dir results/
Example test.jsonl line:
{"id": "sample_001", "text": "Hello world", "ref_audio": "/path/to/ref.wav", "ref_text": "Reference transcript", "instruct": "female, british accent", "language_id": "en", "duration": 10.0, "speed": 1.0}
The README says only id and text are mandatory. ref_audio and ref_text are for voice cloning, instruct is for voice design, and language_id, duration and speed are optional.12
Speed and duration control
The generation-parameters docs list controls such as num_step, speed, duration, guidance_scale, position_temperature, class_temperature and long-form chunking settings.13
Example:
audio = model.generate(
text="Hello, this is a test of duration control.",
num_step=32,
speed=1.2,
)
Fixed 10-second output:
audio = model.generate(
text="Hello, this is a test of duration control.",
duration=10.0,
)
Priority:
duration > speed
If exact duration matters, the docs recommend setting postprocess_output=False, because silence trimming may make output shorter than the requested duration.13
Long-form generation
The generation-parameters docs say long text is automatically split into smaller segments when the estimated speech duration exceeds audio_chunk_threshold. Each segment generates roughly audio_chunk_duration seconds, allowing long-form speech with near-constant VRAM use.13
Example:
audio = model.generate(
text=long_text,
audio_chunk_duration=15.0,
audio_chunk_threshold=30.0,
)
Use cases:
- audiobooks;
- long narration;
- blog reading;
- tutorial voice-over;
- batch content generation.
Non-verbal symbols and pronunciation control
OmniVoice supports inline non-verbal tags:14
audio = model.generate(
text="[laughter] You really got me. I didn't see that coming at all."
)
Supported examples:
[laughter], [sigh], [confirmation-en], [question-en],
[question-ah], [question-oh], [question-ei], [question-yi],
[surprise-ah], [surprise-oh], [surprise-wa], [surprise-yo],
[dissatisfaction-hnn]
Chinese pronunciation can be corrected with pinyin tone numbers:14
audio = model.generate(
text="这批货物打ZHE2出售后他严重SHE2本了,再也经不起ZHE1腾了。"
)
English pronunciation can use CMU dictionary notation in uppercase brackets:14
audio = model.generate(
text="He plays the [B EY1 S] guitar while catching a [B AE1 S] fish."
)
Which mode should you choose?
| Need | Mode |
|---|---|
| Use a lawful reference voice | Voice cloning |
| Create a voice by description | Voice design |
| Generate quick audio with no fixed speaker | Auto voice |
| Generate many outputs | Batch inference |
| Test interactively | omnivoice-demo |
| Integrate into a backend | Python API |
| Run offline jobs | CLI + JSONL |
Personal setup guide
Simple NVIDIA GPU setup:
python -m venv .venv
source .venv/bin/activate
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
pip install omnivoice
omnivoice-demo --ip 127.0.0.1 --port 8001
CLI test:
omnivoice-infer \
--model k2-fsa/OmniVoice \
--text "Hello, this is a basic OmniVoice test." \
--output out.wav
Voice cloning:
omnivoice-infer \
--model k2-fsa/OmniVoice \
--text "This new sentence is generated using the reference voice." \
--ref_audio ref.wav \
--ref_text "This is the reference audio transcription." \
--output cloned.wav
Backend deployment pattern
Basic backend architecture:
API request
↓
Validate text + voice mode
↓
OmniVoice worker
↓
WAV output
↓
Storage/CDN
↓
Return URL
Practical suggestions:
- Do not run long inference directly in the web request thread.
- Use a queue such as Redis/RQ, Celery, Sidekiq or a dedicated worker.
- Cache output by text + voice configuration hash.
- Limit text length and output duration.
- Delete temporary files.
- Separate model workers from the API server.
- Record metadata: model version, mode, text length, duration and reference audio ID.
- Avoid logging raw reference audio or private transcripts.
Content-production workflow
Suggested workflow:
- Write the script.
- Normalize numbers, dates and symbols.
- Choose cloning, design or auto voice.
- Generate a draft.
- Human editor listens and checks pronunciation.
- Correct pronunciation using pinyin/CMU notation if needed.
- Export WAV.
- Mix/master in audio or video editing software.
Do not automatically publish cloned-voice outputs without review. TTS can produce pronunciation errors, bad pauses or unwanted sounds.
Notes for Vietnamese and other languages
OmniVoice supports more than 600 languages, but real quality depends on training data, punctuation, numeric normalization and the reference audio.15
For Vietnamese, practical tips:
- write full diacritics;
- use clear punctuation;
- spell numbers as words if numeric reading is bad;
- avoid ambiguous abbreviations;
- use Vietnamese reference audio for natural pronunciation;
- check names, English mixed text and product codes;
- split long paragraphs into shorter sentences.
Min Nan / Hokkien note
The tips docs say Min Nan Chinese / Hokkien currently supports only Tai-lo romanization input; Chinese characters are not supported for this language in the current model version.16
This shows that “language support” does not always mean every writing system or orthography variant is supported equally.
Training, evaluation and fine-tuning
The README points to examples/ for the complete pipeline from data preparation to training, evaluation and fine-tuning.17
Use this path if you need to:
- fine-tune on a domain;
- evaluate on internal data;
- build an internal benchmark;
- improve pronunciation or speaker style;
- study multilingual TTS architecture.
Most users should start with the pretrained model before considering fine-tuning.
Safety and ethics
Voice cloning is higher risk than many AI tools because it can imitate real people. The README prohibits unauthorized voice cloning, impersonation, fraud, scams and other illegal or unethical activities.3
Required checklist:
- Only clone voices with clear permission.
- Do not impersonate real people.
- Do not use generated voices for fraud, scam calls or fake evidence.
- Do not publish private voice datasets.
- Add watermarking or metadata where appropriate.
- Store consent records for voice samples.
- Use human review before publishing.
- Add abuse detection and rate limits for public systems.
- Provide a deletion process for reference audio.
When should you use OmniVoice?
Use it when:
- you need multilingual TTS;
- you have lawful reference audio for voice cloning;
- you want voice design by speaker attributes;
- you create voice-over, audiobooks, AI assistants or demos;
- you research multilingual TTS or diffusion-style TTS;
- you need batch speech generation.
Do not use it when:
- you do not have the right to use a reference voice;
- legal or financial decisions depend on unreviewed generated audio;
- you need ASR/speech-to-text;
- your machine cannot run the model;
- you require strict real-time guarantees without benchmarking;
- you do not have a misuse-prevention policy.
OmniVoice vs previous repositories
| Repository | Primary purpose |
|---|---|
| OmniVoice | multilingual TTS, voice cloning and voice design |
| PaddleOCR | OCR and document parsing from images/PDFs |
| MarkItDown | document-to-Markdown conversion |
| NVIDIA Cosmos | world models for Physical AI |
| Claude Tap | trace/debug AI coding agents |
| Headroom | compress LLM context/tool output |
| RTK | compress CLI output for coding agents |
OmniVoice is in the speech generation category, not document processing, coding-agent tooling or Physical AI.
FAQ
What is OmniVoice?
OmniVoice is a zero-shot multilingual text-to-speech model from k2-fsa, supporting over 600 languages, voice cloning and voice design.2
How many languages does OmniVoice support?
The README says 600+ languages; the supported languages file states 646 languages and 581k hours of training data.215
How long should reference audio be?
The README recommends a 3–10 second reference audio clip. Longer audio can slow inference and degrade cloning quality.8
Is ref_text required?
No. If ref_text is omitted, the README says the model uses Whisper ASR to auto-transcribe the reference audio.8
Is voice design stable?
The README says voice cloning is the most stable mode. Voice design is trained on Chinese and English data and may be unstable for some low-resource languages or edge cases.8
What license does OmniVoice use?
The GitHub repository and Hugging Face model card list Apache-2.0.46
Conclusion
k2-fsa/OmniVoice is notable because it combines very broad language coverage, short-reference voice cloning and text-based voice design in one open TTS project. Beginners should start with omnivoice-demo, then move to omnivoice-infer or the Python API for integration work.
The most important deployment point is not technical. Because OmniVoice supports voice cloning, any production workflow needs consent, misuse controls, access limits, human review and a clear policy. Used responsibly, it is useful for voice-over, audiobook generation, AI assistants and TTS research. Used irresponsibly, it can enable voice impersonation.
References
Footnotes
-
OmniVoice project image used in README and Hugging Face model card. https://zhu-han.github.io/omnivoice/pics/omnivoice.jpg ↩
-
OmniVoice README. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md ↩ ↩2 ↩3 ↩4
-
OmniVoice README, Disclaimer section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md ↩ ↩2
-
GitHub.
k2-fsa/OmniVoice. https://github.com/k2-fsa/OmniVoice ↩ ↩2 -
OmniVoice pyproject.toml. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/pyproject.toml ↩ ↩2
-
Hugging Face model card for
k2-fsa/OmniVoice. https://huggingface.co/k2-fsa/OmniVoice ↩ ↩2 -
arXiv paper “OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models.” https://arxiv.org/abs/2604.00688 ↩ ↩2
-
OmniVoice README, Python API section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9
-
OmniVoice voice design docs. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/docs/voice-design.md ↩ ↩2 ↩3
-
OmniVoice README, Installation section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md ↩ ↩2
-
OmniVoice README, Quick Start section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md ↩ ↩2
-
OmniVoice README, Command-Line Tools section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md ↩ ↩2 ↩3
-
OmniVoice generation parameters docs. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/docs/generation-parameters.md ↩ ↩2 ↩3
-
OmniVoice README, Non-Verbal & Pronunciation Control section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md ↩ ↩2 ↩3
-
OmniVoice supported languages file. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/docs/languages.md ↩ ↩2
-
OmniVoice tips file. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/docs/tips.md ↩
-
OmniVoice README, Training & Evaluation section. https://raw.githubusercontent.com/k2-fsa/OmniVoice/master/README.md ↩
Written by PixelRouter Editorial Team
We publish deep, authoritative guides on AI infrastructure, API gateway security, cloud financial management, and system optimizations for developers.
FAQ
What is OmniVoice?
OmniVoice is an open-source zero-shot multilingual text-to-speech model from k2-fsa. The article describes it as supporting 600+ languages, voice cloning, and voice design.
How many languages does OmniVoice support?
The article states that the OmniVoice README describes support for 600+ languages, while the supported languages file lists 646 languages.
What generation modes does OmniVoice support?
The article describes three main modes: voice cloning with reference audio, voice design using speaker attributes, and auto voice generation with text only.
How long should reference audio be for voice cloning?
The article says the README recommends a 3–10 second reference audio clip. Longer clips may slow inference and reduce cloning quality.
Is ref_text required for voice cloning?
No. The article states that if ref_text is omitted, OmniVoice can use Whisper ASR to auto-transcribe the reference audio.
What license does OmniVoice use?
The article states that the GitHub repository and Hugging Face model card list Apache-2.0.
📂Related posts
AI Guides
What Is 9Remote? Remote Terminal, Desktop, Files, and AI Coding from Your Phone
A simple guide to decolua/9remote: how it lets developers access a host terminal, remote desktop, file explorer, local sites, and AI coding tools from a phone or browser with QR pairing and Cloudflare tunnel support.
AI Guides
What Is 9Router? A Simple Guide to AI Coding Provider Routing
A practical guide to decolua/9router, an open-source AI router and proxy for AI coding tools with OpenAI-compatible endpoints, provider routing, fallback combos, RTK token saving, dashboard setup, Docker deployment, and security notes.
AI Guides
What Is Claude Tap? A Simple Guide to AI Agent Trace Debugging
Learn what liaohch3/claude-tap is, how it works as a local proxy and trace viewer for AI coding agents, and how it helps inspect prompts, tools, token usage, request diffs, exports, proxy modes, and security considerations.