| | | https://github.com/SWivid/F5-TTS | | | | Flow matching model with a Diffusion Transformer backbone for consistent zero-shot cloning. | |
| | | https://github.com/suno-ai/bark | ["English","Spanish","French","German","Italian","Japanese","Korean","Chinese","Portuguese","Russian","Turkish","Polish","Hindi"] | | | GPT-style autoregressive model capable of generating speech and non-speech sounds like laughter or music. | |
| | | https://github.com/rhasspy/piper | ["English","Spanish","French","German","Chinese","Multilingual"] | | | Extremely fast, ONNX-based model optimized for edge devices and Raspberry Pi, running entirely on CPU. Supports 30+ languages. | |
| | | https://huggingface.co/microsoft/VibeVoice-1.5B | | | | Microsoft's long-form TTS model using low-frame-rate tokenizers for stable multi-speaker dialogue up to 90 minutes. | |
| | | https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice | ["Chinese","English","Japanese","Korean","German","French","Russian","Portuguese","Spanish","Italian"] | | | High-performance TTS model family (0.6B/1.7B) from Alibaba Qwen, beating closed-source models in voice similarity benchmarks. | |
| | | https://huggingface.co/openbmb/VoxCPM2 | ["Arabic","Burmese","Chinese","Danish","Dutch","English","Finnish","French","German","Greek","Hebrew","Hindi","Indonesian","Italian","Japanese","Khmer","Korean","Lao","Malay","Norwegian","Polish","Portuguese","Russian","Spanish","Swahili","Swedish","Tagalog","Thai","Turkish","Vietnamese"] | | | High-performance model from OpenBMB achieving 85%+ voice similarity in competitive benchmarks. | |
| | | https://huggingface.co/k2-fsa/OmniVoice | | | | A massively multilingual zero-shot text-to-speech (TTS) model supporting over 600 languages. | |
| | | https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512 | ["Chinese","English","Japanese","Korean","German","Spanish","French","Italian","Russian"] | | | Small but powerful speech generation model from Alibaba's FunAudioLLM (CosyVoice series) with streaming support. | |
| | | https://huggingface.co/ResembleAI/chatterbox-turbo | ["Arabic","Danish","German","Greek","English","Spanish","Finnish","French","Hebrew","Hindi","Italian","Japanese","Korean","Malay","Dutch","Norwegian","Polish","Portuguese","Russian","Swedish","Swahili","Turkish","Chinese"] | | | Streamlined 350M-parameter model family designed for low-latency, production-grade speech with emotion control. | |
| | | https://github.com/huggingface/parler-tts | | | | Text-controlled TTS model using T5 and DAC decoder, allowing users to describe voice characteristics via natural language. | |
| | | https://huggingface.co/hexgrad/Kokoro-82M | ["English","Japanese","Korean","Spanish","French","German","Italian","Portuguese","Hindi"] | | | Lightweight, 82M parameter model based on StyleTTS2 for high-quality, fast inference. | |
| | Fish Audio Research License | https://huggingface.co/fishaudio/s2-pro | ["English","Chinese","Multilingual (80+)"] | | | Decoder-only transformer model with Dual-AR design for high-quality, controllable speech across 80+ languages. | |
| | | https://huggingface.co/coqui/XTTS-v2 | ["English","Spanish","French","German","Italian","Portuguese","Polish","Turkish","Russian","Dutch","Czech","Arabic","Chinese","Japanese","Hungarian","Korean","Hindi"] | | | GPT-style autoregressive model with DVAE for high-fidelity zero-shot voice cloning in 17 languages. | |
| | | https://huggingface.co/nari-labs/Dia2-2B | | | | Dialogue-focused TTS model (1B/2B variants) for multi-speaker conversations with nonverbal tags like laughter. | |
| | | https://github.com/myshell-ai/MeloTTS | ["English","Spanish","French","Chinese","Japanese","Korean"] | | | Fast, multilingual TTS library optimized for CPU inference with support for mixed-language speech. | |
| | | https://github.com/2Noise/ChatTTS | | | | Conversational TTS model optimized for dialogue, supporting natural prosody and expressive speech features. | |
| | | https://huggingface.co/siliconflow/IndexTTS2-1.5B | | | | Advanced zero-shot TTS model from SiliconFlow with high emotional fidelity and superior speaker similarity. | |
| | | https://github.com/meituan-longcat/LongCat-AudioDiT | | | | Diffusion-based TTS model from Meituan's LongCat team, operating in waveform latent space for high-quality voice cloning. | |
| | | https://huggingface.co/Aratako/MioTTS-2.6B | | | | Zero-shot TTS model (2.6B) built on the MioCodec for efficient, high-quality audio generation and cloning. | |
| | | https://huggingface.co/sesame/csm-1b | | | | 1B parameter TTS model from Sesame focused on high-quality speech generation. | |
| | | https://huggingface.co/myshell-ai/zipvoice-v1 | | | | High-speed zero-shot TTS model from MyShell.ai optimized for low-latency voice cloning and interaction. | |
| | | https://github.com/SWivid/E2-TTS | | | | A highly efficient TTS model from SWivid, sister project to F5-TTS, designed for fast inference. | |
| | | https://github.com/RVC-Boss/GPT-SoVITS | ["English","Chinese","Japanese","Korean","Cantonese"] | | | Powerful few-shot voice cloning and TTS system with a web UI, widely used for creating custom voice models. | |
| | | https://github.com/kyutai-labs/moshi | | | | Low-latency speech-to-speech model (200ms) capable of real-time conversational reasoning, running on consumer hardware. | |
| | | https://github.com/OpenMOSS/MOSS-TTS-Nano | ["English","Chinese","Multilingual"] | | | Ultra-lightweight 0.1B parameter model designed for real-time CPU inference across 20 languages. | |
| | | https://github.com/neuphonic/neu-tts | ["English","Spanish","German","French"] | | | On-device foundation TTS model from Neuphonic for super-realistic speech with instant voice cloning. | |
| | | https://github.com/jasonppy/voicecraft | | | | Token infilling neural codec language model for high-fidelity speech editing and zero-shot voice cloning. | |
| | | https://github.com/FireRedTeam/FireRedTTS | | | | High-quality streaming foundation TTS system from the FireRed Team with two-stage semantic-to-acoustic decoding. | |
| | | https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct | ["English","Japanese","Chinese"] | | | Fully non-autoregressive TTS model using a masked generative codec transformer for zero-shot synthesis without text-speech alignment. | |
| | | https://github.com/yl4579/StyleTTS2 | | | | Foundation TTS model using style diffusion and adversarial training for human-level naturalness and expressive prosody. | |
| | | https://github.com/metavoiceio/metavoice-src | | | | Foundational 1.2B parameter model for human-like, expressive TTS, trained on 100,000 hours of speech. | |
| | | https://github.com/Camb-ai/mars5-tts | | | | Novel speech model from CAMB.AI designed for high-quality prosody and zero-shot cloning from 5 seconds of audio. | |
| | | https://github.com/myshell-ai/OpenVoice | ["English","Spanish","French","Chinese","Japanese","Korean"] | | | Versatile instant voice cloning approach from MIT and MyShell.ai with control over emotion and accent across multiple languages. | |
| | | https://github.com/netease-youdao/EmotiVoice | | | | Multi-voice and prompt-controlled TTS engine from NetEase Youdao with support for over 2000 voices and emotional synthesis. | |
| | | https://github.com/Plachtaa/VALL-E-X | ["English","Chinese","Japanese"] | | | Open-source implementation of Microsoft's VALL-E X for zero-shot cross-lingual voice cloning and synthesis. | |
| | | https://huggingface.co/models?search=MegaTTS3 | | | | Multilingual zero-shot TTS model with high speaker similarity and natural prosody. | |
| | | https://github.com/spark-tts/spark-tts | | | | High-quality zero-shot TTS model with support for expressive speech generation and voice cloning. | |
| | | https://github.com/t5gemma-tts/t5gemma-tts | ["English","Chinese","Japanese"] | | | Multilingual TTS model built on the T5Gemma architecture, supporting voice cloning and precise duration control. | |
| | | https://github.com/kugelaudio/kugelaudio | ["English","French","German","Spanish","Italian","Dutch","Portuguese","Multilingual (23 EU)"] | | | High-quality TTS model supporting 23 European Union languages with zero-shot voice cloning capabilities. | |
| | | https://github.com/tronghieuit/tiny-tts | | | | Ultra-lightweight English TTS model with only 1.6 million parameters, achieving ~53x real-time inference. | |