·
Flow matching model with a Diffusion Transformer backbone for consistent zero-shot cloning.
GPT-style autoregressive model capable of generating speech and non-speech sounds like laughter or music.
Extremely fast, ONNX-based model optimized for edge devices and Raspberry Pi, running entirely on CPU. Supports 30+ languages.
Microsoft's long-form TTS model using low-frame-rate tokenizers for stable multi-speaker dialogue up to 90 minutes.
High-performance TTS model family (0.6B/1.7B) from Alibaba Qwen, beating closed-source models in voice similarity benchmarks.
High-performance model from OpenBMB achieving 85%+ voice similarity in competitive benchmarks.
A massively multilingual zero-shot text-to-speech (TTS) model supporting over 600 languages.
Small but powerful speech generation model from Alibaba's FunAudioLLM (CosyVoice series) with streaming support.
Streamlined 350M-parameter model family designed for low-latency, production-grade speech with emotion control.
Text-controlled TTS model using T5 and DAC decoder, allowing users to describe voice characteristics via natural language.
Lightweight, 82M parameter model based on StyleTTS2 for high-quality, fast inference.
Decoder-only transformer model with Dual-AR design for high-quality, controllable speech across 80+ languages.
GPT-style autoregressive model with DVAE for high-fidelity zero-shot voice cloning in 17 languages.
Dialogue-focused TTS model (1B/2B variants) for multi-speaker conversations with nonverbal tags like laughter.
Fast, multilingual TTS library optimized for CPU inference with support for mixed-language speech.
Conversational TTS model optimized for dialogue, supporting natural prosody and expressive speech features.
Advanced zero-shot TTS model from SiliconFlow with high emotional fidelity and superior speaker similarity.
Diffusion-based TTS model from Meituan's LongCat team, operating in waveform latent space for high-quality voice cloning.
Zero-shot TTS model (2.6B) built on the MioCodec for efficient, high-quality audio generation and cloning.
1B parameter TTS model from Sesame focused on high-quality speech generation.
High-speed zero-shot TTS model from MyShell.ai optimized for low-latency voice cloning and interaction.
A highly efficient TTS model from SWivid, sister project to F5-TTS, designed for fast inference.
Powerful few-shot voice cloning and TTS system with a web UI, widely used for creating custom voice models.
Low-latency speech-to-speech model (200ms) capable of real-time conversational reasoning, running on consumer hardware.
Ultra-lightweight 0.1B parameter model designed for real-time CPU inference across 20 languages.
On-device foundation TTS model from Neuphonic for super-realistic speech with instant voice cloning.
Token infilling neural codec language model for high-fidelity speech editing and zero-shot voice cloning.
High-quality streaming foundation TTS system from the FireRed Team with two-stage semantic-to-acoustic decoding.
Fully non-autoregressive TTS model using a masked generative codec transformer for zero-shot synthesis without text-speech alignment.
Foundation TTS model using style diffusion and adversarial training for human-level naturalness and expressive prosody.
Foundational 1.2B parameter model for human-like, expressive TTS, trained on 100,000 hours of speech.
Novel speech model from CAMB.AI designed for high-quality prosody and zero-shot cloning from 5 seconds of audio.
Versatile instant voice cloning approach from MIT and MyShell.ai with control over emotion and accent across multiple languages.
Multi-voice and prompt-controlled TTS engine from NetEase Youdao with support for over 2000 voices and emotional synthesis.
Open-source implementation of Microsoft's VALL-E X for zero-shot cross-lingual voice cloning and synthesis.
Multilingual zero-shot TTS model with high speaker similarity and natural prosody.
High-quality zero-shot TTS model with support for expressive speech generation and voice cloning.
Multilingual TTS model built on the T5Gemma architecture, supporting voice cloning and precise duration control.
High-quality TTS model supporting 23 European Union languages with zero-shot voice cloning capabilities.
Ultra-lightweight English TTS model with only 1.6 million parameters, achieving ~53x real-time inference.
Made with Webhound · Ask questions about this research, build on it, or start your own
Ask Webhound about this research, build on it, or start your own
Start free