The new Kokoro TTS model just proved something remarkable – a small, efficient model can outperform massive ones that cost millions to train. With only 82 million parameters and less than 100 hours of training data, Kokoro topped the TTS Spaces Arena leaderboard against models 10-15x its size.
What makes this interesting is how it challenges assumptions about AI scaling. Models like XTTS v2 used over 10,000 hours of audio and 467M parameters. MetaVoice threw 1.2B parameters and 100,000 hours at the problem. Fish Speech used a whopping 1 million hours. Yet Kokoro beat them all with a fraction of the resources.
The model generates high-quality speech in both American and British English, with 10 different voice options available. Most impressively, it does this without the complex diffusion or encoder architectures that larger models rely on.
I tested Kokoro extensively and found the generation speed remarkable – about 1 second per output. The voice quality stays consistent and natural-sounding across different types of text.
This matters because it shows that thoughtful architecture and training choices can matter more than raw scale. While massive models grab headlines, Kokoro proves that focused, efficient approaches still have a major role to play.
The model is fully open source under the Apache 2.0 license. You can try it yourself through the Hugging Face demo or by running the code locally. The documentation includes clear setup instructions and code examples to get started.
I expect we will see more examples of smaller, specialized models outperforming general-purpose giants. The future of AI may not be exclusively about who can train the biggest model.
The code and model weights are available at https://huggingface.co/hexgrad/Kokoro-82M for those interested in experimenting with it directly.