Mistral Releases Voxtral TTS: Open-Weight Text-to-Speech Model With 4 Billion Parameters

30-03-2026

Can a 4B model outpace the giants? Mistral AI’s new Voxtral TTS brings human-level speech and lightning-fast voice cloning to your own hardware. High-performance, private, and nine languages strong, see how it’s resetting the benchmarks.

Written by:

Jorick van Weelie

mistral releases voxtral tts open weight text to speech model Sign up for our Newsletter

30-03-2026

Mistral AI has released Voxtral TTS, the company’s first text-to-speech model. The 4-billion-parameter model supports nine languages, runs on consumer hardware, and is available both through Mistral’s API and as open weights on Hugging Face. In human evaluations, Voxtral TTS matches or exceeds the voice quality of ElevenLabs, currently one of the most widely used commercial TTS providers.

What Voxtral TTS does

Voxtral TTS converts text into natural-sounding speech across nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi and Arabic. The model is built on a transformer-based, autoregressive architecture combined with flow-matching, using Mistral’s Ministral 3B as the backbone. It consists of a 3.4-billion-parameter transformer decoder, a 390-million-parameter flow-matching acoustic transformer and a 300-million-parameter neural audio codec.

The system can generate up to two minutes of audio natively per request, with Mistral’s API handling longer inputs through a smart interleaving process. Voice cloning is supported with as little as three seconds of reference audio, and the model captures speaker characteristics including natural pauses, rhythm, intonation and emotional expression. It also supports zero-shot cross-lingual voice adaptation, meaning a French-accented voice can speak English without retraining.

Performance and benchmarks

Mistral reports a model latency of 70 milliseconds for a typical input of 10 seconds of reference audio and 500 characters of text, with a real-time factor of approximately 9.7x. The time-to-first-audio latency sits at around 90 milliseconds, making the model suitable for real-time and streaming applications such as voice agents and interactive assistants.

In human evaluations, Voxtral TTS demonstrates superior naturalness compared to ElevenLabs Flash v2.5, while maintaining similar time-to-first-audio performance. It also reaches parity with ElevenLabs v3, their higher-quality offering, in terms of lifelike voice interactions. These results position Voxtral as a competitive alternative to established commercial TTS solutions.

Lightweight design and on-device potential

At 4 billion parameters total, Voxtral TTS is designed to run on consumer-grade hardware. Mistral states the model can operate on modern laptops, mid-range desktop GPUs and some high-end mobile devices. This lightweight footprint opens the door to on-device voice applications that do not rely on cloud processing, which has implications for latency-sensitive use cases and scenarios where data privacy is a concern.

The model’s compact size is notable because most competitive TTS systems with comparable quality are significantly larger or only available as cloud APIs. By offering both a hosted API and downloadable weights, Mistral provides flexibility for developers who need either a managed service or full local control.

Pricing and availability

Voxtral TTS is available through Mistral’s API at $0.016 per 1,000 characters, as well as through Mistral Studio and Le Chat. The open-weight version, hosted on Hugging Face under a CC BY-NC 4.0 license, includes the model and several reference voices. The non-commercial license means the open weights can be used freely for research and personal projects, while commercial use requires Mistral’s API or a separate licensing agreement.

The official announcement and technical details are available on Mistral’s blog. Developer documentation can be found in the Mistral Docs.