Google Launches Gemini 3.1 Flash TTS

16-04-2026

Google has delivered a TTS model that combines broad language coverage, detailed expressive control, and competitive pricing in a single API.

Written by:

Jorick van Weelie

Jorick van Weelie | Marketing Lead & AI Pioneer at DataNorth AI Jorick specializes in translating complex AI architectures into actionable business strategies.

April 16, 2026

Google DeepMind released Gemini 3.1 Flash TTS on April 15, 2026, a text-to-speech model that introduces granular audio tags for controlling vocal style, pace, and delivery. The model supports more than 70 languages and native multi-speaker dialogue, and is now available through the Gemini API, Google AI Studio, Vertex AI, and Google Vids.

Gemini 3.1 Flash TTS ranks second on the Artificial Analysis TTS leaderboard with an Elo score of 1,211, placing it ahead of ElevenLabs v3 in overall quality. Google positions it as the most expressive and controllable text-to-speech model in the Gemini family to date.

What can Gemini 3.1 Flash TTS do?

Gemini 3.1 Flash TTS converts text into natural-sounding speech with a level of control that goes well beyond standard TTS systems. The model introduces over 200 audio tags that developers can embed directly into the input text to steer vocal style, tone, tempo, and accent. Tags like [enthusiasm], [whispers], [curiosity], and [determination] allow fine-grained emotional control without requiring separate configuration or post-processing.

Beyond emotional expression, the model offers format templates for common use cases: podcast conversation, audiobook narrator, language tutor, voice assistant, wellness guide, news broadcaster, and support agent. Users can select from regional accents across major languages. For English alone, options include American variants like Valley and Southern, as well as British options such as Brixton and RP. All settings can be exported as API code for integration into production workflows.

Gemini 3.1 Flash TTS benchmarks and technical specs

On the Artificial Analysis TTS leaderboard, Gemini 3.1 Flash TTS achieved an Elo score of 1,211, ranking second overall. The model surpasses ElevenLabs v3 in overall quality and sits just behind Inworld 1.5 Max, standing out particularly for its quality-to-price ratio.

The model supports more than 70 languages natively, including Japanese, Hindi, and German. It handles multi-speaker dialogue without requiring separate model calls for each speaker, which simplifies the production of conversational content such as podcasts and interactive voice applications.

Gemini 3.1 Flash TTS pricing and availability

Gemini 3.1 Flash TTS is available in preview through four channels: the Gemini API for developers, Google AI Studio for free experimentation, Vertex AI for enterprise users, and Google Vids for Workspace subscribers. The paid tier is priced at $1.00 per million input tokens and $20.00 per million audio output tokens. A batch mode offers a 50% discount at $0.50 and $10.00 respectively. A free tier is also available, though Google notes that data from free-tier usage may be used for product improvement.

How does Gemini 3.1 Flash TTS compare to other TTS models?

The TTS landscape in April 2026 includes strong offerings from ElevenLabs, Inworld, and OpenAI. Gemini 3.1 Flash TTS differentiates itself through the combination of its audio tag system and competitive pricing. While Inworld 1.5 Max holds the top Elo position, Google’s model offers a broader feature set at a lower per-token cost, making it attractive for high-volume applications like customer service, content creation, and accessibility tools.

Compared to its predecessor models in the Gemini TTS family, the 3.1 Flash TTS version adds the 200+ audio tag system, expands language support, and introduces native multi-speaker dialogue. The addition of format templates for specific use cases also reduces the setup required for common applications.

SynthID watermarking and safety

All audio generated by Gemini 3.1 Flash TTS is watermarked with Google’s SynthID technology. This imperceptible watermark is embedded directly into the audio output, allowing reliable detection of AI-generated speech. Google frames this as a safeguard against misuse and misinformation, making it possible for downstream systems to verify whether a piece of audio was produced by the model.

Gemini 3.1 Flash TTS is available now in preview. For full documentation, pricing details, and API access, visit the official Google blog post or the Gemini API documentation.