Google DeepMind has released Gemini 3.1 Flash TTS, its most controllable text-to-speech model yet. Unlike earlier voice systems that offered limited style presets, Flash TTS uses 'audio tags' — natural language commands placed in square brackets within text — that let developers specify emotion, pacing, accent style, and delivery format at a granular level. The model supports more than 200 audio tags, 70 languages, 30 distinct voices, and native multi-speaker dialogue, meaning a single model call can generate a two-person conversation with different voices, accents, and emotional registers.

The model launched in public preview on April 15, available via Google AI Studio for free-tier prototyping and through the Gemini API and Vertex AI for production use. Flash TTS follows the broader Gemini 3.1 rollout — Gemini 3.1 Pro became globally available earlier this month with enhanced reasoning for complex coding and data analysis tasks, and Gemini 3.1 Flash Live, an audio-to-audio real-time dialogue model, launched in late March. Google has been deepening the audio layer of Gemini considerably, partly in response to competition from ElevenLabs, OpenAI's voice features, and growing enterprise demand for voice-first AI interfaces.

Gemini as a platform now serves 750 million users, a figure Google confirmed alongside the 3.1 Pro launch. That scale makes the TTS release more than a feature update — it is a building block for a large installed base of developers creating voice applications, educational tools, accessibility products, and interactive assistants. The audio tag system is a notable design choice: rather than training a model to infer vocal style from context, Google is giving developers explicit control, which reduces unpredictability in production.

For students building AI projects, Gemini 3.1 Flash TTS is worth experimenting with via the free tier in Google AI Studio. The audio tag API is a clean example of how AI capabilities are being packaged for developers — not just as raw model outputs, but as structured, controllable interfaces. Voice is one of the fastest-growing AI modalities, and understanding how to build with text-to-speech APIs will be a practical skill in nearly every domain from education to accessibility to creative media.