Google's Gemini 3.1 Flash Live Targets Real-Time Voice AI
Gemini 3.1 Flash Live brings low-latency audio, video, and tool use to Google’s Live API, with 90.8% on ComplexFuncBench Audio.

Google says Gemini 3.1 Flash Live is its highest-quality audio and speech model to date, and the numbers explain why. The model is in preview through the Gemini Live API and is built for real-time voice, video, and tool use in one session.
That matters because voice agents usually lose time in a long chain of transcription, reasoning, and speech synthesis. Google is trying to compress that chain into a single low-latency stream, which is exactly what developers need if they want assistants that can keep up with human conversation.
Why Google changed the voice stack
Get the latest AI news in your inbox
Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.
No spam. Unsubscribe at any time.
Traditional voice assistants often work like a relay race. First comes voice activity detection, then speech-to-text, then the language model, then text-to-speech. Each step adds delay, and each step creates another place for errors to creep in.

Gemini 3.1 Flash Live takes a different route. It processes audio natively, so it can react to pitch, pace, pauses, and background noise without waiting for a full transcript first. Google says this improves recognition of speech characteristics compared with Gemini 2.5 Flash Native Audio.
That sounds subtle until you think about where voice AI is actually used. A model that works only in a quiet demo room is one thing. A model that can keep up in a car, a store, or a crowded office is the one developers can ship.
- Input format: 16-bit PCM audio at 16 kHz, little-endian
- Output format: raw PCM audio at 24 kHz
- Video input: JPEG or PNG frames, about 1 frame per second
- Session memory: 128k tokens
The Live API also uses a stateful WebSocket connection, so the model can keep a single conversation open instead of treating every request like a fresh start. That design fits real-time agents much better than plain REST calls.
There is also support for barge-in, which lets a user interrupt the model mid-response. That feature sounds small, but it changes the feel of the product. People talk over assistants all the time. If the model cannot handle interruptions, it feels slow and mechanical.
The Live API is where the real work happens
Google’s Multimodal Live API is the developer-facing piece that makes this model useful. It keeps a persistent bi-directional stream open over WebSockets, so audio, video, transcripts, and tool calls can move in both directions during one session.
That matters for agent builders because the model can receive new user speech while it is still speaking. It can also send audio output without waiting for a separate text-to-speech service, which cuts out another source of delay.
Google’s docs also point to bundled content parts in a single server event, which simplifies client synchronization. If you have ever tried to keep audio playback, transcript rendering, and tool feedback aligned in a live app, you know that small protocol details can save a lot of engineering time.
“The model doesn’t just use a transcript; it processes acoustic nuances directly.”
That line from Google’s release captures the main idea. The company is betting that native audio understanding will matter more than transcript-first pipelines for the next wave of voice agents.
Google also published a gemini-skills repository to help developers keep agent instructions current inside coding tools. One of those skills, gemini-live-api-dev, focuses on WebSocket sessions and audio/video blob handling.
The benchmark numbers are the part developers will notice
Google did not just talk about latency. It also published benchmark results that point to stronger agentic behavior. The headline number is 90.8% on ComplexFuncBench Audio, which tests multi-step function calling from audio input.

That score matters because voice agents become useful when they can do real work, not just answer questions. If a spoken request can trigger a chain of tool calls, the model starts behaving like an operator instead of a chatbot.
Google also reported 36.1% on Audio MultiChallenge with thinking enabled. This benchmark checks how well the model follows instructions in noisy, interrupted speech, which is a better test of real-world use than polished demo prompts.
- ComplexFuncBench Audio: 90.8%
- Audio MultiChallenge: 36.1% with thinking enabled
- Context window: 128k tokens
- Thinking levels: minimal, low, medium, high
For comparison, the model is clearly being tuned for two different goals at once: quick conversational response and deeper reasoning when a task needs it. Developers can choose a thinkingLevel setting from minimal to high, which lets them trade latency for more deliberate reasoning.
That tradeoff is the right one. A customer support bot should answer fast. A field assistant inspecting a machine through live video may need more internal reasoning before it speaks. One model can cover both cases if the controls are exposed cleanly.
What the Gemini Skills repo adds
Google’s Gemini Skills repo is easy to miss, but it is one of the more practical parts of this release. The repo packages curated context and documentation that can be injected into coding assistants so they stop relying on stale assumptions.
According to Google’s repo notes, adding a relevant skill improved code-generation accuracy to 87% with Gemini 3 Flash and 96% with Gemini 3 Pro. Those numbers are useful because agent tooling often fails on documentation drift, not just model quality.
Here is the practical comparison developers should care about:
- Without current API context, agents guess at session behavior and event formats
- With a skill pack, the assistant can follow the latest WebSocket and blob-handling rules
- For teams shipping voice products, fewer wrong assumptions means fewer broken demos and fewer late-night fixes
- For internal coding copilots, the repo becomes a living memory layer for new API behavior
This is one of those unglamorous details that can matter more than a flashy demo. A better model is nice. A model plus fresh implementation guidance is what keeps a team from shipping code that almost works.
If you want a broader look at how agent workflows are changing, see our coverage of production-ready AgentScope workflows and recent voice-agent releases.
What this means for real products
Gemini 3.1 Flash Live is still in preview, so nobody should treat it like a finished production endpoint. It also has constraints: 16 kHz PCM input, 24 kHz output, synchronous function calling, and a live-session workflow that asks for more care than a simple chat API.
Still, the direction is clear. Google is making a case that the next useful voice agent will be one that hears, sees, reasons, and acts in one continuous loop. That is a very different product from a chatbot with a microphone attached.
My guess is that the first teams to benefit will be building customer support tools, mobile assistants, and field-service apps where interruptions, noise, and tool calls are part of the job. If Google keeps the latency low and the API stable, the real question will not be whether developers try this model, but which products can justify the extra complexity of live multimodal sessions.
For now, the takeaway is simple: if your voice product still depends on a transcript-first pipeline, Gemini 3.1 Flash Live is a strong reason to rethink that architecture before your competitors do.
// Related Articles
- [MODEL]
MiniMax-M1 brings 1M-token open reasoning model
- [MODEL]
Gemini Omni Video Review: Text Rendering Beats Rivals
- [MODEL]
Why Xiaomi’s MiMo-V2.5-Pro Changes Coding Agents More Than Chatbots
- [MODEL]
OpenAI’s Realtime Audio Models Target Live Voice
- [MODEL]
Anthropic发布10款金融AI Agent
- [MODEL]
Why Claude’s “Infinite” Context Window Still Won’t Make AI Autonomous