[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-googles-gemini-3-1-flash-live-real-time-voice-ai-zh":3,"tags-googles-gemini-3-1-flash-live-real-time-voice-ai-zh":34,"related-lang-googles-gemini-3-1-flash-live-real-time-voice-ai-zh":49,"related-posts-googles-gemini-3-1-flash-live-real-time-voice-ai-zh":53,"series-model-release-5265564f-1e07-4677-acf9-5848867b1aab":90},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":22,"translated_content":10,"views":23,"is_premium":24,"created_at":25,"updated_at":25,"cover_image":11,"published_at":26,"rewrite_status":27,"rewrite_error":10,"rewritten_from_id":28,"slug":29,"category":30,"related_article_id":31,"status":32,"google_indexed_at":33,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":24},"5265564f-1e07-4677-acf9-5848867b1aab","Gemini 3.1 Flash Live 主打即時語音 …","\u003Cp>Google 的 \u003Ca href=\"https:\u002F\u002Fblog.google\u002Finnovation-and-ai\u002Fmodels-and-research\u002Fgemini-models\u002Fgemini-3-1-flash-live\u002F\" target=\"_blank\" rel=\"noopener\">Gemini 3.1 Flash Live\u003C\u002Fa> 很有意思。它把語音、影像、工具呼叫放進同一條即時串流。Google 也丟出一個很硬的數字：\u003Cstrong>ComplexFuncBench Audio 90.8%\u003C\u002Fstrong>。\u003C\u002Fp>\u003Cp>講白了，這是在挑戰傳統語音管線。以前是先轉文字，再推理，再合成語音。每一步都會卡一下。現在它想把這條路縮短成一個連續流程。\u003C\u002Fp>\u003Cp>對台灣開發者來說，這種模型不是拿來看 demo 而已。它影響的是客服、車載助理、門市導購，還有各種要即時回話的軟體。只要延遲太高，使用者就會直接覺得很笨。\u003C\u002Fp>\u003Ch2>為什麼 Google 會改語音架構\u003C\u002Fh2>\u003Cp>傳統語音系統像接力賽。先做 VAD，再做 speech-to-text，再丟給 LLM，最後再做 text-to-speech。每一段都吃時間，也都可能出錯。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775168331862-ovz5.png\" alt=\"Gemini 3.1 Flash Live 主打即時語音 …\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>\u003Ca href=\"https:\u002F\u002Fai.google.dev\u002Fgemini-api\u002Fdocs\u002Flive\" target=\"_blank\" rel=\"noopener\">Gemini Live API\u003C\u002Fa> 的思路不同。它走 stateful WebSocket。這代表同一個 session 可以一直維持下去，不用每次都重來。\u003C\u002Fp>\u003Cp>更重要的是，它直接處理音訊。Google 說這樣能更好抓到語速、停頓、音高，還有背景雜訊。這點很實際。因為真實世界不是安靜錄音間，而是捷運、辦公室、店面，甚至是工地。\u003C\u002Fp>\u003Cul>\u003Cli>音訊輸入：16-bit PCM、16 kHz、little-endian\u003C\u002Fli>\u003Cli>音訊輸出：raw PCM、24 kHz\u003C\u002Fli>\u003Cli>影像輸入：約 1 fps 的 JPEG 或 PNG frame\u003C\u002Fli>\u003Cli>上下文長度：128k tokens\u003C\u002Fli>\u003C\u002Ful>\u003Cp>這些規格很像工程師會在意的細節，但其實就是產品能不能\u003Ca href=\"\u002Fnews\u002Fferresdb-production-rust-vector-db-updates-zh\">上線的\u003C\u002Fa>分水嶺。16 kHz 足夠做語音理解。24 kHz 輸出也能維持基本的播報品質。對很多即時應用來說，這組設定算務實。\u003C\u002Fp>\u003Cp>它還支援 barge-in。使用者可以直接打斷模型。這功能看起來小，實際上很重要。人本來就會插話，模型如果不會停，整體體驗就會很假。\u003C\u002Fp>\u003Ch2>Live API 才是重點\u003C\u002Fh2>\u003Cp>真正讓這個模型能用的，是 \u003Ca href=\"https:\u002F\u002Fai.google.dev\u002Fgemini-api\u002Fdocs\u002Flive\" target=\"_blank\" rel=\"noopener\">Multimodal Live API\u003C\u002Fa>。它維持雙向串流。音訊、字幕、影像、工具呼叫都能在同一個 session 裡流動。\u003C\u002Fp>\u003Cp>這種設計很適合 agent。因為模型可以邊聽邊想，甚至邊講邊收新指令。你不用先把整段語音轉完，再丟給另一個服務做 TTS。少一層，就少一層延遲。\u003C\u002Fp>\u003Cp>Google 也提到，伺服器事件可以把多個內容片段一起送出。這對前端同步很有幫助。做過即時字幕、音訊播放、工具回應同步的人都知道，對齊這些狀態真的很煩。\u003C\u002Fp>\u003Cblockquote>“The model doesn’t just use a transcript; it processes acoustic nuances directly.”\u003C\u002Fblockquote>\u003Cp>這句話很直白。Google 想說的是，模型不是只看轉好的文字。它直接吃聲學訊號。這對吵雜環境特別有用。\u003C\u002Fp>\u003Cp>另外，Google 還公開了 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fgoogle-gemini\u002Fgemini-skills\" target=\"_blank\" rel=\"noopener\">gemini-skills\u003C\u002Fa> repo。裡面有針對工具鏈與 API 行為的技能包。像 \u003Ccode>gemini-live-api-dev\u003C\u002Fcode> 這類內容，會幫助工程師少踩一些版本差異的坑。\u003C\u002Fp>\u003Ch2>數字比空話更有用\u003C\u002Fh2>\u003Cp>這次最能打的不是行銷詞，是 benchmark。\u003Cstrong>90.8%\u003C\u002Fstrong> 的 \u003Ca href=\"https:\u002F\u002Fblog.google\u002Finnovation-and-ai\u002Fmodels-and-research\u002Fgemini-models\u002Fgemini-3-1-flash-live\u002F\" target=\"_blank\" rel=\"noopener\">ComplexFuncBench Audio\u003C\u002Fa> 很醒目。它測的是從語音輸入觸發多步驟 function calling 的能力。\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775168327854-e3r0.png\" alt=\"Gemini 3.1 Flash Live 主打即時語音 …\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>這代表模型不只會聊天。它還要能做事。像查訂單、開工單、查庫存、叫 API，這些都屬於 voice agent 真正會碰到的工作。\u003C\u002Fp>\u003Cp>Google 也提到 \u003Cstrong>36.1%\u003C\u002Fstrong> 的 Audio Multi\u003Ca href=\"\u002Fnews\u002Fchainalysis-ai-agents-crypto-investigations-zh\">Cha\u003C\u002Fa>llenge，且是在 thinking enabled 的狀態下。這個測試比較接近吵雜、打斷、指令混雜的現場情境。分數不算誇張，但至少方向對。\u003C\u002Fp>\u003Cul>\u003Cli>ComplexFuncBench Audio：90.8%\u003C\u002Fli>\u003Cli>Audio MultiChallenge：36.1%\u003C\u002Fli>\u003Cli>Context window：128k tokens\u003C\u002Fli>\u003Cli>thinkingLevel：minimal、low、medium、high\u003C\u002Fli>\u003C\u002Ful>\u003Cp>這裡可以看出 Google 的策略。它不是只追求快。它也保留了思考深度的選項。開發者可以調 \u003Ccode>thinkingLevel\u003C\u002Fcode>，在低延遲和高推理之間切換。\u003C\u002Fp>\u003Cp>我覺得這很合理。客服機器人要快。現場維修助理可能要慢一點，但要更準。把兩種需求塞在同一個模型裡，對產品團隊比較省事。\u003C\u002Fp>\u003Ch2>Gemini Skills 為什麼實用\u003C\u002Fh2>\u003Cp>\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fgoogle-gemini\u002Fgemini-skills\" target=\"_blank\" rel=\"noopener\">Gemini Skills\u003C\u002Fa> 這個 repo 很容易被忽略，但它其實很實際。它把最新文件與上下文整理成可注入的技能，讓 coding assistant 少靠猜。\u003C\u002Fp>\u003Cp>Google 在 repo 說明裡提到，加入相關 skill 後，code-generation accuracy 可到 \u003Cstrong>87%\u003C\u002Fstrong>，在 \u003Ca href=\"https:\u002F\u002Fai.google.dev\u002Fgemini-api\u002Fdocs\u002Fmodels\u002Fgemini-3-flash\" target=\"_blank\" rel=\"noopener\">Gemini 3 Flash\u003C\u002Fa> 上是這樣；在 \u003Ca href=\"https:\u002F\u002Fai.google.dev\u002Fgemini-api\u002Fdocs\u002Fmodels\u002Fgemini-3-pro\" target=\"_blank\" rel=\"noopener\">Gemini 3 Pro\u003C\u002Fa> 上則到 \u003Cstrong>96%\u003C\u002Fstrong>。這種提升，通常不是模型本體單獨做到的，而是資料上下文整理得夠好。\u003C\u002Fp>\u003Cp>這對團隊很重要。很多 API 問題不是模型不會，而是文件過期。技能包可以把 WebSocket、blob、session 這些細節固定下來。少猜一次，就少改一次 bug。\u003C\u002Fp>\u003Cul>\u003Cli>沒有最新上下文時，agent 容易亂猜事件格式\u003C\u002Fli>\u003Cli>有 skill pack 時，模型比較能跟上最新 API 規則\u003C\u002Fli>\u003Cli>對 voice team 來說，這能少掉很多 demo 翻車\u003C\u002Fli>\u003Cli>對內部 copilot 來說，這像一層活的知識庫\u003C\u002Fli>\u003C\u002Ful>\u003Cp>這種東西不花俏，但很值錢。因為真正拖慢專案的，常常不是模型不夠強，而是工程團隊一直在補文件落差。\u003C\u002Fp>\u003Cp>如果你想看更多 agent 實作脈絡，可以參考我們對 \u003Ca href=\"\u002Fnews\u002Fproduction-ready-agentscope-workflows\" target=\"_blank\" rel=\"noopener\">production-ready AgentScope workflows\u003C\u002Fa> 的整理。\u003C\u002Fp>\u003Ch2>和其他語音模型比起來怎樣\u003C\u002Fh2>\u003Cp>拿這次的 \u003Ca href=\"https:\u002F\u002Fai.google.dev\u002Fgemini-api\u002Fdocs\u002Fmodels\u002Fgemini-2-5-flash-native-audio\" target=\"_blank\" rel=\"noopener\">Gemini 2.5 Flash Native Audio\u003C\u002Fa> 來比，Gemini 3.1 Flash Live 的重點不是單純變大，而是更適合即時互動。它強調原生音訊處理，也更明顯把工具呼叫放進 live flow。\u003C\u002Fp>\u003Cp>和一般 transcript-first 架構比，差別更大。傳統作法常要先等整段語音結束。這對聊天型產品還行。可是一旦要插話、查資料、控制設備，就會慢半拍。\u003C\u002Fp>\u003Cp>如果拿市面上的 voice agent 來看，很多產品還卡在「可以聽懂」而已。真正難的是「聽懂後立刻做事」。這也是為什麼 function calling 的 benchmark 會變成重點。\u003C\u002Fp>\u003Cul>\u003Cli>傳統管線：VAD → STT → LLM → TTS\u003C\u002Fli>\u003Cli>Gemini 3.1 Flash Live：原生音訊 + 即時串流\u003C\u002Fli>\u003Cli>一般 voice bot：常卡在單向請求\u003C\u002Fli>\u003Cli>Live API：適合持續互動與中途插話\u003C\u002Fli>\u003C\u002Ful>\u003Cp>從產品角度看，這會影響兩類團隊。第一類是做客服與銷售。第二類是做現場工具。前者重視回應速度，後者重視工具串接與上下文維持。\u003C\u002Fp>\u003Cp>如果你現在的架構還很依賴 transcript，那就該認真想一下。因為你的競品可能已經開始測 live multimodal 了。\u003C\u002Fp>\u003Ch2>這波其實反映產業方向\u003C\u002Fh2>\u003Cp>語音 AI 這幾年一直在補洞。早期大家都在拼辨識率。後來開始拼對話品質。現在輪到即時性。因為使用者已經不想等。\u003C\u002Fp>\u003Cp>這也跟硬體環境有關。手機麥克風、筆電鏡頭、耳機、車載系統，全都在變得更常見。硬體到位後，軟體就得跟上。否則裝置再多，也只是多幾個收音孔。\u003C\u002Fp>\u003Cp>對台灣市場來說，這類模型很可能先落在客服、零售、教育、製造維修。這幾個場景都有共同點：噪音多、流程碎、要即時回應。剛好都是 live API 擅長的地方。\u003C\u002Fp>\u003Cp>你如果做 SaaS，現在就可以想兩件事。第一，你的產品需不需要 barge-in。第二，你的工具呼叫是不是可以在同一個 session 裡完成。這兩題的答案，會直接影響架構。\u003C\u002Fp>\u003Ch2>接下來怎麼看\u003C\u002Fh2>\u003Cp>Gemini 3.1 Flash Live 還在 pr\u003Ca href=\"\u002Fnews\u002Ftested-devin-10-tasks-finished-3-zh\">evi\u003C\u002Fa>ew。這代表它還不是可以隨便亂上 production 的東西。16 kHz PCM、24 kHz 輸出、WebSocket session、同步 function calling，這些限制都要先吃下來。\u003C\u002Fp>\u003Cp>但方向已經很清楚。Google 想把語音 AI 從「會回答」推到「會即時互動」。如果你正在做客服、助理、或現場工具，我會建議先做一版小型測試。看延遲、看打斷、看工具呼叫成功率。\u003C\u002Fp>\u003Cp>我的預測很直接。接下來 6 到 12 個月，會有更多團隊把 transcript-first 架構換掉。不是因為舊架構不能用，而是新的 live flow 更像真實對話。問題只剩一個：你的產品準備好接這種 session 了嗎？\u003C\u002Fp>","Gemini 3.1 Flash Live 把低延遲語音、影像與工具呼叫塞進 Google Live API。ComplexFuncBench Audio 拿下 90.8%，很適合做即時語音代理。","www.marktechpost.com","https:\u002F\u002Fwww.marktechpost.com\u002F2026\u002F03\u002F26\u002Fgoogle-releases-gemini-3-1-flash-live-a-real-time-multimodal-voice-model-for-low-latency-audio-video-and-tool-use-for-ai-agents\u002F",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775168331862-ovz5.png",[13,14,15,16,17,18,19,20,21],"Gemini 3.1 Flash Live","Google Gemini","Live API","語音 AI","即時語音","多模態 AI","function calling","WebSocket","台灣開發者","zh",1,false,"2026-04-02T22:18:32.56012+00:00","2026-04-02T22:18:32.504+00:00","done","10967849-85a7-4089-80d7-dfc8b8ae324c","googles-gemini-3-1-flash-live-real-time-voice-ai-zh","model-release","d6233062-5791-432c-944d-02e125e4e299","published","2026-04-07T07:41:14.995+00:00",[35,37,38,40,42,44,46,47],{"name":14,"slug":36},"google-gemini",{"name":17,"slug":17},{"name":18,"slug":39},"多模態-ai",{"name":13,"slug":41},"gemini-31-flash-live",{"name":16,"slug":43},"語音-ai",{"name":20,"slug":45},"websocket",{"name":21,"slug":21},{"name":15,"slug":48},"live-api",{"id":31,"slug":50,"title":51,"language":52},"googles-gemini-3-1-flash-live-real-time-voice-ai-en","Google's Gemini 3.1 Flash Live Targets Real-Time Voice AI","en",[54,60,66,72,78,84],{"id":55,"slug":56,"title":57,"cover_image":58,"image_url":58,"created_at":59,"category":30},"5b5fa24f-5259-4e9e-8270-b08b6805f281","minimax-m1-open-hybrid-attention-reasoning-model-zh","MiniMax-M1：開源 1M Token 推理模型","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778797859209-ea1g.png","2026-05-14T22:30:38.636592+00:00",{"id":61,"slug":62,"title":63,"cover_image":64,"image_url":64,"created_at":65,"category":30},"b1da56ac-8019-4c6b-a8dc-22e6e22b1cb5","gemini-omni-video-review-text-rendering-zh","Gemini Omni 影片模型怎麼了","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778779280109-lrrk.png","2026-05-14T17:20:42.608312+00:00",{"id":67,"slug":68,"title":69,"cover_image":70,"image_url":70,"created_at":71,"category":30},"d63e9d93-e613-4bbf-8135-9599fde11d08","why-xiaomi-mimo-v25-pro-changes-coding-agents-zh","為什麼 Xiaomi 的 MiMo-V2.5-Pro 改變的是 Coding …","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778689858139-v38e.png","2026-05-13T16:30:27.893951+00:00",{"id":73,"slug":74,"title":75,"cover_image":76,"image_url":76,"created_at":77,"category":30},"8f0c9185-52f9-46f2-82c6-5baec126ba2e","openai-realtime-audio-models-live-voice-zh","OpenAI 即時音訊模型瞄準語音互動","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778451657895-2iu7.png","2026-05-10T22:20:32.443798+00:00",{"id":79,"slug":80,"title":81,"cover_image":82,"image_url":82,"created_at":83,"category":30},"52106dc2-4eba-4ca0-8318-fa646064de97","anthropic-10-finance-ai-agents-zh","Anthropic推10款金融AI Agent","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778389843399-vclb.png","2026-05-10T05:10:22.778762+00:00",{"id":85,"slug":86,"title":87,"cover_image":88,"image_url":88,"created_at":89,"category":30},"6ee6ed2a-35c6-4be3-ba2c-43847e592179","why-claudes-infinite-context-window-wont-autonomous-zh","為什麼 Claude 的「無限」上下文窗口，仍然不會讓 AI 自主運作","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778350250836-d5d5.png","2026-05-09T18:10:27.004984+00:00",[91,96,101,106,111,116,121,126,131,136],{"id":92,"slug":93,"title":94,"created_at":95},"58b64033-7eb6-49b9-9aab-01cf8ae1b2f2","nvidia-rubin-six-chips-one-ai-supercomputer-zh","NVIDIA Rubin 把六顆晶片塞進 AI 機櫃","2026-03-26T07:18:45.861277+00:00",{"id":97,"slug":98,"title":99,"created_at":100},"0dcc2c61-c2a6-480d-adb8-dd225fc68914","march-2026-ai-model-news-what-mattered-zh","2026 年 3 月 AI 模型新聞重點","2026-03-26T07:32:08.386348+00:00",{"id":102,"slug":103,"title":104,"created_at":105},"214ab08b-5ce5-4b5c-8b72-47619d8675dd","why-small-models-are-winning-on-device-ai-zh","小模型為何吃下裝置端 AI","2026-03-26T07:36:30.488966+00:00",{"id":107,"slug":108,"title":109,"created_at":110},"785624b2-0355-4b82-adc3-de5e45eecd88","midjourney-v8-faster-images-higher-costs-zh","Midjourney V8 變快了，也變貴了","2026-03-26T07:52:03.562971+00:00",{"id":112,"slug":113,"title":114,"created_at":115},"cda76b92-d209-4134-86c1-a60f5bc7b128","xiaomi-mimo-trio-agents-robots-voice-zh","小米 MiMo 三模型瞄準代理、機器人與語音","2026-03-28T03:05:08.779489+00:00",{"id":117,"slug":118,"title":119,"created_at":120},"9e1044b4-946d-47fe-9e2a-c2ee032e1164","xiaomi-mimo-v2-pro-1t-moe-agents-zh","小米 MiMo-V2-Pro 登場：1T MoE 模型","2026-03-28T03:06:19.002353+00:00",{"id":122,"slug":123,"title":124,"created_at":125},"d68e59a2-55eb-4a8f-95d6-edc8fcbff581","cursor-composer-2-started-from-kimi-zh","Cursor Composer 2 其實從 Kimi 起步","2026-03-28T03:11:58.893796+00:00",{"id":127,"slug":128,"title":129,"created_at":130},"c4b6186f-bd84-4598-997e-c6e31d543c0d","cursor-composer-2-agentic-coding-model-zh","Cursor Composer 2 走向代理式寫碼","2026-03-28T03:13:06.422716+00:00",{"id":132,"slug":133,"title":134,"created_at":135},"45812c46-99fc-4b1f-aae1-56f64f5c9024","openai-shuts-down-sora-video-app-api-zh","OpenAI 關閉 Sora App 與 API","2026-03-29T04:47:48.974108+00:00",{"id":137,"slug":138,"title":139,"created_at":140},"e112e76f-ec3b-408f-810e-e93ae21a888a","apple-siri-gemini-distilled-models-zh","Apple Siri 牽手 Gemini 的真相","2026-03-29T04:52:57.886544+00:00"]