Tag
multimodal agents
Multimodal agents combine text, audio, video, and tool use so models can interpret context and act in real time. The hard part is not only accuracy, but deciding when to call tools, when to reason directly, and how to keep latency and reliability in balance.
2 articles

Research/Apr 10
Act Wisely: Teaching Agents When Not to Call Tools
A new training scheme, HDPO, aims to cut blind tool use in multimodal agents by separating accuracy from tool efficiency.

Model Releases/Apr 3
Google's Gemini 3.1 Flash Live Targets Real-Time Voice AI
Gemini 3.1 Flash Live brings low-latency audio, video, and tool use to Google’s Live API, with 90.8% on ComplexFuncBench Audio.