[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-vision-language":3},{"tag":4,"articles":10},{"id":5,"name":6,"slug":6,"article_count":7,"description_zh":8,"description_en":9},"fabc42ac-43e0-4090-a27c-92dcce597044","vision-language",3,"視覺語言模型把影像、文字與推理接到同一條管線，常見於圖文問答、偏好對齊與多模態 MoE。這個主題關注模型如何看懂畫面、選對專家並在任務規則下做出更穩定的判斷。","Vision-language models connect images, text, and reasoning in one pipeline, powering tasks like VQA, preference alignment, and multimodal MoE. This topic centers on how models interpret visuals, route to the right experts, and stay reliable under task-specific constraints.",[11],{"id":12,"slug":13,"title":14,"summary":15,"category":16,"image_url":17,"cover_image":17,"language":18,"created_at":19},"10a60b90-b59c-47e7-a6e5-a7fba43c353a","multimodal-moe-routing-distraction-en","Why multimodal MoE models get distracted","A study of multimodal MoE models finds visual inputs can derail routing to reasoning experts, and a routing-guided fix improves results.","research","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775801394754-ctzn.png","en","2026-04-10T06:09:35.090825+00:00"]