[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tag-llm-inference":3},{"tag":4,"articles":11},{"id":5,"name":6,"slug":7,"article_count":8,"description_zh":9,"description_en":10},"a487ff8b-bc7c-473d-b9f2-867dd22c9327","LLM inference","llm-inference",4,"LLM 推論聚焦模型在部署時的延遲、吞吐量與記憶體成本，尤其是 KV cache、量化與加速器友善的實作。這類技術直接影響大模型能否在雲端與邊緣裝置上穩定運行。","LLM inference covers the runtime side of large models: latency, throughput, memory footprint, and how KV cache, quantization, and accelerator-friendly kernels shape deployment. It matters because these choices determine whether a model is practical on GPUs, servers, or edge devices.",[12,21,28,35,42,49,56],{"id":13,"slug":14,"title":15,"summary":16,"category":17,"image_url":18,"cover_image":18,"language":19,"created_at":20},"407ca117-f24b-4ff9-96b8-09d4d4733b31","taming-black-box-llm-inference-scheduling-en","Taming Black-Box LLM Inference Scheduling","A scheduling approach for black-box LLM inference that uses predicted output lengths to reduce queueing friction at scale.","research","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778740250597-fhpf.png","en","2026-05-14T06:30:33.21401+00:00",{"id":22,"slug":23,"title":24,"summary":25,"category":17,"image_url":26,"cover_image":26,"language":19,"created_at":27},"01b8c278-3f2b-4c2c-8505-63dea2a0fd5f","saga-workflow-atomic-scheduling-gpu-clusters-en","SAGA makes AI agent GPU scheduling workflow-aware","SAGA argues GPU schedulers should treat an agent’s chained LLM calls as one workflow, not isolated requests.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778567457823-o68t.png","2026-05-12T06:30:33.774584+00:00",{"id":29,"slug":30,"title":31,"summary":32,"category":17,"image_url":33,"cover_image":33,"language":19,"created_at":34},"3d747e63-24a0-4e20-9e83-e2263d06a779","speckv-adaptive-speculative-decoding-gamma-en","SpecKV tunes speculative decoding on the fly","SpecKV adapts speculative decoding’s token budget per step, using draft-model signals to beat fixed gamma across compression settings.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777961487463-lssf.png","2026-05-05T06:10:40.207648+00:00",{"id":36,"slug":37,"title":38,"summary":39,"category":17,"image_url":40,"cover_image":40,"language":19,"created_at":41},"bc8a4577-e218-43ae-a08b-4898abf26e2a","turboquant-online-vector-quantization-near-optimal-en","TurboQuant brings near-optimal online vector quantization","TurboQuant is an online, accelerator-friendly vector quantizer that targets near-optimal MSE and inner-product distortion.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777467656845-z759.png","2026-04-29T13:00:40.593903+00:00",{"id":43,"slug":44,"title":45,"summary":46,"category":17,"image_url":47,"cover_image":47,"language":19,"created_at":48},"d7b529f2-02b7-4d5b-bf82-490aa5fe8362","turboquant-eden-citation-fight-en","TurboQuant, EDEN, and the citation fight","TurboQuant’s KV-cache quantization claims are under fire: EDEN authors say the paper reuses older ideas, weaker scales, and shaky benchmarks.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1777467061610-ug4x.png","2026-04-29T12:50:47.131528+00:00",{"id":50,"slug":51,"title":52,"summary":53,"category":17,"image_url":54,"cover_image":54,"language":19,"created_at":55},"fdb997e1-6691-46c5-bb2d-e1ca3f730c25","turboquant-google-paper-explained-en","TurboQuant Explained: Why Google’s New Paper Matters","Google’s TurboQuant paper targets KV cache bottlenecks with lower-bit quantization, aiming to cut LLM memory use and inference costs.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775160958409-7jj5.png","2026-04-02T20:15:40.601225+00:00",{"id":57,"slug":58,"title":59,"summary":60,"category":17,"image_url":61,"cover_image":61,"language":19,"created_at":62},"6fd1f021-a7ca-4fa7-9aae-6ca84b22dc6c","googles-turboquant-cuts-llm-memory-costs-en","Google's TurboQuant Cuts LLM Memory Costs","Google says TurboQuant uses QJL and PolarQuant to shrink vector-quantization memory and speed up LLM inference by up to 8x.","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775160776347-4esa.png","2026-04-02T20:12:32.387326+00:00"]