Tag
LLM inference
LLM inference covers the runtime side of large models: latency, throughput, memory footprint, and how KV cache, quantization, and accelerator-friendly kernels shape deployment. It matters because these choices determine whether a model is practical on GPUs, servers, or edge devices.
7 articles

Taming Black-Box LLM Inference Scheduling
A scheduling approach for black-box LLM inference that uses predicted output lengths to reduce queueing friction at scale.

SAGA makes AI agent GPU scheduling workflow-aware
SAGA argues GPU schedulers should treat an agent’s chained LLM calls as one workflow, not isolated requests.

SpecKV tunes speculative decoding on the fly
SpecKV adapts speculative decoding’s token budget per step, using draft-model signals to beat fixed gamma across compression settings.

TurboQuant brings near-optimal online vector quantization
TurboQuant is an online, accelerator-friendly vector quantizer that targets near-optimal MSE and inner-product distortion.

TurboQuant, EDEN, and the citation fight
TurboQuant’s KV-cache quantization claims are under fire: EDEN authors say the paper reuses older ideas, weaker scales, and shaky benchmarks.

TurboQuant Explained: Why Google’s New Paper Matters
Google’s TurboQuant paper targets KV cache bottlenecks with lower-bit quantization, aiming to cut LLM memory use and inference costs.

Google's TurboQuant Cuts LLM Memory Costs
Google says TurboQuant uses QJL and PolarQuant to shrink vector-quantization memory and speed up LLM inference by up to 8x.