Tag

LLM inference

LLM inference covers the runtime side of large models: latency, throughput, memory footprint, and how KV cache, quantization, and accelerator-friendly kernels shape deployment. It matters because these choices determine whether a model is practical on GPUs, servers, or edge devices.

7 articles

Research/May 14

Taming Black-Box LLM Inference Scheduling

A scheduling approach for black-box LLM inference that uses predicted output lengths to reduce queueing friction at scale.

Research/May 12

SAGA makes AI agent GPU scheduling workflow-aware

SAGA argues GPU schedulers should treat an agent’s chained LLM calls as one workflow, not isolated requests.

Research/May 5

SpecKV tunes speculative decoding on the fly

SpecKV adapts speculative decoding’s token budget per step, using draft-model signals to beat fixed gamma across compression settings.

Research/Apr 29

TurboQuant brings near-optimal online vector quantization

TurboQuant is an online, accelerator-friendly vector quantizer that targets near-optimal MSE and inner-product distortion.

Research/Apr 29

TurboQuant, EDEN, and the citation fight

TurboQuant’s KV-cache quantization claims are under fire: EDEN authors say the paper reuses older ideas, weaker scales, and shaky benchmarks.

Research/Apr 3

TurboQuant Explained: Why Google’s New Paper Matters

Google’s TurboQuant paper targets KV cache bottlenecks with lower-bit quantization, aiming to cut LLM memory use and inference costs.

Research/Apr 3

Google's TurboQuant Cuts LLM Memory Costs

Google says TurboQuant uses QJL and PolarQuant to shrink vector-quantization memory and speed up LLM inference by up to 8x.