Tag
vLLM
vLLM is a high-throughput inference engine for large language models, built around PagedAttention, KV cache management, and continuous batching. It matters for chat services, RAG pipelines, batch generation, and multi-model GPU deployment.
3 articles

Research/May 15
TurboQuant vs FP8: vLLM’s first broad test
vLLM found FP8 KV-cache quantization beats TurboQuant on speed, while TurboQuant’s strongest variants hurt accuracy.

Tools & Apps/May 9
Gemma 4 assistant models get faster draft tokens
Gemma 4 E2B and E4B assistant models use centroid masking to cut lm_head work about 45x with little quality loss.

Tools & Apps/Apr 12
Awesome Open Source AI: the best projects list
This GitHub list curates battle-tested open-source AI tools, models, and infra, from PyTorch to vLLM, with 2,486 stars.