Tag

vLLM

vLLM is a high-throughput inference engine for large language models, built around PagedAttention, KV cache management, and continuous batching. It matters for chat services, RAG pipelines, batch generation, and multi-model GPU deployment.

3 articles

Research/May 15

TurboQuant vs FP8: vLLM’s first broad test

vLLM found FP8 KV-cache quantization beats TurboQuant on speed, while TurboQuant’s strongest variants hurt accuracy.

Tools & Apps/May 9

Gemma 4 assistant models get faster draft tokens

Gemma 4 E2B and E4B assistant models use centroid masking to cut lm_head work about 45x with little quality loss.

Tools & Apps/Apr 12

Awesome Open Source AI: the best projects list

This GitHub list curates battle-tested open-source AI tools, models, and infra, from PyTorch to vLLM, with 2,486 stars.