Tag

inference

Inference is the production stage where models serve predictions or generate outputs, so latency, throughput, GPU scheduling, memory footprint, and cost all matter. Recent work spans Kubernetes as an AI control plane, quantization, and TensorRT-LLM optimizations.

4 articles

Research/May 6

AE-LLM aims to make LLMs more efficient

AE-LLM proposes adaptive efficiency optimization for large language models, but the provided source does not include benchmark details.

Research/Apr 3

Nvidia’s MLPerf Gains Show Software Still Matters

Nvidia posted up to 2.77x MLPerf gains on GB300 NVL72, with software tricks like Dynamo and TensorRT-LLM doing heavy lifting.

Industry News/Apr 3

Kubernetes Is Becoming AI’s Control Plane

KubeCon Europe 2026 showed Kubernetes moving from app orchestration to AI ops, with inference, GPUs, and open standards leading the shift.

Industry News/Apr 3

Five AI Infra Frontiers Bessemer Expects for 2026

Bessemer’s 2026 AI infra roadmap points to memory, continual learning, RL, inference, and world models as the next big build areas.