Back to home

Tag

inference

Inference is the production stage where models serve predictions or generate outputs, so latency, throughput, GPU scheduling, memory footprint, and cost all matter. Recent work spans Kubernetes as an AI control plane, quantization, and TensorRT-LLM optimizations.

4 articles