Tag
llama.cpp
llama.cpp is a local inference stack for running LLMs on CPUs, GPUs, and edge devices with tight memory budgets. The topic often covers quantization, KV cache optimization, cold-start latency, and how it fits into fine-tuning and multimodal workflows.
5 articles

Gemma 4 12B: Specs, Benchmarks & How to Run It Locally
Gemma 4 12B is a local-first multimodal model you can run on a 16 GB machine.

Why llama.cpp’s release notes matter more than its model bragging
llama.cpp’s latest releases show that backend correctness drives real speed gains.

Why llama.cpp should treat TurboQuant as the new default path
TurboQuant is the right direction for llama.cpp because asymmetric KV compression cuts memory without breaking compatibility.

llama.cpp adds local LLM inference in C/C++
ggml-org’s llama.cpp keeps expanding local LLM support with OpenAI-compatible serving, browser WebGPU, and broad hardware backends.

5 KV cache takeaways for llama.cpp users
5 takeaways from TurboQuant: under-3-bit KV cache compression, memory savings, and the tradeoffs llama.cpp users should watch.