Sebastian Raschka’s LLM Architecture Gallery

OraCore Editors

[RSCH] April 2, 20267 min readOraCore Editors

Sebastian Raschka’s LLM Architecture Gallery

Raschka’s gallery compares GPT-2, Llama 3, OLMo 2, DeepSeek, and Qwen stacks with exact layer, cache, and attention data.

Sebastian Raschka mixture-of-experts KV cache LLM architecture attention mechanisms

Share LinkedIn

Sebastian Raschka’s LLM Architecture Gallery is the kind of page model builders bookmark and keep open in another tab. It collects architecture panels for more than 30 language models, from GPT-2 to Llama 4, with concrete facts like layer counts, context length, attention type, and KV cache size.

The most useful part is how it turns model design into numbers you can compare in seconds. A Llama 3 8B stack uses 32 layers and 128 KiB per token of KV cache in bf16, while OLMo 2 7B uses 32 layers too, but with classic multi-head attention plus QK-Norm and a much larger 512 KiB cache per token.

A reference library for decoder stacks

Get the latest AI news in your inbox

Weekly picks of model releases, tools, and deep dives — no spam, unsubscribe anytime.

No spam. Unsubscribe at any time.

Raschka’s page is not a marketing roundup. It is a technical reference built around architecture panels and fact sheets, with links to config files and technical reports where available. That matters because the real story in modern LLMs is often hidden in the boring-looking parts: attention choice, normalization order, number of layers, and how much memory each token consumes during inference.

The gallery pulls from several of Raschka’s own comparison posts, including The Big LLM Architecture Comparison, From GPT-2 to gpt-oss, and From DeepSeek V3 to V3.2. The gallery also points to a diff tool, so you can compare two models side by side instead of squinting at screenshots.

That diff view is the real product here. If you want to understand why one model is cheaper to serve than another, the answer is usually not buried in benchmark charts. It is in the stack itself.

GPT-2 XL: 1.5B parameters, 1,024-token context, 48 MHA layers, 300 KiB KV cache per token
Llama 3 8B: 8B parameters, 8,192-token context, 32 GQA layers, 128 KiB KV cache per token
OLMo 2 7B: 7B parameters, 4,096-token context, 32 MHA layers, 512 KiB KV cache per token
DeepSeek V3: 671B total parameters, 37B active, 61 layers, 68.6 KiB KV cache per token

What the gallery says about design trade-offs

The clearest lesson is that model architecture has become a study in trade-offs. Dense stacks like GPT-2, Llama 3, and OLMo 2 are easier to reason about, but they can be expensive to run at scale. Sparse MoE models like DeepSeek V3 and Llama 4 Maverick spread capacity across many experts, which cuts active compute while keeping parameter counts huge.

Attention design shows a similar split. Some models stick with standard multi-head attention. Others use grouped-query attention, QK-Norm, sliding windows, or Meta’s chunked-plus-full pattern in Llama 4. Raschka’s gallery makes those choices visible in one place, which is useful because these details often explain memory pressure more directly than raw parameter counts.

"The best way to understand a model is to look at its architecture." — Sebastian Raschka, The Big LLM Architecture Comparison

That line captures the whole point of the gallery. Benchmarks tell you what a model did on a test. Architecture tells you what it had to be built like to get there.

Raschka also makes the page practical for people who care about reproducibility. Each card links to source material, and the page itself notes that if a fact sheet is wrong or a link is broken, readers can file an issue in the Architecture Gallery issue tracker. That kind of maintenance is rare, and it matters in a field where model specs age fast.

Llama 4 Maverick: 400B total, 17B active, 1,000,000-token context, 36 chunked + 12 full GQA layers
Qwen3 235B-A22B: 235B total, 22B active, 94 layers, 188 KiB KV cache per token
Gemma 3 27B: 27B parameters, 128,000-token context, 52 sliding-window + 10 global layers
Mistral Small 3.1: 24B parameters, 128,000-token context, 40 GQA layers, 160 KiB KV cache per token

Why the comparison tool matters more than the poster

The gallery is available as a poster on Redbubble and as a print-ready download on Gumroad, which is a fun extra for office walls. But the comparison tool is the part that makes this page genuinely useful for engineers. A poster helps you remember what a stack looks like. The diff tool helps you decide whether one model is actually cheaper to serve than another.

That matters because some of the most interesting models in the gallery are surprisingly close in size while being very different in memory behavior. Llama 3 8B uses 128 KiB per token of KV cache, while OLMo 2 7B uses 512 KiB. That is a 4x gap in cache footprint between two dense models in the same broad size class, and it has real consequences for batch size, latency, and deployment cost.

The same pattern shows up in the large models. DeepSeek V3 carries 671B total parameters but only 37B active, which is a very different serving profile from a dense 67B-class model. Llama 4 Maverick goes even further on context length with a 1,000,000-token window, which changes the conversation from “Can it fit?” to “What does long-context inference cost in practice?”

For readers who like to compare architecture notes with release coverage, OraCore has also been tracking adjacent releases in pieces like our Llama 4 Maverick architecture notes and our DeepSeek V3.2 breakdown.

The practical takeaway for builders

This gallery is useful because it makes one thing obvious: model quality is no longer the only question. The real questions are how much memory a model needs, how much of it is active at inference time, and which attention recipe makes a deployment sane for your hardware.

If you are choosing between dense and sparse models, or between standard attention and GQA, the gallery gives you a fast way to sanity-check your assumptions before you read a full technical report. It also gives newer engineers a clean visual map of how the field moved from GPT-2’s 48-layer dense decoder to modern stacks with MoE blocks, long contexts, and more careful cache design.

My prediction is simple: as context windows keep growing, architecture pages like this will become as important as benchmark leaderboards for anyone shipping LLMs. The next question to ask is not which model is largest, but which architecture gives you the best trade-off between memory, latency, and quality for your workload.

If you build with LLMs, this is worth keeping open while you read release notes. The numbers tell a better story than the hype does.

// Related Articles

Sebastian Raschka’s LLM Architecture Gallery

A reference library for decoder stacks

Get the latest AI news in your inbox

What the gallery says about design trade-offs

Why the comparison tool matters more than the poster

The practical takeaway for builders

TurboQuant and the SEO Shift for Small Sites

TurboQuant vs FP8: vLLM’s first broad test

LLMbda calculus gives agents safety rules

A simpler beamspace denoiser for mmWave MIMO

Why AI benchmark wins in cyber should scare defenders

Why Linux security needs a patch-wave mindset