[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-cuda-cp-async-ampere-hbm-latency-en":3,"tags-cuda-cp-async-ampere-hbm-latency-en":33,"related-lang-cuda-cp-async-ampere-hbm-latency-en":46,"related-posts-cuda-cp-async-ampere-hbm-latency-en":50,"series-research-68bfa04a-94c4-4c8a-921c-61e93ab207aa":87},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":19,"translated_content":10,"views":20,"is_premium":21,"created_at":22,"updated_at":22,"cover_image":11,"published_at":23,"rewrite_status":24,"rewrite_error":10,"rewritten_from_id":25,"slug":26,"category":27,"related_article_id":28,"status":29,"google_indexed_at":30,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":31,"title_original":32,"key_takeaways":10,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":21},"68bfa04a-94c4-4c8a-921c-61e93ab207aa","cp.async on Ampere: Hide HBM Latency on A100","\u003Cp>On an \u003Ca href=\"https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fdata-center\u002Fa100\u002F\" target=\"_blank\" rel=\"noopener\">NVIDIA A100\u003C\u002Fa>, an HBM2e load can cost roughly 450 to 600 cycles, which is long enough to leave an entire warp scheduler idle if you do nothing else. Ampere’s \u003Ccode>cp.async\u003C\u002Fcode> changes that by moving data into shared memory without tying up registers or setting the long scoreboard.\u003C\u002Fp>\u003Cp>This is why the instruction matters: it lets the programmer describe what data should move, while the hardware handles when the transfer completes. Part VII of \u003Ca href=\"https:\u002F\u002Fsoftwarefrontier.substack.com\u002Fp\u002Fmastering-cuda-and-high-performance-ea1\" target=\"_blank\" rel=\"noopener\">Mastering CUDA and High-Performance Computing\u003C\u002Fa> is really about that shift in mental model, from blocking loads to overlapped pipelines.\u003C\u002Fp>\u003Ch2>The memory hierarchy on A100 is the real story\u003C\u002Fh2>\u003Cp>The article opens by grounding the discussion in the A100 SXM4 memory stack, and the numbers are worth keeping in your head. Registers are fast, but they are capped at 255 per thread. Shared memory and L1 share a 192 KB pool. L2 is 40 MB. HBM2e tops out at 2 TB\u002Fs on paper, while real kernels usually land somewhere lower depending on access pattern quality.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775167612143-4qvu.png\" alt=\"CUDA cp.async on Ampere: hiding HBM latency\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That hierarchy is not just trivia. It explains why a kernel that looks fine in source code can fall apart in practice. If the compiler spills registers, those values go to local memory and pay the same global-memory penalty as any other DRAM access. If shared memory access patterns hit the same bank, the warp serializes. If the working set misses L2, the kernel ends up paying hundreds of cycles per access.\u003C\u002Fp>\u003Cul>\u003Cli>Register file per SM: 256 KB total, or 64 KB per SMSP\u003C\u002Fli>\u003Cli>Shared memory bank count: 32 banks, 4 bytes wide each\u003C\u002Fli>\u003Cli>L2 cache size on A100: 40 MB split into two 20 MB slices\u003C\u002Fli>\u003Cli>HBM2e peak bandwidth: 2 TB\u002Fs theoretical, about 1.6 to 1.9 TB\u002Fs in strong cases\u003C\u002Fli>\u003Cli>HBM2e latency: roughly 450 to 600 cycles with caches bypassed\u003C\u002Fli>\u003C\u002Ful>\u003Cp>Those numbers matter because they define the gap \u003Ccode>cp.async\u003C\u002Fcode> is trying to close. Ampere is not removing latency. It is making latency easier to hide.\u003C\u002Fp>\u003Ch2>Why cp.async changes the execution model\u003C\u002Fh2>\u003Cp>Traditional global loads write into registers first. That means the warp cannot use those destination registers until the data returns, and the scheduler marks them pending. With \u003Ccode>cp.async\u003C\u002Fcode>, the copy goes straight from global memory into shared memory through a dedicated asynchronous copy engine. No destination registers are occupied. No long scoreboard bits are set for the destination data path.\u003C\u002Fp>\u003Cp>That detail sounds small until you trace what it means for scheduling. The warp issues the copy, hands the transaction off, and immediately becomes eligible for more instructions. The load and the compute no longer have to happen one after the other. They can overlap.\u003C\u002Fp>\u003Cp>Here is the practical effect: instead of waiting 500 cycles for memory before doing useful work, a kernel can spend those cycles computing on the previous tile while the next tile is already in flight. That is the whole point of the instruction.\u003C\u002Fp>\u003Cblockquote>“Latency hiding is the name of the game.” — Mark Harris, NVIDIA, in his CUDA programming guidance and talks on overlapping data movement with computation\u003C\u002Fblockquote>\u003Cp>Harris has repeated that idea for years, and \u003Ccode>cp.async\u003C\u002Fcode> is one of the cleanest examples of it in hardware. The programmer still needs to structure the work carefully, but the architecture gives a much better tool than the old load-store-register path.\u003C\u002Fp>\u003Ch2>Commit, wait, and the double-buffer pattern\u003C\u002Fh2>\u003Cp>The article’s explanation of commit and wait groups is the part many CUDA programmers should read twice. \u003Ccode>cp.async.commit_group\u003C\u002Fcode> does bookkeeping only. It marks a batch of copy instructions as a group. \u003Ccode>cp.async.wait_group N\u003C\u002Fcode> blocks until at most \u003Cem>N\u003C\u002Fem> groups remain pending. With \u003Ccode>N=1\u003C\u002Fcode>, one group can still be in flight while the kernel computes on the previous one.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775167614022-18hw.png\" alt=\"CUDA cp.async on Ampere: hiding HBM latency\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>That is what turns asynchronous copy into a pipeline. You keep one buffer being filled while another buffer is being consumed. The kernel does not try to make memory faster. It keeps the machine busy while memory is slow.\u003C\u002Fp>\u003Cul>\u003Cli>Conventional load path: load into registers, wait on long scoreboard, then store to shared memory\u003C\u002Fli>\u003Cli>\u003Ccode>cp.async\u003C\u002Fcode> path: copy directly into shared memory, no destination register stall\u003C\u002Fli>\u003Cli>\u003Ccode>cp.async.commit_group\u003C\u002Fcode>: groups prior async copies for bookkeeping\u003C\u002Fli>\u003Cli>\u003Ccode>cp.async.wait_group 1\u003C\u002Fcode>: allows one in-flight group while compute continues\u003C\u002Fli>\u003C\u002Ful>\u003Cp>The article also points out that this is not free. Shared memory usage rises as you add stages to the pipeline, and that can reduce occupancy. On Ampere, the best stage count depends on the kernel. A GEMM kernel with enough arithmetic intensity may benefit from multiple stages, while a lighter kernel may lose more to reduced occupancy than it gains from deeper pipelining.\u003C\u002Fp>\u003Cp>That tradeoff is why libraries such as \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass\" target=\"_blank\" rel=\"noopener\">CUTLASS\u003C\u002Fa> expose the pipeline depth as a tuning parameter. The right answer is usually measured, not guessed.\u003C\u002Fp>\u003Ch2>What the profiler shows before and after\u003C\u002Fh2>\u003Cp>The most useful part of the piece is the profiler framing. It gives you a way to tell whether your kernel is memory-stalled or actually doing work. Before pipelining, a conventional load-heavy kernel often shows long scoreboard stalls dominating the timeline. After a good \u003Ccode>cp.async\u003C\u002Fcode> rewrite, those stalls shrink and the FMA pipe stays busy for a much larger share of cycles.\u003C\u002Fp>\u003Cp>That is a cleaner way to think about optimization than just chasing raw bandwidth. A kernel can look fast on paper and still waste half its issue slots waiting on memory. Once the transfer is asynchronous, the metric that matters is overlap, not just throughput.\u003C\u002Fp>\u003Cul>\u003Cli>Before pipelining: \u003Ccode>smsp__warp_issue_stalled_long_scoreboard\u003C\u002Fcode> often dominates at 40% to 70%\u003C\u002Fli>\u003Cli>After pipelining: long scoreboard stalls can drop below 5%\u003C\u002Fli>\u003Cli>Well-tuned kernels: \u003Ccode>smsp__pipe_fma_cycles_active\u003C\u002Fcode> can rise into the 70% to 90% range\u003C\u002Fli>\u003Cli>A100 L2 bandwidth: about 4 TB\u002Fs aggregate, around 15x HBM bandwidth\u003C\u002Fli>\u003C\u002Ful>\u003Cp>If you want to see this in practice, look at kernels built with \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcuda-samples\" target=\"_blank\" rel=\"noopener\">NVIDIA CUDA Samples\u003C\u002Fa> and then compare them with a tiled implementation that uses asynchronous copy. The difference is usually obvious in the profiler even before you inspect the assembly.\u003C\u002Fp>\u003Cp>For a broader performance view, the \u003Ca href=\"https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fprofiler-users-guide\u002Findex.html\" target=\"_blank\" rel=\"noopener\">NVIDIA Nsight Compute documentation\u003C\u002Fa> is the best companion piece. It shows how to read the stall reasons, issue activity, and memory throughput counters that tell you whether your pipeline is actually doing its job.\u003C\u002Fp>\u003Ch2>What this means for Hopper and beyond\u003C\u002Fh2>\u003Cp>The summary line from the source article gets to the heart of it: Ampere still leaves a programmer-visible gap between expression and execution, and Hopper reduces that gap further with TMA, or Tensor Memory Accelerator. That matters because every step in this direction makes memory movement feel less like a blocking operation and more like a scheduled transfer handled by the chip itself.\u003C\u002Fp>\u003Cp>My read is simple: if you are still writing CUDA kernels that assume load, wait, compute, repeat, you are leaving a lot of performance on the table. The real question is whether your data movement can be expressed as a pipeline. If it can, \u003Ccode>cp.async\u003C\u002Fcode> is worth the extra bookkeeping. If it cannot, you may need to change the data layout first.\u003C\u002Fp>\u003Cp>For developers already working on Ampere, the next move is practical: profile one hot kernel, measure long scoreboard stalls, then try a double-buffered \u003Ccode>cp.async\u003C\u002Fcode> version before touching anything else. If the stall profile drops and occupancy stays healthy, you have your answer. If not, the bottleneck is probably somewhere else, and the profiler will tell you where to look next.\u003C\u002Fp>\u003Cp>That is the useful lesson here. The hardware is giving you a better way to overlap memory and compute, but it still rewards precise thinking. The teams that get the most out of Ampere and Hopper will be the ones that treat data movement as a pipeline problem, not a load instruction problem.\u003C\u002Fp>","Ampere’s cp.async moves data without stalling warps, cutting HBM waits from 450–600 cycles into overlapped compute on A100.","softwarefrontier.substack.com","https:\u002F\u002Fsoftwarefrontier.substack.com\u002Fp\u002Fmastering-cuda-and-high-performance-ea1",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1775167612143-4qvu.png",[13,14,15,16,17,18],"CUDA","cp.async","A100","Ampere","shared memory","HBM2e","en",5,false,"2026-04-02T22:06:36.521272+00:00","2026-04-02T22:06:36.479+00:00","done","bffdfadf-7883-411e-a9aa-ddfb6108265d","cuda-cp-async-ampere-hbm-latency-en","research","d458f7db-1e28-4cf1-9bd8-ad9c95dee997","published","2026-04-07T09:01:02.15+00:00","2026-05-04T02:00:06.981+00:00","CUDA cp.async on Ampere: hiding HBM latency",[34,36,38,40,42,44],{"name":14,"slug":35},"cpasync",{"name":13,"slug":37},"cuda",{"name":17,"slug":39},"shared-memory",{"name":18,"slug":41},"hbm2e",{"name":16,"slug":43},"ampere",{"name":15,"slug":45},"a100",{"id":28,"slug":47,"title":48,"language":49},"cuda-cp-async-ampere-hbm-latency-zh","Ampere 的 cp.async 怎麼藏 HBM 延遲","zh",[51,57,63,69,75,81],{"id":52,"slug":53,"title":54,"cover_image":55,"image_url":55,"created_at":56,"category":27},"94994abd-e24d-4fd1-b941-942d03d19acf","turboquant-seo-shift-small-sites-en","TurboQuant and the SEO Shift for Small Sites","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778840455122-jfce.png","2026-05-15T10:20:28.134545+00:00",{"id":58,"slug":59,"title":60,"cover_image":61,"image_url":61,"created_at":62,"category":27},"670a7f69-911f-41e8-a18b-7d3491253a19","turboquant-vllm-comparison-fp8-kv-cache-en","TurboQuant vs FP8: vLLM’s first broad test","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778839858405-b5ao.png","2026-05-15T10:10:37.219158+00:00",{"id":64,"slug":65,"title":66,"cover_image":67,"image_url":67,"created_at":68,"category":27},"5aef1c57-961f-49f7-8277-f83f7336799a","llmbda-calculus-agent-safety-rules-en","LLMbda calculus gives agents safety rules","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778825459914-obkf.png","2026-05-15T06:10:36.242145+00:00",{"id":70,"slug":71,"title":72,"cover_image":73,"image_url":73,"created_at":74,"category":27},"712a0357-f7cd-48f2-adde-c2691da0815f","low-complexity-beamspace-denoiser-mmwave-mimo-en","A simpler beamspace denoiser for mmWave MIMO","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778814646705-e7mx.png","2026-05-15T03:10:31.764301+00:00",{"id":76,"slug":77,"title":78,"cover_image":79,"image_url":79,"created_at":80,"category":27},"f595f949-6ea1-4b0e-a632-f1832ef26e36","ai-benchmark-wins-cyber-scare-defenders-en","Why AI benchmark wins in cyber should scare defenders","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778807444539-gz7f.png","2026-05-15T01:10:30.04579+00:00",{"id":82,"slug":83,"title":84,"cover_image":85,"image_url":85,"created_at":86,"category":27},"3ad202d1-9e5f-49c5-8383-02fcf1a23cf2","why-linux-security-needs-patch-wave-mindset-en","Why Linux security needs a patch-wave mindset","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778741441493-ikl6.png","2026-05-14T06:50:25.906256+00:00",[88,93,98,103,108,113,118,123,128,133],{"id":89,"slug":90,"title":91,"created_at":92},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":94,"slug":95,"title":96,"created_at":97},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":99,"slug":100,"title":101,"created_at":102},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":104,"slug":105,"title":106,"created_at":107},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":109,"slug":110,"title":111,"created_at":112},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":114,"slug":115,"title":116,"created_at":117},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":119,"slug":120,"title":121,"created_at":122},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":124,"slug":125,"title":126,"created_at":127},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":129,"slug":130,"title":131,"created_at":132},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":134,"slug":135,"title":136,"created_at":137},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]