[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"article-outlier-tokens-diffusion-transformers-dsr-en":3,"tags-outlier-tokens-diffusion-transformers-dsr-en":34,"related-lang-outlier-tokens-diffusion-transformers-dsr-en":44,"related-posts-outlier-tokens-diffusion-transformers-dsr-en":48,"series-research-25495601-69d8-42fa-868d-ccd71c6d1347":85},{"id":4,"title":5,"content":6,"summary":7,"source":8,"source_url":9,"author":10,"image_url":11,"keywords":12,"language":18,"translated_content":10,"views":19,"is_premium":20,"created_at":21,"updated_at":21,"cover_image":11,"published_at":22,"rewrite_status":23,"rewrite_error":10,"rewritten_from_id":24,"slug":25,"category":26,"related_article_id":27,"status":28,"google_indexed_at":29,"x_posted_at":10,"tweet_text":10,"title_rewritten_at":10,"title_original":10,"key_takeaways":30,"topic_cluster_id":10,"embedding":10,"is_canonical_seed":20},"25495601-69d8-42fa-868d-ccd71c6d1347","Outlier Tokens in DiTs, and How DSR Fixes Them","\u003Cp data-speakable=\"summary\">This paper shows outlier tokens hurt Diffusion Transformers and proposes Dual-Stage Registers to reduce them.\u003C\u002Fp>\u003Cp>Diffusion Transformers, or DiTs, are powerful image generators, but this paper argues they have a hidden failure mode: a small number of high-norm “outlier” tokens can distort attention and degrade the quality of generated images. The practical takeaway is simple—if you are building on modern DiT pipelines, \u003Ca href=\"\u002Ftag\u002Ftoken\">token\u003C\u002Fa> behavior inside the model may matter just as much as architecture size or training scale.\u003C\u002Fp>\u003Cp>The paper, \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.05206\">Taming Outlier Tokens in Diffusion Transformers\u003C\u002Fa>, looks at this problem in Representation Autoencoder, or RAE, based DiT pipelines. It finds that outliers show up in both the pretrained ViT encoder and in the DiT denoiser itself, which means the issue is not isolated to one stage. That makes this a useful systems-level paper for anyone working on image generation models, because it points to a failure mode that can propagate through the whole pipeline.\u003C\u002Fp>\u003Ch2>What problem this paper is trying to fix\u003C\u002Fh2>\u003Cp>The core issue is outlier tokens: a small number of tokens with unusually large norms that attract a lot of attention while carrying limited local information. Prior work had already observed this in Vision Transformers, but mostly in the context of recognition or representation learning. This paper asks what happens when the same phenomenon appears inside generative models, where corrupted token semantics can directly affect image synthesis.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778134850923-gtbr.png\" alt=\"Outlier Tokens in DiTs, and How DSR Fixes Them\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>The answer is that outlier tokens are not just a curiosity. In modern RAE-DiT pipelines, pretrained ViT encoders can emit outlier representations, and DiTs can also develop internal outlier tokens, especially in intermediate layers. That matters because diffusion models depend on stable token interactions during denoising. If a few tokens dominate attention for the wrong reasons, the model may produce artifacts instead of faithful structure.\u003C\u002Fp>\u003Cp>There is also an important negative result here: simply masking high-norm tokens does not improve performance. That suggests the problem is not just a handful of extreme activations that can be clipped away. The paper argues the issue is more closely tied to corrupted local patch semantics, which is a more structural problem than a simple outlier-removal trick.\u003C\u002Fp>\u003Ch2>How the method works in plain English\u003C\u002Fh2>\u003Cp>To address this, the authors introduce Dual-Stage Registers, or DSR. The idea is to use register tokens as an intervention in both parts of the pipeline: the encoder and the denoiser. Rather than trying to delete bad tokens after the fact, the method gives the model a way to absorb or reroute the problematic behavior through registers.\u003C\u002Fp>\u003Cp>DSR has three pieces described in the abstract. First, it uses trained registers when they are available. Second, if trained registers are not available, it falls back to recursive test-time registers. Third, it adds diffusion registers for the denoiser. In plain terms, the method is trying to stabilize token processing at both stages instead of treating encoder and denoiser as separate problems.\u003C\u002Fp>\u003Cp>That two-stage design is the main engineering insight. If the encoder already produces outlier representations, the denoiser inherits a messy input. If the denoiser itself develops internal outlier tokens, then the model’s own dynamics can amplify the issue. DSR is meant to address both sources rather than only one of them.\u003C\u002Fp>\u003Ch2>What the paper actually shows\u003C\u002Fh2>\u003Cp>The paper says the phenomenon appears in both the encoder and the denoiser of modern RAE-DiT pipelines. It also says these interventions consistently reduce outlier artifacts and improve generation quality across ImageNet and large-scale text-to-image generation. That is the strongest claim available in the abstract: the method seems to help in multiple settings, not just a narrow benchmark.\u003C\u002Fp>\n\u003Cfigure class=\"my-6\">\u003Cimg src=\"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778134847820-fpy2.png\" alt=\"Outlier Tokens in DiTs, and How DSR Fixes Them\" class=\"rounded-xl w-full\" loading=\"lazy\" \u002F>\u003C\u002Ffigure>\n\u003Cp>What the abstract does not provide is benchmark numbers. There are no exact scores, no percentage improvements, and no table-level results in the source material provided here. So while the direction of the effect is clear, the magnitude is not something we can infer from the abstract alone.\u003C\u002Fp>\u003Cp>Still, the qualitative result is meaningful. The paper is not just saying “outliers exist.” It is saying they are a practical obstacle in generative transformers, that they can arise in more than one stage, and that a register-based intervention can reduce the visible artifacts they cause.\u003C\u002Fp>\u003Cul>\u003Cli>Outlier tokens appear in both ViT encoders and DiT denoisers.\u003C\u002Fli>\u003Cli>Intermediate layers are especially prone to internal outlier tokens.\u003C\u002Fli>\u003Cli>Masking high-norm tokens alone does not solve the problem.\u003C\u002Fli>\u003Cli>Dual-Stage Registers are used to intervene in both stages of the pipeline.\u003C\u002Fli>\u003Cli>The paper reports better generation quality and fewer outlier artifacts across ImageNet and text-to-image generation.\u003C\u002Fli>\u003C\u002Ful>\u003Ch2>Why developers should care\u003C\u002Fh2>\u003Cp>If you work on image generation, this paper is a reminder that transformer internals can fail in subtle ways that are easy to miss if you only look at final outputs. Outlier tokens are a model-behavior issue, not just a training bug or a data-cleaning problem. That means they may affect architecture choices, encoder selection, and how you think about \u003Ca href=\"\u002Ftag\u002Finference\">inference\u003C\u002Fa>-time stabilization.\u003C\u002Fp>\u003Cp>The paper is also useful because it broadens the conversation around registers. Registers are not just a representation-learning trick; here they are used as a practical control mechanism for a generative pipeline. For engineers, that suggests a possible design pattern: when token semantics drift or become unstable, introducing structured tokens may be more effective than blunt token suppression.\u003C\u002Fp>\u003Cp>There are, however, open questions. The abstract does not tell us how expensive DSR is to train or run, how sensitive it is to model size, or how it behaves outside the reported settings. It also does not establish whether the approach generalizes to non-RAE pipelines or other diffusion backbones. Those are the kinds of details practitioners would want before treating DSR as a default fix.\u003C\u002Fp>\u003Cp>Even with those limits, the paper is valuable because it identifies a concrete failure mode in a fast-moving part of the generative stack. For teams building DiT-based systems, the takeaway is not just “use a new trick.” It is to pay attention to token norms, attention concentration, and whether the model is preserving local patch semantics as it moves from encoder to denoiser.\u003C\u002Fp>\u003Cp>In short, this is a paper about making diffusion transformers less brittle by controlling outlier tokens at both ends of the pipeline. If DiTs are going to keep scaling as image generators, papers like this point to the kinds of internal pathologies engineers will need to understand and manage.\u003C\u002Fp>\u003Ch2>Bottom line\u003C\u002Fh2>\u003Cp>Outlier tokens are now a generative-model problem, not just a ViT curiosity. This paper’s Dual-Stage Registers approach offers a register-based way to reduce their impact in RAE-DiT pipelines, and the reported result is better image generation quality with fewer artifacts.\u003C\u002Fp>","A new paper shows outlier tokens affect both RAE encoders and DiT denoisers, and proposes Dual-Stage Registers to reduce artifacts.","arxiv.org","https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.05206",null,"https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778134850923-gtbr.png",[13,14,15,16,17],"diffusion transformers","outlier tokens","vision transformers","image generation","registers","en",1,false,"2026-05-07T06:20:32.854811+00:00","2026-05-07T06:20:32.834+00:00","done","d6979e1c-45eb-4943-b33b-579b3e3e6524","outlier-tokens-diffusion-transformers-dsr-en","research","92510ea0-af04-4078-9e4f-ac4365a55e48","published","2026-05-07T09:00:18.012+00:00",[31,32,33],"Outlier tokens affect both the encoder and denoiser in RAE-DiT pipelines.","Masking high-norm tokens alone does not solve the problem.","Dual-Stage Registers reduce artifacts and improve generation quality, but the abstract gives no benchmark numbers.",[35,36,38,40,42],{"name":17,"slug":17},{"name":15,"slug":37},"vision-transformers",{"name":13,"slug":39},"diffusion-transformers",{"name":14,"slug":41},"outlier-tokens",{"name":16,"slug":43},"image-generation",{"id":27,"slug":45,"title":46,"language":47},"outlier-tokens-diffusion-transformers-dsr-zh","DiT 的異常 token 怎麼救","zh",[49,55,61,67,73,79],{"id":50,"slug":51,"title":52,"cover_image":53,"image_url":53,"created_at":54,"category":26},"94994abd-e24d-4fd1-b941-942d03d19acf","turboquant-seo-shift-small-sites-en","TurboQuant and the SEO Shift for Small Sites","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778840455122-jfce.png","2026-05-15T10:20:28.134545+00:00",{"id":56,"slug":57,"title":58,"cover_image":59,"image_url":59,"created_at":60,"category":26},"670a7f69-911f-41e8-a18b-7d3491253a19","turboquant-vllm-comparison-fp8-kv-cache-en","TurboQuant vs FP8: vLLM’s first broad test","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778839858405-b5ao.png","2026-05-15T10:10:37.219158+00:00",{"id":62,"slug":63,"title":64,"cover_image":65,"image_url":65,"created_at":66,"category":26},"5aef1c57-961f-49f7-8277-f83f7336799a","llmbda-calculus-agent-safety-rules-en","LLMbda calculus gives agents safety rules","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778825459914-obkf.png","2026-05-15T06:10:36.242145+00:00",{"id":68,"slug":69,"title":70,"cover_image":71,"image_url":71,"created_at":72,"category":26},"712a0357-f7cd-48f2-adde-c2691da0815f","low-complexity-beamspace-denoiser-mmwave-mimo-en","A simpler beamspace denoiser for mmWave MIMO","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778814646705-e7mx.png","2026-05-15T03:10:31.764301+00:00",{"id":74,"slug":75,"title":76,"cover_image":77,"image_url":77,"created_at":78,"category":26},"f595f949-6ea1-4b0e-a632-f1832ef26e36","ai-benchmark-wins-cyber-scare-defenders-en","Why AI benchmark wins in cyber should scare defenders","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778807444539-gz7f.png","2026-05-15T01:10:30.04579+00:00",{"id":80,"slug":81,"title":82,"cover_image":83,"image_url":83,"created_at":84,"category":26},"3ad202d1-9e5f-49c5-8383-02fcf1a23cf2","why-linux-security-needs-patch-wave-mindset-en","Why Linux security needs a patch-wave mindset","https:\u002F\u002Fxxdpdyhzhpamafnrdkyq.supabase.co\u002Fstorage\u002Fv1\u002Fobject\u002Fpublic\u002Fcovers\u002Finline-1778741441493-ikl6.png","2026-05-14T06:50:25.906256+00:00",[86,91,96,101,106,111,116,121,126,131],{"id":87,"slug":88,"title":89,"created_at":90},"a2715e72-1fe8-41b3-abb1-d0cf1f710189","ai-predictions-2026-big-changes-en","AI Predictions for 2026: Brace for Big Changes","2026-03-26T01:25:07.788356+00:00",{"id":92,"slug":93,"title":94,"created_at":95},"8404bd7b-4c2f-4109-9ec4-baf29d88af2b","ml-papers-of-the-week-github-research-desk-en","ML Papers of the Week Turns GitHub Into a Research Desk","2026-03-27T01:11:39.480259+00:00",{"id":97,"slug":98,"title":99,"created_at":100},"87897a94-8065-4464-a016-1f23e89e17cc","ai-ml-conferences-to-watch-in-2026-en","AI\u002FML Conferences to Watch in 2026","2026-03-27T01:51:54.184108+00:00",{"id":102,"slug":103,"title":104,"created_at":105},"6f1987cf-25f3-47a4-b3e6-db0997695be8","openclaw-agents-manipulated-self-sabotage-en","OpenClaw Agents Can Be Manipulated Into Failure","2026-03-28T03:03:18.899465+00:00",{"id":107,"slug":108,"title":109,"created_at":110},"a53571ad-735a-4178-9f93-cb09b699d99c","vega-driving-language-instructions-en","Vega: Driving with Natural Language Instructions","2026-03-28T14:54:04.698882+00:00",{"id":112,"slug":113,"title":114,"created_at":115},"a34581d6-f36e-46da-88bb-582fb3e7425c","personalizing-autonomous-driving-styles-en","Drive My Way: Personalizing Autonomous Driving Styles","2026-03-28T14:54:26.148181+00:00",{"id":117,"slug":118,"title":119,"created_at":120},"2bc1ad7f-26ce-4f02-9885-803b35fd229d","training-knowledge-bases-writeback-rag-en","Training Knowledge Bases with WriteBack-RAG","2026-03-28T14:54:45.643433+00:00",{"id":122,"slug":123,"title":124,"created_at":125},"71adc507-3c54-4605-bbe2-c966acd6187e","packforcing-long-video-generation-en","PackForcing: Efficient Long-Video Generation Method","2026-03-28T14:55:02.646943+00:00",{"id":127,"slug":128,"title":129,"created_at":130},"675942ef-b9ec-4c5f-a997-381250b6eacb","pixelsmile-facial-expression-editing-en","PixelSmile Framework Enhances Facial Expression Editing","2026-03-28T14:55:20.633463+00:00",{"id":132,"slug":133,"title":134,"created_at":135},"6954fa2b-8b66-4839-884b-e46f89fa1bc3","adaptive-block-scaled-data-types-en","IF4: Smarter 4-Bit Quantization That Adapts to Your Data","2026-03-31T06:00:36.65963+00:00"]