pith. machine review for the scientific record. sign in

arxiv: 2605.13974 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.AI· cs.MM

Recognition: no theorem link

Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM
keywords diffusion transformersmassive activationsactivation analysissemantic transporttext-to-image generationhidden channelsprompt interpolationgeneration control
0
0 comments X

The pith

A small set of massive activation channels in Diffusion Transformers controls image semantics in function, space, and transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies massive activations, a sparse subset of hidden-state channels in Diffusion Transformers that produce much larger responses than the others. It shows these channels are functionally critical because zeroing them out collapses generation quality while an equal number of low-magnitude channels barely affects output. The channels are spatially organized, as restricting tokens to them and clustering yields partitions that match the main subject and salient regions. They are also transferable, so moving the activations from one prompt-conditioned run to another shifts the image toward the source semantics while retaining much of the target content. This recasts the channels as a prompt-conditioned carrier subspace that organizes semantic information, enabling new editing uses without retraining.

Core claim

Massive activations form a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models. They are functionally critical because a controlled disruption that zeroes them causes sharp collapse in generation quality, while an equally sized set of low-statistic channels has marginal effect. They are spatially organized because restricting image-stream tokens to massive channels and clustering them produces coherent partitions that closely align with the main subject and salient regions. They are transferable because transporting the activations from one prompt trajectory into another shifts the final image toward the source prompt while keeping

What carries the argument

massive activations: the small subset of hidden-state channels whose responses are consistently much larger than the rest, acting as a sparse prompt-conditioned carrier subspace

If this is right

  • Zeroing the massive channels causes sharp collapse in generation quality while zeroing an equal number of low-magnitude channels does not.
  • Clustering tokens restricted to massive channels produces spatial partitions that align with the main subject and salient regions.
  • Transporting massive activations between prompt trajectories shifts the output toward the source semantics while preserving target content.
  • The transport property supports text-conditioned and image-conditioned semantic editing and subject-driven generation without any additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Focusing computation or pruning on these channels could improve efficiency in large DiT models.
  • The same magnitude-based identification might reveal similar sparse subspaces in other transformer generators.
  • Semantic transport via activations could enable more controllable editing interfaces for artists without model fine-tuning.
  • If the pattern holds across scales, it might indicate a general architectural feature of diffusion transformers rather than a model-specific quirk.

Load-bearing premise

Identification of massive channels through magnitude statistics stays stable across prompts and models, and the zeroing probe isolates their causal role without confounding network dynamics.

What would settle it

An experiment in which zeroing the massive channels identified on one set of prompts leaves generation quality intact on a different prompt or model, or in which their token clustering fails to align with image subjects.

Figures

Figures reproduced from arXiv: 2605.13974 by Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, Sara Sarto.

Figure 1
Figure 1. Figure 1: Overview of our findings. (Left) Disrupting the top-k massive activation channels severely degrades generation quality, revealing their functional importance. (Center) These channels exhibit structured spatial organization, which we capture as a mask. (Right) Transplanting the top-k acti￾vations within this mask from a source to a target generation enables localized semantic transfer, producing coherent co… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of channel disruption across models, metrics, and streams. Each plot reports [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Spatial structure of MAs on FLUX.2-klein model. (A) Channel-wise activation maps for [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Activation transport via MAs. (Left) Overall pipeline: two prompts are used to generate a [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Semantic effect of activations transport across models and layer regimes. Each point [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples prompt-to-prompt (left) and image-conditioned (right) semantic [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effects of channel disruption on GenAI Bench, with top- [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Metrics for dichotomous image segmentations across layers for various generative models. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Screenshot of the platform used for collecting user-study data. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative examples of MAs-based text-conditioned transport on FLUX.1-dev. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative examples of MAs-based text-conditioned transport on Qwen-Image. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative examples of MAs-based text-conditioned transport on FLUX.2-klein. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative examples of MAs-based text-conditioned transport on FLUX.1-schnell. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative examples of MAs-based image-conditioned transport on FLUX.1-dev. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative examples of MAs-based image-conditioned transport on Qwen-Image. [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative examples of MAs-based image-conditioned transport on FLUX.2-klein. [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative examples of MAs-based image-conditioned transport on FLUX.1-schnell. [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative examples of MAs-based image-conditioned transport on SANA1.5. [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Qualitative examples for MAs-based masking extraction, on FLUX.1-dev (top row) and [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
read the original abstract

Diffusion Transformers (DiTs) and related flow-based architectures are now among the strongest text-to-image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden-state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect. Second, they are spatially organized: restricting image-stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text-conditioned and image-conditioned semantic transport, where massive activations transport enables prompt interpolation and subject-driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper studies massive activations—a sparse subset of hidden-state channels with consistently larger magnitudes—in Diffusion Transformers (DiTs). It claims these channels are functionally critical (zeroing them sharply degrades generation quality while zeroing low-magnitude channels does not), spatially organized (clustering massive-channel tokens yields coherent subject-aligned partitions), and transferable (transporting them between prompt trajectories produces localized semantic interpolation). The authors demonstrate two downstream uses: text-conditioned prompt interpolation and image-conditioned subject-driven generation, both without additional training.

Significance. If the empirical claims hold after addressing controls, the work identifies a sparse, prompt-conditioned carrier subspace that organizes semantic information in modern DiTs. This reframes massive activations from anomalies to a structured mechanism, with immediate applications for training-free semantic editing. The combination of disruption probes, spatial clustering, and cross-prompt transport provides complementary evidence; the absence of parameter fitting or invented axioms is a strength.

major comments (3)
  1. [disruption probe section] § on controlled disruption probe (zeroing experiment): zeroing the top-k massive channels necessarily perturbs per-token mean and variance fed to subsequent RMSNorm/LayerNorm layers. Without a matched control that restores original activation statistics (e.g., re-normalization or bias correction after zeroing), the observed quality collapse could arise from global distributional shift rather than loss of semantic content carried by those channels alone. A quantitative comparison of activation statistics before/after zeroing is required.
  2. [spatial clustering section] Spatial organization section: the claim that clustering massive-channel tokens produces partitions that 'closely align with the main subject' needs quantitative validation. Report IoU, Dice, or precision-recall against ground-truth subject masks across a held-out prompt set; qualitative examples alone are insufficient to support the 'structured spatial code' conclusion.
  3. [transfer experiment section] Transferability experiments: when transporting massive activations from source to target trajectory, clarify whether the remaining (non-massive) channels retain their original statistics or are also rescaled. If the transport implicitly alters layer-norm inputs, the observed semantic shift may not be attributable solely to the massive channels.
minor comments (2)
  1. [figures] Figure captions should explicitly state the number of prompts, models, and random seeds used for each panel so readers can assess reproducibility.
  2. [methods] Notation for 'massive channels' should be defined once (e.g., channels whose magnitude exceeds k standard deviations of the layer) and used consistently; current usage mixes 'top-k' and 'outlier' terminology.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the empirical support without altering the core claims.

read point-by-point responses
  1. Referee: [disruption probe section] § on controlled disruption probe (zeroing experiment): zeroing the top-k massive channels necessarily perturbs per-token mean and variance fed to subsequent RMSNorm/LayerNorm layers. Without a matched control that restores original activation statistics (e.g., re-normalization or bias correction after zeroing), the observed quality collapse could arise from global distributional shift rather than loss of semantic content carried by those channels alone. A quantitative comparison of activation statistics before/after zeroing is required.

    Authors: We agree that zeroing channels perturbs per-token statistics for subsequent normalization layers. Our existing control—zeroing an equal number of low-magnitude channels—undergoes a comparable distributional shift yet produces only marginal quality degradation, which supports that the collapse is driven by loss of semantic content rather than the shift alone. To address the request directly, we will add quantitative tables comparing mean and variance (pre- and post-zeroing) for both massive and low-magnitude cases across layers and timesteps in the revised manuscript. revision: yes

  2. Referee: [spatial clustering section] Spatial organization section: the claim that clustering massive-channel tokens produces partitions that 'closely align with the main subject' needs quantitative validation. Report IoU, Dice, or precision-recall against ground-truth subject masks across a held-out prompt set; qualitative examples alone are insufficient to support the 'structured spatial code' conclusion.

    Authors: We concur that quantitative metrics are needed to substantiate the spatial alignment claim. We will generate ground-truth subject masks for a held-out prompt set, compute IoU and Dice scores for the clusters obtained from massive-channel tokens, and report mean scores with standard deviations in the revised spatial organization section. revision: yes

  3. Referee: [transfer experiment section] Transferability experiments: when transporting massive activations from source to target trajectory, clarify whether the remaining (non-massive) channels retain their original statistics or are also rescaled. If the transport implicitly alters layer-norm inputs, the observed semantic shift may not be attributable solely to the massive channels.

    Authors: In the transfer procedure we replace only the values of the massive channels with those from the source trajectory; the non-massive channels are left exactly as computed in the target trajectory with no rescaling or adjustment. Consequently, any change to layer-norm inputs arises exclusively from the massive-channel substitution. We will add an explicit description of this protocol, including pseudocode, to the transfer experiment section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical observations

full rationale

The paper contains no derivations, equations, fitted parameters, or self-citation chains that reduce claims to their own inputs. Massive channels are identified via direct magnitude statistics on observed activations; functional criticality is shown by zeroing interventions whose outcomes are measured independently; spatial organization and transferability are demonstrated through clustering and cross-prompt activation swapping. None of these steps define a quantity in terms of itself or rename a fitted result as a prediction. The analysis is self-contained against external benchmarks and exhibits none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is observational with no explicit mathematical derivations; no free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5563 in / 1013 out tokens · 37841 ms · 2026-05-15T05:55:17.889017+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

  1. [1]

    Building normalizing flows with stochastic interpolants

    Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InICLR, 2023

  2. [2]

    All are Worth Words: A ViT Backbone for Diffusion Models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are Worth Words: A ViT Backbone for Diffusion Models. InCVPR, 2023

  3. [3]

    Tiny Inference-Time Scaling with Latent Verifiers

    Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Tiny Inference-Time Scaling with Latent Verifiers. InCVPR Findings, 2026

  4. [4]

    SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation

    Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation. InICCV, 2025

  5. [5]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision Transformers Need Registers. InICLR, 2024

  6. [6]

    Scaling Vision Transformers to 22 Billion Parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling Vision Transformers to 22 Billion Parameters. InICML, 2023

  7. [7]

    Diffusion Models Beat GANs on Image Synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion Models Beat GANs on Image Synthesis. In NeurIPS, 2021

  8. [8]

    Attention (as Discrete-Time Markov) Chains

    Yotam Erel, Olaf Dünkel, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Amit Haim Bermano. Attention (as Discrete-Time Markov) Chains. InNeurIPS, 2025

  9. [9]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

  10. [10]

    Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

    Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, and Weiyao Lin. Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations. InNeurIPS, 2025

  11. [11]

    Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

    Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Mehrtash Harandi, and Weiyao Lin. Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers. InICLR, 2026

  12. [12]

    Tokenverse: Versatile multi-concept personalization in token modulation space.ACM TOG, 2025

    Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, and Tali Dekel. Tokenverse: Versatile multi-concept personalization in token modulation space.ACM TOG, 2025

  13. [13]

    Unsupervised Semantic Correspondence Using Stable Diffusion

    Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasac- chi, and Kwang Moo Yi. Unsupervised Semantic Correspondence Using Stable Diffusion. In NeurIPS, 2023

  14. [14]

    ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

    Alec Helbling, Tuna Han Salih Meral, Benjamin Hoover, Pinar Yanardag, and Duen Horng Chau. ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features. In ICML, 2025

  15. [15]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. InEMNLP, 2021. 10

  16. [16]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017

  17. [17]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. In NeurIPS, 2020

  18. [18]

    FLUX.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. FLUX.https://github.com/black-forest-labs/flux, 2024

  19. [19]

    FLUX.2: Frontier Visual Intelligence

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025

  20. [20]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. FLUX. 1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space.arXiv preprint arXiv:2506.15742, 2025

  21. [21]

    GenAI-Bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

    Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. GenAI-Bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

  22. [22]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. InICLR, 2023

  23. [23]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. InICLR, 2023

  24. [24]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

  25. [25]

    DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

    Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation. InICLR, 2025

  26. [26]

    A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers

    Trung X Pham, Kang Zhang, Ji Woo Hong, and Chang D Yoo. A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers. InICLR, 2026

  27. [27]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. InCVPR, 2022

  28. [28]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.IJCV, 2015

  29. [29]

    LAION- 5B: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. LAION- 5B: An open large-scale dataset for training next generation image-text models. InNeurIPS, 2022

  30. [30]

    Exploring Multi- modal Diffusion Transformers for Enhanced Prompt-based Image Editing

    Joonghyuk Shin, Alchan Hwang, Yujin Kim, Daneul Kim, and Jaesik Park. Exploring Multi- modal Diffusion Transformers for Enhanced Prompt-based Image Editing. InICCV, 2025

  31. [31]

    Massive Activations in Large Language Models

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive Activations in Large Language Models. InCOLM, 2024

  32. [32]

    What the DAAM: Interpreting Stable Diffusion Using Cross Attention

    Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. What the DAAM: Interpreting Stable Diffusion Using Cross Attention. InACL, 2023

  33. [33]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 Technical Report.arXiv preprint arXiv:2503.19786, 2025

  34. [34]

    Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

    Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, and Mar Gonzalez-Franco. Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion. InCVPR, 2024. 11

  35. [35]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017

  36. [36]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image Technical Report.arXiv preprint arXiv:2508.02324, 2025

  37. [37]

    SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

    Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer. InICML, 2025

  38. [38]

    ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation. InNeurIPS, 2023

  39. [39]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  40. [40]

    A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence. InNeurIPS, 2023

  41. [41]

    Bilateral Reference for High-Resolution Dichotomous Image Segmentation.CAAI Artificial Intelligence Research, 2024

    Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Jorma Laaksonen, Wanli Ouyang, and Nicu Sebe. Bilateral Reference for High-Resolution Dichotomous Image Segmentation.CAAI Artificial Intelligence Research, 2024. 12 A Additional Implementation Details Model Setup.For the disruption experiment (Sec. 3.1), we generate 5 images per ImageNet class, following [26]...