arxiv: 2605.13974 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.AI· cs.MM

Recognition: no theorem link

Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

Evelyn Turri , Davide Bucciarelli , Sara Sarto , Lorenzo Baraldi , Marcella Cornia

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords diffusion transformersmassive activationsactivation analysissemantic transporttext-to-image generationhidden channelsprompt interpolationgeneration control

0 comments

The pith

A small set of massive activation channels in Diffusion Transformers controls image semantics in function, space, and transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies massive activations, a sparse subset of hidden-state channels in Diffusion Transformers that produce much larger responses than the others. It shows these channels are functionally critical because zeroing them out collapses generation quality while an equal number of low-magnitude channels barely affects output. The channels are spatially organized, as restricting tokens to them and clustering yields partitions that match the main subject and salient regions. They are also transferable, so moving the activations from one prompt-conditioned run to another shifts the image toward the source semantics while retaining much of the target content. This recasts the channels as a prompt-conditioned carrier subspace that organizes semantic information, enabling new editing uses without retraining.

Core claim

Massive activations form a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models. They are functionally critical because a controlled disruption that zeroes them causes sharp collapse in generation quality, while an equally sized set of low-statistic channels has marginal effect. They are spatially organized because restricting image-stream tokens to massive channels and clustering them produces coherent partitions that closely align with the main subject and salient regions. They are transferable because transporting the activations from one prompt trajectory into another shifts the final image toward the source prompt while keeping

What carries the argument

massive activations: the small subset of hidden-state channels whose responses are consistently much larger than the rest, acting as a sparse prompt-conditioned carrier subspace

If this is right

Zeroing the massive channels causes sharp collapse in generation quality while zeroing an equal number of low-magnitude channels does not.
Clustering tokens restricted to massive channels produces spatial partitions that align with the main subject and salient regions.
Transporting massive activations between prompt trajectories shifts the output toward the source semantics while preserving target content.
The transport property supports text-conditioned and image-conditioned semantic editing and subject-driven generation without any additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Focusing computation or pruning on these channels could improve efficiency in large DiT models.
The same magnitude-based identification might reveal similar sparse subspaces in other transformer generators.
Semantic transport via activations could enable more controllable editing interfaces for artists without model fine-tuning.
If the pattern holds across scales, it might indicate a general architectural feature of diffusion transformers rather than a model-specific quirk.

Load-bearing premise

Identification of massive channels through magnitude statistics stays stable across prompts and models, and the zeroing probe isolates their causal role without confounding network dynamics.

What would settle it

An experiment in which zeroing the massive channels identified on one set of prompts leaves generation quality intact on a different prompt or model, or in which their token clustering fails to align with image subjects.

Figures

Figures reproduced from arXiv: 2605.13974 by Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, Sara Sarto.

**Figure 1.** Figure 1: Overview of our findings. (Left) Disrupting the top-k massive activation channels severely degrades generation quality, revealing their functional importance. (Center) These channels exhibit structured spatial organization, which we capture as a mask. (Right) Transplanting the top-k activations within this mask from a source to a target generation enables localized semantic transfer, producing coherent co… view at source ↗

**Figure 2.** Figure 2: Effect of channel disruption across models, metrics, and streams. Each plot reports [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Spatial structure of MAs on FLUX.2-klein model. (A) Channel-wise activation maps for [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Activation transport via MAs. (Left) Overall pipeline: two prompts are used to generate a [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Semantic effect of activations transport across models and layer regimes. Each point [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative examples prompt-to-prompt (left) and image-conditioned (right) semantic [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Effects of channel disruption on GenAI Bench, with top- [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Metrics for dichotomous image segmentations across layers for various generative models. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Screenshot of the platform used for collecting user-study data. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative examples of MAs-based text-conditioned transport on FLUX.1-dev. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative examples of MAs-based text-conditioned transport on Qwen-Image. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative examples of MAs-based text-conditioned transport on FLUX.2-klein. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative examples of MAs-based text-conditioned transport on FLUX.1-schnell. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative examples of MAs-based image-conditioned transport on FLUX.1-dev. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative examples of MAs-based image-conditioned transport on Qwen-Image. [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative examples of MAs-based image-conditioned transport on FLUX.2-klein. [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative examples of MAs-based image-conditioned transport on FLUX.1-schnell. [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative examples of MAs-based image-conditioned transport on SANA1.5. [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative examples for MAs-based masking extraction, on FLUX.1-dev (top row) and [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗

read the original abstract

Diffusion Transformers (DiTs) and related flow-based architectures are now among the strongest text-to-image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden-state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect. Second, they are spatially organized: restricting image-stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text-conditioned and image-conditioned semantic transport, where massive activations transport enables prompt interpolation and subject-driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sparse massive channels in DiTs carry semantic structure with usable transfer properties, but the zeroing probe needs a norm-statistic control to pin down causality.

read the letter

The paper's core observation is that a small set of high-magnitude channels in DiT hidden states act as a sparse carrier for prompt semantics. The authors demonstrate this with three probes: zeroing the massive channels collapses generation quality while zeroing low-magnitude ones does not; restricting tokens to those channels and clustering them produces partitions that line up with objects and salient regions; and transporting the activations from one conditioned trajectory to another produces localized semantic shifts rather than random blending. The transfer experiments are the most immediately useful part, since they enable prompt interpolation and subject-driven edits without any fine-tuning or additional training. That gives a practical handle on semantic control that interpretability work on diffusion models has been missing. The spatial clustering result is also clean and adds a concrete visualization of how the model organizes information inside what looks like an outlier subspace. The main soft spot is the functional-criticality claim. Zeroing a sparse subset of channels necessarily alters the per-token mean and variance that any preceding RMSNorm or LayerNorm sees. Without a matched control that restores the original activation statistics after zeroing, the quality collapse could be driven by that distributional shift rather than by the loss of the specific semantic content. The abstract does not describe such a control, so the causal isolation is not yet tight. This is the sort of empirical finding that belongs in a reading group on generative-model internals. It is not a complete theory, but the observations are direct enough that others can test and extend them. I would send it to peer review; the experiments are simple to replicate and the missing control is easy to request.

Referee Report

3 major / 2 minor

Summary. The paper studies massive activations—a sparse subset of hidden-state channels with consistently larger magnitudes—in Diffusion Transformers (DiTs). It claims these channels are functionally critical (zeroing them sharply degrades generation quality while zeroing low-magnitude channels does not), spatially organized (clustering massive-channel tokens yields coherent subject-aligned partitions), and transferable (transporting them between prompt trajectories produces localized semantic interpolation). The authors demonstrate two downstream uses: text-conditioned prompt interpolation and image-conditioned subject-driven generation, both without additional training.

Significance. If the empirical claims hold after addressing controls, the work identifies a sparse, prompt-conditioned carrier subspace that organizes semantic information in modern DiTs. This reframes massive activations from anomalies to a structured mechanism, with immediate applications for training-free semantic editing. The combination of disruption probes, spatial clustering, and cross-prompt transport provides complementary evidence; the absence of parameter fitting or invented axioms is a strength.

major comments (3)

[disruption probe section] § on controlled disruption probe (zeroing experiment): zeroing the top-k massive channels necessarily perturbs per-token mean and variance fed to subsequent RMSNorm/LayerNorm layers. Without a matched control that restores original activation statistics (e.g., re-normalization or bias correction after zeroing), the observed quality collapse could arise from global distributional shift rather than loss of semantic content carried by those channels alone. A quantitative comparison of activation statistics before/after zeroing is required.
[spatial clustering section] Spatial organization section: the claim that clustering massive-channel tokens produces partitions that 'closely align with the main subject' needs quantitative validation. Report IoU, Dice, or precision-recall against ground-truth subject masks across a held-out prompt set; qualitative examples alone are insufficient to support the 'structured spatial code' conclusion.
[transfer experiment section] Transferability experiments: when transporting massive activations from source to target trajectory, clarify whether the remaining (non-massive) channels retain their original statistics or are also rescaled. If the transport implicitly alters layer-norm inputs, the observed semantic shift may not be attributable solely to the massive channels.

minor comments (2)

[figures] Figure captions should explicitly state the number of prompts, models, and random seeds used for each panel so readers can assess reproducibility.
[methods] Notation for 'massive channels' should be defined once (e.g., channels whose magnitude exceeds k standard deviations of the layer) and used consistently; current usage mixes 'top-k' and 'outlier' terminology.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the empirical support without altering the core claims.

read point-by-point responses

Referee: [disruption probe section] § on controlled disruption probe (zeroing experiment): zeroing the top-k massive channels necessarily perturbs per-token mean and variance fed to subsequent RMSNorm/LayerNorm layers. Without a matched control that restores original activation statistics (e.g., re-normalization or bias correction after zeroing), the observed quality collapse could arise from global distributional shift rather than loss of semantic content carried by those channels alone. A quantitative comparison of activation statistics before/after zeroing is required.

Authors: We agree that zeroing channels perturbs per-token statistics for subsequent normalization layers. Our existing control—zeroing an equal number of low-magnitude channels—undergoes a comparable distributional shift yet produces only marginal quality degradation, which supports that the collapse is driven by loss of semantic content rather than the shift alone. To address the request directly, we will add quantitative tables comparing mean and variance (pre- and post-zeroing) for both massive and low-magnitude cases across layers and timesteps in the revised manuscript. revision: yes
Referee: [spatial clustering section] Spatial organization section: the claim that clustering massive-channel tokens produces partitions that 'closely align with the main subject' needs quantitative validation. Report IoU, Dice, or precision-recall against ground-truth subject masks across a held-out prompt set; qualitative examples alone are insufficient to support the 'structured spatial code' conclusion.

Authors: We concur that quantitative metrics are needed to substantiate the spatial alignment claim. We will generate ground-truth subject masks for a held-out prompt set, compute IoU and Dice scores for the clusters obtained from massive-channel tokens, and report mean scores with standard deviations in the revised spatial organization section. revision: yes
Referee: [transfer experiment section] Transferability experiments: when transporting massive activations from source to target trajectory, clarify whether the remaining (non-massive) channels retain their original statistics or are also rescaled. If the transport implicitly alters layer-norm inputs, the observed semantic shift may not be attributable solely to the massive channels.

Authors: In the transfer procedure we replace only the values of the massive channels with those from the source trajectory; the non-massive channels are left exactly as computed in the target trajectory with no rescaling or adjustment. Consequently, any change to layer-norm inputs arises exclusively from the massive-channel substitution. We will add an explicit description of this protocol, including pseudocode, to the transfer experiment section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical observations

full rationale

The paper contains no derivations, equations, fitted parameters, or self-citation chains that reduce claims to their own inputs. Massive channels are identified via direct magnitude statistics on observed activations; functional criticality is shown by zeroing interventions whose outcomes are measured independently; spatial organization and transferability are demonstrated through clustering and cross-prompt activation swapping. None of these steps define a quantity in terms of itself or rename a fitted result as a prediction. The analysis is self-contained against external benchmarks and exhibits none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is observational with no explicit mathematical derivations; no free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5563 in / 1013 out tokens · 37841 ms · 2026-05-15T05:55:17.889017+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

[1]

Building normalizing flows with stochastic interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InICLR, 2023

work page 2023
[2]

All are Worth Words: A ViT Backbone for Diffusion Models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are Worth Words: A ViT Backbone for Diffusion Models. InCVPR, 2023

work page 2023
[3]

Tiny Inference-Time Scaling with Latent Verifiers

Davide Bucciarelli, Evelyn Turri, Lorenzo Baraldi, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Tiny Inference-Time Scaling with Latent Verifiers. InCVPR Findings, 2026

work page 2026
[4]

SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation

Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation. InICCV, 2025

work page 2025
[5]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision Transformers Need Registers. InICLR, 2024

work page 2024
[6]

Scaling Vision Transformers to 22 Billion Parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling Vision Transformers to 22 Billion Parameters. InICML, 2023

work page 2023
[7]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion Models Beat GANs on Image Synthesis. In NeurIPS, 2021

work page 2021
[8]

Attention (as Discrete-Time Markov) Chains

Yotam Erel, Olaf Dünkel, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Amit Haim Bermano. Attention (as Discrete-Time Markov) Chains. InNeurIPS, 2025

work page 2025
[9]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

work page 2024
[10]

Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, and Weiyao Lin. Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations. InNeurIPS, 2025

work page 2025
[11]

Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Mehrtash Harandi, and Weiyao Lin. Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers. InICLR, 2026

work page 2026
[12]

Tokenverse: Versatile multi-concept personalization in token modulation space.ACM TOG, 2025

Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, and Tali Dekel. Tokenverse: Versatile multi-concept personalization in token modulation space.ACM TOG, 2025

work page 2025
[13]

Unsupervised Semantic Correspondence Using Stable Diffusion

Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasac- chi, and Kwang Moo Yi. Unsupervised Semantic Correspondence Using Stable Diffusion. In NeurIPS, 2023

work page 2023
[14]

ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features

Alec Helbling, Tuna Han Salih Meral, Benjamin Hoover, Pinar Yanardag, and Duen Horng Chau. ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features. In ICML, 2025

work page 2025
[15]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. InEMNLP, 2021. 10

work page 2021
[16]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS, 2017

work page 2017
[17]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. In NeurIPS, 2020

work page 2020
[18]

FLUX.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. FLUX.https://github.com/black-forest-labs/flux, 2024

work page 2024
[19]

FLUX.2: Frontier Visual Intelligence

Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025

work page 2025
[20]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. FLUX. 1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

GenAI-Bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. GenAI-Bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

work page arXiv 2024
[22]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. InICLR, 2023

work page 2023
[23]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. InICLR, 2023

work page 2023
[24]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

work page 2023
[25]

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, and Shu-Tao Xia. DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation. InICLR, 2025

work page 2025
[26]

A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers

Trung X Pham, Kang Zhang, Ji Woo Hong, and Chang D Yoo. A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers. InICLR, 2026

work page 2026
[27]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. InCVPR, 2022

work page 2022
[28]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.IJCV, 2015

work page 2015
[29]

LAION- 5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. LAION- 5B: An open large-scale dataset for training next generation image-text models. InNeurIPS, 2022

work page 2022
[30]

Exploring Multi- modal Diffusion Transformers for Enhanced Prompt-based Image Editing

Joonghyuk Shin, Alchan Hwang, Yujin Kim, Daneul Kim, and Jaesik Park. Exploring Multi- modal Diffusion Transformers for Enhanced Prompt-based Image Editing. InICCV, 2025

work page 2025
[31]

Massive Activations in Large Language Models

Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive Activations in Large Language Models. InCOLM, 2024

work page 2024
[32]

What the DAAM: Interpreting Stable Diffusion Using Cross Attention

Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. What the DAAM: Interpreting Stable Diffusion Using Cross Attention. InACL, 2023

work page 2023
[33]

Gemma 3 Technical Report

Gemma Team. Gemma 3 Technical Report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, and Mar Gonzalez-Franco. Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion. InCVPR, 2024. 11

work page 2024
[35]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017

work page 2017
[36]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image Technical Report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer. InICML, 2025

work page 2025
[38]

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation. InNeurIPS, 2023

work page 2023
[39]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence. InNeurIPS, 2023

work page 2023
[41]

Bilateral Reference for High-Resolution Dichotomous Image Segmentation.CAAI Artificial Intelligence Research, 2024

Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Jorma Laaksonen, Wanli Ouyang, and Nicu Sebe. Bilateral Reference for High-Resolution Dichotomous Image Segmentation.CAAI Artificial Intelligence Research, 2024. 12 A Additional Implementation Details Model Setup.For the disruption experiment (Sec. 3.1), we generate 5 images per ImageNet class, following [26]...

work page 2024