pith. machine review for the scientific record. sign in

arxiv: 2605.01711 · v2 · submitted 2026-05-03 · 💻 cs.CV

Recognition: unknown

Linear-Time Global Visual Modeling without Explicit Attention

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords dynamic parameterizationlinear complexityglobal visual modelingattention replacementvision transformersMLP layerssequence modeling
0
0 comments X

The pith

Attention can be reframed as an MLP with dynamically predicted parameters to deliver global sequence modeling in linear time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes the global modeling power of Transformer attention not as explicit pairwise token weighting but as the effect of parameters generated on the fly from the input. By testing several strategies for predicting those parameters and inserting them into ordinary layers, the authors build vision networks that match or approach attention-based performance while scaling linearly with sequence length. A sympathetic reader would see this as evidence that the expensive quadratic step is not necessary once the right kind of dynamic compression of context is available. The result points toward simpler, faster architectures for images, video, and other long sequences.

Core claim

Attention can be mathematically rewritten as a multi-layer perceptron whose weights are generated dynamically from the input sequence. When these dynamic-parameter layers are substituted for explicit attention in vision models, global context is still captured at linear rather than quadratic cost, and empirical results on standard vision benchmarks remain competitive.

What carries the argument

Dynamic parameter prediction inserted into standard network layers, serving as a compressed representation of global context that replaces explicit attention weights.

If this is right

  • Vision transformers and other sequence models can be redesigned around dynamic layers to handle larger images or longer videos without quadratic slowdown.
  • Global context modeling becomes possible in any architecture that already contains MLPs, simply by adding a lightweight parameter predictor.
  • Training and inference cost for high-resolution visual tasks decreases while retaining the ability to integrate information across the entire input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dynamic-parameter substitution could be applied to language or multimodal sequences, potentially yielding linear-time alternatives outside vision.
  • Because the mechanism is expressed as ordinary layer weights rather than attention matrices, it may combine more easily with existing convolution or state-space models.
  • Further work could test whether the predicted parameters themselves become interpretable representations of global scene structure.

Load-bearing premise

The tested ways of predicting layer parameters from the input are enough to encode the same global information that explicit attention extracts, without any hidden quadratic overhead or overfitting to particular tasks.

What would settle it

A controlled experiment on a long-range dependency benchmark where the dynamic-parameter model is run at the same linear cost yet shows a clear drop in accuracy relative to an attention baseline of equal parameter count.

Figures

Figures reproduced from arXiv: 2605.01711 by Dongchen Han, Gao Huang, Ruize He.

Figure 1
Figure 1. Figure 1: ImageNet accuracy of WeightFormer and baselines. ◦ represent models with O(N2 ), while • represent models with O(N). Each bubble’s area is proportional to FLOPs. The success of Transformers across various domains is primarily rooted in their capacity for global sequence modeling, which is con￾ventionally attributed to the attention mecha￾nism. Under this prevailing perspective, atten￾tion is conceptualized… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Explicit weighted aggregation, where [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of ERF [26] between DeiT, static and dynamic weight prediction strategies. Pixels with higher intensity indicate larger responses related to the central pixel. This spatial compression is crucial: by projecting the context into a fixed size, the parameter generation process becomes independent of the input resolution, thereby preserving linear complexity. Formally, given an input feature matrix … view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of WeightFormer architecture. LayerNorm is omitted for simplicity. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparisons between DeiT and WeightFormer in Throughput on RTX 3090, and per-image GPU memory usage. Ablation on Dynamic Block Frequency. We study the impact of dynamic block frequency by varying the total number of dynamic blocks, denoted as N. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Dynamic weight strength across depth. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process that inherently incurs quadratic computational complexity. In this work, we offer a novel perspective: we demonstrate that attention can be mathematically reframed as a Multi-Layer Perceptron (MLP) equipped with dynamically predicted parameters. Through this lens, we explain attention's global modeling power not as explicit token-wise aggregation, but as an implicit process where dynamically generated parameters act as a compressed representation of the global context. Inspired by this insight, we investigate a fundamental question: can we achieve Transformer-level sequence global modeling entirely through dynamic parameterization while maintaining linear complexity, effectively replacing explicit attention? To explore this, we design various dynamic parameter prediction strategies and integrate them into standard network layers. Extensive empirical studies on vision models demonstrate that dynamic parameterization can indeed serve as a highly effective, linear-complexity alternative to explicit attention, opening new pathways for efficient sequence modeling. Code is available at https://github.com/LeapLabTHU/WeightFormer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that attention can be reframed as an MLP whose weights are dynamically predicted from the input sequence to encode global context implicitly, and that various dynamic parameter prediction strategies can therefore replace explicit attention for global visual modeling while preserving linear complexity; this is supported by empirical studies on vision models and released code.

Significance. If the empirical results hold under scrutiny, the work offers a concrete pathway toward linear-complexity global modeling that avoids the quadratic cost of explicit attention, with potential impact on efficient vision transformers and sequence models. The release of code strengthens reproducibility and allows direct verification of the claimed strategies.

major comments (2)
  1. [§3] §3 (Dynamic Parameter Prediction Strategies): the central equivalence claim requires that the parameter generator itself be strictly linear in sequence length and produce token-specific, context-dependent weights equivalent in mixing power to softmax(QK^T). The description does not include a formal complexity breakdown or a proof sketch showing that the chosen generators (hypernetwork, linear projection, or recurrent update) avoid implicit quadratic terms or collapse to per-token/global-average modulation; without this, the linear-complexity replacement claim remains unverified.
  2. [§4] §4 (Experiments) and associated tables: the reported vision-model results lack ablations that isolate whether global context is actually being mixed (e.g., a control that disables cross-token information in the parameter predictor while keeping the same architecture). Without such controls or quantitative metrics (accuracy deltas, complexity measurements with error bars), it is impossible to confirm that the observed performance stems from true long-range token interactions rather than local modulation.
minor comments (2)
  1. [Abstract] The abstract states 'extensive empirical studies' but supplies no numbers; moving at least one key table or accuracy comparison into the abstract would improve immediate readability.
  2. [§3.1] Notation for the dynamically generated parameters (e.g., W_dyn(t)) is introduced without an explicit equation linking it to the MLP forward pass; adding a compact equation in §3.1 would clarify the reframing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and committing to revisions that strengthen the formal analysis and experimental validation.

read point-by-point responses
  1. Referee: [§3] §3 (Dynamic Parameter Prediction Strategies): the central equivalence claim requires that the parameter generator itself be strictly linear in sequence length and produce token-specific, context-dependent weights equivalent in mixing power to softmax(QK^T). The description does not include a formal complexity breakdown or a proof sketch showing that the chosen generators (hypernetwork, linear projection, or recurrent update) avoid implicit quadratic terms or collapse to per-token/global-average modulation; without this, the linear-complexity replacement claim remains unverified.

    Authors: We appreciate the referee's emphasis on rigor here. We will revise §3 to include a dedicated complexity analysis subsection. For the linear projection generator, parameter prediction is strictly O(N) in sequence length N (independent per-token linear transforms with no cross-token operations in the generator itself). The hypernetwork uses a fixed-size global context vector aggregated via a linear layer, avoiding quadratic terms. The recurrent update processes tokens sequentially with constant state size. We will add a proof sketch showing these designs preclude implicit quadratic costs and produce token-specific weights that encode global context through the dynamic prediction process, distinguishing them from per-token or global-average modulation. The mathematical reframing of attention as a dynamic-parameter MLP underpins the equivalence claim, with empirical results confirming comparable mixing power. revision: yes

  2. Referee: [§4] §4 (Experiments) and associated tables: the reported vision-model results lack ablations that isolate whether global context is actually being mixed (e.g., a control that disables cross-token information in the parameter predictor while keeping the same architecture). Without such controls or quantitative metrics (accuracy deltas, complexity measurements with error bars), it is impossible to confirm that the observed performance stems from true long-range token interactions rather than local modulation.

    Authors: We agree that targeted controls are needed to isolate global mixing. In the revision, we will add ablations in §4 where cross-token information is disabled in the parameter predictor (e.g., by restricting it to independent per-token features with no global aggregation). We will report accuracy deltas relative to the full models, along with complexity measurements (FLOPs and runtime) averaged over multiple runs with error bars. These results will quantify the contribution of long-range interactions versus local modulation, directly addressing the concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity; reframing is conceptual and claims rest on empirical results.

full rationale

The paper presents a mathematical reframing of attention as an MLP with dynamically predicted parameters as a perspective, then tests dynamic parameterization strategies empirically in vision models for linear-complexity global modeling. No load-bearing step reduces by construction to a fitted input, self-definition, or self-citation chain; the equivalence claim is not derived tautologically but validated through experiments. This matches the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on (1) the mathematical equivalence between attention and an MLP whose parameters are functions of the input, and (2) the empirical sufficiency of the chosen dynamic-prediction modules. No new physical entities are introduced.

axioms (1)
  • domain assumption Attention weights can be exactly re-expressed as the parameters of an input-dependent MLP.
    Invoked in the opening paragraph of the abstract as the basis for the entire alternative approach.

pith-pipeline@v0.9.0 · 5470 in / 1159 out tokens · 56210 ms · 2026-05-10T15:49:33.710346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 10 canonical work pages · 7 internal anchors

  1. [1]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  2. [2]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  3. [3]

    Cascade r-cnn: High quality object detection and instance segmentation.IEEE transactions on pattern analysis and machine intelligence, 43(5):1483– 1498, 2019

    Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: High quality object detection and instance segmentation.IEEE transactions on pattern analysis and machine intelligence, 43(5):1483– 1498, 2019

  4. [4]

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018

  5. [5]

    Dynamic convolution: Attention over convolution kernels

    Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. Dynamic convolution: Attention over convolution kernels. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11030–11039, 2020

  6. [6]

    Generating Long Sequences with Sparse Transformers

    Rewon Child. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

  7. [7]

    Rethinking Attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020

  8. [8]

    Randaugment: Practical automated data augmentation with a reduced search space

    Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020

  9. [9]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  10. [10]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. 10

  11. [11]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  12. [12]

    HyperNetworks

    David Ha, Andrew Dai, and Quoc V Le. Hypernetworks.arXiv preprint arXiv:1609.09106, 2016

  13. [13]

    Vision transformers are circulant attention learners.arXiv preprint arXiv:2512.21542, 2025

    Dongchen Han, Tianyu Li, Ziyi Wang, and Gao Huang. Vision transformers are circulant attention learners.arXiv preprint arXiv:2512.21542, 2025

  14. [14]

    Agent attention: On the integration of softmax and linear attention

    Dongchen Han, Tianzhu Ye, Yizeng Han, Zhuofan Xia, Siyuan Pan, Pengfei Wan, Shiji Song, and Gao Huang. Agent attention: On the integration of softmax and linear attention. In European Conference on Computer Vision, pages 124–140, 2024

  15. [15]

    On the connection between local attention and dynamic depth-wise convolution.arXiv preprint arXiv:2106.04263, 2021

    Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu, and Jingdong Wang. On the connection between local attention and dynamic depth-wise convolution.arXiv preprint arXiv:2106.04263, 2021

  16. [16]

    Dynamic neural networks: A survey.IEEE Transactions on Pattern Analysis & Machine Intelligence, 44(11):7436–7456, 2022

    Yizeng Han, Gao Huang, Shiji Song, Le Yang, Honghui Wang, and Yulin Wang. Dynamic neural networks: A survey.IEEE Transactions on Pattern Analysis & Machine Intelligence, 44(11):7436–7456, 2022

  17. [17]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  18. [18]

    Dynamic filter networks

    Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic filter networks. Advances in neural information processing systems, 29, 2016

  19. [19]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

  20. [20]

    Involution: Inverting the inherence of convolution for visual recognition

    Duo Li, Jie Hu, Changhu Wang, Xiangtai Li, Qi She, Lei Zhu, Tong Zhang, and Qifeng Chen. Involution: Inverting the inherence of convolution for visual recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12321–12330, 2021

  21. [21]

    Exploring plain vision transformer backbones for object detection

    Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. InEuropean conference on computer vision, pages 280–296. Springer, 2022

  22. [22]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  23. [23]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  24. [24]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

  25. [25]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  26. [26]

    Understanding the effective receptive field in deep convolutional neural networks.Advances in neural information processing systems, 29, 2016

    Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks.Advances in neural information processing systems, 29, 2016

  27. [27]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  28. [28]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. InProceedings of the AAAI conference on artificial intelligence, 2018. 11

  29. [29]

    Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

    Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks.Advances in neural information processing systems, 29, 2016

  30. [30]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021

  31. [31]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  32. [32]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

  33. [33]

    Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. InProceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021

  34. [34]

    Pay Less Attention with Lightweight and Dynamic Convolutions

    Felix Wu, Angela Fan, Alexei Baevski, Yann N Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions.arXiv preprint arXiv:1901.10430, 2019

  35. [35]

    Unified perceptual parsing for scene understanding

    Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. InProceedings of the European conference on computer vision (ECCV), pages 418–434, 2018

  36. [36]

    Nyströmformer: A nyström-based algorithm for approximating self-attention

    Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14138–14148, 2021

  37. [37]

    Condconv: Conditionally parameterized convolutions for efficient inference.Advances in neural information processing systems, 32, 2019

    Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. Condconv: Conditionally parameterized convolutions for efficient inference.Advances in neural information processing systems, 32, 2019

  38. [38]

    Native sparse attention: Hardware-aligned and natively trainable sparse attention

    Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Yuxing Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23078–23097, 2025

  39. [39]

    Cutmix: Regularization strategy to train strong classifiers with localizable features

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019

  40. [40]

    Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33:17283–17297, 2020

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santi- ago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences.Advances in neural information processing systems, 33:17283–17297, 2020

  41. [41]

    mixup: Beyond empirical risk minimization

    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. InInternational Conference on Learning Representations, 2018

  42. [42]

    Random erasing data augmentation

    Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 13001–13008, 2020

  43. [43]

    Semantic understanding of scenes through the ade20k dataset.International journal of computer vision, 127(3):302–321, 2019

    Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset.International journal of computer vision, 127(3):302–321, 2019

  44. [44]

    Interpret vision transformers as convnets with dynamic convolutions.CoRR, 2023

    Chong Zhou, Chen Change Loy, and Bo Dai. Interpret vision transformers as convnets with dynamic convolutions.CoRR, 2023. 12

  45. [45]

    Dig: Scalable and efficient diffusion models with gated linear attention

    Lianghui Zhu, Zilong Huang, Bencheng Liao, Jun Hao Liew, Hanshu Yan, Jiashi Feng, and Xinggang Wang. Dig: Scalable and efficient diffusion models with gated linear attention. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 7664–7674, 2025

  46. [46]

    Vision mamba: efficient visual representation learning with bidirectional state space model

    Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: efficient visual representation learning with bidirectional state space model. In Proceedings of the 41st International Conference on Machine Learning, pages 62429–62442, 2024. 13 A Appendix A.1 Dynamic Weight Strength Analysis To analyze how dynamic paramet...