pith. sign in

arxiv: 2505.21535 · v4 · pith:Q5U4ZSA7new · submitted 2025-05-24 · 💻 cs.CV · cs.AI· cs.LG

FAR: Function-preserving Attention Replacement for IMC-friendly Inference

Pith reviewed 2026-05-22 01:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords attention replacementin-memory computingDeiTbidirectional LSTMdistillationstructured pruningvision transformersedge accelerators
0
0 comments X

The pith

Replacing attention in DeiT models with multi-head bidirectional LSTMs via block-wise distillation preserves accuracy while suiting in-memory computing hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that swaps the self-attention layers in pretrained DeiT vision transformers for sequential modules designed to match the dataflow of in-memory computing devices. It applies block-wise distillation so the new multi-head bidirectional LSTM blocks keep the same input-output behavior as the original attention. The resulting models include structured pruning for hardware fit and deliver comparable accuracy on ImageNet and downstream tasks with lower parameter counts and latency. A sympathetic reader would care because transformers are widely used yet poorly matched to efficient accelerators, and this substitution directly targets that hardware-model gap without requiring full retraining from scratch.

Core claim

FAR replaces every self-attention block in pretrained DeiT models with a multi-head bidirectional LSTM architecture through block-wise distillation to retain functional equivalence, enabling linear-time computation and localized weight reuse; structured pruning then adapts the models to resource-constrained IMC arrays while preserving accuracy on ImageNet and other vision tasks.

What carries the argument

The multi-head bidirectional LSTM replacement for self-attention, which carries out the function-preserving substitution to achieve IMC dataflow compatibility through sequential processing.

Load-bearing premise

The block-wise distillation process retains functional equivalence between the multi-head bidirectional LSTM replacement and the original self-attention mechanism.

What would settle it

A large accuracy gap between the FAR model and the original DeiT on ImageNet after the replacement and distillation steps would show that functional equivalence was not achieved.

Figures

Figures reproduced from arXiv: 2505.21535 by Huanrui Yang, Maxwell D Collins, Miao Hu, Yuxin Ren.

Figure 1
Figure 1. Figure 1: IMC crossbar illustration wordlines and sensing accumulated currents on bitlines naturally realizes analog GEMM with minimal data movement and high energy efficiency [7, 8]. However, attention is dominated not by weight-stationary GEMM, but by activation-to-activation multiplications such as QK⊤, softmax normalization, and per-token dynamic mixing. These operations require repeatedly reading spatially dist… view at source ↗
Figure 2
Figure 2. Figure 2: Block-wise replacement of attention. Each replaced module is supervised by a similarity [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multihead BiLSTM module used to replace attention. The input is first projected into [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Structured pruning of LSTM hidden units. Removing one unit (shaded row) consistently [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pruning ratios across heads and directions. FAR learns to prune differently across layers [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Head-wise token interaction visualization of DeiT-Tiny and FAR-Tiny in the final trans [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators. To address this mismatch, we propose FAR, a Function-preserving Attention Replacement framework that substitutes all attention in pretrained DeiTs with sequential modules inherently compatible with IMC dataflows. Specifically, FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence while enabling linear-time computation and localized weight reuse. We further incorporate structured pruning on FAR models, enabling flexible adaptation to resource-constrained IMC arrays while maintaining functional fidelity. Evaluations on the DeiT family demonstrate that FAR maintains comparable accuracy to the original attention-based models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships learned by attention while improving computational efficiency, highlighting its potential for energy-efficient transformer inference on IMC-based edge accelerators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FAR (Function-preserving Attention Replacement), a method to substitute self-attention layers in DeiT transformer models with multi-head bidirectional LSTM modules. This is done through block-wise distillation to maintain functional equivalence, aiming to improve compatibility with in-memory computing (IMC) hardware by enabling linear complexity and localized memory access. The paper also applies structured pruning and evaluates the approach on ImageNet and downstream tasks, claiming comparable accuracy with reduced parameters and latency, while preserving semantic token relationships.

Significance. If the claimed accuracy parity and functional preservation hold under rigorous testing, the work could enable more efficient transformer inference on IMC accelerators by addressing mismatches in memory access patterns and computational complexity. The block-wise distillation strategy and structured pruning for hardware adaptation represent practical contributions if supported by detailed metrics.

major comments (2)
  1. Abstract: The central claim of comparable accuracy on ImageNet and downstream tasks with preserved semantics is asserted without any quantitative metrics, error bars, baseline comparisons, or description of how functional equivalence was measured (e.g., via output matching, attention similarity, or dependency tests). This absence prevents evaluation of support for the core result.
  2. Block-wise distillation section: Replacing self-attention (all-to-all parallel interactions) with multi-head bidirectional LSTM (sequential hidden-state updates) via per-block output matching does not automatically guarantee reproduction of long-range token relationships. Without reported layer-wise equivalence metrics, attention-map comparisons, or analysis of error accumulation in deeper DeiT blocks, the functional equivalence assumption remains unverified and load-bearing for the accuracy claims.
minor comments (2)
  1. Abstract: Define 'IMC-friendly' more explicitly with reference to specific hardware constraints (e.g., ReRAM bandwidth limits) rather than leaving it as a general term.
  2. Evaluation section: Include details on the distillation loss function (MSE, KL, or other) and any regularization used to enforce semantic preservation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and provide additional supporting evidence where appropriate.

read point-by-point responses
  1. Referee: Abstract: The central claim of comparable accuracy on ImageNet and downstream tasks with preserved semantics is asserted without any quantitative metrics, error bars, baseline comparisons, or description of how functional equivalence was measured (e.g., via output matching, attention similarity, or dependency tests). This absence prevents evaluation of support for the core result.

    Authors: We agree that the abstract would be strengthened by including specific quantitative results. The full manuscript reports that FAR retains top-1 accuracy within 0.8% of the original DeiT on ImageNet (with standard deviations from three runs), achieves comparable performance on downstream tasks such as CIFAR-100 and Oxford Flowers, and delivers 1.8x lower latency with 22% fewer parameters. Functional equivalence is measured via block-wise output matching during distillation. We have revised the abstract to include these key metrics, error bars, and a concise description of the equivalence measurement. revision: yes

  2. Referee: Block-wise distillation section: Replacing self-attention (all-to-all parallel interactions) with multi-head bidirectional LSTM (sequential hidden-state updates) via per-block output matching does not automatically guarantee reproduction of long-range token relationships. Without reported layer-wise equivalence metrics, attention-map comparisons, or analysis of error accumulation in deeper DeiT blocks, the functional equivalence assumption remains unverified and load-bearing for the accuracy claims.

    Authors: We acknowledge that per-block output matching does not by itself prove preservation of all long-range dependencies. However, because distillation proceeds sequentially through the entire network, later blocks can compensate for minor discrepancies. To strengthen verification, we have added to the revised manuscript: (i) layer-wise cosine similarity metrics between original attention-block outputs and FAR-block outputs, (ii) probing experiments that measure preservation of semantic token relationships, and (iii) an analysis of error accumulation across depth showing that end-to-end accuracy remains stable. These additions directly address the concern while remaining consistent with the existing experimental results. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical results follow from independent evaluation after architectural substitution

full rationale

The paper's core derivation replaces self-attention with a multi-head bidirectional LSTM via block-wise distillation, then reports measured accuracy, parameter count, and latency on ImageNet and downstream tasks. These outcomes are obtained from external benchmarks and are not algebraically or definitionally forced by the distillation loss or any internal parameter fit. No equations, self-citations, or uniqueness theorems are invoked that would reduce the claimed performance parity to a quantity defined by the inputs themselves. The functional-equivalence assumption is an empirical claim tested by the training procedure and subsequent evaluation, leaving the reported results self-contained against standard vision benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that LSTM modules can be distilled to match attention outputs while enabling IMC-compatible computation; no free parameters or new entities are explicitly quantified in the abstract.

free parameters (1)
  • Distillation and pruning hyperparameters
    Block-wise distillation loss weights and structured pruning ratios are chosen to maintain fidelity and fit IMC arrays but not numerically specified.
axioms (1)
  • domain assumption Block-wise distillation can retain functional equivalence between self-attention and the multi-head bidirectional LSTM replacement.
    This premise is invoked to justify accuracy preservation after the architectural swap.

pith-pipeline@v0.9.0 · 5711 in / 1280 out tokens · 73725 ms · 2026-05-22T01:48:43.235548+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

  1. [1]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

  2. [2]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), 2019

  3. [3]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021, V...

  4. [4]

    Learning transferable visual models from natural language supervision

    Alec Radford et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, 2021

  5. [5]

    On the relationship between self-attention and convolutional layers, 2020

    Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship between self-attention and convolutional layers, 2020

  6. [6]

    Efficient transformers: A survey

    Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Comput. Surv., 2022

  7. [7]

    Retransformer: Reram-based processing- in-memory architecture for transformer acceleration

    Xiaoxuan Yang, Bonan Yan, Hai Li, and Yiran Chen. Retransformer: Reram-based processing- in-memory architecture for transformer acceleration. In2020 IEEE/ACM International Confer- ence On Computer Aided Design (ICCAD), 2020

  8. [8]

    Memory is all you need: An overview of compute-in-memory architectures for accelerating large language model inference, 2024

    Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, and Toyotaro Suzumura. Memory is all you need: An overview of compute-in-memory architectures for accelerating large language model inference, 2024

  9. [9]

    Leveraging redundancy in attention with reuse transformers, 2021

    Srinadh Bhojanapalli et al. Leveraging redundancy in attention with reuse transformers, 2021

  10. [10]

    What matters in transformers? not all attention is needed, 2025

    Shwai He, Guoheng Sun, Zheyu Shen, and Ang Li. What matters in transformers? not all attention is needed, 2025. 10

  11. [11]

    Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

  12. [12]

    Deephoyer: Learning sparser neural network with differentiable scale-invariant sparsity measures

    Huanrui Yang, Wei Wen, and Hai Li. Deephoyer: Learning sparser neural network with differentiable scale-invariant sparsity measures. InInternational Conference on Learning Representations, 2020

  13. [13]

    Training data-efficient image transformers and distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers and distillation through attention. InProceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 2021

  14. [14]

    Mlp-mixer: an all-mlp architecture for vision

    Tolstikhin et al. Mlp-mixer: an all-mlp architecture for vision. InProceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, 2021

  15. [15]

    Retentive network: A successor to transformer for large language models, 2024

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models, 2024

  16. [16]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024

  17. [17]

    Li, Madian Khabsa, Han Fang, and Hao Ma

    Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-Attention with Linear Complexity.arXiv e-prints, 2020

  18. [18]

    Rethinking attention with performers

    Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. InInternational Conference on Learning Representations, 2021

  19. [19]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: fast and memory-efficient exact attention with io-awareness. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, 2022

  20. [20]

    Reformer: The efficient transformer, 2020

    Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020

  21. [21]

    Distilling the Knowledge in a Neural Network

    Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.ArXiv, abs/1503.02531, 2015

  22. [22]

    Xiaoqi et al. Jiao. TinyBERT: Distilling BERT for natural language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, 2020

  23. [23]

    Mobile- BERT: a compact task-agnostic BERT for resource-limited devices

    Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobile- BERT: a compact task-agnostic BERT for resource-limited devices. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2158–2170, 2020

  24. [24]

    Minilm: deep self- attention distillation for task-agnostic compression of pre-trained transformers

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: deep self- attention distillation for task-agnostic compression of pre-trained transformers. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, 2020

  25. [25]

    Rush, and Tri Dao

    Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models. InAdvances in Neural Information Processing Systems, volume 37, 2024

  26. [26]

    Demonstration of transformer-based albert model on a 14nm analog ai inference chip.Nature Communications, 16, 2025

    An Chen et al. Demonstration of transformer-based albert model on a 14nm analog ai inference chip.Nature Communications, 16, 2025

  27. [27]

    Myeonggu Kang, Hyein Shin, and Lee-Sup Kim. A framework for accelerating transformer- based language model on reram-based architecture.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(9), 2022

  28. [28]

    Stevens, Kaushik Roy, and Anand Raghunathan

    Shrihari Sridharan, Jacob R. Stevens, Kaushik Roy, and Anand Raghunathan. X-former: In- memory acceleration of transformers.IEEE Transactions on V ery Large Scale Integration (VLSI) Systems, 31(8):1223–1233, 2023. 11

  29. [29]

    Attar: Rram-based in-memory attention accelerator with software-hardware co-optimization.Science China Information Sciences, 68(3):132401, 2025

    Bing Li, Ying Qi, Ying Wang, and Yinhe Han. Attar: Rram-based in-memory attention accelerator with software-hardware co-optimization.Science China Information Sciences, 68(3):132401, 2025

  30. [30]

    Global vision transformer pruning with hessian-aware saliency

    Huanrui Yang, Hongxu Yin, Maying Shen, Pavlo Molchanov, Hai Li, and Jan Kautz. Global vision transformer pruning with hessian-aware saliency. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18547–18557, 2023

  31. [31]

    A-vit: Adaptive tokens for efficient vision transformer

    Hongxu Yin, Arash Vahdat, Jose M Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10809–10818, 2022

  32. [32]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009

  33. [33]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  34. [34]

    3d object representations for fine- grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013

  35. [35]

    Automated flower classification over a large number of classes

    M-E Nilsback and A Zisserman. Automated flower classification over a large number of classes. InProceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 2008

  36. [36]

    The inaturalist species classification and detection dataset

    Grant Van Horn, Oisin Mac Aodha, Yang Song, Chenyi Cui, Yin Sun, Andrew Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  37. [37]

    A survey of reram-based architectures for processing-in-memory and neural networks.Machine Learning and Knowledge Extraction, 1, 2019

    Sparsh Mittal. A survey of reram-based architectures for processing-in-memory and neural networks.Machine Learning and Knowledge Extraction, 1, 2019

  38. [38]

    Overcoming the challenges of crossbar resistive memory architectures

    Cong Xu, Dimin Niu, Naveen Muralimanohar, Rajeev Balasubramonian, Tao Zhang, Shimeng Yu, and Yuan Xie. Overcoming the challenges of crossbar resistive memory architectures. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture, 2015

  39. [39]

    Xiangyu Dong, Cong Xu, Yuan Xie, and Norman P. Jouppi. Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory.IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, 31(7):994–1007, 2012

  40. [40]

    Era-lstm: An efficient reram-based architecture for long short-term memory.IEEE Transactions on Parallel and Distributed Systems, 31(6):1328–1342, 2020

    Jianhui Han, He Liu, Mingyu Wang, Zhaolin Li, and Youhui Zhang. Era-lstm: An efficient reram-based architecture for long short-term memory.IEEE Transactions on Parallel and Distributed Systems, 31(6):1328–1342, 2020

  41. [41]

    Real-time signal processing enabled by fused networks on a memristor-based system on a chip.Science Advances, 11(30), 2025

    Zixu Wang et al. Real-time signal processing enabled by fused networks on a memristor-based system on a chip.Science Advances, 11(30), 2025

  42. [42]

    Star: An efficient softmax engine for attention model with rram crossbar

    Yifeng Zhai, Bing Li, Bonan Yan, and Jing Wang. Star: An efficient softmax engine for attention model with rram crossbar. In2023 Design, Automation and Test in Europe Conference and Exhibition (DATE), 2023

  43. [43]

    Topkima-former: Low-energy, low-latency inference for transformers using top-k in-memory adc.IEEE Transactions on Circuits and Systems I: Regular Papers, 2025

    Shuai Dong, Junyi Yang, Xiaoqi Peng, Hongyang Shang, Ye Ke, Xiaofeng Yang, Hongjie Liu, and Arindam Basu. Topkima-former: Low-energy, low-latency inference for transformers using top-k in-memory adc.IEEE Transactions on Circuits and Systems I: Regular Papers, 2025

  44. [44]

    Stevens, Rangharajan Venkatesan, Steve Dai, Brucek Khailany, and Anand Raghu- nathan

    Jacob R. Stevens, Rangharajan Venkatesan, Steve Dai, Brucek Khailany, and Anand Raghu- nathan. Softermax: Hardware/software co-design of an efficient softmax for transformers. In Proceedings of the 58th Annual ACM/IEEE Design Automation Conference, DAC ’21, 2022. 12