FAR: Function-preserving Attention Replacement for IMC-friendly Inference
Pith reviewed 2026-05-22 01:48 UTC · model grok-4.3
The pith
Replacing attention in DeiT models with multi-head bidirectional LSTMs via block-wise distillation preserves accuracy while suiting in-memory computing hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FAR replaces every self-attention block in pretrained DeiT models with a multi-head bidirectional LSTM architecture through block-wise distillation to retain functional equivalence, enabling linear-time computation and localized weight reuse; structured pruning then adapts the models to resource-constrained IMC arrays while preserving accuracy on ImageNet and other vision tasks.
What carries the argument
The multi-head bidirectional LSTM replacement for self-attention, which carries out the function-preserving substitution to achieve IMC dataflow compatibility through sequential processing.
Load-bearing premise
The block-wise distillation process retains functional equivalence between the multi-head bidirectional LSTM replacement and the original self-attention mechanism.
What would settle it
A large accuracy gap between the FAR model and the original DeiT on ImageNet after the replacement and distillation steps would show that functional equivalence was not achieved.
Figures
read the original abstract
While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators. To address this mismatch, we propose FAR, a Function-preserving Attention Replacement framework that substitutes all attention in pretrained DeiTs with sequential modules inherently compatible with IMC dataflows. Specifically, FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence while enabling linear-time computation and localized weight reuse. We further incorporate structured pruning on FAR models, enabling flexible adaptation to resource-constrained IMC arrays while maintaining functional fidelity. Evaluations on the DeiT family demonstrate that FAR maintains comparable accuracy to the original attention-based models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships learned by attention while improving computational efficiency, highlighting its potential for energy-efficient transformer inference on IMC-based edge accelerators.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FAR (Function-preserving Attention Replacement), a method to substitute self-attention layers in DeiT transformer models with multi-head bidirectional LSTM modules. This is done through block-wise distillation to maintain functional equivalence, aiming to improve compatibility with in-memory computing (IMC) hardware by enabling linear complexity and localized memory access. The paper also applies structured pruning and evaluates the approach on ImageNet and downstream tasks, claiming comparable accuracy with reduced parameters and latency, while preserving semantic token relationships.
Significance. If the claimed accuracy parity and functional preservation hold under rigorous testing, the work could enable more efficient transformer inference on IMC accelerators by addressing mismatches in memory access patterns and computational complexity. The block-wise distillation strategy and structured pruning for hardware adaptation represent practical contributions if supported by detailed metrics.
major comments (2)
- Abstract: The central claim of comparable accuracy on ImageNet and downstream tasks with preserved semantics is asserted without any quantitative metrics, error bars, baseline comparisons, or description of how functional equivalence was measured (e.g., via output matching, attention similarity, or dependency tests). This absence prevents evaluation of support for the core result.
- Block-wise distillation section: Replacing self-attention (all-to-all parallel interactions) with multi-head bidirectional LSTM (sequential hidden-state updates) via per-block output matching does not automatically guarantee reproduction of long-range token relationships. Without reported layer-wise equivalence metrics, attention-map comparisons, or analysis of error accumulation in deeper DeiT blocks, the functional equivalence assumption remains unverified and load-bearing for the accuracy claims.
minor comments (2)
- Abstract: Define 'IMC-friendly' more explicitly with reference to specific hardware constraints (e.g., ReRAM bandwidth limits) rather than leaving it as a general term.
- Evaluation section: Include details on the distillation loss function (MSE, KL, or other) and any regularization used to enforce semantic preservation.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and provide additional supporting evidence where appropriate.
read point-by-point responses
-
Referee: Abstract: The central claim of comparable accuracy on ImageNet and downstream tasks with preserved semantics is asserted without any quantitative metrics, error bars, baseline comparisons, or description of how functional equivalence was measured (e.g., via output matching, attention similarity, or dependency tests). This absence prevents evaluation of support for the core result.
Authors: We agree that the abstract would be strengthened by including specific quantitative results. The full manuscript reports that FAR retains top-1 accuracy within 0.8% of the original DeiT on ImageNet (with standard deviations from three runs), achieves comparable performance on downstream tasks such as CIFAR-100 and Oxford Flowers, and delivers 1.8x lower latency with 22% fewer parameters. Functional equivalence is measured via block-wise output matching during distillation. We have revised the abstract to include these key metrics, error bars, and a concise description of the equivalence measurement. revision: yes
-
Referee: Block-wise distillation section: Replacing self-attention (all-to-all parallel interactions) with multi-head bidirectional LSTM (sequential hidden-state updates) via per-block output matching does not automatically guarantee reproduction of long-range token relationships. Without reported layer-wise equivalence metrics, attention-map comparisons, or analysis of error accumulation in deeper DeiT blocks, the functional equivalence assumption remains unverified and load-bearing for the accuracy claims.
Authors: We acknowledge that per-block output matching does not by itself prove preservation of all long-range dependencies. However, because distillation proceeds sequentially through the entire network, later blocks can compensate for minor discrepancies. To strengthen verification, we have added to the revised manuscript: (i) layer-wise cosine similarity metrics between original attention-block outputs and FAR-block outputs, (ii) probing experiments that measure preservation of semantic token relationships, and (iii) an analysis of error accumulation across depth showing that end-to-end accuracy remains stable. These additions directly address the concern while remaining consistent with the existing experimental results. revision: yes
Circularity Check
No significant circularity: empirical results follow from independent evaluation after architectural substitution
full rationale
The paper's core derivation replaces self-attention with a multi-head bidirectional LSTM via block-wise distillation, then reports measured accuracy, parameter count, and latency on ImageNet and downstream tasks. These outcomes are obtained from external benchmarks and are not algebraically or definitionally forced by the distillation loss or any internal parameter fit. No equations, self-citations, or uniqueness theorems are invoked that would reduce the claimed performance parity to a quantity defined by the inputs themselves. The functional-equivalence assumption is an empirical claim tested by the training procedure and subsequent evaluation, leaving the reported results self-contained against standard vision benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Distillation and pruning hyperparameters
axioms (1)
- domain assumption Block-wise distillation can retain functional equivalence between self-attention and the multi-head bidirectional LSTM replacement.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Evaluations on the DeiT family demonstrate that FAR maintains comparable accuracy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017
work page 2017
-
[2]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), 2019
work page 2019
-
[3]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021, V...
work page 2021
-
[4]
Learning transferable visual models from natural language supervision
Alec Radford et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, 2021
work page 2021
-
[5]
On the relationship between self-attention and convolutional layers, 2020
Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the relationship between self-attention and convolutional layers, 2020
work page 2020
-
[6]
Efficient transformers: A survey
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. ACM Comput. Surv., 2022
work page 2022
-
[7]
Retransformer: Reram-based processing- in-memory architecture for transformer acceleration
Xiaoxuan Yang, Bonan Yan, Hai Li, and Yiran Chen. Retransformer: Reram-based processing- in-memory architecture for transformer acceleration. In2020 IEEE/ACM International Confer- ence On Computer Aided Design (ICCAD), 2020
work page 2020
-
[8]
Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, and Toyotaro Suzumura. Memory is all you need: An overview of compute-in-memory architectures for accelerating large language model inference, 2024
work page 2024
-
[9]
Leveraging redundancy in attention with reuse transformers, 2021
Srinadh Bhojanapalli et al. Leveraging redundancy in attention with reuse transformers, 2021
work page 2021
-
[10]
What matters in transformers? not all attention is needed, 2025
Shwai He, Guoheng Sun, Zheyu Shen, and Ang Li. What matters in transformers? not all attention is needed, 2025. 10
work page 2025
-
[11]
Long short-term memory.Neural Computation, 9(8):1735–1780, 1997
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997
work page 1997
-
[12]
Deephoyer: Learning sparser neural network with differentiable scale-invariant sparsity measures
Huanrui Yang, Wei Wen, and Hai Li. Deephoyer: Learning sparser neural network with differentiable scale-invariant sparsity measures. InInternational Conference on Learning Representations, 2020
work page 2020
-
[13]
Training data-efficient image transformers and distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers and distillation through attention. InProceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 2021
work page 2021
-
[14]
Mlp-mixer: an all-mlp architecture for vision
Tolstikhin et al. Mlp-mixer: an all-mlp architecture for vision. InProceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ’21, 2021
work page 2021
-
[15]
Retentive network: A successor to transformer for large language models, 2024
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models, 2024
work page 2024
-
[16]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024
work page 2024
-
[17]
Li, Madian Khabsa, Han Fang, and Hao Ma
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-Attention with Linear Complexity.arXiv e-prints, 2020
work page 2020
-
[18]
Rethinking attention with performers
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. InInternational Conference on Learning Representations, 2021
work page 2021
-
[19]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: fast and memory-efficient exact attention with io-awareness. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, 2022
work page 2022
-
[20]
Reformer: The efficient transformer, 2020
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer, 2020
work page 2020
-
[21]
Distilling the Knowledge in a Neural Network
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.ArXiv, abs/1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[22]
Xiaoqi et al. Jiao. TinyBERT: Distilling BERT for natural language understanding. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, 2020
work page 2020
-
[23]
Mobile- BERT: a compact task-agnostic BERT for resource-limited devices
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobile- BERT: a compact task-agnostic BERT for resource-limited devices. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2158–2170, 2020
work page 2020
-
[24]
Minilm: deep self- attention distillation for task-agnostic compression of pre-trained transformers
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: deep self- attention distillation for task-agnostic compression of pre-trained transformers. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, 2020
work page 2020
-
[25]
Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models. InAdvances in Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[26]
An Chen et al. Demonstration of transformer-based albert model on a 14nm analog ai inference chip.Nature Communications, 16, 2025
work page 2025
-
[27]
Myeonggu Kang, Hyein Shin, and Lee-Sup Kim. A framework for accelerating transformer- based language model on reram-based architecture.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(9), 2022
work page 2022
-
[28]
Stevens, Kaushik Roy, and Anand Raghunathan
Shrihari Sridharan, Jacob R. Stevens, Kaushik Roy, and Anand Raghunathan. X-former: In- memory acceleration of transformers.IEEE Transactions on V ery Large Scale Integration (VLSI) Systems, 31(8):1223–1233, 2023. 11
work page 2023
-
[29]
Bing Li, Ying Qi, Ying Wang, and Yinhe Han. Attar: Rram-based in-memory attention accelerator with software-hardware co-optimization.Science China Information Sciences, 68(3):132401, 2025
work page 2025
-
[30]
Global vision transformer pruning with hessian-aware saliency
Huanrui Yang, Hongxu Yin, Maying Shen, Pavlo Molchanov, Hai Li, and Jan Kautz. Global vision transformer pruning with hessian-aware saliency. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18547–18557, 2023
work page 2023
-
[31]
A-vit: Adaptive tokens for efficient vision transformer
Hongxu Yin, Arash Vahdat, Jose M Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10809–10818, 2022
work page 2022
-
[32]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009
work page 2009
-
[33]
Learning multiple layers of features from tiny images
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009
work page 2009
-
[34]
3d object representations for fine- grained categorization
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine- grained categorization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013
work page 2013
-
[35]
Automated flower classification over a large number of classes
M-E Nilsback and A Zisserman. Automated flower classification over a large number of classes. InProceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 2008
work page 2008
-
[36]
The inaturalist species classification and detection dataset
Grant Van Horn, Oisin Mac Aodha, Yang Song, Chenyi Cui, Yin Sun, Andrew Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
work page 2018
-
[37]
Sparsh Mittal. A survey of reram-based architectures for processing-in-memory and neural networks.Machine Learning and Knowledge Extraction, 1, 2019
work page 2019
-
[38]
Overcoming the challenges of crossbar resistive memory architectures
Cong Xu, Dimin Niu, Naveen Muralimanohar, Rajeev Balasubramonian, Tao Zhang, Shimeng Yu, and Yuan Xie. Overcoming the challenges of crossbar resistive memory architectures. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture, 2015
work page 2015
-
[39]
Xiangyu Dong, Cong Xu, Yuan Xie, and Norman P. Jouppi. Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory.IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, 31(7):994–1007, 2012
work page 2012
-
[40]
Jianhui Han, He Liu, Mingyu Wang, Zhaolin Li, and Youhui Zhang. Era-lstm: An efficient reram-based architecture for long short-term memory.IEEE Transactions on Parallel and Distributed Systems, 31(6):1328–1342, 2020
work page 2020
-
[41]
Zixu Wang et al. Real-time signal processing enabled by fused networks on a memristor-based system on a chip.Science Advances, 11(30), 2025
work page 2025
-
[42]
Star: An efficient softmax engine for attention model with rram crossbar
Yifeng Zhai, Bing Li, Bonan Yan, and Jing Wang. Star: An efficient softmax engine for attention model with rram crossbar. In2023 Design, Automation and Test in Europe Conference and Exhibition (DATE), 2023
work page 2023
-
[43]
Shuai Dong, Junyi Yang, Xiaoqi Peng, Hongyang Shang, Ye Ke, Xiaofeng Yang, Hongjie Liu, and Arindam Basu. Topkima-former: Low-energy, low-latency inference for transformers using top-k in-memory adc.IEEE Transactions on Circuits and Systems I: Regular Papers, 2025
work page 2025
-
[44]
Stevens, Rangharajan Venkatesan, Steve Dai, Brucek Khailany, and Anand Raghu- nathan
Jacob R. Stevens, Rangharajan Venkatesan, Steve Dai, Brucek Khailany, and Anand Raghu- nathan. Softermax: Hardware/software co-design of an efficient softmax for transformers. In Proceedings of the 58th Annual ACM/IEEE Design Automation Conference, DAC ’21, 2022. 12
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.