Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

Sergey Zagoruyko , Nikos Komodakis

Authors on Pith no claims yet

classification 💻 cs.CV

keywords attentionneuralconvolutionalnetworknetworksperformancerolevariety

read the original abstract

Attention plays a critical role in human visual experience. Furthermore, it has recently been demonstrated that attention can also play an important role in the context of applying artificial neural networks to a variety of tasks from fields such as computer vision and NLP. In this work we show that, by properly defining attention for convolutional neural networks, we can actually use this type of information in order to significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network. To that end, we propose several novel methods of transferring attention, showing consistent improvement across a variety of datasets and convolutional neural network architectures. Code and models for our experiments are available at https://github.com/szagoruyko/attention-transfer

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer
cs.LG 2026-05 unverdicted novelty 7.0

GDPD treats partial student features as degraded observations and uses a learned diffusion prior over teacher features to sample restorative long-context targets for improved partial time-series classification.
GaitKD: A Universal Decoupled Distillation Framework for Efficient Gait Recognition
cs.CV 2026-04 unverdicted novelty 6.0

GaitKD introduces a decoupled distillation framework that transfers inter-class decisions via part-calibrated logits and preserves embedding space partitioning via activation boundaries, yielding consistent gains over...
Rapidly deploying on-device eye tracking by distilling visual foundation models
cs.CV 2026-04 unverdicted novelty 6.0

DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
cs.AI 2023-03 conditional novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
Deep Reprogramming Distillation for Medical Foundation Models
cs.CV 2026-05 unverdicted novelty 5.0

DRD introduces a reprogramming module and CKA-based distillation to enable efficient, robust adaptation of medical foundation models to downstream 2D/3D classification and segmentation tasks, outperforming prior PEFT ...
SwiftChannel: Algorithm-Hardware Co-Design for Deep Learning-Based 5G Channel Estimation
cs.IT 2026-05 unverdicted novelty 5.0

SwiftChannel delivers a compressed CNN-based channel estimator with parameter-free attention running on FPGA, achieving sub-millisecond latency, 24x speedup, and 33x better energy efficiency than GPU baselines while g...
Improving Diversity in Black-box Few-shot Knowledge Distillation
cs.CV 2026-04 unverdicted novelty 5.0

An adaptive high-confidence image selection scheme during GAN training expands diversity in the distillation set for black-box few-shot KD and yields SOTA student accuracy on seven image datasets.
Self-Abstraction Learning for Effective and Stable Training of Deep Neural Networks
cs.LG 2026-04 unverdicted novelty 5.0

SAL is a hierarchical framework that trains deep neural networks by starting with the simplest network and using its hidden and output layers to guide more complex networks, resulting in more stable training and bette...
FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion
cs.LG 2026-04 unverdicted novelty 5.0

FedProxy replaces weak adapters with a proxy SLM for federated LLM fine-tuning, outperforming prior methods and approaching centralized performance via compression, heterogeneity-aware aggregation, and training-free fusion.
Multi-Dataset Cross-Domain Knowledge Distillation for Unified Medical Image Segmentation, Classification, and Detection
cs.CV 2026-05 unverdicted novelty 4.0

A multi-dataset cross-domain knowledge distillation approach improves unified performance on medical image segmentation, classification, and detection by transferring domain-invariant features from a joint teacher mod...