pith. sign in

arxiv: 2601.11667 · v2 · pith:V2V5FTTKnew · submitted 2026-01-16 · 💻 cs.LG · cs.AI

Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction

classification 💻 cs.LG cs.AI
keywords attentionlinearhybridfull-attentiondistill-then-replaceefficientfullmodel
0
0 comments X
read the original abstract

Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with respect to sequence length limits practical deployment. Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation. Hybrid models that integrate full and linear attention layers promise a balance between efficiency and expressiveness, but face two major challenges: training such hybrid models from scratch is computationally expensive, and manually designing the optimal placement of attention types is highly nontrivial. We propose DtR (Distill-then-Replace), which first transfers weights from the pretrained full-attention modules to its linear attention counterparts through blockwise local distillation, and then applies a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear ones while monitoring validation performance on the target task. DtR yields a task-specific hybrid model in a single efficient pass, without costly re-training or neural architecture search, and can be applied to any pretrained full-attention backbone for diverse downstream tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.