Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction

Chaoliang Zhong; Huigang Zhang; Jun Sun; Xiaojie Xia; Yusuke Oishi

REVIEW 4 major objections 4 minor 45 references

A pretrained full-attention transformer can be converted into a task-specific hybrid model that matches or exceeds its accuracy while running faster, in a single pass without retraining or architecture search.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · deepseek-v4-flash

2026-08-03 10:08 UTC pith:V2V5FTTK

load-bearing objection Plausible practical pipeline for building task-specific hybrid attention models, but its central 'match or exceed' claim is inflated by selection-on-validation and missing variance. the 4 major comments →

arxiv 2601.11667 v2 pith:V2V5FTTK submitted 2026-01-16 cs.LG cs.AI

Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction

Xiaojie Xia , Huigang Zhang , Chaoliang Zhong , Jun Sun , Yusuke Oishi This is my paper

classification cs.LG cs.AI

keywords hybrid attention modelslinear attentionblockwise local distillationgreedy layer replacementtask-specific architectureinference efficiencytransformer compressionknowledge distillation

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper claims that a pretrained full-attention transformer can be converted into a task-specific hybrid model—part full attention, part linear attention—in a single low-cost pass, without retraining or neural architecture search. The method, called DtR, first distills each full-attention block into a linear-attention counterpart using blockwise local distillation, then greedily replaces full layers with these linear blocks while monitoring validation performance on the target task. The central result is that the searched hybrid models match or exceed the base model's accuracy on most of the twelve evaluated tasks, while replacing several to many layers and thereby increasing inference throughput, especially at long sequence lengths. If true, this gives practitioners a practical recipe for turning an existing large language model into a faster, task-tuned model in hours on a single GPU.

Core claim

The central claim is that a pretrained full-attention transformer can be made task-specific and faster by replacing many of its full-attention blocks with linear-attention counterparts trained to reproduce each block's output in isolation. After blockwise local distillation, a greedy algorithm repeatedly swaps in the linear block that most improves (or least hurts) validation performance on the target task, continuing until a performance threshold is breached. Across three base models of different families and scales, the best hybrid found this way matches or exceeds the base model on most of twelve tasks, and the maximum-replacement hybrid stays within the allowed performance drop while sub

What carries the argument

The method combines two mechanisms. Blockwise local distillation trains each linear-attention block independently, in parallel, to reproduce the output of its parent full-attention block on the same hidden states, using a mean-squared-error loss and no backpropagation through the whole model. Then a greedy layer-replacement loop evaluates, for each remaining full-attention layer, the task validation metric after replacing that layer with its distilled linear counterpart, commits the swap that yields the best score, and halts when the score falls below a minimum acceptable threshold. This two-stage design is what lets the method avoid retraining and architecture search, and it also explains t

Load-bearing premise

The load-bearing premise is that the linear blocks distilled in isolation remain accurate when chained after already-replaced predecessors; if errors accumulate under that distribution shift, the greedy validation scores won't reflect the final hybrid's test behavior.

What would settle it

Run the greedy search on a long-context task using only the short validation set, then evaluate the chosen hybrid on a fresh, longer test set; if the validation-selected model's accuracy drops below the tolerated margin on the longer data, the greedy validation criterion is not a faithful proxy and the match-or-exceed result would be a selection artifact.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Any pretrained full-attention backbone can be converted to a task-specific hybrid in a few GPU-hours, using roughly 100M general tokens plus a small task validation set.
The greedy search preserves full attention in task-critical layers and replaces the rest, so the hybrid retains baseline accuracy while improving throughput; speedups grow with sequence length.
With a user-specified allowed accuracy drop, the method returns a maximum-replacement hybrid whose measured drop tracks the allowed margin, giving a controllable efficiency/accuracy dial.
The searched hybrids can be further improved by supervised fine-tuning, sometimes surpassing the fine-tuned base model, meaning the hybrid architecture is a viable substrate for downstream training.
Replacement order is largely consistent across linear-attention variants, suggesting that a single search per model and task can inform the placement of many different linear backends.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A selection-bias caution the paper does not address: because the same validation set drives the greedy search, the reported test scores for the 'best hybrid' are likely optimistically selected; reserving a separate validation split for a single final evaluation would be a stronger check.
The results imply that many full-attention layers in pretrained LLMs are task-redundant; if so, task-specific compression headroom is large, and DtR-style layer replacement could plausibly combine with layer pruning or quantization for even larger speedups.
The method's benefit should grow with sequence length, so testing it on long-context tasks such as document QA or extended dialogue, with the same protocol, is a natural next step because linear layers' KV-cache savings only materialize there.
One datum in the tables appears impossible: Table 3 reports a replacement count of 72 for a 32-layer model in one cell. Since a layer can only be replaced once, this is likely a typo, and that cell should be corrected before being used as evidence for the match-or-exceed claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Plausible practical pipeline for building task-specific hybrid attention models, but its central 'match or exceed' claim is inflated by selection-on-validation and missing variance.

read the letter

Colleague,

This is a practical engineering paper with one inflated claim. The pipeline — blockwise local distillation of linear attention counterparts plus greedy, validation-guided layer replacement — is not in the prior literature as a task-specific single-pass recipe, and the experiments span three backbones and a dozen tasks. That is real work, and the cost numbers (a few GPU-hours) are attractive. The method is sensible and clearly written.

The soft spot is the empirical support for "match or exceed." Algorithm 1 updates M_best only when a candidate's validation score is >= the current best, and the initial best is the base model, so the hybrid is guaranteed to match or beat the base on validation by construction. That is selection, not evidence. The test results are the independent part, and they show variance: some tasks drop 8–9% even though the validation tolerance was 5%. There are no error bars or repeated runs, so we cannot tell whether the gains are signal or selection noise. The stress-test arithmetic is right: for 28–32 layers, the greedy search evaluates a few hundred candidates per task, and picking the max of ~500-sample validation estimates carries real optimism bias.

Other issues: no code or training hyperparameters for the distillation stage are given, so the central cost/quality trade-off is not independently checkable. The closest prior methods (PULSE/Puzzle and Jet-Nemotron) are cited but never compared, which weakens the novelty claim. Table 3 has a typo (PQ #Rep=72) that needs fixing. The distribution-shift concern — independent block distillation followed by sequential replacement — is partially addressed because the greedy search evaluates the actual hybrid on validation, but it would be nice to see evidence that distillation quality survives the shift.

On balance, this is a plausible method with an overstated headline. It deserves a serious referee, but the authors need to add variance, a random/null-selection baseline, and ideally code or full configs. I'd bring it to reading group mostly as a case study in selection-on-validation.

Recommendation: accept peer review with expectation of major revision. Verify the variance and comparison claims before relying on this.

Referee Report

4 major / 4 minor

Summary. The paper proposes DtR, a two-stage method for converting a pretrained full-attention transformer into a task-specific hybrid model. In the first stage, each full-attention block is paired with a linear-attention counterpart trained to match the block's output via blockwise MSE distillation (Sec. 3.1). In the second stage, Algorithm 1 greedily replaces full-attention layers with these distilled linear blocks: at each step, every remaining full-attention position is evaluated on the task validation set, and the replacement yielding the highest validation score is committed, while tracking the best-scoring hybrid and a maximum-replacement hybrid subject to a performance threshold. The method is tested on 13 benchmarks across three base models (Qwen2.5-1.5B, Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct) with three linear-attention variants (GLA, GDN, JET). The authors report that the validation-best hybrid matches or exceeds the base model on most test sets and that decoding throughput grows with the number of replaced layers.

Significance. If the empirical claims are robust, DtR is a practically valuable recipe for cheaply converting existing LLMs into hybrid architectures without from-scratch training, and the reported GPU-hour costs (Table 5) are attractive for deployment-oriented researchers. The experiments are broad and the greedy, validation-driven search is simple and potentially reusable. However, the central empirical claim is weakened by best-of-many selection on small validation sets without uncertainty quantification, and by the threshold violations in Table 4. The core idea deserves publication only after the statistical evidence is strengthened.

major comments (4)

[Sec. 4.2, Tables 1-3, Algorithm 1] The claim that the 'searched best hybrid models match or even exceed the performance of the base model in the majority of cases' is not supported by the reported numbers. Algorithm 1 selects the architecture with maximum validation score over O(L^2) candidate configurations (~406 for 28 layers, ~496 for 32 layers) per task and variant. With validation sets that are either official splits or only 500 random samples (Sec. 4.1), this best-of-many selection produces upward-biased validation estimates. The P_best values in Tables 1-3 are the test scores of the validation-max architecture, reported as single points without error bars, repeated runs, or a null comparison (e.g., test scores of random architectures with matched #Rep). The base model is a single fixed point, so max-of-many can beat it even when all hybrids are on average worse. Please report standard errors or binomial confidence
[Table 4, Sec. 4.2] The stated protocol allows a maximum 5% relative validation drop, yet the selected optimal models show test drops of 9.58% (EC) and 8.47% (MK). This discrepancy indicates that the 500-sample validation sets are too noisy to control the advertised performance-efficiency trade-off. The authors should report validation drops alongside test drops for all tasks and explain the EC/MK outliers. Without this, the claim that a user can preselect a degradation margin and trust the resulting hybrid is not validated. This is load-bearing for the practical contribution.
[Sec. 3.1, Sec. 4.1] The blockwise local distillation (BLD) stage is the foundation of the method, but its training setup is underspecified: no optimizer, learning rate, number of training steps/epochs, batch size, sequence length, or exact linear-block configuration is given, and the 100M-token corpus is described only as 'a combination of Nemotron-CC and Redstone-QA.' This prevents reproduction of the central component. Please provide full hyperparameters and ideally code or checkpoints.
[Sec. 4.2, Fig. 4] The text says the proposed replacement strategy 'consistently and significantly outperforms' alternative strategies, but Fig. 4 appears to show single runs without error bars or significance tests. The word 'significantly' is unsupported. Add variance estimates across repeated runs or random seeds, or soften the claim to 'numerically outperforms in these runs.'

minor comments (4)

[Table 3, PQ row] GDN #Rep is reported as 72, which exceeds the 32-layer Llama-3.1-8B model. This is likely a typo (7 or 2?) and should be corrected.
[Abstract, Sec. 3.2] 'Without costly re-training or neural architecture search' is misleading because Algorithm 1 is itself a greedy architecture search. Suggest rewording to 'without expensive end-to-end training or exhaustive architecture search.'
[Sec. 2.1] 'impossible triangle' should likely be 'impossible trinity' for standard terminology.
[Sec. 4.1] The exact validation set sizes per task are not listed. A supplementary table with official vs. sampled validation sizes would help interpret the noise levels.

Circularity Check

1 steps flagged

Minor tautology in defining the 'best' hybrid via the validation update rule; the test-set results remain genuinely out-of-sample.

specific steps

self definitional [Algorithm 1 lines 14–16; Section 4.2, first paragraph]
"if P ∗ ≥ P best then M best ← M ∗, P best ← P ∗ (Algorithm 1); "the searched best hybrid models match or even exceed the performance of the base model in the majority of cases" (Sec. 4.2)"

M_best is defined as the candidate attaining the maximum validation metric among all visited hybrids, with the full-attention model as the initial incumbent. Therefore the statement that the 'best searched hybrid' matches or exceeds the base on validation is true by the update rule, not by measured model quality. Repeating it as an empirical result in Sec. 4.2 is partly a restatement of the selection criterion. The Tables 1-3 P_best values are test scores, not the validation scores used in the update, so those numbers are not forced by construction; the circularity is partial and mainly affects the framing, not the out-of-sample measurement.

full rationale

The central pipeline is not circular in an algebraic sense. Blockwise local distillation trains each linear block to regress on its parent full-attention block's outputs (L = MSE(Ofull, Olinear)), which is an independent local fitting step. The greedy replacement then evaluates actual hybrid models on validation, so the distribution-shift concern is empirically measured rather than assumed. The core empirical claim (Tables 1-3) is reported on test splits, which lie outside the selection loop, so 'match or exceed base' is not guaranteed by construction on test data. The only circularity-adjacent element is Algorithm 1's rule that M_best is updated only when a candidate's validation score is >= the current best, making 'the best searched hybrid matches or exceeds the base' tautologically true on validation; Sec. 4.2 restates this selection property as a substantive finding. The absence of error bars and the Table 4 test drops of 9.58% (EC) and 8.47% (MK) against a 5% validation tolerance indicate selection noise or overfitting, a statistical-validity concern rather than definitional circularity. There are no load-bearing self-citations, no imported uniqueness theorems, and no ansatz smuggled in by the authors' prior work; the paper relies on external benchmarks and externally introduced linear modules (GLA, GDN, JetBlock). Overall this is a minor self-definitional framing issue, not circularity of the derivation itself.

Axiom & Free-Parameter Ledger

4 free parameters · 4 axioms · 0 invented entities

The central claim rests on the heuristic assumptions that blockwise local distillation preserves functional fidelity under replacement, that greedy replacement is near-optimal, and that small validation sets are representative. Method hyperparameters (5% tolerance, 500 validation samples, 100M distillation tokens) are user-chosen and not fitted to data.

free parameters (4)

Maximum allowed performance drop (Pmin threshold) = 5% relative drop (used in Table 4; not reported per-task)
User-set constraint controlling Mopt; the greedy search stops when validation drops below it.
Validation set size for tasks without official split = 500 randomly sampled examples
Chosen by hand; affects search reliability; no repeated sampling.
Distillation corpus size = ~100M tokens (Nemotron-CC + Redstone-QA)
Arbitrary budget; BLD stage depends on it.
BLD training hyperparameters (learning rate, steps, batch)
Required to reproduce distillation but omitted from paper.

axioms (4)

domain assumption MSE blockwise distillation preserves enough of each full-attention block's function under later replacements
Section 3.1 defines L=MSE(Ofull,Olinear), but no analysis of distribution shift after replacing earlier layers.
domain assumption Greedy one-way layer replacement finds a task-optimal hybrid configuration
Algorithm 1 replaces a layer permanently once validation improves; no backtracking or global search; Section 3.2 asserts it adapts to task-specific knowledge.
domain assumption Validation metrics on 500 samples are reliable for architecture selection
Section 4.1 uses official val split or 500 random training examples; all greedy decisions rely on this.
domain assumption Linear attention modules can be inserted into any pretrained transformer with hidden-size matching
Section 4.1 states all variants use same hidden size; no discussion of architectural compatibility (e.g., normalization, residual structure).

pith-pipeline@v1.3.0-alltime-deepseek · 10598 in / 13367 out tokens · 140936 ms · 2026-08-03T10:08:12.085865+00:00 · methodology

0 comments

read the original abstract

Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with respect to sequence length limits practical deployment. Linear attention mechanisms offer linear or near-linear scaling yet often incur performance degradation. Hybrid models that integrate full and linear attention layers promise a balance between efficiency and expressiveness, but face two major challenges: training such hybrid models from scratch is computationally expensive, and manually designing the optimal placement of attention types is highly nontrivial. We propose DtR (Distill-then-Replace), which first transfers weights from the pretrained full-attention modules to its linear attention counterparts through blockwise local distillation, and then applies a greedy layer replacement strategy that iteratively substitutes full attention blocks with linear ones while monitoring validation performance on the target task. DtR yields a task-specific hybrid model in a single efficient pass, without costly re-training or neural architecture search, and can be applied to any pretrained full-attention backbone for diverse downstream tasks.

Figures

Figures reproduced from arXiv: 2601.11667 by Chaoliang Zhong, Huigang Zhang, Jun Sun, Xiaojie Xia, Yusuke Oishi.

**Figure 1.** Figure 1: Linear attention weight by blockwise local distillation. (a) Overall distillation framework from full attention to the linear attention. (b) BLD (blockwise local distillation), which are trained in parallel and independently. This decoupled distillation ensures that each linear attention module depends solely on the behavior of its associated full-attention block, without requiring back-propagation throug… view at source ↗

**Figure 2.** Figure 2: Throughput comparison under context length of 512, 2,048, 16,384 and 65,536. The numbers above points indicate the speedup relative to base full-attention model. We allow a maximum performance drop of 5% relative to the base model on the validation set and select the hybrid model with the most linear attention layers as the optimal configuration. Then we examine the resulting trade-off between performance… view at source ↗

**Figure 3.** Figure 3: Layer replacement trajectories on PubMedQA (PM) and CommonsenseQA (CQ) using Qwen2.5-1.5B and Llama3.2-3B-Instruct (28 layers each) with linear attention variants: Gated Linear Attention (GLA) and Jet-Block (JET) [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of different replacement strategies. Experimental cost We report the total experimental cost of our method on a single NVIDIA A800 GPU on the PubmedQA dataset with different base models. Note that the greedy search ends when all full-attention layers are replaced, providing an upper-bound greedy search runtime [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of supervised fine-tuning (SFT) on searched hybrid model. As shown in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 27 linked inside Pith

[1]

In: The Twelfth International Conference on Learning Representations (2024)

Agarwal, R., Vieillard, N., Zhou, Y., Stanczyk, P., Garea, S.R., Geist, M., Bachem, O.: On-policy distillation of language models: Learning from self-generated mis- takes. In: The Twelfth International Conference on Learning Representations (2024)

2024
[2]

Amini, A., Gabriel, S., Lin, S., Koncel-Kedziorski, R., Choi, Y., Hajishirzi, H.: Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In: Proceedings of the 2019 conference of the North American chap- ter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers). pp...

2019
[3]

arXiv preprint arXiv:2402.18668 (2024)

Arora, S., Eyuboglu, S., Zhang, M., Timalsina, A., Alberti, S., Zinsley, D., Zou, J., Rudra, A., Ré, C.: Simple linear attention language models balance the recall- throughput tradeoff. arXiv preprint arXiv:2402.18668 (2024)

Pith/arXiv arXiv 2024
[4]

arXiv preprint arXiv:2411.19146 (2024)

Bercovich, A., Ronen, T., Abramovich, T., Ailon, N., Assaf, N., Dabbah, M., Galil, I., Geifman, A., Geifman, Y., Golan, I., et al.: Puzzle: Distillation-based nas for inference-optimized llms. arXiv preprint arXiv:2411.19146 (2024)

Pith/arXiv arXiv 2024
[5]

In: Proceedings of the AAAI conference on artificial intelligence

Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al.: Piqa: Reasoning about physical commonsense in natural language. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 7432–7439 (2020)

2020
[6]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Bulatov, A., Kuratov, Y., Kapushev, Y., Burtsev, M.: Beyond attention: Breaking the limits of transformer context length with recurrent memory. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 17700–17708 (2024)

2024
[7]

arXiv preprint arXiv:2412.03398 (2024)

Chang, Y., Cui, L., Dong, L., Huang, S., Huang, Y., Huang, Y., Li, S., Lv, T., Ma, S., Sun, Q., et al.: Redstone: Curating general, code, math, and qa data for large language models. arXiv preprint arXiv:2412.03398 (2024)

Pith/arXiv arXiv 2024
[8]

arXiv preprint arXiv:2009.14794 (2020)

Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al.: Rethinking attention with performers. arXiv preprint arXiv:2009.14794 (2020)

Pith/arXiv arXiv 2009
[9]

arXiv preprint arXiv:1905.10044 (2019)

Clark, C., Lee, K., Chang, M.W., Kwiatkowski, T., Collins, M., Toutanova, K.: Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044 (2019)

Pith/arXiv arXiv 1905
[10]

arXiv preprint arXiv:1803.05457 (2018)

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018)

Pith/arXiv arXiv 2018
[11]

arXiv preprint arXiv:2405.21060 (2024) 14 Xiaojie Xia, Huigang Zhang, Chaoliang Zhong, Jun Sun, and Yusuke Oishi

Dao, T., Gu, A.: Transformers are ssms: Generalized models and efficient algo- rithms through structured state space duality. arXiv preprint arXiv:2405.21060 (2024) 14 Xiaojie Xia, Huigang Zhang, Chaoliang Zhong, Jun Sun, and Yusuke Oishi

Pith/arXiv arXiv 2024
[12]

arXiv preprint arXiv:2402.19427 (2024)

De, S., Smith, S.L., Fernando, A., Botev, A., Cristian-Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y., Srinivasan, S., et al.: Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427 (2024)

Pith/arXiv arXiv 2024
[13]

arXiv preprint arXiv:2212.14052 (2022)

Fu, D.Y., Dao, T., Saab, K.K., Thomas, A.W., Rudra, A., Ré, C.: Hungry hun- gry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052 (2022)

Pith/arXiv arXiv 2022
[14]

Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., Zou, A.: The language model eval- uation harness (07 2024).https://doi.org/10.528...

arXiv 2024
[15]

arXiv preprint arXiv:2405.16712 (2024)

Glorioso, P., Anthony, Q., Tokpanov, Y., Whittington, J., Pilault, J., Ibrahim, A., Millidge, B.: Zamba: A compact 7b ssm hybrid model. arXiv preprint arXiv:2405.16712 (2024)

Pith/arXiv arXiv 2024
[16]

arXiv preprint arXiv:2505.03005 (2025)

Goldstein, D., Alcaide, E., Lu, J., Cheah, E.: Radlads: Rapid attention distillation to linear attention decoders at scale. arXiv preprint arXiv:2505.03005 (2025)

arXiv 2025
[17]

arXiv preprint arXiv:2407.21783 (2024)

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

Pith/arXiv arXiv 2024
[18]

In: First conference on language modeling (2024)

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First conference on language modeling (2024)

2024
[19]

arXiv preprint arXiv:2111.00396 (2021)

Gu, A., Goel, K., Ré, C.: Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 (2021)

Pith/arXiv arXiv 2021
[20]

arXiv preprint arXiv:2306.08543 (2023)

Gu, Y., Dong, L., Wei, F., Huang, M.: Minillm: Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543 (2023)

Pith/arXiv arXiv 2023
[21]

arXiv preprint arXiv:2508.15884 (2025)

Gu, Y., Hu, Q., Yang, S., Xi, H., Chen, J., Han, S., Cai, H.: Jet-nemotron: Efficient language model with post neural architecture search. arXiv preprint arXiv:2508.15884 (2025)

arXiv 2025
[22]

arXiv preprint arXiv:2009.03300 (2020)

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Stein- hardt, J.: Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020)

Pith/arXiv arXiv 2009
[23]

arXiv preprint arXiv:1503.02531 (2015)

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

Pith/arXiv arXiv 2015
[24]

Jin, Q., Dhingra, B., Liu, Z., Cohen, W., Lu, X.: Pubmedqa: A dataset for biomedi- calresearchquestionanswering.In:Proceedingsofthe2019conferenceonempirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp. 2567–2577 (2019)

2019
[25]

In: International conference on machine learning

Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: Fast autoregressive transformers with linear attention. In: International conference on machine learning. pp. 5156–5165. PMLR (2020)

2020
[26]

In: The Thirteenth International Conference on Learning Representations (2025)

Lenz, B., Lieber, O., Arazi, A., Bergman, A., Manevich, A., Peleg, B., Aviram, B., Almagor, C., Fridman, C., Padnos, D., et al.: Jamba: Hybrid transformer- mamba language models. In: The Thirteenth International Conference on Learning Representations (2025)

2025
[27]

arXiv preprint arXiv:2503.13299 (2025)

Liu, Y., Yu, J., Xu, Y., Li, Z., Zhu, Q.: A survey on transformer context extension: Approaches and evaluation. arXiv preprint arXiv:2503.13299 (2025)

Pith/arXiv arXiv 2025
[28]

arXiv preprint arXiv:1809.02789 (2018) Efficient Task-Specific Hybrid Attention Model Construction 15

Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a suit of armor con- duct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789 (2018) Efficient Task-Specific Hybrid Attention Model Construction 15

Pith/arXiv arXiv 2018
[29]

arXiv preprint arXiv:2305.13048 (2023)

Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et al.: Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048 (2023)

Pith/arXiv arXiv 2023
[30]

arXiv preprint arXiv:2408.01129 (2024)

Qu, H., Ning, L., An, R., Fan, W., Derr, T., Liu, H., Xu, X., Li, Q.: A survey of mamba. arXiv preprint arXiv:2408.01129 (2024)

Pith/arXiv arXiv 2024
[31]

arXiv preprint arXiv:2406.07522 (2024)

Ren, L., Liu, Y., Lu, Y., Shen, Y., Liang, C., Chen, W.: Samba: Simple hybrid state space models for efficient unlimited context language modeling. arXiv preprint arXiv:2406.07522 (2024)

Pith/arXiv arXiv 2024
[32]

Communications of the ACM64(9), 99–106 (2021)

Sakaguchi, K., Bras, R.L., Bhagavatula, C., Choi, Y.: Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM64(9), 99–106 (2021)

2021
[33]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Su,D.,Kong,K.,Lin,Y.,Jennings,J.,Norick,B.,Kliegl,M.,Patwary,M.,Shoeybi, M., Catanzaro, B.: Nemotron-cc: Transforming common crawl into a refined long- horizon pretraining dataset. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2459– 2475 (2025)

2025
[34]

arXiv preprint arXiv:2307.08621 (2023)

Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., Wei, F.: Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621 (2023)

Pith/arXiv arXiv 2023
[35]

In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Talmor, A., Herzig, J., Lourie, N., Berant, J.: Commonsenseqa: A question an- swering challenge targeting commonsense knowledge. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4149–4158 (2019)

2019
[36]

arXiv preprint arXiv:2402.05964 (2024)

Tang, Y., Wang, Y., Guo, J., Tu, Z., Han, K., Hu, H., Tao, D.: A survey on transformer compression. arXiv preprint arXiv:2402.05964 (2024)

Pith/arXiv arXiv 2024
[37]

arXiv preprint arXiv:2407.106712(3) (2024)

Team, Q., et al.: Qwen2 technical report. arXiv preprint arXiv:2407.106712(3) (2024)

Pith/arXiv arXiv 2024
[38]

Advances in neural information processing systems30(1), 261–272 (2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I., et al.: Attention is all you need. Advances in neural information processing systems30(1), 261–272 (2017)

2017
[39]

arXiv preprint arXiv:2406.07887 (2024)

Waleffe, R., Byeon, W., Riach, D., Norick, B., Korthikanti, V., Dao, T., Gu, A., Hatamizadeh, A., Singh, S., Narayanan, D., et al.: An empirical study of mamba- based language models. arXiv preprint arXiv:2406.07887 (2024)

Pith/arXiv arXiv 2024
[40]

arXiv preprint arXiv:2507.06457 (2025)

Wang, D., Zhu, R.J., Abreu, S., Shan, Y., Kergan, T., Pan, Y., Chou, Y., Li, Z., Zhang, G., Huang, W., et al.: A systematic analysis of hybrid linear attention. arXiv preprint arXiv:2507.06457 (2025)

Pith/arXiv arXiv 2025
[41]

Advances in Neural Information Processing Systems37, 62432–62457 (2024)

Wang, J., Paliotta, D., May, A., Rush, A., Dao, T.: The mamba in the llama: Dis- tilling and accelerating hybrid models. Advances in Neural Information Processing Systems37, 62432–62457 (2024)

2024
[42]

arXiv preprint arXiv:1707.06209 (2017)

Welbl, J., Liu, N.F., Gardner, M.: Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209 (2017)

Pith/arXiv arXiv 2017
[43]

arXiv preprint arXiv:2505.17272 (2025)

Yang, M., Rezagholizadeh, M., Li, G., Appia, V., Barsoum, E.: Zebra-llama: To- wards extremely efficient hybrid models. arXiv preprint arXiv:2505.17272 (2025)

arXiv 2025
[44]

arXiv preprint arXiv:2412.06464 (2024)

Yang, S., Kautz, J., Hatamizadeh, A.: Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464 (2024)

Pith/arXiv arXiv 2024
[45]

In: International Conference on Machine Learning

Yang, S., Wang, B., Shen, Y., Panda, R., Kim, Y.: Gated linear attention trans- formers with hardware-efficient training. In: International Conference on Machine Learning. pp. 56501–56523. PMLR (2024)

2024

[1] [1]

In: The Twelfth International Conference on Learning Representations (2024)

Agarwal, R., Vieillard, N., Zhou, Y., Stanczyk, P., Garea, S.R., Geist, M., Bachem, O.: On-policy distillation of language models: Learning from self-generated mis- takes. In: The Twelfth International Conference on Learning Representations (2024)

2024

[2] [2]

Amini, A., Gabriel, S., Lin, S., Koncel-Kedziorski, R., Choi, Y., Hajishirzi, H.: Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In: Proceedings of the 2019 conference of the North American chap- ter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers). pp...

2019

[3] [3]

arXiv preprint arXiv:2402.18668 (2024)

Arora, S., Eyuboglu, S., Zhang, M., Timalsina, A., Alberti, S., Zinsley, D., Zou, J., Rudra, A., Ré, C.: Simple linear attention language models balance the recall- throughput tradeoff. arXiv preprint arXiv:2402.18668 (2024)

Pith/arXiv arXiv 2024

[4] [4]

arXiv preprint arXiv:2411.19146 (2024)

Bercovich, A., Ronen, T., Abramovich, T., Ailon, N., Assaf, N., Dabbah, M., Galil, I., Geifman, A., Geifman, Y., Golan, I., et al.: Puzzle: Distillation-based nas for inference-optimized llms. arXiv preprint arXiv:2411.19146 (2024)

Pith/arXiv arXiv 2024

[5] [5]

In: Proceedings of the AAAI conference on artificial intelligence

Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al.: Piqa: Reasoning about physical commonsense in natural language. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 7432–7439 (2020)

2020

[6] [6]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Bulatov, A., Kuratov, Y., Kapushev, Y., Burtsev, M.: Beyond attention: Breaking the limits of transformer context length with recurrent memory. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 17700–17708 (2024)

2024

[7] [7]

arXiv preprint arXiv:2412.03398 (2024)

Chang, Y., Cui, L., Dong, L., Huang, S., Huang, Y., Huang, Y., Li, S., Lv, T., Ma, S., Sun, Q., et al.: Redstone: Curating general, code, math, and qa data for large language models. arXiv preprint arXiv:2412.03398 (2024)

Pith/arXiv arXiv 2024

[8] [8]

arXiv preprint arXiv:2009.14794 (2020)

Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al.: Rethinking attention with performers. arXiv preprint arXiv:2009.14794 (2020)

Pith/arXiv arXiv 2009

[9] [9]

arXiv preprint arXiv:1905.10044 (2019)

Clark, C., Lee, K., Chang, M.W., Kwiatkowski, T., Collins, M., Toutanova, K.: Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044 (2019)

Pith/arXiv arXiv 1905

[10] [10]

arXiv preprint arXiv:1803.05457 (2018)

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018)

Pith/arXiv arXiv 2018

[11] [11]

arXiv preprint arXiv:2405.21060 (2024) 14 Xiaojie Xia, Huigang Zhang, Chaoliang Zhong, Jun Sun, and Yusuke Oishi

Dao, T., Gu, A.: Transformers are ssms: Generalized models and efficient algo- rithms through structured state space duality. arXiv preprint arXiv:2405.21060 (2024) 14 Xiaojie Xia, Huigang Zhang, Chaoliang Zhong, Jun Sun, and Yusuke Oishi

Pith/arXiv arXiv 2024

[12] [12]

arXiv preprint arXiv:2402.19427 (2024)

De, S., Smith, S.L., Fernando, A., Botev, A., Cristian-Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y., Srinivasan, S., et al.: Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427 (2024)

Pith/arXiv arXiv 2024

[13] [13]

arXiv preprint arXiv:2212.14052 (2022)

Fu, D.Y., Dao, T., Saab, K.K., Thomas, A.W., Rudra, A., Ré, C.: Hungry hun- gry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052 (2022)

Pith/arXiv arXiv 2022

[14] [14]

Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., Zou, A.: The language model eval- uation harness (07 2024).https://doi.org/10.528...

arXiv 2024

[15] [15]

arXiv preprint arXiv:2405.16712 (2024)

Glorioso, P., Anthony, Q., Tokpanov, Y., Whittington, J., Pilault, J., Ibrahim, A., Millidge, B.: Zamba: A compact 7b ssm hybrid model. arXiv preprint arXiv:2405.16712 (2024)

Pith/arXiv arXiv 2024

[16] [16]

arXiv preprint arXiv:2505.03005 (2025)

Goldstein, D., Alcaide, E., Lu, J., Cheah, E.: Radlads: Rapid attention distillation to linear attention decoders at scale. arXiv preprint arXiv:2505.03005 (2025)

arXiv 2025

[17] [17]

arXiv preprint arXiv:2407.21783 (2024)

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

Pith/arXiv arXiv 2024

[18] [18]

In: First conference on language modeling (2024)

Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. In: First conference on language modeling (2024)

2024

[19] [19]

arXiv preprint arXiv:2111.00396 (2021)

Gu, A., Goel, K., Ré, C.: Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 (2021)

Pith/arXiv arXiv 2021

[20] [20]

arXiv preprint arXiv:2306.08543 (2023)

Gu, Y., Dong, L., Wei, F., Huang, M.: Minillm: Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543 (2023)

Pith/arXiv arXiv 2023

[21] [21]

arXiv preprint arXiv:2508.15884 (2025)

Gu, Y., Hu, Q., Yang, S., Xi, H., Chen, J., Han, S., Cai, H.: Jet-nemotron: Efficient language model with post neural architecture search. arXiv preprint arXiv:2508.15884 (2025)

arXiv 2025

[22] [22]

arXiv preprint arXiv:2009.03300 (2020)

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Stein- hardt, J.: Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020)

Pith/arXiv arXiv 2009

[23] [23]

arXiv preprint arXiv:1503.02531 (2015)

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

Pith/arXiv arXiv 2015

[24] [24]

Jin, Q., Dhingra, B., Liu, Z., Cohen, W., Lu, X.: Pubmedqa: A dataset for biomedi- calresearchquestionanswering.In:Proceedingsofthe2019conferenceonempirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp. 2567–2577 (2019)

2019

[25] [25]

In: International conference on machine learning

Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: Fast autoregressive transformers with linear attention. In: International conference on machine learning. pp. 5156–5165. PMLR (2020)

2020

[26] [26]

In: The Thirteenth International Conference on Learning Representations (2025)

Lenz, B., Lieber, O., Arazi, A., Bergman, A., Manevich, A., Peleg, B., Aviram, B., Almagor, C., Fridman, C., Padnos, D., et al.: Jamba: Hybrid transformer- mamba language models. In: The Thirteenth International Conference on Learning Representations (2025)

2025

[27] [27]

arXiv preprint arXiv:2503.13299 (2025)

Liu, Y., Yu, J., Xu, Y., Li, Z., Zhu, Q.: A survey on transformer context extension: Approaches and evaluation. arXiv preprint arXiv:2503.13299 (2025)

Pith/arXiv arXiv 2025

[28] [28]

arXiv preprint arXiv:1809.02789 (2018) Efficient Task-Specific Hybrid Attention Model Construction 15

Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a suit of armor con- duct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789 (2018) Efficient Task-Specific Hybrid Attention Model Construction 15

Pith/arXiv arXiv 2018

[29] [29]

arXiv preprint arXiv:2305.13048 (2023)

Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et al.: Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048 (2023)

Pith/arXiv arXiv 2023

[30] [30]

arXiv preprint arXiv:2408.01129 (2024)

Qu, H., Ning, L., An, R., Fan, W., Derr, T., Liu, H., Xu, X., Li, Q.: A survey of mamba. arXiv preprint arXiv:2408.01129 (2024)

Pith/arXiv arXiv 2024

[31] [31]

arXiv preprint arXiv:2406.07522 (2024)

Ren, L., Liu, Y., Lu, Y., Shen, Y., Liang, C., Chen, W.: Samba: Simple hybrid state space models for efficient unlimited context language modeling. arXiv preprint arXiv:2406.07522 (2024)

Pith/arXiv arXiv 2024

[32] [32]

Communications of the ACM64(9), 99–106 (2021)

Sakaguchi, K., Bras, R.L., Bhagavatula, C., Choi, Y.: Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM64(9), 99–106 (2021)

2021

[33] [33]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Su,D.,Kong,K.,Lin,Y.,Jennings,J.,Norick,B.,Kliegl,M.,Patwary,M.,Shoeybi, M., Catanzaro, B.: Nemotron-cc: Transforming common crawl into a refined long- horizon pretraining dataset. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2459– 2475 (2025)

2025

[34] [34]

arXiv preprint arXiv:2307.08621 (2023)

Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., Wei, F.: Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621 (2023)

Pith/arXiv arXiv 2023

[35] [35]

In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Talmor, A., Herzig, J., Lourie, N., Berant, J.: Commonsenseqa: A question an- swering challenge targeting commonsense knowledge. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4149–4158 (2019)

2019

[36] [36]

arXiv preprint arXiv:2402.05964 (2024)

Tang, Y., Wang, Y., Guo, J., Tu, Z., Han, K., Hu, H., Tao, D.: A survey on transformer compression. arXiv preprint arXiv:2402.05964 (2024)

Pith/arXiv arXiv 2024

[37] [37]

arXiv preprint arXiv:2407.106712(3) (2024)

Team, Q., et al.: Qwen2 technical report. arXiv preprint arXiv:2407.106712(3) (2024)

Pith/arXiv arXiv 2024

[38] [38]

Advances in neural information processing systems30(1), 261–272 (2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I., et al.: Attention is all you need. Advances in neural information processing systems30(1), 261–272 (2017)

2017

[39] [39]

arXiv preprint arXiv:2406.07887 (2024)

Waleffe, R., Byeon, W., Riach, D., Norick, B., Korthikanti, V., Dao, T., Gu, A., Hatamizadeh, A., Singh, S., Narayanan, D., et al.: An empirical study of mamba- based language models. arXiv preprint arXiv:2406.07887 (2024)

Pith/arXiv arXiv 2024

[40] [40]

arXiv preprint arXiv:2507.06457 (2025)

Wang, D., Zhu, R.J., Abreu, S., Shan, Y., Kergan, T., Pan, Y., Chou, Y., Li, Z., Zhang, G., Huang, W., et al.: A systematic analysis of hybrid linear attention. arXiv preprint arXiv:2507.06457 (2025)

Pith/arXiv arXiv 2025

[41] [41]

Advances in Neural Information Processing Systems37, 62432–62457 (2024)

Wang, J., Paliotta, D., May, A., Rush, A., Dao, T.: The mamba in the llama: Dis- tilling and accelerating hybrid models. Advances in Neural Information Processing Systems37, 62432–62457 (2024)

2024

[42] [42]

arXiv preprint arXiv:1707.06209 (2017)

Welbl, J., Liu, N.F., Gardner, M.: Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209 (2017)

Pith/arXiv arXiv 2017

[43] [43]

arXiv preprint arXiv:2505.17272 (2025)

Yang, M., Rezagholizadeh, M., Li, G., Appia, V., Barsoum, E.: Zebra-llama: To- wards extremely efficient hybrid models. arXiv preprint arXiv:2505.17272 (2025)

arXiv 2025

[44] [44]

arXiv preprint arXiv:2412.06464 (2024)

Yang, S., Kautz, J., Hatamizadeh, A.: Gated delta networks: Improving mamba2 with delta rule. arXiv preprint arXiv:2412.06464 (2024)

Pith/arXiv arXiv 2024

[45] [45]

In: International Conference on Machine Learning

Yang, S., Wang, B., Shen, Y., Panda, R., Kim, Y.: Gated linear attention trans- formers with hardware-efficient training. In: International Conference on Machine Learning. pp. 56501–56523. PMLR (2024)

2024