pith. sign in

arxiv: 2410.13846 · v3 · pith:BO4RFU3Nnew · submitted 2024-10-17 · 💻 cs.CL · cs.AI· cs.LG

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

Pith reviewed 2026-05-23 18:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords hybrid modelslazy layersstreaming attentionlong-context LLMsKV cachethroughput improvementtransformer adaptationreasoning models
0
0 comments X

The pith

Pretrained LLMs convert to hybrid models by replacing attention in lazy layers with streaming attention, needing little or no retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that layers in long-context transformers often focus only on recent or initial tokens and can have their full attention replaced by streaming attention to create hybrid models. This change works without training for understanding tasks and with minimal fine-tuning for reasoning tasks. The result is up to 2.17 times higher generation speed with less than 1.5 percent accuracy loss on LongBench, plus strong results on math benchmarks for advanced models. A sympathetic reader would care because the method turns existing models efficient without building or retraining new architectures from scratch.

Core claim

By identifying lazy layers that attend mainly to recent or initial tokens and replacing their full attention with streaming attention, transformer models can be converted into hybrid models. This conversion works without training for long-context understanding tasks and with minimal fine-tuning for o1-like long reasoning, delivering up to 2.17 times higher throughput and less than 1.5 percent performance drop on LongBench, while reaching 53.3 percent on AIME24 for advanced models.

What carries the argument

Lazy layers, defined as those focusing on recent or initial tokens, whose full attention is replaced by streaming attention to form hybrid models.

If this is right

  • Even with half the layers replaced, throughput improves by up to 2.17 times.
  • Performance loss stays below 1.5 percent on LongBench across tested models.
  • Advanced reasoning models achieve 53.3 percent on the AIME24 math benchmark after the change.
  • The method applies to multiple model families including LLaMA, Mistral, and QwQ-STILL.
  • Transformation needs no training for understanding tasks and only light fine-tuning for reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Many existing models may already contain underused layers that could be optimized this way without redesign.
  • Hybrid architectures might become standard by retrofitting rather than building new models.
  • Further gains could come from applying similar identification to other components like feed-forward layers.
  • The approach might generalize to other efficiency methods such as selective KV cache eviction in lazy layers.

Load-bearing premise

Layers identified as lazy by their attention focus can have full attention swapped for streaming attention without harming the model's core capabilities.

What would settle it

A test where lazy layers are accurately identified but replacing their attention causes more than 5 percent drop in accuracy on LongBench or similar long-context tasks.

Figures

Figures reproduced from arXiv: 2410.13846 by Chao Du, Cunxiao Du, Fengzhuo Zhang, Min Lin, Tianyu Pang, Wei Gao, Xuan Zhang.

Figure 1
Figure 1. Figure 1: (a) A standard transformer architecture. (b) A hybrid model in which certain layers of a standard transformer are re￾placed with more memory-efficient designs. LIGHTTRANSFER identifies lazy layers in (a) and transforms them into more efficient variants, yielding (b). et al., 2024), while OpenAI’s o1 can produce sequences of up to 100K tokens (OpenAI, 2024). As the cornerstone of the efficient inference of … view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of attention weight distributions on LLaMA3-8B. Left: The attention patterns across different layers. Right: Each cell represents an attention weight from each token (x-axis) to the initial tokens and the most recent tokens during both the prefilling and decoding stages. Layers that predominantly attend to these tokens are outlined in black boxes. ing, and also saves the KV cache of tokens in… view at source ↗
Figure 3
Figure 3. Figure 3: The framework of our LIGHTTRANSFER-TEST. A priority queue is maintained during the prefilling stage to store the lazy ratio and corresponding layer index after processing each layer. Once the queue reaches its capacity, the layer with the highest lazy ratio is identified as a lazy layer, and its KV cache is reduced, freeing memory for storing the KV cache of the current layer. in [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison of LIGHTTRANSFER and stan￾dard model on NIAH tasks using Mistral-7B-Instruct. The transferred hybrid architectures can preserve strong long-context understanding capability. LIGHTTRANSFER￾TEST applies streaming attention in some layers of a transformer-based model while retaining standard self￾attention in others, striking an effective balance be￾tween computational efficiency and re… view at source ↗
Figure 5
Figure 5. Figure 5: Lazy ratio scores across layers in QwQ-32B-STILL. 6.2. Experiments on o1-like Long Reasoning Tasks In these experiments, we investigate the effectiveness of LIGHTTRANSFER-TRAIN on o1-like long reasoning gener￾ation tasks. While these tasks feature relatively short inputs, they demand intricate reasoning. Consequently, we SFT the model with approximately 5K training examples to facilitate swift adaptation w… view at source ↗
Figure 7
Figure 7. Figure 7: Different layer replacement strategies and their perfor￾mance on LLaMA3-8B-Instruct: 1) Standard: Use standard atten￾tion in all layers. 2) Our LIGHTTRANSFER: Dynamically identify lazy layers on the fly, and replace their attention mechanism ac￾cordingly. 3) Pyramid: Replace each layer with memory-efficient attention; the budget decreases with depth, forming a pyramid-like structure. 4) Random: Randomly re… view at source ↗
Figure 9
Figure 9. Figure 9: Additional examples of layer behavior across tokens. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 35.8 36 36.2 36.4 36.6 36.8 37 37.2 37.4 37.6 37.8 SnapKV SnapKV+LightTransfer Performance Compression Ratio [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of SnapKV and SnapKV+LightTransfer. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
read the original abstract

Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation. In this work, we propose LightTransfer, a lightweight method that transforms models such as LLaMA into hybrid variants. Our approach identifies lazy layers -- those focusing on recent or initial tokens -- and replaces their full attention with streaming attention. This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities. Experiments across diverse benchmarks and models (e.g., LLaMA, Mistral, QwQ-STILL) demonstrate that, even when half of the layers are identified as lazy, LightTransfer achieves up to 2.17$\times$ throughput improvement with minimal performance loss ($<1.5\%$ on LongBench) and achieves 53.3\% on math benchmark AIME24 of advanced o1-like long reasoning model QwQ-STILL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LightTransfer, a method to transform standard transformer-based LLMs (e.g., LLaMA, Mistral, QwQ-STILL) into hybrid architectures by identifying 'lazy layers'—those whose attention focuses primarily on recent or initial tokens—and replacing their full attention with streaming attention. Identification is performed by inspecting attention patterns; the transformation requires no training for long-context tasks or only minimal fine-tuning for reasoning tasks. Experiments claim up to 2.17× throughput improvement with <1.5% drop on LongBench even when half the layers are replaced, plus 53.3% on AIME24.

Significance. If the lazy-layer identification proves robust across distributions, the work would offer a low-cost route to hybrid-model efficiency gains on existing pretrained backbones, directly addressing KV-cache scaling issues in long-context inference. The reported throughput numbers and cross-model results (LLaMA/Mistral/QwQ) constitute concrete empirical support for effortless adaptation.

major comments (3)
  1. [§3] §3 (Lazy Layer Identification): The central claim rests on reliable detection of layers whose attention concentrates on recent/initial tokens, yet no cross-task or cross-prompt stability test is reported for the selected layer set. Without this, the <1.5% LongBench drop could be an artifact of identification tuned to the evaluation distribution.
  2. [§4] §4 (Experiments): No ablation replaces a random or attention-entropy-matched set of layers with streaming attention. Such a control is required to confirm that performance preservation stems from the lazy-layer criterion rather than incidental properties of the model or task.
  3. [Results] Results on AIME24: The 53.3% figure for QwQ-STILL after minimal fine-tuning is presented without a matched baseline (full-attention QwQ-STILL) or details on the fine-tuning data distribution, leaving the incremental benefit of the LightTransfer swap unclear.
minor comments (2)
  1. [Methods] The precise definition and implementation of 'streaming attention' (e.g., window size, eviction policy) should be stated explicitly in the methods rather than assumed from prior work.
  2. [Experiments] Throughput and accuracy tables would be strengthened by reporting standard deviations or multiple random seeds, as noted in the abstract's numerical claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed review and constructive feedback. We appreciate the opportunity to strengthen the manuscript and address each major comment below. We will incorporate revisions to improve clarity and robustness where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (Lazy Layer Identification): The central claim rests on reliable detection of layers whose attention concentrates on recent/initial tokens, yet no cross-task or cross-prompt stability test is reported for the selected layer set. Without this, the <1.5% LongBench drop could be an artifact of identification tuned to the evaluation distribution.

    Authors: We thank the referee for highlighting this. The lazy-layer identification was derived from attention patterns observed on a diverse set of long-context examples, and the selected layers showed consistent behavior across the LongBench tasks we evaluated. However, we agree that explicit cross-task and cross-prompt stability analysis would strengthen the claim. In the revised manuscript, we will add a new subsection with overlap statistics for the identified lazy layers across different LongBench subsets, prompt lengths, and task types to demonstrate that the selection is not tuned to the evaluation distribution. revision: yes

  2. Referee: [§4] §4 (Experiments): No ablation replaces a random or attention-entropy-matched set of layers with streaming attention. Such a control is required to confirm that performance preservation stems from the lazy-layer criterion rather than incidental properties of the model or task.

    Authors: This is a fair criticism and a useful control. We will add the requested ablation in the revised experiments section: we will replace an equal number of randomly chosen layers (and separately, layers matched on attention entropy) with streaming attention and report the resulting performance on LongBench. This will allow direct comparison to our lazy-layer selection and help isolate the contribution of the identification criterion. revision: yes

  3. Referee: [Results] Results on AIME24: The 53.3% figure for QwQ-STILL after minimal fine-tuning is presented without a matched baseline (full-attention QwQ-STILL) or details on the fine-tuning data distribution, leaving the incremental benefit of the LightTransfer swap unclear.

    Authors: We agree that the presentation can be improved. In the revision we will report the AIME24 score of the original full-attention QwQ-STILL model under identical evaluation conditions and provide additional details on the fine-tuning data distribution, number of examples, and training hyperparameters used for the hybrid model. This will make the incremental effect of the LightTransfer replacement clearer. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical identification and replacement method

full rationale

The paper describes an empirical procedure: inspect attention patterns to label layers as lazy, then replace full attention with streaming attention (with optional minimal fine-tuning). No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Claims rest on benchmark measurements (LongBench, AIME24) rather than any derivation that reduces to its own inputs by construction. This is the common case of a self-contained empirical transformation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the newly introduced concept of lazy layers whose attention can be safely downgraded, plus the empirical claim that identification works across models.

free parameters (1)
  • lazy-layer identification threshold or criterion
    Used to decide which layers count as lazy; value or rule not stated in abstract.
axioms (1)
  • domain assumption Streaming attention in lazy layers preserves sufficient information for downstream task performance
    Invoked to justify the replacement without training.
invented entities (1)
  • lazy layers no independent evidence
    purpose: Layers whose attention pattern is limited to recent or initial tokens and can therefore use cheaper attention
    Newly defined category used to select which layers to modify.

pith-pipeline@v0.9.0 · 5756 in / 1142 out tokens · 30457 ms · 2026-05-23T18:28:29.794284+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MoBA: Mixture of Block Attention for Long-Context LLMs

    cs.LG 2025-02 unverdicted novelty 6.0

    MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.

  2. The Pitfalls of KV Cache Compression

    cs.LG 2025-09 conditional novelty 5.0

    KV cache compression causes certain instructions to degrade rapidly and be ignored in multi-instruction prompting, with system prompt leakage worsened by method choice, instruction order, and eviction bias; simple pol...

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 2 Pith papers · 21 internal anchors

  1. [1]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Bai, Y ., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., et al. Longbench: A bilingual, multitask benchmark for long context under- standing. arXiv preprint arXiv:2308.14508,

  2. [2]

    Titans: Learning to Memorize at Test Time

    Behrouz, A., Zhong, P., and Mirrokni, V . Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663,

  3. [3]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer. arXiv preprint arXiv:2004.05150,

  4. [4]

    Y ., Xing, E

    Bick, A., Li, K. Y ., Xing, E. P., Kolter, J. Z., and Gu, A. Transformers to ssms: Distilling quadratic knowledge to subquadratic models. arXiv preprint arXiv:2408.10189,

  5. [5]

    L., Fernando, A., Muraru, G.- C., Haroun, R., Berrada, L., Pascanu, R., Sessa, P

    Botev, A., De, S., Smith, S. L., Fernando, A., Muraru, G.- C., Haroun, R., Berrada, L., Pascanu, R., Sessa, P. G., Dadashi, R., et al. Recurrentgemma: Moving past transformers for efficient open language models. arXiv preprint arXiv:2404.07839,

  6. [6]

    Brandon, W., Mishra, M., Nrusimha, A., Panda, R., and Kelly, J. R. Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981,

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  8. [8]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Dao, T. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning. arXiv preprint arXiv:2307.08691,

  9. [9]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Dao, T. and Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060,

  10. [10]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    De, S., Smith, S. L., Fernando, A., Botev, A., Cristian- Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y ., Srinivasan, S., et al. Griffin: Mixing gated linear recur- rences with local attention for efficient language models. arXiv preprint arXiv:2402.19427,

  11. [11]

    Flex Attention: A Programming Model for Generating Optimized Attention Kernels

    Dong, J., Feng, B., Guessous, D., Liang, Y ., and He, H. Flex attention: A programming model for gen- erating optimized attention kernels. arXiv preprint arXiv:2412.05496,

  12. [12]

    The Llama 3 Herd of Models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  13. [13]

    A little goes a long way: Efficient long context training and inference with partial contexts

    Ge, S., Lin, X., Zhang, Y ., Han, J., and Peng, H. A little goes a long way: Efficient long context training and inference with partial contexts. arXiv preprint arXiv:2410.01485,

  14. [14]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma, T., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahri- ari, B., Ram ´e, A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118,

  15. [15]

    Goldfinch: High performance rwkv/transformer hybrid with linear pre-fill and extreme kv-cache compression

    Goldstein, D., Obeid, F., Alcaide, E., Song, G., and Cheah, E. Goldfinch: High performance rwkv/transformer hybrid with linear pre-fill and extreme kv-cache compression. arXiv preprint arXiv:2407.12077,

  16. [16]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752,

  17. [17]

    When Attention Sink Emerges in Language Models: An Empirical View

    Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y ., and Lin, M. When attention sink emerges in language models: An empirical view. arXiv preprint arXiv:2410.10781,

  18. [18]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., and Ginsburg, B. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

  19. [19]

    Mistral 7B

    Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

  20. [20]

    Kasai, J., Peng, H., Zhang, Y ., Yogatama, D., Ilharco, G., Pappas, N., Mao, Y ., Chen, W., and Smith, N

    GitHub repository. Kasai, J., Peng, H., Zhang, Y ., Yogatama, D., Ilharco, G., Pappas, N., Mao, Y ., Chen, W., and Smith, N. A. Fine- tuning pretrained transformers into rnns. arXiv preprint arXiv:2103.13076,

  21. [21]

    MiniMax-01: Scaling Foundation Models with Lightning Attention

    9 LIGHT TRANSFER : Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation Li, A., Gong, B., Yang, B., Shan, B., Liu, C., Zhu, C., Zhang, C., Guo, C., Chen, D., Li, D., et al. Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313,

  22. [22]

    A survey on large lan- guage model acceleration based on kv cache management

    Li, H., Li, Y ., Tian, A., Tang, T., Xu, Z., Chen, X., Hu, N., Dong, W., Li, Q., and Chen, L. A survey on large lan- guage model acceleration based on kv cache management. arXiv preprint arXiv:2412.19442, 2024a. Li, Y ., Huang, Y ., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. Snapkv: Llm knows what you are looking for ...

  23. [23]

    Let’s verify step by step

    Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. In The Twelfth International Conference on Learning Representations. Liu, A., Liu, J., Pan, Z., He, Y ., Haffari, G., and Zhuang, B. Minicache: Kv cache compression in depth dimension for large language...

  24. [24]

    Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems

    Min, Y ., Chen, Z., Jiang, J., Chen, J., Deng, J., Hu, Y ., Tang, Y ., Wang, J., Cheng, X., Song, H., et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems. arXiv preprint arXiv:2412.09413 ,

  25. [25]

    Nawrot, P., Ła´ncucki, A., Chochowski, M., Tarjan, D., and Ponti, E. M. Dynamic memory compression: Retrofitting llms for accelerated inference. arXiv preprint arXiv:2403.09636,

  26. [26]

    RWKV: Reinventing RNNs for the Transformer Era

    Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048,

  27. [27]

    Lightning attention-2: A free lunch for handling unlim- ited sequence lengths in large language models

    Qin, Z., Sun, W., Li, D., Shen, X., Sun, W., and Zhong, Y . Lightning attention-2: A free lunch for handling unlim- ited sequence lengths in large language models. arXiv preprint arXiv:2401.04658,

  28. [28]

    Fast Transformer Decoding: One Write-Head is All You Need

    Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150,

  29. [29]

    Keep the cost down: A review on methods to optimize llm’s kv-cache consumption

    Shi, L., Zhang, H., Yao, Y ., Li, Z., and Zhao, H. Keep the cost down: A review on methods to optimize llm’s kv-cache consumption. arXiv preprint arXiv:2407.18003,

  30. [30]

    You only cache once: Decoder-decoder architectures for language models

    Sun, Y ., Dong, L., Zhu, Y ., Huang, S., Wang, W., Ma, S., Zhang, Q., Wang, J., and Wei, F. You only cache once: Decoder-decoder architectures for language models. arXiv preprint arXiv:2405.05254,

  31. [31]

    Jamba-1.5: Hybrid transformer-mamba models at scale

    Team, J., Lenz, B., Arazi, A., Bergman, A., Manevich, A., Peleg, B., Aviram, B., Almagor, C., Fridman, C., Padnos, D., et al. Jamba-1.5: Hybrid transformer-mamba models at scale. arXiv preprint arXiv:2408.12570,

  32. [32]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models. arXiv preprint arXiv:2307.09288 ,

  33. [33]

    M., and Dao, T

    Wang, J., Paliotta, D., May, A., Rush, A. M., and Dao, T. The mamba in the llama: Distilling and accelerating hybrid models. arXiv preprint arXiv:2408.15237, 2024a. Wang, Z., Cui, B., and Gan, S. Squeezeattention: 2d man- agement of kv-cache in llm inference via layer-wise opti- mal budget. arXiv preprint arXiv:2404.04793, 2024b. Wang, Z., Jin, B., Yu, Z....

  34. [34]

    Pyramidinfer: Pyramid kv cache compres- sion for high-throughput llm inference

    Yang, D., Han, X., Gao, Y ., Hu, Y ., Zhang, S., and Zhao, H. Pyramidinfer: Pyramid kv cache compres- sion for high-throughput llm inference. arXiv preprint arXiv:2405.12532,

  35. [35]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Yang, S., Wang, B., Shen, Y ., Panda, R., and Kim, Y . Gated linear attention transformers with hardware-efficient train- ing. arXiv preprint arXiv:2312.06635,

  36. [36]

    Effectively com- press kv heads for llm

    Yu, H., Yang, Z., Li, S., Li, Y ., and Wu, J. Effectively com- press kv heads for llm. arXiv preprint arXiv:2406.07056,

  37. [37]

    Kv cache compression, but what must we give in return? a compre- hensive benchmark of long context capable approaches

    Yuan, J., Liu, H., Zhong, S., Chuang, Y .-N., Li, S., Wang, G., Le, D., Jin, H., Chaudhary, V ., Xu, Z., et al. Kv cache compression, but what must we give in return? a compre- hensive benchmark of long context capable approaches. arXiv preprint arXiv:2407.01527,

  38. [38]

    Lolcats: On low- rank linearizing of large language models

    Zhang, M., Arora, S., Chalamala, R., Wu, A., Spector, B., Singhal, A., Ramesh, K., and R ´e, C. Lolcats: On low- rank linearizing of large language models. arXiv preprint arXiv:2410.10254, 2024a. Zhang, M., Bhatia, K., Kumbong, H., and R ´e, C. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. arXiv preprint arXiv:2402.04347...

  39. [39]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zhang, Y ., Gao, B., Liu, T., Lu, K., Xiong, W., Dong, Y ., Chang, B., Hu, J., Xiao, W., et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024c. Zhang, Z., Sheng, Y ., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y ., R´e, C., Barrett, C., et al. H2o: Heavy-hitter oracle f...

  40. [40]

    We followed all the hyper-parameters outlined in the paper, except for the number of retention tokens

    code it references. We followed all the hyper-parameters outlined in the paper, except for the number of retention tokens. SqueezeAttention and our LIGHT TRANSFER -T EST TIME are both set to the same compression ratio, equivalent to removing KV caches from 50% of the layers (i.e., P is set to 50% of the total number of layers), whereas MiniCache is set to...

  41. [41]

    Performance under different hyperparameters. (a) Window size wrecent Window size 252 508 1020 2044 Performance 39.5 39.8 39.8 40.1 (b) Sink token count wsink Sink num 0 2 4 6 Performance 26.5 39.8 39.8 39.9 (c) wlast wlast 8 16 32 64 Performance 39.9 39.8 39.9 39.7 (a) Example 0 (b) Example 1 Figure

  42. [42]

    The analysis is conducted using LLaMA3-8B-Instruct

    The examples are randomly chosen from LongBench benchmarks. The analysis is conducted using LLaMA3-8B-Instruct. E. Notation For a positive integer N ∈ N, we define the set [N ] = {1, · · · , N}. For a vector x ∈ Rd, we adopt ∥ · ∥p to denote the ℓp norm of vectors. For a matrix X = [x⊤ 1 , · · · , x⊤ d1 ]⊤ ∈ Rd1×d2, where xi ∈ Rd2 for i = 1, · · · , d1, w...

  43. [43]

    Given any two conjugate numbers u, v ∈ [1, ∞], i.e., 1 u + 1 v = 1, and 1 ≤ p ≤ ∞, for any A ∈ Rr×c and x ∈ Rc, we have ∥Ax∥p ≤ ∥A⊤∥p,u∥x∥v and ∥Ax∥p ≤ ∥A∥u,p∥x∥v

    ). Given any two conjugate numbers u, v ∈ [1, ∞], i.e., 1 u + 1 v = 1, and 1 ≤ p ≤ ∞, for any A ∈ Rr×c and x ∈ Rc, we have ∥Ax∥p ≤ ∥A⊤∥p,u∥x∥v and ∥Ax∥p ≤ ∥A∥u,p∥x∥v. Lemma G.3 (Lemma I.8 in (Zhang et al., 2023)). For any X, ˜X ∈ RN ×d, and any WQ,h, WK,h ∈ Rd×dh , WV,h ∈ Rd×d for h ∈ [H] , if ∥X∥2,∞, ∥ ˜X∥2,∞ ≤ BX, ∥WQ,h∥F ≤ BQ, ∥WK,h∥F, ≤ BK, ∥WV,h∥F ≤ ...