pith. sign in

arxiv: 2507.09025 · v4 · submitted 2025-07-11 · 💻 cs.CL · cs.LG

Lizard: An Efficient Linearization Framework for Large Language Models

Pith reviewed 2026-05-19 04:35 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords linearization frameworksubquadratic attentionlarge language modelstransformer efficiencylength generalizationadaptive memory controlassociative recallMMLU benchmark
0
0 comments X p. Extension

The pith

Lizard converts pretrained LLMs into subquadratic architectures that recover nearly all original performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Lizard as a framework to linearize Transformer-based large language models by replacing their quadratic attention with a subquadratic version. This matters for handling long sequences efficiently since current models struggle with memory and computation costs that grow quadratically. Lizard adds compact learnable modules on top of the approximation to allow adaptive memory management and better generalization across sequence lengths. A hardware-aware algorithm is included to handle numerical issues during training. Results indicate this approach comes close to the teacher's performance and beats earlier linearization techniques on standard tests.

Core claim

Lizard is an efficient linearization framework that transforms pretrained Transformer LLMs into subquadratic architectures using an approximating attention mechanism augmented by learnable modules for memory control and length generalization, supported by a hardware-aware algorithm for stable gated attention training.

What carries the argument

A subquadratic attention mechanism that approximates softmax attention, augmented with compact learnable modules for adaptive memory control.

Load-bearing premise

The subquadratic attention mechanism closely approximates softmax attention while the compact learnable modules provide adaptive memory control without degrading model quality or causing instability.

What would settle it

Applying Lizard to a model and measuring performance drop on long-context tasks beyond training lengths, or failure to match the teacher on associative recall, would show the approximation or modules are insufficient.

Figures

Figures reproduced from arXiv: 2507.09025 by Chien Van Nguyen, Franck Dernoncourt, Hanieh Deilamsalehy, Haoliang Wang, Huy Nguyen, Jayakumar Subramanian, Nikos Vlassis, Puneet Mathur, Ruiyi Zhang, Ryan A. Rossi, Thien Huu Nguyen, Trung Bui, Viet Dac Lai.

Figure 1
Figure 1. Figure 1: Overview of the Lizard training pipeline. The model is trained in two stages. Stage 1: The [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualize attention map of Lizard, as a combination of Gated Linear Attention and Sliding [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Needle-in-a-Haystack evaluation. Each cell shows retrieval accuracy by sequence length (X-axis) and target distance (Y-axis). Green indicates success; red indicates failure. The white dashed line marks the max training length. 4.1 Language Modeling Benchmarks We evaluate Lizard on six popular language understanding benchmarks from the LM Evaluation Harness (LM Eval) 4 [Gao et al., 2024], including PiQA [Bi… view at source ↗
Figure 3
Figure 3. Figure 3: Example from the synthetic passkey retrieval dataset. To evaluate our model’s performance on associative recall tasks, where the goal is to retrieve specific information from a long context, we use the Needle￾in-a-Haystack setup. To better assess retrieval ca￾pabilities, we design a synthetic passkey-retrieval dataset tailored for this purpose. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Throughput and memory comparison. We assess the efficiency of Lizard by compar￾ing its throughput and memory usage to that of the teacher model across input sequence lengths from 1K to 32K, using a batch size of 16. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Inference speed comparison between GLA and Lizard kernel.. Hardware-aware GLA in Lizard. We benchmark the Lizard kernel under BF16 precision with batch size B = 16, sequence length S = 2048, number of heads H = 32, and head dimension Dhead = 128. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
read the original abstract

We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures. Transformers faces severe computational and memory bottlenecks with long sequences due to the quadratic complexity of softmax attention and the growing Key-Value (KV) cache that makes inference memory-bound by context length. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving model quality. Unlike prior linearization methods constrained by fixed, non-adaptive structures, Lizard augments the architecture with compact, learnable modules that enable adaptive memory control and robust length generalization. Moreover, we introduce a hardwareaware algorithm that solves numerical instability in gated attention to accelerate training. Extensive experiments show that Lizard achieves near-lossless recovery of its teacher model's performance, significantly outperforming previous methods by up to 9.4 - 24.5 points on the 5-shot MMLU benchmark and demonstrating superior associative recall.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Lizard, a linearization framework that converts pretrained Transformer LLMs into subquadratic architectures. It replaces quadratic softmax attention with a subquadratic mechanism claimed to closely approximate it, augments the model with compact learnable modules for adaptive memory control and length generalization, and introduces a hardware-aware algorithm to mitigate numerical instability in gated attention. Experiments report near-lossless recovery of the teacher model's performance, with gains of 9.4–24.5 points on 5-shot MMLU over prior linearization methods and improved associative recall.

Significance. If the approximation quality and ablation results hold, the work would offer a practical route to subquadratic LLMs that preserve quality while scaling to long contexts, directly addressing KV-cache memory bounds and quadratic complexity; the hardware-aware stabilization and adaptive modules are notable strengths if quantitatively validated.

major comments (2)
  1. [Method and Experiments] The central claim of near-lossless recovery depends on the subquadratic attention closely approximating softmax, yet no quantitative bounds on approximation error (e.g., max-norm, KL divergence, or output-level divergence between the linearized and original attention) are reported in the method or experimental sections. This leaves the load-bearing assumption untested and risks conflating approximation fidelity with other factors such as training schedule or added parameters.
  2. [Ablation Studies] No ablation is presented that removes the compact learnable modules while retaining only the subquadratic linearization (or vice versa), which is needed to isolate whether performance gains on MMLU and associative recall stem from faithful approximation or from the extra adaptive capacity. Without this, downstream improvements cannot be confidently attributed to the core linearization.
minor comments (2)
  1. [Method] Notation for the gated attention and hardware-aware stabilization could be made more explicit with an additional equation or pseudocode block to clarify the numerical fix.
  2. [Experiments] Figure captions for associative-recall and MMLU plots should include error bars or standard deviations across runs to support the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our submission. We address each major comment below and describe the revisions we have made or will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method and Experiments] The central claim of near-lossless recovery depends on the subquadratic attention closely approximating softmax, yet no quantitative bounds on approximation error (e.g., max-norm, KL divergence, or output-level divergence between the linearized and original attention) are reported in the method or experimental sections. This leaves the load-bearing assumption untested and risks conflating approximation fidelity with other factors such as training schedule or added parameters.

    Authors: We agree that explicit quantitative bounds on the approximation error would provide stronger grounding for the central claim. In the revised manuscript we add a new subsection (Section 3.3) that reports max-norm differences, average KL divergence, and output-level cosine similarity between the linearized attention outputs and the original softmax attention, computed on held-out sequences of varying lengths. These metrics remain small and stable, and we include a brief correlation analysis showing that lower approximation error aligns with higher downstream retention. This addition helps separate the contribution of the linearization itself from the effects of the adaptive modules and training schedule. revision: yes

  2. Referee: [Ablation Studies] No ablation is presented that removes the compact learnable modules while retaining only the subquadratic linearization (or vice versa), which is needed to isolate whether performance gains on MMLU and associative recall stem from faithful approximation or from the extra adaptive capacity. Without this, downstream improvements cannot be confidently attributed to the core linearization.

    Authors: We acknowledge the value of isolating the contribution of the subquadratic linearization from the adaptive modules. The revised manuscript includes a new ablation study (Section 4.4 and Table 4) that evaluates three variants: (i) subquadratic attention alone without the learnable modules, (ii) the full Lizard model, and (iii) the original Transformer with the adaptive modules grafted on. Results show that the subquadratic attention alone recovers most but not all of the performance, while the adaptive modules provide additional gains especially on long-context associative recall and 5-shot MMLU. These controlled comparisons allow clearer attribution of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with independent experimental validation

full rationale

The paper introduces Lizard as an architectural framework for linearizing Transformers via a subquadratic attention mechanism plus compact learnable modules, supported by a hardware-aware stabilization algorithm. All central claims (near-lossless recovery, benchmark gains of 9.4-24.5 points on 5-shot MMLU, superior associative recall) are presented as outcomes of extensive experiments rather than any first-principles derivation, fitted-parameter prediction, or self-citation chain. No equations or steps reduce by construction to inputs; the approximation to softmax attention is asserted and then evaluated empirically. This is a standard engineering contribution whose validity rests on external benchmarks, not tautological definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the learnable modules are standard trainable components whose internal parameterization is not detailed here.

pith-pipeline@v0.9.0 · 5741 in / 1022 out tokens · 35963 ms · 2026-05-19T04:35:17.342492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 17 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    Simple linear attention language models balance the recall-throughput tradeoff

    Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668,

  3. [3]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

  4. [4]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691,

  5. [5]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060,

  6. [6]
  7. [7]

    The language model evaluation harness, 07 2024

    URL https://zenodo.org/records/12608602. Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model. arXiv preprint arXiv:2405.16712,

  8. [8]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  9. [9]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752,

  10. [10]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,

  11. [11]

    Mistral 7B

    AQ Jiang, A Sablayrolles, A Mensch, C Bamford, DS Chaplot, D de Las Casas, F Bressand, G Lengyel, G Lample, L Saulnier, et al. Mistral 7b, arxiv abs/2310.06825 (2023). URL: https://api. semanticscholar. org/CorpusID, 263830494. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformer...

  12. [12]

    Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache

    Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, et al. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache. arXiv preprint arXiv:2401.02669,

  13. [13]

    Language Models are Few-Shot Learners

    Ben Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sastry, A Askell, S Agarwal, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 1:3,

  14. [14]

    Linearizing large language models

    Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, and Thomas Kollar. Linearizing large language models. arXiv preprint arXiv:2405.06640,

  15. [15]

    Leave no context behind: Efficient infinite context transformers with infini-attention, 2024

    Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv preprint arXiv:2404.07143, 101,

  16. [16]

    RWKV: Reinventing RNNs for the Transformer Era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guan...

  17. [17]

    org/abs/2307.14995

    URL https://arxiv. org/abs/2307.14995. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM , 64(9):99–106,

  18. [18]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,

  19. [19]

    An Empirical Study of Mamba-based Language Models

    URL https://arxiv. org/abs/2406.07887. Junxiong Wang, Daniele Paliotta, Avner May, Alexander Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models. Advances in Neural Information Processing Systems, 37:62432–62457,

  20. [20]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768,

  21. [21]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

  22. [22]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635,

  23. [23]

    org/P19-1472

    URL https://aclanthology. org/P19-1472. Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry, 2024b. URL https://arxiv. org/abs/2402.04347. Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. ...

  24. [24]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277,

  25. [25]

    First, Lizard still relies on a strong pretrained backbone to achieve high quality

    12 A Limitations Despite the promising performance and efficiency gains demonstrated by Lizard, our approach has two key limitations. First, Lizard still relies on a strong pretrained backbone to achieve high quality. As with many recent distillation-based or hybrid architectures, the success of our method depends heavily on the expressiveness and general...

  26. [26]

    As shown in Figure 6, our hardware-aware implementation of GLA achieves 3.25 ms per forward pass, representing a 32% reduction in inference time compared to the original Gated Linear Attention 5 kernel (4.30 ms). This speedup stems from shifting the gating contributions into the feature space, enabling tensor core compatibility and chunkwise matrix operat...

  27. [27]

    Chungus Among Us

    For the learning rate, we performed an initial sweep over {1e-2, 5e-3, 1e-3, 5e-4, 1e-4}. We did not tune the batch size. For the other designs, we adopted the default values used by prior work [Zhang et al., 2024]. E Evaluation on small-size LLMs Table 8: Evaluation results of small-size LLMs and their variants across multiple benchmarks. Lizard consiste...