Lizard: An Efficient Linearization Framework for Large Language Models
Pith reviewed 2026-05-19 04:35 UTC · model grok-4.3
The pith
Lizard converts pretrained LLMs into subquadratic architectures that recover nearly all original performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lizard is an efficient linearization framework that transforms pretrained Transformer LLMs into subquadratic architectures using an approximating attention mechanism augmented by learnable modules for memory control and length generalization, supported by a hardware-aware algorithm for stable gated attention training.
What carries the argument
A subquadratic attention mechanism that approximates softmax attention, augmented with compact learnable modules for adaptive memory control.
Load-bearing premise
The subquadratic attention mechanism closely approximates softmax attention while the compact learnable modules provide adaptive memory control without degrading model quality or causing instability.
What would settle it
Applying Lizard to a model and measuring performance drop on long-context tasks beyond training lengths, or failure to match the teacher on associative recall, would show the approximation or modules are insufficient.
Figures
read the original abstract
We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures. Transformers faces severe computational and memory bottlenecks with long sequences due to the quadratic complexity of softmax attention and the growing Key-Value (KV) cache that makes inference memory-bound by context length. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving model quality. Unlike prior linearization methods constrained by fixed, non-adaptive structures, Lizard augments the architecture with compact, learnable modules that enable adaptive memory control and robust length generalization. Moreover, we introduce a hardwareaware algorithm that solves numerical instability in gated attention to accelerate training. Extensive experiments show that Lizard achieves near-lossless recovery of its teacher model's performance, significantly outperforming previous methods by up to 9.4 - 24.5 points on the 5-shot MMLU benchmark and demonstrating superior associative recall.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Lizard, a linearization framework that converts pretrained Transformer LLMs into subquadratic architectures. It replaces quadratic softmax attention with a subquadratic mechanism claimed to closely approximate it, augments the model with compact learnable modules for adaptive memory control and length generalization, and introduces a hardware-aware algorithm to mitigate numerical instability in gated attention. Experiments report near-lossless recovery of the teacher model's performance, with gains of 9.4–24.5 points on 5-shot MMLU over prior linearization methods and improved associative recall.
Significance. If the approximation quality and ablation results hold, the work would offer a practical route to subquadratic LLMs that preserve quality while scaling to long contexts, directly addressing KV-cache memory bounds and quadratic complexity; the hardware-aware stabilization and adaptive modules are notable strengths if quantitatively validated.
major comments (2)
- [Method and Experiments] The central claim of near-lossless recovery depends on the subquadratic attention closely approximating softmax, yet no quantitative bounds on approximation error (e.g., max-norm, KL divergence, or output-level divergence between the linearized and original attention) are reported in the method or experimental sections. This leaves the load-bearing assumption untested and risks conflating approximation fidelity with other factors such as training schedule or added parameters.
- [Ablation Studies] No ablation is presented that removes the compact learnable modules while retaining only the subquadratic linearization (or vice versa), which is needed to isolate whether performance gains on MMLU and associative recall stem from faithful approximation or from the extra adaptive capacity. Without this, downstream improvements cannot be confidently attributed to the core linearization.
minor comments (2)
- [Method] Notation for the gated attention and hardware-aware stabilization could be made more explicit with an additional equation or pseudocode block to clarify the numerical fix.
- [Experiments] Figure captions for associative-recall and MMLU plots should include error bars or standard deviations across runs to support the reported gains.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our submission. We address each major comment below and describe the revisions we have made or will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Method and Experiments] The central claim of near-lossless recovery depends on the subquadratic attention closely approximating softmax, yet no quantitative bounds on approximation error (e.g., max-norm, KL divergence, or output-level divergence between the linearized and original attention) are reported in the method or experimental sections. This leaves the load-bearing assumption untested and risks conflating approximation fidelity with other factors such as training schedule or added parameters.
Authors: We agree that explicit quantitative bounds on the approximation error would provide stronger grounding for the central claim. In the revised manuscript we add a new subsection (Section 3.3) that reports max-norm differences, average KL divergence, and output-level cosine similarity between the linearized attention outputs and the original softmax attention, computed on held-out sequences of varying lengths. These metrics remain small and stable, and we include a brief correlation analysis showing that lower approximation error aligns with higher downstream retention. This addition helps separate the contribution of the linearization itself from the effects of the adaptive modules and training schedule. revision: yes
-
Referee: [Ablation Studies] No ablation is presented that removes the compact learnable modules while retaining only the subquadratic linearization (or vice versa), which is needed to isolate whether performance gains on MMLU and associative recall stem from faithful approximation or from the extra adaptive capacity. Without this, downstream improvements cannot be confidently attributed to the core linearization.
Authors: We acknowledge the value of isolating the contribution of the subquadratic linearization from the adaptive modules. The revised manuscript includes a new ablation study (Section 4.4 and Table 4) that evaluates three variants: (i) subquadratic attention alone without the learnable modules, (ii) the full Lizard model, and (iii) the original Transformer with the adaptive modules grafted on. Results show that the subquadratic attention alone recovers most but not all of the performance, while the adaptive modules provide additional gains especially on long-context associative recall and 5-shot MMLU. These controlled comparisons allow clearer attribution of the observed improvements. revision: yes
Circularity Check
No significant circularity; empirical framework with independent experimental validation
full rationale
The paper introduces Lizard as an architectural framework for linearizing Transformers via a subquadratic attention mechanism plus compact learnable modules, supported by a hardware-aware stabilization algorithm. All central claims (near-lossless recovery, benchmark gains of 9.4-24.5 points on 5-shot MMLU, superior associative recall) are presented as outcomes of extensive experiments rather than any first-principles derivation, fitted-parameter prediction, or self-citation chain. No equations or steps reduce by construction to inputs; the approximation to softmax attention is asserted and then evaluated empirically. This is a standard engineering contribution whose validity rests on external benchmarks, not tautological definitions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We approximate the causal softmax attention with RoPE with GLA: ... Γi = sigmoid(Wγ xi) ... hardware-aware factorization ... eQ = ϕq(Q) ⊙ exp(log C)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Simple linear attention language models balance the recall-throughput tradeoff
Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668,
-
[3]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
URL https://arxiv. org/abs/2402.19427, page
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
The language model evaluation harness, 07 2024
URL https://zenodo.org/records/12608602. Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model. arXiv preprint arXiv:2405.16712,
-
[8]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[11]
AQ Jiang, A Sablayrolles, A Mensch, C Bamford, DS Chaplot, D de Las Casas, F Bressand, G Lengyel, G Lample, L Saulnier, et al. Mistral 7b, arxiv abs/2310.06825 (2023). URL: https://api. semanticscholar. org/CorpusID, 263830494. Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformer...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache
Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, et al. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache. arXiv preprint arXiv:2401.02669,
-
[13]
Language Models are Few-Shot Learners
Ben Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sastry, A Askell, S Agarwal, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 1:3,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[14]
Linearizing large language models
Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, and Thomas Kollar. Linearizing large language models. arXiv preprint arXiv:2405.06640,
-
[15]
Leave no context behind: Efficient infinite context transformers with infini-attention, 2024
Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv preprint arXiv:2404.07143, 101,
-
[16]
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guan...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.findings-emnlp.936 2023
-
[17]
URL https://arxiv. org/abs/2307.14995. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM , 64(9):99–106,
-
[18]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
An Empirical Study of Mamba-based Language Models
URL https://arxiv. org/abs/2406.07887. Junxiong Wang, Daniele Paliotta, Avner May, Alexander Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models. Advances in Neural Information Processing Systems, 37:62432–62457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Linformer: Self-Attention with Linear Complexity
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[21]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
URL https://aclanthology. org/P19-1472. Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry, 2024b. URL https://arxiv. org/abs/2402.04347. Michael Zhang, Simran Arora, Rahul Chalamala, Alan Wu, Benjamin Spector, Aaryan Singhal, Krithik Ramesh, and Christopher Ré. ...
-
[24]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
First, Lizard still relies on a strong pretrained backbone to achieve high quality
12 A Limitations Despite the promising performance and efficiency gains demonstrated by Lizard, our approach has two key limitations. First, Lizard still relies on a strong pretrained backbone to achieve high quality. As with many recent distillation-based or hybrid architectures, the success of our method depends heavily on the expressiveness and general...
work page 2048
-
[26]
As shown in Figure 6, our hardware-aware implementation of GLA achieves 3.25 ms per forward pass, representing a 32% reduction in inference time compared to the original Gated Linear Attention 5 kernel (4.30 ms). This speedup stems from shifting the gating contributions into the feature space, enabling tensor core compatibility and chunkwise matrix operat...
work page 2048
-
[27]
For the learning rate, we performed an initial sweep over {1e-2, 5e-3, 1e-3, 5e-4, 1e-4}. We did not tune the batch size. For the other designs, we adopted the default values used by prior work [Zhang et al., 2024]. E Evaluation on small-size LLMs Table 8: Evaluation results of small-size LLMs and their variants across multiple benchmarks. Lizard consiste...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.