LoopQ: Quantization for Recursive Transformers

Hsi-Wen Chen; Ming-Syan Chen; Rui Fang

arxiv: 2605.16343 · v1 · pith:646YTXTNnew · submitted 2026-05-08 · 💻 cs.LG · cs.AI

LoopQ: Quantization for Recursive Transformers

Rui Fang , Hsi-Wen Chen , Ming-Syan Chen This is my paper

Pith reviewed 2026-05-20 22:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords post-training quantizationlooped language modelsrecursive transformersquantization error accumulationactivation scalinglow-bit inferenceparameter-efficient models

0 comments

The pith

LoopQ enables practical 4-bit quantization for looped language models by fixing role shifts, state reuse, and recursive error buildup.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Looped language models reuse the same transformer blocks repeatedly to get deeper computation from a fixed parameter budget. This reuse creates three distinct problems under post-training quantization: activation distributions change depending on the loop step, hidden states get reused across iterations, and small quantization errors compound over successive loops. LoopQ keeps one shared quantized backbone and adds a small set of loop-specific fixes: activation scaling, selective transformations, cross-loop state alignment, and trajectory-aware optimization. The result is a large recovery in both task accuracy and language-modeling quality when running at W4A4 precision. A sympathetic reader would care because the approach turns an otherwise fragile efficiency trick into something that can actually be deployed on low-precision hardware.

Core claim

LoopQ is a loop-aware post-training quantization framework for looped language models. It retains a single shared quantized model while inserting lightweight adaptations that correct distributional mismatch inside each loop and limit error accumulation across loops. The adaptations consist of activation scaling, selective transformation, cross-loop state alignment, and trajectory-aware optimization. Across seven benchmarks, W4A4 LoopQ raises average downstream accuracy by 68.8 percent and lowers average perplexity by 87.7 percent relative to the strongest static PTQ baseline.

What carries the argument

LoopQ, a loop-aware PTQ method that keeps a shared quantized backbone and adds lightweight per-loop adaptations for activation scaling, selective transformation, cross-loop alignment, and trajectory optimization.

If this is right

Quantized looped models can reach downstream accuracy levels much closer to their full-precision counterparts.
Language-modeling perplexity drops sharply once loop-specific alignment is applied.
The same shared quantized weights remain usable while the lightweight corrections handle iteration-dependent effects.
No full retraining is required, so the method stays practical for large looped architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loop-aware corrections might extend to other recursive architectures such as iterated vision transformers or recurrent diffusion models.
If the adaptations stay lightweight at 4 bits, they could be tested at 2-bit or 3-bit widths to see how far the error-compensation scales.
Hardware designs could add dedicated support for cross-loop state alignment to make the method even faster on edge devices.

Load-bearing premise

The three challenges of distribution shift, state reuse, and recursive error accumulation are the main causes of quantization failure in LoopLMs, and the added adaptations correct them without introducing new mismatches or errors.

What would settle it

Apply LoopQ to a held-out looped model not used in the original experiments and check whether the accuracy gain stays above 50 percent and the perplexity reduction stays above 70 percent under the same W4A4 setting.

Figures

Figures reproduced from arXiv: 2605.16343 by Hsi-Wen Chen, Ming-Syan Chen, Rui Fang.

**Figure 1.** Figure 1: Left: LoopLMs exhibit loop-dependent activation drift (Challenge 1), error spikes at loop transitions (Challenge 2), and error accumulation across loops (Challenge 3). Right: LoopQ preserves a shared quantized backbone while adding lightweight loop-aware scaling, selective transformation, and cross-loop state alignment to reduce recursive quantization errors. The root cause is that the same shared block is… view at source ↗

**Figure 2.** Figure 2: Quantization error across layers and loops on LAMBADA, measured by the relative [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Loop-dependent selection heatmap on Ouro 1.4B. [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Perplexity under different numbers of selected loop-dependent linear layers. [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Perplexity under different loop depths. 32 64 128 256 512 1024 2048 Calibration set size 180 240 300 360 420 LAMBADA perplexity 55 60 65 70 75 80 WikiText2 perplexity LAMBADA PPL WikiText2 PPL [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Perplexity under different calibration set sizes. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Looped language models (LoopLMs) improve parameter efficiency by recursively reusing Transformer blocks, enabling deeper computation under a fixed model size. However, this reuse makes LoopLMs more fragile under post-training quantization (PTQ). We present the first systematic study of quantization in LoopLMs and identify three challenges: distribution shift across roles, state reuse across loop transitions, and recursive error accumulation. To address these challenges, we propose LoopQ, a loop-aware PTQ framework that preserves a shared quantized backbone while introducing lightweight adaptations. LoopQ combines activation scaling, selective transformation, cross-loop state alignment, and trajectory-aware optimization to reduce distributional mismatch within loops and error accumulation across loops. Experiments across seven benchmarks show that, under W4A4 quantization, LoopQ improves average downstream accuracy by 68.8% and reduces average perplexity by 87.7% compared with the strongest static PTQ baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LoopQ gives a practical loop-aware PTQ method for recursive transformers with big reported gains over static baselines, but the evidence is still thin on details and robustness.

read the letter

The core contribution is the first explicit look at why looped language models break under standard post-training quantization. The authors flag three concrete issues—distribution shift across different loop roles, state reuse between iterations, and accumulating errors over multiple passes—and then build four lightweight fixes on top of a shared quantized backbone: activation scaling, selective transformation, cross-loop state alignment, and trajectory-aware optimization. That framing is new and useful for anyone working with parameter-efficient recursive architectures. The reported numbers are eye-catching: under W4A4, they claim a 68.8% lift in average downstream accuracy and an 87.7% drop in average perplexity versus the strongest static PTQ baseline across seven benchmarks. If those deltas hold, the work directly helps deployment cost and energy for this model family. The experiments at least span multiple tasks, which is better than single-benchmark claims. The main soft spot is the lack of visible ablations, error bars, or full method descriptions in what we have so far. Without those, it is hard to tell whether the gains come from the proposed components or from favorable hyper-parameter choices. The assumption that the three challenges dominate and that the fixes do not introduce offsetting mismatches also needs checking against the actual code and runs. This paper is for people already working on quantization or on looped/recursive transformers who need practical inference tools. It is not reshaping the broader field but it fills a clear gap in the efficiency literature. The thinking looks coherent and the empirical angle is honest, so it deserves a serious referee to verify the numbers and the implementation.

Referee Report

2 major / 2 minor

Summary. The paper claims to present the first systematic study of post-training quantization (PTQ) for looped language models (LoopLMs), which reuse Transformer blocks recursively for parameter efficiency. It identifies three challenges—distribution shift across roles, state reuse across loop transitions, and recursive error accumulation—and proposes LoopQ, a loop-aware PTQ framework that retains a shared quantized backbone while adding lightweight adaptations: activation scaling, selective transformation, cross-loop state alignment, and trajectory-aware optimization. Under W4A4 quantization, experiments across seven benchmarks report that LoopQ improves average downstream accuracy by 68.8% and reduces average perplexity by 87.7% relative to the strongest static PTQ baseline.

Significance. If the reported gains prove robust, the work would be significant for enabling low-bit deployment of parameter-efficient recursive transformers. The systematic identification of loop-specific quantization challenges and the targeted lightweight fixes represent a clear contribution over generic PTQ methods. The large empirical deltas on multiple benchmarks are a strength, provided they are supported by proper controls, ablations, and statistical reporting in the full manuscript.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: the central performance claims (68.8% accuracy lift and 87.7% perplexity reduction under W4A4) are load-bearing, yet the manuscript provides no error bars, number of runs, or variance estimates. This makes it impossible to determine whether the gains are statistically reliable or sensitive to random seeds and post-hoc hyperparameter choices.
[Method and Experiments] Method and Experiments: the weakest assumption—that the three identified challenges dominate quantization degradation and that the proposed adaptations address them without introducing offsetting distribution mismatches—is not directly tested via ablations that isolate each component (e.g., removing cross-loop state alignment while keeping the others). Without such controls, it remains unclear whether the full LoopQ suite is necessary or if simpler static PTQ suffices.

minor comments (2)

[Abstract] The abstract would benefit from briefly stating the model sizes, number of loop iterations, and the exact seven benchmarks used, to allow readers to assess the scope of the claimed improvements.
[Method] Notation for loop transitions and state reuse should be defined more explicitly in the method section to avoid ambiguity when describing cross-loop alignment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications on our experimental design and commit to targeted revisions that strengthen the statistical rigor and component isolation without altering the core claims.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the central performance claims (68.8% accuracy lift and 87.7% perplexity reduction under W4A4) are load-bearing, yet the manuscript provides no error bars, number of runs, or variance estimates. This makes it impossible to determine whether the gains are statistically reliable or sensitive to random seeds and post-hoc hyperparameter choices.

Authors: We acknowledge that the manuscript as submitted does not report error bars or the number of runs. The experiments used a single fixed seed per benchmark for direct reproducibility with prior PTQ work. In the revised manuscript we will add results averaged over three independent random seeds, reporting means and standard deviations for all key metrics (accuracy and perplexity) under W4A4. This will allow readers to assess the stability of the reported 68.8% and 87.7% relative improvements. revision: yes
Referee: [Method and Experiments] Method and Experiments: the weakest assumption—that the three identified challenges dominate quantization degradation and that the proposed adaptations address them without introducing offsetting distribution mismatches—is not directly tested via ablations that isolate each component (e.g., removing cross-loop state alignment while keeping the others). Without such controls, it remains unclear whether the full LoopQ suite is necessary or if simpler static PTQ suffices.

Authors: We agree that isolating each adaptation is necessary to substantiate the design choices. The current manuscript contains a cumulative ablation that adds components sequentially and shows consistent gains, but it does not include leave-one-out variants. In the revision we will add explicit ablations that remove cross-loop state alignment and selective transformation individually while retaining the remaining modules. These new controls will demonstrate that each element contributes measurably and that omitting any of them degrades performance relative to the full LoopQ configuration. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical study that identifies three challenges in quantizing LoopLMs and introduces the LoopQ framework with targeted lightweight adaptations (activation scaling, selective transformation, cross-loop state alignment, and trajectory-aware optimization). These are evaluated via experiments on seven benchmarks under W4A4 PTQ, reporting accuracy and perplexity deltas relative to static baselines. No derivation chain, first-principles prediction, or fitted quantity is claimed that reduces by construction to its own inputs, self-citations, or ansatzes; the contribution is self-contained as a set of practical techniques plus empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The lightweight adaptations may implicitly introduce scaling factors or alignment parameters, but none are enumerated.

pith-pipeline@v0.9.0 · 5682 in / 1150 out tokens · 39837 ms · 2026-05-20T22:38:41.096450+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Proposition 2... ε_{t+1} ≤ ε_quant_t + γ_t ε_t ... ε_T ≤ Σ_{τ=0}^{T-1} (∏_{t=τ+1}^{T-1} γ_t) ε_quant_τ
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LoopQ combines activation scaling, selective transformation, cross-loop state alignment, and trajectory-aware optimization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 11 internal anchors

[1]

Alizadeh, A

M. Alizadeh, A. Behboodi, M. van Baalen, C. Louizos, T. Blankevoort, and M. Welling. Gradient l1 regularization for quantization robustness. InInternational Conference on Learning Representations, 2020

work page 2020
[2]

Ashkboos, A

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213–100240, 2024

work page 2024
[3]

S. Bae, Y . Kim, R. Bayat, S. Kim, J. Ha, T. Schuster, A. Fisch, H. Harutyunyan, Z. Ji, A. Courville, et al. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524, 2025

work page arXiv 2025
[4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Universal Transformers

M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser. Universal transformers.arXiv preprint arXiv:1807.03819, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Dettmers, M

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

work page 2022
[7]

Egashira, M

K. Egashira, M. Vero, R. Staab, J. He, and M. Vechev. Exploiting llm quantization.Advances in Neural Information Processing Systems, 37:41709–41732, 2024

work page 2024
[8]

Y . Fan, Y . Du, K. Ramchandran, and K. Lee. Looped transformers for length generalization. arXiv preprint arXiv:2409.15647, 2024

work page arXiv 2024
[9]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[11]

Gupta, P

N. Gupta, P. Wang, R. Kannan, and V . K. Prasanna. A persistent-state dataflow accelerator for memory-bound linear attention decode on fpga.arXiv preprint arXiv:2603.05931, 2026

work page arXiv 2026
[12]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[13]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

X. Hu, Y . Cheng, D. Yang, Z. Xu, Z. Yuan, J. Yu, C. Xu, Z. Jiang, and S. Zhou. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting.arXiv preprint arXiv:2501.13987, 2025

work page arXiv 2025
[15]

Huang, H

W. Huang, H. Qin, Y . Liu, Y . Li, Q. Liu, X. Liu, L. Benini, M. Magno, S. Zhang, and X. Qi. Slim-llm: Salience-driven mixed-precision quantization for large language models.arXiv preprint arXiv:2405.14917, 2024

work page arXiv 2024
[16]

Imfeld, J

M. Imfeld, J. Graldi, M. Giordano, T. Hofmann, S. Anagnostidis, and S. P. Singh. Transformer fusion with optimal transport.arXiv preprint arXiv:2310.05719, 2023

work page arXiv 2023
[17]

Jeddi, M

A. Jeddi, M. Ciccone, and B. Taati. Loopformer: Elastic-depth looped transformers for latent reasoning via shortcut modulation.arXiv preprint arXiv:2602.11451, 2026. 10

work page arXiv 2026
[18]

J. Kim, Y . J. Park, S. Son, C. Lee, H.-y. Kim, J. Kim, and Y . Jeon. Turboboa: Faster and exact attention-aware quantization without backpropagation.arXiv preprint arXiv:2602.04929, 2026

work page arXiv 2026
[19]

H. Kwon, K. Koo, J. Kim, W. Lee, M. Lee, G. Jung, H. Lee, Y . Jung, J. Park, Y . Song, et al. Pimphony: Overcoming bandwidth and capacity inefficiency in pim-based long-context llm inference system. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1–21. IEEE, 2026

work page 2026
[20]

A. F. Laguna, M. M. Sharifi, A. Kazemi, X. Yin, M. Niemier, and X. S. Hu. Hardware-software co-design of an in-memory transformer network accelerator.Frontiers in Electronics, 3:847069, 2022

work page 2022
[21]

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[22]

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

work page 2024
[23]

D. Liu, Z. Qin, H. Wang, Z. Yang, Z. Wang, F. Rong, Q. Liu, Y . Hao, B. Li, X. Chen, et al. Pruning via merging: Compressing llms via manifold alignment based layer merging. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17817–17829, 2024

work page 2024
[24]

Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V . Chandra, Y . Tian, and T. Blankevoort. Spinquant: Llm quantization with learned rotations. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[25]

Martens and R

J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. InInternational conference on machine learning, pages 2408–2417. PMLR, 2015

work page 2015
[26]

Pointer Sentinel Mixture Models

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Paperno, G

D. Paperno, G. Kruszewski, A. Lazaridou, N.-Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. The lambada dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1525–1534, 2016

work page 2016
[28]

Parcae: Scaling Laws For Stable Looped Language Models

H. Prairie, Z. Novack, T. Berg-Kirkpatrick, and D. Y . Fu. Parcae: Scaling laws for stable looped language models.arXiv preprint arXiv:2604.12946, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[30]

Riera, J

M. Riera, J. M. Arnau, and A. González. Crew: Computation reuse and efficient weight storage for hardware-accelerated mlps and rnns.Journal of Systems Architecture, 129:102604, 2022

work page 2022
[31]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021
[32]

Sanjeet, I

S. Sanjeet, I. Colbert, P. Monteagudo-Lago, G. Franco, Y . Umuroglu, and N. J. Fraser. Mixquant: Pushing the limits of block rotations in post-training quantization.arXiv preprint arXiv:2601.22347, 2026

work page arXiv 2026
[33]

Saxena, S

U. Saxena, S. Sharify, K. Roy, and X. Wang. Resq: Mixed-precision quantization of large language models with low-rank residuals. InInternational Conference on Machine Learning, pages 53095–53114. PMLR, 2025

work page 2025
[34]

Shkolnik, B

M. Shkolnik, B. Chmiel, R. Banner, G. Shomron, Y . Nahshan, A. Bronstein, and U. Weiser. Robust quantization: one model to rule them all. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 5308–5317, 2020. 11

work page 2020
[35]

Y . Sun, R. Liu, H. Bai, H. Bao, K. Zhao, Y . Li, J. Hu, X. Yu, L. Hou, C. Yuan, et al. Flatquant: Flatness matters for llm quantization. InInternational Conference on Machine Learning, pages 57587–57613. PMLR, 2025

work page 2025
[36]

Verma, K

N. Verma, K. Murray, and K. Duh. Merging feed-forward sublayers for compressed transformers. arXiv preprint arXiv:2501.06126, 2025

work page arXiv 2025
[37]

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

work page 2023
[38]

Y . Xiao, A. Liu, T. Zhang, H. Qin, J. Guo, and X. Liu. Robustmq: benchmarking robustness of quantized models.Visual Intelligence, 1(1):30, 2023

work page 2023
[39]

Z. Yu, Z. Wang, Y . Li, R. Gao, X. Zhou, S. R. Bommu, Y . Zhao, and Y . Lin. Edge-llm: Enabling efficient large language model adaptation on edge devices via unified compression and adaptive layer voting. InProceedings of the 61st ACM/IEEE Design Automation Conference, pages 1–6, 2024

work page 2024
[40]

Hyperloop Transformers

A. Zeitoun, L. Torroba-Hennigen, and Y . Kim. Hyperloop transformers.arXiv preprint arXiv:2604.21254, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[41]

Zellers, A

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

work page 2019
[42]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

work page 2019
[43]

M. Zhou, W. Xu, J. Kang, and T. Rosing. Transpim: A memory-based acceleration via software-hardware co-design for transformer. In2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 1071–1085. IEEE, 2022

work page 2022
[44]

W. Zhou, R. Le Bras, and Y . Choi. Modular transformers: Compressing transformers into mod- ularized layers for flexible efficient inference. InFindings of the Association for Computational Linguistics: ACL 2023, pages 10452–10465, 2023

work page 2023
[45]

R.-J. Zhu, Z. Wang, K. Hua, T. Zhang, Z. Li, H. Que, B. Wei, Z. Wen, F. Yin, H. Xing, et al. Scaling latent reasoning via looped language models.arXiv preprint arXiv:2510.25741, 2025. 12 A Detailed Proof A.1 Notation Table Table 3: Summary of notations. Category Notation Description LoopLMXInput token sequence. T,L Number of recursive loops and number of ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Alizadeh, A

M. Alizadeh, A. Behboodi, M. van Baalen, C. Louizos, T. Blankevoort, and M. Welling. Gradient l1 regularization for quantization robustness. InInternational Conference on Learning Representations, 2020

work page 2020

[2] [2]

Ashkboos, A

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213–100240, 2024

work page 2024

[3] [3]

S. Bae, Y . Kim, R. Bayat, S. Kim, J. Ha, T. Schuster, A. Fisch, H. Harutyunyan, Z. Ji, A. Courville, et al. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524, 2025

work page arXiv 2025

[4] [4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Universal Transformers

M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser. Universal transformers.arXiv preprint arXiv:1807.03819, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Dettmers, M

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

work page 2022

[7] [7]

Egashira, M

K. Egashira, M. Vero, R. Staab, J. He, and M. Vechev. Exploiting llm quantization.Advances in Neural Information Processing Systems, 37:41709–41732, 2024

work page 2024

[8] [8]

Y . Fan, Y . Du, K. Ramchandran, and K. Lee. Looped transformers for length generalization. arXiv preprint arXiv:2409.15647, 2024

work page arXiv 2024

[9] [9]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020

[11] [11]

Gupta, P

N. Gupta, P. Wang, R. Kannan, and V . K. Prasanna. A persistent-state dataflow accelerator for memory-bound linear attention decode on fpga.arXiv preprint arXiv:2603.05931, 2026

work page arXiv 2026

[12] [12]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[13] [13]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[14] [14]

X. Hu, Y . Cheng, D. Yang, Z. Xu, Z. Yuan, J. Yu, C. Xu, Z. Jiang, and S. Zhou. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting.arXiv preprint arXiv:2501.13987, 2025

work page arXiv 2025

[15] [15]

Huang, H

W. Huang, H. Qin, Y . Liu, Y . Li, Q. Liu, X. Liu, L. Benini, M. Magno, S. Zhang, and X. Qi. Slim-llm: Salience-driven mixed-precision quantization for large language models.arXiv preprint arXiv:2405.14917, 2024

work page arXiv 2024

[16] [16]

Imfeld, J

M. Imfeld, J. Graldi, M. Giordano, T. Hofmann, S. Anagnostidis, and S. P. Singh. Transformer fusion with optimal transport.arXiv preprint arXiv:2310.05719, 2023

work page arXiv 2023

[17] [17]

Jeddi, M

A. Jeddi, M. Ciccone, and B. Taati. Loopformer: Elastic-depth looped transformers for latent reasoning via shortcut modulation.arXiv preprint arXiv:2602.11451, 2026. 10

work page arXiv 2026

[18] [18]

J. Kim, Y . J. Park, S. Son, C. Lee, H.-y. Kim, J. Kim, and Y . Jeon. Turboboa: Faster and exact attention-aware quantization without backpropagation.arXiv preprint arXiv:2602.04929, 2026

work page arXiv 2026

[19] [19]

H. Kwon, K. Koo, J. Kim, W. Lee, M. Lee, G. Jung, H. Lee, Y . Jung, J. Park, Y . Song, et al. Pimphony: Overcoming bandwidth and capacity inefficiency in pim-based long-context llm inference system. In2026 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1–21. IEEE, 2026

work page 2026

[20] [20]

A. F. Laguna, M. M. Sharifi, A. Kazemi, X. Yin, M. Niemier, and X. S. Hu. Hardware-software co-design of an in-memory transformer network accelerator.Frontiers in Electronics, 3:847069, 2022

work page 2022

[21] [21]

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. Albert: A lite bert for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[22] [22]

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

work page 2024

[23] [23]

D. Liu, Z. Qin, H. Wang, Z. Yang, Z. Wang, F. Rong, Q. Liu, Y . Hao, B. Li, X. Chen, et al. Pruning via merging: Compressing llms via manifold alignment based layer merging. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17817–17829, 2024

work page 2024

[24] [24]

Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V . Chandra, Y . Tian, and T. Blankevoort. Spinquant: Llm quantization with learned rotations. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[25] [25]

Martens and R

J. Martens and R. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. InInternational conference on machine learning, pages 2408–2417. PMLR, 2015

work page 2015

[26] [26]

Pointer Sentinel Mixture Models

S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[27] [27]

Paperno, G

D. Paperno, G. Kruszewski, A. Lazaridou, N.-Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. The lambada dataset: Word prediction requiring a broad discourse context. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 1525–1534, 2016

work page 2016

[28] [28]

Parcae: Scaling Laws For Stable Looped Language Models

H. Prairie, Z. Novack, T. Berg-Kirkpatrick, and D. Y . Fu. Parcae: Scaling laws for stable looped language models.arXiv preprint arXiv:2604.12946, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019

[30] [30]

Riera, J

M. Riera, J. M. Arnau, and A. González. Crew: Computation reuse and efficient weight storage for hardware-accelerated mlps and rnns.Journal of Systems Architecture, 129:102604, 2022

work page 2022

[31] [31]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021

[32] [32]

Sanjeet, I

S. Sanjeet, I. Colbert, P. Monteagudo-Lago, G. Franco, Y . Umuroglu, and N. J. Fraser. Mixquant: Pushing the limits of block rotations in post-training quantization.arXiv preprint arXiv:2601.22347, 2026

work page arXiv 2026

[33] [33]

Saxena, S

U. Saxena, S. Sharify, K. Roy, and X. Wang. Resq: Mixed-precision quantization of large language models with low-rank residuals. InInternational Conference on Machine Learning, pages 53095–53114. PMLR, 2025

work page 2025

[34] [34]

Shkolnik, B

M. Shkolnik, B. Chmiel, R. Banner, G. Shomron, Y . Nahshan, A. Bronstein, and U. Weiser. Robust quantization: one model to rule them all. InProceedings of the 34th International Conference on Neural Information Processing Systems, pages 5308–5317, 2020. 11

work page 2020

[35] [35]

Y . Sun, R. Liu, H. Bai, H. Bao, K. Zhao, Y . Li, J. Hu, X. Yu, L. Hou, C. Yuan, et al. Flatquant: Flatness matters for llm quantization. InInternational Conference on Machine Learning, pages 57587–57613. PMLR, 2025

work page 2025

[36] [36]

Verma, K

N. Verma, K. Murray, and K. Duh. Merging feed-forward sublayers for compressed transformers. arXiv preprint arXiv:2501.06126, 2025

work page arXiv 2025

[37] [37]

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

work page 2023

[38] [38]

Y . Xiao, A. Liu, T. Zhang, H. Qin, J. Guo, and X. Liu. Robustmq: benchmarking robustness of quantized models.Visual Intelligence, 1(1):30, 2023

work page 2023

[39] [39]

Z. Yu, Z. Wang, Y . Li, R. Gao, X. Zhou, S. R. Bommu, Y . Zhao, and Y . Lin. Edge-llm: Enabling efficient large language model adaptation on edge devices via unified compression and adaptive layer voting. InProceedings of the 61st ACM/IEEE Design Automation Conference, pages 1–6, 2024

work page 2024

[40] [40]

Hyperloop Transformers

A. Zeitoun, L. Torroba-Hennigen, and Y . Kim. Hyperloop transformers.arXiv preprint arXiv:2604.21254, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [41]

Zellers, A

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

work page 2019

[42] [42]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

work page 2019

[43] [43]

M. Zhou, W. Xu, J. Kang, and T. Rosing. Transpim: A memory-based acceleration via software-hardware co-design for transformer. In2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 1071–1085. IEEE, 2022

work page 2022

[44] [44]

W. Zhou, R. Le Bras, and Y . Choi. Modular transformers: Compressing transformers into mod- ularized layers for flexible efficient inference. InFindings of the Association for Computational Linguistics: ACL 2023, pages 10452–10465, 2023

work page 2023

[45] [45]

R.-J. Zhu, Z. Wang, K. Hua, T. Zhang, Z. Li, H. Que, B. Wei, Z. Wen, F. Yin, H. Xing, et al. Scaling latent reasoning via looped language models.arXiv preprint arXiv:2510.25741, 2025. 12 A Detailed Proof A.1 Notation Table Table 3: Summary of notations. Category Notation Description LoopLMXInput token sequence. T,L Number of recursive loops and number of ...

work page internal anchor Pith review Pith/arXiv arXiv 2025