Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

Arash Behboodi; Arnau Padres Masdemont; Fabio Valerio Massoli; Jordi Ros-Giralt; Niccol\`o Grillo; Victor Conchello Vendrell

arxiv: 2605.07721 · v2 · pith:HH6MRAPLnew · submitted 2026-05-08 · 💻 cs.CL · cs.AI· cs.LG

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

Victor Conchello Vendrell , Arnau Padres Masdemont , Niccol\`o Grillo , Jordi Ros-Giralt , Arash Behboodi , Fabio Valerio Massoli This is my paper

Pith reviewed 2026-05-20 22:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords memory-efficient looped transformersconstant-memory iterative reasoningshared KV cachelearnable gatingrecurrent language modelschunk-wise trainingLoopLM

0 comments

The pith

Looped language models can perform iterative reasoning with constant memory by sharing a single KV cache updated via learnable gating.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MELT to address memory growth in recurrent LLMs that perform multi-step reasoning by updating internal states over loops. Standard models like Ouro retain a growing KV cache across iterations, which scales linearly with depth and quickly becomes impractical. MELT instead maintains one shared KV cache per layer that is updated over time through a learnable gating mechanism. A two-phase chunk-wise training procedure transfers capabilities from a pretrained LoopLM model using interpolated transition followed by attention-aligned distillation. This yields models that match or exceed the performance of standard LLMs of similar size while keeping memory usage fixed and much lower than Ouro.

Core claim

MELT replaces per-iteration KV caches with a single shared cache per layer that is updated across loops by a learnable gating mechanism, combined with a two-phase chunk-wise training process of interpolated transition and attention-aligned distillation from a base LoopLM, to achieve constant-memory iterative reasoning without loss of performance.

What carries the argument

Learnable gating mechanism that updates a single shared KV cache per layer across all reasoning loops.

If this is right

MELT achieves constant memory footprint independent of reasoning depth.
Fine-tuned MELT models from Ouro parameters outperform standard LLMs of comparable size.
Memory usage stays comparable to standard transformers and dramatically smaller than Ouro.
Only a lightweight post-training procedure is required rather than full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deeper reasoning chains become feasible on memory-constrained hardware without scaling cache size.
The gating approach might extend to other recurrent or stateful network components to reduce memory overhead.
Limits of information retention in the gated cache could appear on very long reasoning sequences.

Load-bearing premise

The learnable gate can preserve and update all necessary state information across reasoning loops without significant loss of context or performance.

What would settle it

Measure whether model accuracy on reasoning tasks remains stable or degrades when the number of loops is increased far beyond the training regime, compared against a baseline that retains full per-loop caches.

Figures

Figures reproduced from arXiv: 2605.07721 by Arash Behboodi, Arnau Padres Masdemont, Fabio Valerio Massoli, Jordi Ros-Giralt, Niccol\`o Grillo, Victor Conchello Vendrell.

**Figure 1.** Figure 1: (a) MELT achieves superior performance compared to similarly sized non-looped models, while maintaining an equivalent memory footprint, only slightly higher due to the absence of MQA. (b) As in looped transformers, layers are reused across iterations, but the KV cache is updated rather than expanded across loops. ∗Equal contribution. †Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc. arX… view at source ↗

**Figure 2.** Figure 2: Visualization of the MELT architecture and its KV cache dynamics. The pink arrows [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of the Phase 1 training techniques proposed. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Example reasoning trace in Ouro-1.4B-Thinking illustrating the failure mode of last-loop [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: The auxiliary alignment loss matches MELT attention outputs to the corresponding outputs of the frozen LoopLM teacher at each layer and reasoning loop.. D Hyperparameters This section provides the hyperparameters required to reproduce our training and evaluation runs [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism. To enable stable and efficient training under this architecture, we propose to train MELT using chunk-wise training in a two phase procedure: interpolated transition, followed by attention-aligned distillation, both from the LoopLM starting model to MELT. Empirically, we show that MELT models fine-tuned from pretrained Ouro parameters outperform standard LLMs of comparable size, while maintaining a memory footprint comparable to those models and dramatically smaller than Ouro's. Overall, MELT achieves constant-memory iterative reasoning without sacrificing LoopLM performance, using only a lightweight post-training procedure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MELT shares one gated KV cache per layer to cap memory at constant cost while keeping looped reasoning performance, but the gating and distillation transfer still need tighter validation on long chains.

read the letter

The core move here is replacing Ouro-style per-iteration KV caches with a single shared cache per layer that gets updated by a learnable gate. That directly attacks the linear memory growth that has limited how deep these recurrent models can run on current hardware. The two-phase chunk-wise training—first an interpolated transition, then attention-aligned distillation from the base LoopLM—is a practical way to adapt an existing pretrained model without retraining from scratch, and that part looks like a useful engineering contribution rather than a wholly new theoretical insight.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Memory-Efficient Looped Transformer (MELT), a modification to looped language models such as Ouro. MELT replaces per-iteration KV caches with a single shared KV cache per layer that is updated across reasoning loops by a learnable gating mechanism. Training uses a two-phase chunk-wise procedure (interpolated transition followed by attention-aligned distillation) from a pretrained LoopLM starting model. The central empirical claim is that fine-tuned MELT models outperform standard LLMs of comparable size while maintaining memory usage comparable to standard models and dramatically lower than Ouro, thereby achieving constant-memory iterative reasoning without performance loss via only lightweight post-training.

Significance. If the performance and memory claims are substantiated, the work would address a practical scalability barrier in recurrent LLM architectures by removing linear memory growth with reasoning depth. The lightweight post-training recipe from existing Ouro parameters is a pragmatic strength that could facilitate adoption. No parameter-free derivations or formal information-retention bounds are indicated, so significance rests entirely on the empirical results.

major comments (2)

[Abstract] Abstract: the central claim that MELT 'outperform[s] standard LLMs of comparable size' and has 'dramatically smaller' memory than Ouro is presented without any quantitative metrics, task descriptions, baseline comparisons, ablation results, or error analysis. This is load-bearing for the performance and memory assertions; the abstract supplies no numbers against which to evaluate whether the data support the claims.
[Architecture and gating description] Description of the learnable gating mechanism: the architecture replaces Ouro's independent per-iteration caches with a single shared KV cache updated by gating. No analysis, bounds, or experiments are supplied on whether this update preserves critical state across arbitrary reasoning depths (especially beyond training chunk lengths). If the gate is low-rank or linear, irreversible compression could occur; the two-phase distillation procedure may align only at fixed chunk sizes and fail to generalize. This directly affects the constant-memory claim.

minor comments (2)

[Method] Clarify the exact functional form, initialization, and parameter count of the gating mechanism relative to the base Ouro model.
[Experiments] Add a table or figure showing memory usage and accuracy versus number of reasoning loops for MELT, Ouro, and standard LLMs.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive report. We address each major comment point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that MELT 'outperform[s] standard LLMs of comparable size' and has 'dramatically smaller' memory than Ouro is presented without any quantitative metrics, task descriptions, baseline comparisons, ablation results, or error analysis. This is load-bearing for the performance and memory assertions; the abstract supplies no numbers against which to evaluate whether the data support the claims.

Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In the revised version we will update the abstract to report specific performance improvements (e.g., accuracy gains on reasoning benchmarks relative to standard LLMs of similar size), memory-footprint comparisons (constant vs. linear scaling with Ouro), and brief references to the evaluation tasks and baselines used. revision: yes
Referee: [Architecture and gating description] Description of the learnable gating mechanism: the architecture replaces Ouro's independent per-iteration caches with a single shared KV cache updated by gating. No analysis, bounds, or experiments are supplied on whether this update preserves critical state across arbitrary reasoning depths (especially beyond training chunk lengths). If the gate is low-rank or linear, irreversible compression could occur; the two-phase distillation procedure may align only at fixed chunk sizes and fail to generalize. This directly affects the constant-memory claim.

Authors: We acknowledge the absence of formal analysis or bounds on long-horizon state preservation. The manuscript currently relies on empirical results obtained within the chunk lengths used during the two-phase distillation. We will add a dedicated discussion subsection describing the gating design and why the learnable parameters plus attention-aligned distillation are intended to mitigate irreversible compression. We will also include new experiments that extend reasoning depth beyond the training chunk sizes to test generalization of the constant-memory behavior. revision: partial

standing simulated objections not resolved

Formal information-retention bounds or parameter-free derivations for the gating mechanism across arbitrary reasoning depths

Circularity Check

0 steps flagged

No circularity: empirical architecture and training claims are independent of inputs

full rationale

The paper introduces MELT as a concrete architectural change (single shared KV cache per layer updated by learnable gating) plus a two-phase chunk-wise training procedure (interpolated transition followed by attention-aligned distillation) applied to a pretrained Ouro/LoopLM base model. All central claims—constant memory, retained performance, and outperformance of comparable LLMs—are presented strictly as measured empirical outcomes of this design and procedure. No derivation, uniqueness theorem, or prediction is offered that reduces by construction to a fitted parameter, self-citation chain, or redefinition of the input model; the work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of a newly introduced gating mechanism and a two-phase training strategy that are not supported by prior independent evidence or external benchmarks in the abstract.

axioms (1)

domain assumption The learnable gating mechanism can be trained to maintain necessary state information across loops without performance degradation.
This assumption underpins the claim that constant memory is achieved while preserving LoopLM capabilities.

invented entities (1)

Learnable gating mechanism for shared KV cache update no independent evidence
purpose: To refresh a single shared KV cache across multiple reasoning loops while keeping memory constant
This is a new architectural component introduced to solve the memory growth problem in looped transformers.

pith-pipeline@v0.9.0 · 5804 in / 1336 out tokens · 80117 ms · 2026-05-20T22:56:20.290671+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable gating mechanism... h^{(l)}_t = z^{(l)}_t ⊙ h^{(l)}_{t-1} + (1-z^{(l)}_t) ⊙ x^{(l)}_t
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Recurrent steps: 4 (hyperparameters); chunk-wise training with fixed chunk size 500

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 22 internal anchors

[1]

Scaling Latent Reasoning via Looped Language Models

Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Uni- versal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Uni- versal transformers. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[3]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

work page 2022
[4]

Hierarchical Reasoning Model

Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical Reasoning Model, 2025. URL https://arxiv.org/ abs/2506.21734. Version Number: 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Less is More: Recursive Reasoning with Tiny Networks

Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks, 2025. URL https://arxiv.org/abs/2510.04871

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, and Yu Wang. Think-at- hard: Selective latent iterations to improve reasoning language models, 2026. URL https: //arxiv.org/abs/2511.08577

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416,

work page arXiv
[10]

Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, and Yuekun Yao. Loop, think, & generalize: Implicit reasoning in recurrent-depth transformers, 2026. URL https://arxiv.org/abs/ 2604.07822

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424,

Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424, 2024. doi: 10.48550/ ARXIV .2311.12424. URLhttps://arxiv.org/abs/2311.12424. Accepted at ICLR 2024

work page arXiv 2024
[13]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bar- toldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171, 2025. doi: 10.48550/ARXIV .2502.05171. URLhttps://arxiv.org/abs/2502.05171

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
[14]

Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y . Fu. Parcae: Scaling laws for stable looped language models, 2026. URL https://arxiv.org/abs/2604.12946

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Hyperloop Transformers

Abbas Zeitoun, Lucas Torroba-Hennigen, and Yoon Kim. Hyperloop transformers, 2026. URL https://arxiv.org/abs/2604.21254

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019. URLhttps://arxiv.org/abs/1911.02150

work page internal anchor Pith review Pith/arXiv arXiv 1911
[17]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models.arXiv preprint arXiv:2305.13245, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Reducing transformer key-value cache size with cross-layer attention

William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. Reducing transformer key-value cache size with cross-layer attention, 2024. URL https://arxiv.org/abs/2405.12981

work page arXiv 2024
[19]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian L...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Parallel loop transformer for efficient test-time computation scaling.CoRR, abs/2510.24824, 2025

Bohong Wu, Mengzhao Chen, Xiang Luo, Shen Yan, Qifan Yu, Fan Xia, Tianqi Zhang, Hongrui Zhan, Zheng Zhong, Xun Zhou, Siyuan Qiao, and Xingyan Bin. Parallel loop transformer for efficient test-time computation scaling, 2025. URLhttps://arxiv.org/abs/2510.24824

work page arXiv 2025
[21]

Mixture-of-recursions: Learning dynamic recur- sive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, and Se-Young Yun. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation, 2025. URL https: //arxiv.org/abs/2507.10524

work page arXiv 2025
[22]

Progressive growing of gans for improved quality, stability, and variation, 2018

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation, 2018. URL https://arxiv.org/abs/1710. 10196

work page 2018
[23]

Progressive residual warmup for language model pretraining, 2026

Tianhao Chen, Xin Xu, Lu Yin, Hao Chen, Yang Wang, Shizhe Diao, and Can Yang. Progressive residual warmup for language model pretraining, 2026. URLhttps://arxiv.org/abs/2603. 05369

work page 2026
[24]

Learning without Forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting, 2017. URL https://arxiv. org/abs/1606.09282

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

Zhen Cheng, Hao-Bo Yang, Wan-Yi Huang, and Jin-Long Li. Attention editing: A versatile framework for cross-architecture attention conversion, 2026. URL https://arxiv.org/abs/ 2604.05688

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Sparse upcycling: Training mixture-of-experts from dense checkpoints, 2023

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints, 2023. URL https://arxiv.org/abs/2212. 05055

work page 2023
[27]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network, March 2015. URLhttp://arxiv.org/abs/1503.02531. arXiv:1503.02531 [stat]

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

Knowledge Distillation from Internal Representations, January 2020

Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, and Chenlei Guo. Knowledge Distillation from Internal Representations, January 2020. URL http://arxiv.org/abs/ 1910.03723. arXiv:1910.03723 [cs]

work page arXiv 2020
[29]

Cross-Layer Distillation with Semantic Calibration, August 2021

Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Yan Feng, and Chun Chen. Cross-Layer Distillation with Semantic Calibration, August 2021. URL http://arxiv.org/abs/2012. 03236. arXiv:2012.03236 [cs]

work page arXiv 2021
[30]

Compact language models via pruning and knowledge distillation, 2024

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact Language Models via Pruning and Knowledge Distillation, November 2024. URL http://arxiv.org/abs/2407.14679. arXiv:2407.14679 [cs]

work page arXiv 2024
[31]

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone, December 2025

Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, and Jun Yu. A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone, December 2025. URLhttp://arxiv.org/abs/2505.12781. arXiv:2505.12781 [cs]. 11

work page arXiv 2025
[32]

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Chal- lenges, and Future Directions, January 2026

Luyang Fang, Xiaowei Yu, Jiazhang Cai, Yongkai Chen, Shushan Wu, Zhengliang Liu, Zhenyuan Yang, Haoran Lu, Xilin Gong, Yufang Liu, Terry Ma, Wei Ruan, Ali Abbasi, Jing Zhang, Tao Wang, Ehsan Latif, Weihang You, Hanqi Jiang, Wei Liu, Wei Zhang, Soheil Kolouri, Xiaoming Zhai, Dajiang Zhu, Wenxuan Zhong, Tianming Liu, and Ping Ma. Knowledge Distillation and ...

work page arXiv 2026
[33]

Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025

Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy, June 2025. URLhttp://arxiv.org/abs/2506.13284

work page arXiv 2025
[34]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

American invitational mathematics examination (aime) 2024, 2024

Mathematical Association of America. American invitational mathematics examination (aime) 2024, 2024. URLhttps://maa.org/. Problems I and II

work page 2024
[36]

American invitational mathematics examination (aime) 2025, 2025

Mathematical Association of America. American invitational mathematics examination (aime) 2025, 2025. URLhttps://maa.org/. Problems I and II

work page 2025
[37]

American invitational mathematics examination (aime) 2026, 2026

Mathematical Association of America. American invitational mathematics examination (aime) 2026, 2026. URLhttps://maa.org/. Problems I and II

work page 2026
[38]

American mathematics competitions (amc) 10/12 2023,

Mathematical Association of America. American mathematics competitions (amc) 10/12 2023,

work page 2023
[39]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023. doi: 10.48550/arXiv.2305.20050. URL https://arxiv.org/abs/ 2305.20050

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.20050 2023
[40]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Zhou, Lei Hou, Juanzi Li, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Co...

work page 2024
[41]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Humanity's Last Exam

Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dan Hendrycks, Ziwen Han, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Aakaash Nattanmai, Gordon McKellips, Anish Cheraku, Asim Suhail, Luo, et al. A ben...

work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
[43]

Are we done with mmlu?, 2024

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, and Pasquale Minervini. Are we done with mmlu?, 2024. 12

work page 2024
[45]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[46]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, and other. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Gemma open models, 2024

Google. Gemma open models, 2024. URLhttps://ai.google.dev/gemma

work page 2024
[48]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

work page
[49]

URLhttps://qwen.ai/blog?id=qwen3.5

work page
[50]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Wu, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645 (8081):633–6...

work page doi:10.1038/s41586-025-09422-z 2025
[51]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[52]

A Mechanistic Analysis of Looped Reasoning Language Models

Hugh Blayney, Álvaro Arroyo, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Michael M. Bronstein, and Xiaowen Dong. A mechanistic analysis of looped reasoning language models, 2026. URLhttps://arxiv.org/abs/2604.11791

work page internal anchor Pith review Pith/arXiv arXiv 2026
[53]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https: //arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[54]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[55]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain 13 Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-...

work page 2020
[56]

TRL: Transformers Rein- forcement Learning, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Rein- forcement Learning, 2020. URLhttps://github.com/huggingface/trl

work page 2020
[57]

thinking

Nathan Habib, Clémentine Fourrier, Hynek Kydlí ˇcek, Thomas Wolf, and Lewis Tunstall. Lighteval: A lightweight framework for llm evaluation, 2023. URL https://github.com/ huggingface/lighteval. 14 A Extended Related Work Looped transformers.While CoT [3] and other ITC techniques have recently been highly in- fluential, a complementary direction has emerge...

work page 2023
[58]

Thus, ∂zt ∂ht−1 →0

Term 2:The derivative of the sigmoid function σ′(u) =σ(u)(1−σ(u)) vanishes as zt →1 . Thus, ∂zt ∂ht−1 →0

work page
[59]

Consequently: limz→1Jt =I+0+0=⇒J t ≈I

Term 3:The term (1−z t) approaches 0, nullifying the contribution of the recurrent weight matrix in ∂˜ht ∂ht−1 . Consequently: limz→1Jt =I+0+0=⇒J t ≈I . Since the eigenvalues of the identity matrix are all1, the spectral radius isρ(J t) = 1. 19 Proposition E.1 gives more insights into the role of the gate zt. Rather than simply selecting information, it a...

work page 2024

[1] [1]

Scaling Latent Reasoning via Looped Language Models

Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Uni- versal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Uni- versal transformers. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019

[3] [3]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

work page 2022

[4] [4]

Hierarchical Reasoning Model

Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical Reasoning Model, 2025. URL https://arxiv.org/ abs/2506.21734. Version Number: 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Less is More: Recursive Reasoning with Tiny Networks

Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks, 2025. URL https://arxiv.org/abs/2510.04871

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [7]

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, and Yu Wang. Think-at- hard: Selective latent iterations to improve reasoning language models, 2026. URL https: //arxiv.org/abs/2511.08577

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [8]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers.arXiv preprint arXiv:2502.17416,

work page arXiv

[8] [10]

Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers

Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, and Yuekun Yao. Loop, think, & generalize: Implicit reasoning in recurrent-depth transformers, 2026. URL https://arxiv.org/abs/ 2604.07822

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [12]

Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424,

Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424, 2024. doi: 10.48550/ ARXIV .2311.12424. URLhttps://arxiv.org/abs/2311.12424. Accepted at ICLR 2024

work page arXiv 2024

[10] [13]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bar- toldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171, 2025. doi: 10.48550/ARXIV .2502.05171. URLhttps://arxiv.org/abs/2502.05171

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025

[11] [14]

Hayden Prairie, Zachary Novack, Taylor Berg-Kirkpatrick, and Daniel Y . Fu. Parcae: Scaling laws for stable looped language models, 2026. URL https://arxiv.org/abs/2604.12946

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [15]

Hyperloop Transformers

Abbas Zeitoun, Lucas Torroba-Hennigen, and Yoon Kim. Hyperloop transformers, 2026. URL https://arxiv.org/abs/2604.21254

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [16]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019. URLhttps://arxiv.org/abs/1911.02150

work page internal anchor Pith review Pith/arXiv arXiv 1911

[14] [17]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models.arXiv preprint arXiv:2305.13245, 2023. 10

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [18]

Reducing transformer key-value cache size with cross-layer attention

William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan Kelly. Reducing transformer key-value cache size with cross-layer attention, 2024. URL https://arxiv.org/abs/2405.12981

work page arXiv 2024

[16] [19]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J. L. Cai, Jian L...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [20]

Parallel loop transformer for efficient test-time computation scaling.CoRR, abs/2510.24824, 2025

Bohong Wu, Mengzhao Chen, Xiang Luo, Shen Yan, Qifan Yu, Fan Xia, Tianqi Zhang, Hongrui Zhan, Zheng Zhong, Xun Zhou, Siyuan Qiao, and Xingyan Bin. Parallel loop transformer for efficient test-time computation scaling, 2025. URLhttps://arxiv.org/abs/2510.24824

work page arXiv 2025

[18] [21]

Mixture-of-recursions: Learning dynamic recur- sive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, and Se-Young Yun. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation, 2025. URL https: //arxiv.org/abs/2507.10524

work page arXiv 2025

[19] [22]

Progressive growing of gans for improved quality, stability, and variation, 2018

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation, 2018. URL https://arxiv.org/abs/1710. 10196

work page 2018

[20] [23]

Progressive residual warmup for language model pretraining, 2026

Tianhao Chen, Xin Xu, Lu Yin, Hao Chen, Yang Wang, Shizhe Diao, and Can Yang. Progressive residual warmup for language model pretraining, 2026. URLhttps://arxiv.org/abs/2603. 05369

work page 2026

[21] [24]

Learning without Forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting, 2017. URL https://arxiv. org/abs/1606.09282

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [25]

Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

Zhen Cheng, Hao-Bo Yang, Wan-Yi Huang, and Jin-Long Li. Attention editing: A versatile framework for cross-architecture attention conversion, 2026. URL https://arxiv.org/abs/ 2604.05688

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [26]

Sparse upcycling: Training mixture-of-experts from dense checkpoints, 2023

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints, 2023. URL https://arxiv.org/abs/2212. 05055

work page 2023

[24] [27]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network, March 2015. URLhttp://arxiv.org/abs/1503.02531. arXiv:1503.02531 [stat]

work page internal anchor Pith review Pith/arXiv arXiv 2015

[25] [28]

Knowledge Distillation from Internal Representations, January 2020

Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, and Chenlei Guo. Knowledge Distillation from Internal Representations, January 2020. URL http://arxiv.org/abs/ 1910.03723. arXiv:1910.03723 [cs]

work page arXiv 2020

[26] [29]

Cross-Layer Distillation with Semantic Calibration, August 2021

Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Yan Feng, and Chun Chen. Cross-Layer Distillation with Semantic Calibration, August 2021. URL http://arxiv.org/abs/2012. 03236. arXiv:2012.03236 [cs]

work page arXiv 2021

[27] [30]

Compact language models via pruning and knowledge distillation, 2024

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact Language Models via Pruning and Knowledge Distillation, November 2024. URL http://arxiv.org/abs/2407.14679. arXiv:2407.14679 [cs]

work page arXiv 2024

[28] [31]

A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone, December 2025

Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, and Jun Yu. A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone, December 2025. URLhttp://arxiv.org/abs/2505.12781. arXiv:2505.12781 [cs]. 11

work page arXiv 2025

[29] [32]

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Chal- lenges, and Future Directions, January 2026

Luyang Fang, Xiaowei Yu, Jiazhang Cai, Yongkai Chen, Shushan Wu, Zhengliang Liu, Zhenyuan Yang, Haoran Lu, Xilin Gong, Yufang Liu, Terry Ma, Wei Ruan, Ali Abbasi, Jing Zhang, Tao Wang, Ehsan Latif, Weihang You, Hanqi Jiang, Wei Liu, Wei Zhang, Soheil Kolouri, Xiaoming Zhai, Dajiang Zhu, Wenxuan Zhong, Tianming Liu, and Ping Ma. Knowledge Distillation and ...

work page arXiv 2026

[30] [33]

Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025

Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy, June 2025. URLhttp://arxiv.org/abs/2506.13284

work page arXiv 2025

[31] [34]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [35]

American invitational mathematics examination (aime) 2024, 2024

Mathematical Association of America. American invitational mathematics examination (aime) 2024, 2024. URLhttps://maa.org/. Problems I and II

work page 2024

[33] [36]

American invitational mathematics examination (aime) 2025, 2025

Mathematical Association of America. American invitational mathematics examination (aime) 2025, 2025. URLhttps://maa.org/. Problems I and II

work page 2025

[34] [37]

American invitational mathematics examination (aime) 2026, 2026

Mathematical Association of America. American invitational mathematics examination (aime) 2026, 2026. URLhttps://maa.org/. Problems I and II

work page 2026

[35] [38]

American mathematics competitions (amc) 10/12 2023,

Mathematical Association of America. American mathematics competitions (amc) 10/12 2023,

work page 2023

[36] [39]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023. doi: 10.48550/arXiv.2305.20050. URL https://arxiv.org/abs/ 2305.20050

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.20050 2023

[37] [40]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Zhou, Lei Hou, Juanzi Li, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Co...

work page 2024

[38] [41]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [42]

Humanity's Last Exam

Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dan Hendrycks, Ziwen Han, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Aakaash Nattanmai, Gordon McKellips, Anish Cheraku, Asim Suhail, Luo, et al. A ben...

work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026

[40] [43]

Are we done with mmlu?, 2024

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, and Pasquale Minervini. Are we done with mmlu?, 2024. 12

work page 2024

[41] [45]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021

[42] [46]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, and other. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [47]

Gemma open models, 2024

Google. Gemma open models, 2024. URLhttps://ai.google.dev/gemma

work page 2024

[44] [48]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

work page

[45] [49]

URLhttps://qwen.ai/blog?id=qwen3.5

work page

[46] [50]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Wu, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645 (8081):633–6...

work page doi:10.1038/s41586-025-09422-z 2025

[47] [51]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[48] [52]

A Mechanistic Analysis of Looped Reasoning Language Models

Hugh Blayney, Álvaro Arroyo, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Michael M. Bronstein, and Xiaowen Dong. A mechanistic analysis of looped reasoning language models, 2026. URLhttps://arxiv.org/abs/2604.11791

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [53]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021. URL https: //arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[50] [54]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[51] [55]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain 13 Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-...

work page 2020

[52] [56]

TRL: Transformers Rein- forcement Learning, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Rein- forcement Learning, 2020. URLhttps://github.com/huggingface/trl

work page 2020

[53] [57]

thinking

Nathan Habib, Clémentine Fourrier, Hynek Kydlí ˇcek, Thomas Wolf, and Lewis Tunstall. Lighteval: A lightweight framework for llm evaluation, 2023. URL https://github.com/ huggingface/lighteval. 14 A Extended Related Work Looped transformers.While CoT [3] and other ITC techniques have recently been highly in- fluential, a complementary direction has emerge...

work page 2023

[54] [58]

Thus, ∂zt ∂ht−1 →0

Term 2:The derivative of the sigmoid function σ′(u) =σ(u)(1−σ(u)) vanishes as zt →1 . Thus, ∂zt ∂ht−1 →0

work page

[55] [59]

Consequently: limz→1Jt =I+0+0=⇒J t ≈I

Term 3:The term (1−z t) approaches 0, nullifying the contribution of the recurrent weight matrix in ∂˜ht ∂ht−1 . Consequently: limz→1Jt =I+0+0=⇒J t ≈I . Since the eigenvalues of the identity matrix are all1, the spectral radius isρ(J t) = 1. 19 Proposition E.1 gives more insights into the role of the gate zt. Rather than simply selecting information, it a...

work page 2024