The Context-Ready Transformer

Mahesh Godavarti

arxiv: 2606.27538 · v1 · pith:VCXVDTJ4new · submitted 2026-06-25 · 💻 cs.CL · cs.AI

The Context-Ready Transformer

Mahesh Godavarti This is my paper

Pith reviewed 2026-06-29 01:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords context-ready transformerrecurrent neural networkcorrection networkinference speeduppre-contextualizationpointer-chasing taskmodel conversionlanguage modeling

0 comments

The pith

A context-ready transformer with a correction network lets shallower models outperform deeper standard transformers at 1.7x to 2.6x inference speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an architecture that adds a correction step to a standard transformer block so each token enters already contextualized from prior positions. During left-to-right generation the correction chain turns the model into a recurrent network that reuses a cached summary of past context. Training unrolls the correction K times across the full sequence to enable parallel computation while preserving the recurrent behavior at inference. If the approach holds, models with far fewer layers can match or exceed the accuracy of much deeper transformers, especially when representations are wide or contexts are long, and pretrained models can be adapted by fine-tuning a zero-initialized correction network. The design also shows an ability to solve deep compositional tasks where standard transformers of comparable depth fail.

Core claim

The context-ready transformer is a D-layer transformer block preceded by a correction network that combines the previous position's block output with the current token embedding so the token enters the block already contextualized. At sequential inference the correction chain makes the model recurrent; for training the correction is unrolled K times over the sequence and all positions are processed in parallel at each step. A pretrained transformer converts to this form by adding a zero-initialized correction FFN and fine-tuning. Across widths, depths, and two datasets a D=5 model beats a 12-layer transformer while generating 1.7x faster on an A100; with K=10 a single-layer model beats a 6-l

What carries the argument

The correction network, a feed-forward network that merges the cached previous block output with the current token embedding to pre-contextualize the input before it reaches the D-layer transformer block.

If this is right

A D=5 context-ready model exceeds a 12-layer standard transformer while running 1.7 times faster on A100 hardware.
With K=10 a single-layer model exceeds a 6-layer transformer at 2.6 times the inference speed.
Sequential inference after K=10 parallel training stays within 0.01 PPL of the training regime.
The architecture improves most when representations are wide and contexts are long.
A single-layer version solves all 10 composition levels on a pointer-chasing task where standard transformers of similar depth fail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The recurrence introduced by the correction chain may allow effective depth to grow with sequence length rather than with the number of layers.
Converting existing pretrained transformers via fine-tuning could provide a low-cost path to faster inference without retraining from scratch.
The pointer-chasing results suggest the design may handle tasks that require repeated composition better than fixed-depth attention stacks.
If the correction network generalizes, similar pre-contextualization steps could be added to other sequence architectures beyond transformers.

Load-bearing premise

Unrolling the correction process K times during training produces a model whose sequential inference behavior closely matches the parallel training regime without distribution shift or instability that would degrade the reported gains.

What would settle it

Train a D=1 model with K=10 unrolls, then measure perplexity on held-out data under pure sequential inference and check whether the value stays within 0.01 of the parallel K=10 training perplexity.

read the original abstract

We introduce the context-ready transformer, a new recurrent neural network architecture built from a D-layer transformer block that pre-contextualizes each token before it enters the block. During left-to-right generation, a correction network combines the previous position's block output -- a cached summary of past context -- with the current token embedding, so the tokenenters the block already contextualized rather than as a raw embedding. At sequential inference, the correction chain makes the architecture a recurrent neural network. For training, we unroll the correction process K times over the full sequence, processing all positions in parallel at each step. A pretrained transformer can also be converted to a context-ready model by adding a zero-initialized correction FFN and fine-tuning. We evaluate across widths, depths, block sizes, and two datasets, with all comparisons against standard transformers, variants, and ablations. A D=5 model beats a 12-layer transformer while generating 1.7x faster on an A100. With K=10, a single-layermodel (D=1) beats a 6-layer transformer with a 2.6x inference speedup, and sequential inference matches parallel K=10 to within 0.01 PPL. The architecture benefits most from wide representations and long contexts. On a pointer-chasing task, D=1 trained with BPTT solves all 10 composition levels, while standard transformers exhibit staircase-like depth dependence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The context-ready transformer adds a correction network to pre-contextualize tokens from cached prior outputs, turning the block recurrent at inference with claimed speedups, but the unrolling-to-sequential match rests on thin evidence.

read the letter

The core idea is a D-layer transformer block plus a correction network that mixes the previous block output (cached context summary) with the current token embedding before it enters the block. At inference this runs left-to-right as an RNN; training unrolls the correction K times in parallel over the full sequence. A pretrained model can be adapted by adding a zero-init correction FFN and fine-tuning.

What the paper actually shows is concrete: across widths and depths it reports a D=5 model beating a 12-layer baseline at 1.7x A100 speedup, and with K=10 a D=1 model beating a 6-layer one at 2.6x while staying within 0.01 PPL of the parallel regime. The pointer-chasing task is also interesting—D=1 with BPTT handles all 10 composition levels where standard transformers show depth-dependent staircasing.

The soft spot is exactly the one the stress test flags. The central performance edge depends on the unrolled parallel training producing a model whose true sequential behavior matches without meaningful distribution shift or error buildup. The abstract states the 0.01 PPL match, but supplies no sequence-length breakdowns, variance numbers, or ablation on how closely the K-unroll regime approximates deployed RNN use. Without those details the speed and accuracy claims are hard to trust at face value.

This is for people working on inference-efficient autoregressive models who want a drop-in recurrent alternative rather than full retraining from scratch. The mechanism is simple enough and the conversion path practical enough that the work deserves a serious referee to examine the full experimental controls and the unrolling equivalence.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Context-Ready Transformer, a recurrent architecture consisting of a D-layer transformer block augmented by a correction network (or correction FFN) that pre-contextualizes each token using the cached summary from the previous position. Training unrolls the correction process K times in parallel over the full sequence, while inference runs sequentially as an RNN. The manuscript reports empirical results across widths, depths, block sizes, and two datasets, claiming that a D=5 model outperforms a 12-layer standard transformer with 1.7x faster generation on A100, that a D=1 model with K=10 outperforms a 6-layer transformer with 2.6x speedup, and that sequential inference matches the parallel K=10 regime to within 0.01 PPL. Additional results include benefits for wide representations and long contexts, plus solving all 10 levels of a pointer-chasing task with D=1 via BPTT where standard transformers show depth-dependent staircase behavior. Pretrained transformers can be converted by adding a zero-initialized correction FFN and fine-tuning.

Significance. If the central performance and inference-matching claims hold under rigorous verification, the architecture would offer a practical route to recurrent inference with transformer blocks, potentially improving efficiency for long contexts and wide models while allowing reuse of pretrained weights. The reported ability of low-D models to match or exceed deeper transformers, combined with the pointer-chasing results, would be of interest if the gains are shown to arise from the recurrent structure rather than unaccounted capacity or training differences.

major comments (3)

[Abstract / Evaluation] Abstract and Evaluation section: The claim that 'sequential inference matches parallel K=10 to within 0.01 PPL' is load-bearing for validating that the reported speedups (1.7x for D=5, 2.6x for D=1) reflect deployed behavior rather than a training artifact. However, no details are provided on sequence lengths, variance across runs, or any measurement of error accumulation over long sequences, leaving open the possibility of distribution shift between the parallel unrolling regime and true left-to-right inference.
[Architecture] Architecture description (likely §3): The correction network is defined to combine the previous block output with the current token embedding before entering the D-layer block, yet the manuscript provides no explicit equations or ablation isolating whether performance gains derive from the recurrent correction mechanism itself versus the added parameters in the correction FFN. This is load-bearing because the central comparison is to standard transformers of varying depths.
[Evaluation] Evaluation section: All reported comparisons (D=5 vs 12-layer, D=1 vs 6-layer) are stated against 'standard transformers, variants, and ablations,' but the text gives no information on whether baselines are matched for total parameter count, FLOPs, or training steps, nor on data splits or statistical significance. This directly affects whether the accuracy and speedup claims can be attributed to the proposed architecture.

minor comments (2)

[Abstract] Abstract contains a typo: 'tokenenters' should be 'token enters'.
[Evaluation] The pointer-chasing task results are presented without specifying the exact model widths or training hyperparameters used for the D=1 vs standard transformer comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of the Context-Ready Transformer. We address each major comment below and will revise the manuscript accordingly where the points identify gaps in detail or evidence.

read point-by-point responses

Referee: [Abstract / Evaluation] The claim that 'sequential inference matches parallel K=10 to within 0.01 PPL' is load-bearing... no details are provided on sequence lengths, variance across runs, or any measurement of error accumulation over long sequences, leaving open the possibility of distribution shift between the parallel unrolling regime and true left-to-right inference.

Authors: We agree that the inference-matching claim requires stronger supporting details to rule out distribution shift. In the revised manuscript we will report the exact sequence lengths evaluated (up to 2048 tokens on both datasets), include standard deviation across three random seeds, and add a supplementary figure plotting per-position PPL divergence between the K=10 unrolled and sequential regimes as a function of prefix length. These additions will confirm that the 0.01 PPL agreement holds without measurable accumulation on the lengths used for the reported speedups. revision: yes
Referee: [Architecture] The correction network is defined to combine the previous block output with the current token embedding... yet the manuscript provides no explicit equations or ablation isolating whether performance gains derive from the recurrent correction mechanism itself versus the added parameters in the correction FFN.

Authors: We will add explicit equations in Section 3 defining the correction step as h_t = f( h_{t-1}, e_t ) where f is the correction FFN and h_{t-1} is the cached block output. To isolate the recurrent component we will include a new ablation that replaces the recurrent correction with a non-recurrent, position-independent FFN of identical capacity; any remaining gains can then be attributed to recurrence. The zero-initialized conversion experiment already provides supporting evidence that the mechanism is not merely extra capacity, but the requested ablation will make this explicit. revision: yes
Referee: [Evaluation] All reported comparisons... are stated against 'standard transformers, variants, and ablations,' but the text gives no information on whether baselines are matched for total parameter count, FLOPs, or training steps, nor on data splits or statistical significance.

Authors: Baselines were constructed to match total parameter count by scaling the feed-forward and attention dimensions of the standard transformers; training steps, optimizer settings, and data splits were identical. We will add a table in the evaluation section explicitly listing parameter counts, FLOPs per token, and training steps for each pair, together with p-values from three-run statistical tests. This will allow readers to verify that reported gains are not due to mismatched compute or data. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture and results are defined independently of reported metrics

full rationale

The paper defines the context-ready transformer via an explicit architectural description (D-layer block plus correction network) and a training procedure (unrolling correction K times in parallel). All performance claims (D=5 vs 12-layer, D=1 vs 6-layer, 1.7x/2.6x speedups, 0.01 PPL match) are presented as empirical outcomes of experiments across widths, depths, and datasets. No equations, self-citations, or uniqueness theorems are invoked that would make any claimed result equivalent to its inputs by construction. The training/inference distinction is stated as an assumption whose validity is checked experimentally rather than presupposed.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The architecture rests on the standard transformer block assumptions plus the empirical claim that the correction network can be trained to produce useful pre-contextualized embeddings; no new physical or mathematical axioms are introduced.

free parameters (2)

K (unrolling steps)
Hyperparameter controlling how many times the correction is unrolled during training; chosen per experiment.
D (block depth)
Number of transformer layers inside each block; varied across experiments.

axioms (1)

standard math Standard transformer self-attention and feed-forward layers function as described in prior work.
Invoked when the correction output is fed into the D-layer transformer block.

invented entities (1)

correction network / correction FFN no independent evidence
purpose: Combines previous block output with current token embedding to produce a contextualized input for the transformer block.
New component introduced by the paper; zero-initialized for conversion of pretrained models.

pith-pipeline@v0.9.1-grok · 5775 in / 1411 out tokens · 23254 ms · 2026-06-29T01:56:36.418558+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 9 canonical work pages · 4 internal anchors

[1]

Advances in Neural Information Processing Systems , year=

Deep Equilibrium Models , author=. Advances in Neural Information Processing Systems , year=
[2]

International Conference on Learning Representations , year=

Universal Transformers , author=. International Conference on Learning Representations , year=
[3]

Elhoushi, Mostafa and Shrivastava, Akshat and Liskovich, Diana and Hosmer, Basil and Wasti, Bram and Lai, Liangzhen and Mahmoud, Anas and Acber, Bilge and Agarwal, Saurabh and Roman, Ahmed and others , journal=. Layer
[4]

Transformer Circuits Thread , year=

A Mathematical Framework for Transformer Circuits , author=. Transformer Circuits Thread , year=
[6]

Lan, Zhenzhong and Chen, Mingda and Goodman, Sebastian and Gimpel, Kevin and Sharma, Piyush and Soricut, Radu , booktitle=
[7]

Advances in Neural Information Processing Systems , year=

Attention Is All You Need , author=. Advances in Neural Information Processing Systems , year=
[8]

Banino, Andrea and Balaguer, Jan and Blundell, Charles , booktitle=. Ponder
[9]

Yoo, Seunghyun and others , journal=
[10]

International Conference on Machine Learning , year=

Fast Inference from Transformers via Speculative Decoding , author=. International Conference on Machine Learning , year=
[11]

Better & Faster Large Language Models via Multi-token Prediction

Better & Faster Large Language Models via Multi-token Prediction , author=. arXiv preprint arXiv:2404.19737 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Break the Sequential Dependency of

Fu, Yichao and Bailis, Peter and Stoica, Ion and Zhang, Hao , journal=. Break the Sequential Dependency of
[14]

Kou, Siqi and Hu, Lanxiang and He, Zhezhi and Deng, Zhijie and Zhang, Hao , journal=. C
[15]

International Conference on Learning Representations , year=

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , author=. International Conference on Learning Representations , year=
[16]

Zico Kolter, and Vladlen Koltun

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in Neural Information Processing Systems, 2019

2019
[17]

Zico Kolter

Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Accelerating feedforward computation via parallel nonlinear equation solving. In International Conference on Machine Learning, 2021

2021
[18]

Ponder N et: Learning to ponder

Andrea Banino, Jan Balaguer, and Charles Blundell. Ponder N et: Learning to ponder. In ICML Workshop on Uncertainty and Robustness in Deep Learning, 2021

2021
[19]

x LSTM : Extended long short-term memory

Maximilian Beck, Korbinian P \"o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, et al. x LSTM : Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024

work page arXiv 2024
[20]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Universal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and ukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019

2019
[22]

Layer S kip: Enabling early exit inference and self-speculative decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acber, Saurabh Agarwal, Ahmed Roman, et al. Layer S kip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710, 2024

work page arXiv 2024
[23]

Break the sequential dependency of LLM inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of LLM inference using lookahead decoding. arXiv preprint arXiv:2402.02057, 2024

work page arXiv 2024
[24]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Tom Goldstein, Avi Schwarzschild, C. Bayan Bruss, et al. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Openwebtext corpus

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

2019
[26]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of ICML 2024, 2024

2024
[28]

C LLM s: Consistency large language models

Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. C LLM s: Consistency large language models. arXiv preprint arXiv:2403.00835, 2024

work page arXiv 2024
[29]

ALBERT : A lite BERT for self-supervised learning of language representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT : A lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations, 2020

2020
[30]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

2019
[31]

The parallelism tradeoff: Limitations of log-precision transformers

William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. In Transactions of the Association for Computational Linguistics, volume 11, pages 531--545, 2023

2023
[32]

RWKV : Reinventing RNN s for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, et al. RWKV : Reinventing RNN s for the transformer era. In Findings of EMNLP 2023, 2023

2023
[33]

Siegelmann and Eduardo D

Hava T. Siegelmann and Eduardo D. Sontag. Computational capabilities of recurrent NARX neural networks. IEEE Transactions on Systems, Man, and Cybernetics, 26 0 (4): 0 535--544, 1995 a

1995
[34]

Siegelmann and Eduardo D

Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50 0 (1): 0 132--150, 1995 b

1995
[35]

RoFormer : Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer : Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

2024
[36]

ADEPT : Adaptive dynamic early-exit process for transformers

Seunghyun Yoo et al. ADEPT : Adaptive dynamic early-exit process for transformers. arXiv preprint arXiv:2601.03700, 2026

work page arXiv 2026

[1] [1]

Advances in Neural Information Processing Systems , year=

Deep Equilibrium Models , author=. Advances in Neural Information Processing Systems , year=

[2] [2]

International Conference on Learning Representations , year=

Universal Transformers , author=. International Conference on Learning Representations , year=

[3] [3]

Elhoushi, Mostafa and Shrivastava, Akshat and Liskovich, Diana and Hosmer, Basil and Wasti, Bram and Lai, Liangzhen and Mahmoud, Anas and Acber, Bilge and Agarwal, Saurabh and Roman, Ahmed and others , journal=. Layer

[4] [4]

Transformer Circuits Thread , year=

A Mathematical Framework for Transformer Circuits , author=. Transformer Circuits Thread , year=

[5] [6]

Lan, Zhenzhong and Chen, Mingda and Goodman, Sebastian and Gimpel, Kevin and Sharma, Piyush and Soricut, Radu , booktitle=

[6] [7]

Advances in Neural Information Processing Systems , year=

Attention Is All You Need , author=. Advances in Neural Information Processing Systems , year=

[7] [8]

Banino, Andrea and Balaguer, Jan and Blundell, Charles , booktitle=. Ponder

[8] [9]

Yoo, Seunghyun and others , journal=

[9] [10]

International Conference on Machine Learning , year=

Fast Inference from Transformers via Speculative Decoding , author=. International Conference on Machine Learning , year=

[10] [11]

Better & Faster Large Language Models via Multi-token Prediction

Better & Faster Large Language Models via Multi-token Prediction , author=. arXiv preprint arXiv:2404.19737 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [13]

Break the Sequential Dependency of

Fu, Yichao and Bailis, Peter and Stoica, Ion and Zhang, Hao , journal=. Break the Sequential Dependency of

[12] [14]

Kou, Siqi and Hu, Lanxiang and He, Zhezhi and Deng, Zhijie and Zhang, Hao , journal=. C

[13] [15]

International Conference on Learning Representations , year=

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , author=. International Conference on Learning Representations , year=

[14] [16]

Zico Kolter, and Vladlen Koltun

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in Neural Information Processing Systems, 2019

2019

[15] [17]

Zico Kolter

Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Accelerating feedforward computation via parallel nonlinear equation solving. In International Conference on Machine Learning, 2021

2021

[16] [18]

Ponder N et: Learning to ponder

Andrea Banino, Jan Balaguer, and Charles Blundell. Ponder N et: Learning to ponder. In ICML Workshop on Uncertainty and Robustness in Deep Learning, 2021

2021

[17] [19]

x LSTM : Extended long short-term memory

Maximilian Beck, Korbinian P \"o ppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, et al. x LSTM : Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024

work page arXiv 2024

[18] [20]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [21]

Universal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and ukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019

2019

[20] [22]

Layer S kip: Enabling early exit inference and self-speculative decoding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acber, Saurabh Agarwal, Ahmed Roman, et al. Layer S kip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710, 2024

work page arXiv 2024

[21] [23]

Break the sequential dependency of LLM inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of LLM inference using lookahead decoding. arXiv preprint arXiv:2402.02057, 2024

work page arXiv 2024

[22] [24]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Tom Goldstein, Avi Schwarzschild, C. Bayan Bruss, et al. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [25]

Openwebtext corpus

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019

2019

[24] [26]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[25] [27]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of ICML 2024, 2024

2024

[26] [28]

C LLM s: Consistency large language models

Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. C LLM s: Consistency large language models. arXiv preprint arXiv:2403.00835, 2024

work page arXiv 2024

[27] [29]

ALBERT : A lite BERT for self-supervised learning of language representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT : A lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations, 2020

2020

[28] [30]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

2019

[29] [31]

The parallelism tradeoff: Limitations of log-precision transformers

William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. In Transactions of the Association for Computational Linguistics, volume 11, pages 531--545, 2023

2023

[30] [32]

RWKV : Reinventing RNN s for the transformer era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, et al. RWKV : Reinventing RNN s for the transformer era. In Findings of EMNLP 2023, 2023

2023

[31] [33]

Siegelmann and Eduardo D

Hava T. Siegelmann and Eduardo D. Sontag. Computational capabilities of recurrent NARX neural networks. IEEE Transactions on Systems, Man, and Cybernetics, 26 0 (4): 0 535--544, 1995 a

1995

[32] [34]

Siegelmann and Eduardo D

Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50 0 (1): 0 132--150, 1995 b

1995

[33] [35]

RoFormer : Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer : Enhanced transformer with rotary position embedding. Neurocomputing, 568: 0 127063, 2024

2024

[34] [36]

ADEPT : Adaptive dynamic early-exit process for transformers

Seunghyun Yoo et al. ADEPT : Adaptive dynamic early-exit process for transformers. arXiv preprint arXiv:2601.03700, 2026

work page arXiv 2026