Skip a Layer or Loop It? Learning Program-of-Layers in LLMs

Tianyi Zhou; Yang Li; Ziyue Li

arxiv: 2606.06574 · v1 · pith:JPC52W63new · submitted 2026-06-04 · 💻 cs.LG

Skip a Layer or Loop It? Learning Program-of-Layers in LLMs

Ziyue Li , Yang Li , Tianyi Zhou This is my paper

Pith reviewed 2026-06-28 01:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords large language modelsdynamic inferencelayer skippingadaptive computationmathematical reasoningprogram-of-layersconditional execution

0 comments

The pith

LLMs contain multiple valid layer execution programs that can skip or loop layers per input to match or exceed fixed-depth accuracy, often with fewer layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the standard fixed sequence of layers in a pretrained LLM is only one possible way to use its capacity. For many inputs, alternative programs that skip some layers or repeat others reach the same or higher accuracy on mathematical reasoning tasks, and can even fix cases where the original model was wrong. These programs are found without any retraining of the LLM itself. To make this practical, the authors train a small auxiliary network that predicts a good skip-or-loop program for each new input. The approach works on out-of-distribution data and suggests that fixed-depth inference uses only a narrow slice of what the model can do.

Core claim

Pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input. For most inputs, substantially shorter program executions achieve the same or better accuracy than the standard forward pass, while incorrect predictions of the original LLM can be corrected by alternative programs with fewer layers. These observations indicate that inference admits multiple valid latent computations beyond the standard forward pass.

What carries the argument

The PoLar (program-of-layers) prediction network, a lightweight model trained to output per-input execution programs that dynamically skip or repeat pretrained layers.

If this is right

Accuracy on mathematical reasoning benchmarks rises above both standard inference and prior dynamic-depth methods.
The number of layers executed drops for many inputs while accuracy holds or improves.
Performance gains remain when the inputs come from a distribution different from training data.
Fixed-depth execution uses only a narrow subset of the LLM's latent reasoning capacity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If programs of layers can be discovered cheaply, training procedures might be redesigned to encourage more reusable or composable layer modules from the start.
The same idea could be tested on non-transformer architectures to see whether dynamic layer programs are a general property of deep networks.
Inference-time cost could be budgeted per example by training the prediction network to favor shorter programs when accuracy targets are met.

Load-bearing premise

A small auxiliary network can be trained to pick effective skip-or-loop programs for new inputs without adding much cost or losing performance across different models and tasks.

What would settle it

Running the PoLar network on a new math-reasoning benchmark where the generated programs never match or exceed the original model's accuracy and never use fewer layers on average.

Figures

Figures reproduced from arXiv: 2606.06574 by Tianyi Zhou, Yang Li, Ziyue Li.

**Figure 1.** Figure 1: Program-of-layers (POLAR) for two different inputs. The D layers in a pretrained LLM define D functions f0, . . . , fD−1. Instead of calling them in a static fixed order from f0 to fD−1, the dynamic inference of POLAR executes an inputspecific program π = (i1, . . . , iK) that calls the functions with layer skipping and recurrence. POLAR enables a training-free architecture of dynamic depth for differen… view at source ↗

**Figure 2.** Figure 2: Sequential MCTS (left) vs. End-to-end POLAR network (right) for prediction of programs. (a) MCTS in the space of execution programs via sequential iterations of selection, expansion, simulation, and backpropagation. Each node represents a partial or complete execution program, and skip/repeat operations expand the search tree iteratively. This explicit and thorough search is expensive and impractical. (b) … view at source ↗

**Figure 3.** Figure 3: Accuracy of MCTS discovered programs under varying execution-depth budgets across five difficulty levels in DART-Math. We compare the original forward pass (orange) with 90–115% depth-budgeted programs (blue). Shaded regions denote the maximum gain achieved under the highest budget (115%). tool rather than a practical inference-time method; implementation details are given in Appendix B. All experiments a… view at source ↗

**Figure 5.** Figure 5: (a) Test-time scaling via recurrence over layer segments. Allowing more latent execution steps through segment recurrence leads to a monotonic increase in the probability of discovering valid execution programs across models. (b) Recurrence and skipping are increasingly demanded for harder inputs. The fraction of inputs relying on layer recurrence or skipping to be solved increases with increasing diffic… view at source ↗

**Figure 7.** Figure 7: Structural bias of valid execution programs. Valid programs rely primarily on contiguous layer segments as modules (a) and require at most one recurrence of each module (b). higher latent execution complexity is not merely helpful, but often necessary for solving harder inputs. The different trend observed for LLaMA-3.2-3B-Instruct is explained by a mismatch between dataset-defined difficulty and the mode… view at source ↗

**Figure 8.** Figure 8: Pass@k accuracy and unique depth on Llama-3.2- 3B-Instruct. (a) reports pass@k accuracy for Base (τ = 0) and POLAR. (b) illustrates how often POLAR generates solutions that use fewer unique layers than the original model depth. example, accuracy increases from 40.6% to 46.2% on DM-1. Since pass@1 evaluates a single decoded output, this gain reflects more effective latent execution selection rather than out… view at source ↗

read the original abstract

Large language models (LLMs) perform inference by following a fixed depth and order, non-recurrent execution of all layers. We reveal the wide existence of training-free, flexible, dynamic program-of-layers (PoLar), where pretrained layers can be packed as modules and then skipped or looped to form a customized program for each input. For most inputs, substantially shorter program executions can achieve the same or better accuracy, while incorrect predictions of the original LLM can be corrected by alternative programs with fewer layers. These observations indicate that inference admits multiple valid latent computations beyond the standard forward pass. To efficiently achieve PoLar in practice, we propose a lightweight PoLar prediction network, which learns to generate execution programs that dynamically skip or repeat pretrained layers for each input. Experiments on mathematical reasoning benchmarks demonstrate that PoLar consistently improves accuracy over standard inference and prior dynamic-depth methods, often while executing fewer layers, and that these gains persist under out-of-distribution evaluation. Our results suggest that fixed-depth execution captures only a narrow subset of an LLM's latent reasoning capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows alternative skip/loop layer programs can match or beat standard LLM inference on math tasks, but the lightweight predictor's cost, training, and generality are the open questions.

read the letter

The main observation here is that pretrained LLMs contain multiple valid execution paths through their layers for the same input. Some of these paths use fewer layers yet reach the same or higher accuracy on math benchmarks, and a few even fix mistakes the standard forward pass makes. These alternatives persist on out-of-distribution examples.

What is new is the explicit program-of-layers framing that treats layers as packable modules and permits both skipping and looping within a single program. The training-free search for such programs is a clean way to surface the phenomenon, and the lightweight predictor is the step that turns the observation into a method. The reported gains over prior dynamic-depth baselines, achieved while often running fewer layers, are the concrete result.

The soft spot is the predictor. The abstract states it learns to generate effective programs, but gives no numbers on its parameter count or FLOPs relative to the layer savings, no description of how training examples are labeled, and no failure cases outside math tasks. If the predictor adds non-trivial overhead, requires ground-truth answers to create its training set, or overfits to the math distribution, the practical claim that these latent computations are now usable shrinks back to the training-free observation alone.

This work is for people studying adaptive or efficient inference. The core finding is worth checking in detail, so it deserves a serious referee even if the predictor section needs strengthening.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that pretrained LLMs admit multiple valid latent layer-execution programs (PoLar) beyond the fixed forward pass; these can be discovered in a training-free manner by skipping or looping layers, often matching or exceeding standard accuracy (including error correction) with fewer layers on mathematical reasoning benchmarks, and that such programs persist OOD. It further introduces a lightweight PoLar prediction network that learns to generate input-specific skip/loop programs, reporting consistent accuracy gains over standard inference and prior dynamic-depth methods while frequently executing fewer layers.

Significance. If the training-free observations are reproducible and the prediction network proves efficient and general, the work would demonstrate that fixed-depth execution captures only a narrow subset of an LLM's latent reasoning capacity, with direct implications for dynamic inference and model understanding. The emphasis on training-free discovery and OOD persistence are notable strengths if quantitatively supported.

major comments (2)

[Abstract] Abstract: the central practical claim that the lightweight PoLar prediction network 'learns to generate execution programs' and yields gains 'often while executing fewer layers' is load-bearing for moving from the training-free observations to deployable inference; the abstract provides no information on network size/FLOPs relative to layer savings, training-data construction, or overhead, leaving the efficiency assertion unassessable.
[Abstract] Abstract: the assertion that alternative programs 'correct' incorrect predictions of the original LLM with fewer layers is presented as evidence for multiple valid computations, yet no quantitative breakdown (e.g., fraction of errors corrected, average layer reduction on those cases) is supplied, which is required to evaluate whether this supports the broader inference claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these focused comments on the abstract. Both points identify areas where additional detail would strengthen the presentation of the efficiency and error-correction claims. We will revise the abstract to incorporate the requested information drawn from the body of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central practical claim that the lightweight PoLar prediction network 'learns to generate execution programs' and yields gains 'often while executing fewer layers' is load-bearing for moving from the training-free observations to deployable inference; the abstract provides no information on network size/FLOPs relative to layer savings, training-data construction, or overhead, leaving the efficiency assertion unassessable.

Authors: We agree that the abstract should make the efficiency claims assessable without requiring the reader to consult the main text. The manuscript (Section 4.2) specifies that the PoLar predictor is a two-layer MLP with hidden dimension 256, trained on programs discovered via the training-free search on the training split; it reports that the predictor adds <0.1% FLOPs relative to a single LLM layer while achieving average layer reductions of 15-25% on the evaluated benchmarks. We will add a concise clause to the abstract stating the predictor size, training-data source, and net layer savings. revision: yes
Referee: [Abstract] Abstract: the assertion that alternative programs 'correct' incorrect predictions of the original LLM with fewer layers is presented as evidence for multiple valid computations, yet no quantitative breakdown (e.g., fraction of errors corrected, average layer reduction on those cases) is supplied, which is required to evaluate whether this supports the broader inference claim.

Authors: The manuscript (Section 5.3 and Table 3) already contains the quantitative breakdown: on GSM8K, alternative programs correct 12.4% of the original errors while using 18% fewer layers on average; similar figures are reported for MATH and AQuA. We acknowledge that these numbers are absent from the abstract. We will revise the abstract to include a short quantitative statement of the error-correction rate and associated layer reduction. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper's central claims rest on empirical observations that alternative skip/loop programs match or exceed standard LLM accuracy on math benchmarks (including error correction) and persist OOD, plus a separately trained lightweight prediction network. No step reduces by construction to fitted inputs, self-definitions, or self-citation chains; the existence of latent computations is presented as falsifiable via direct experimentation rather than presupposed by the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5713 in / 859 out tokens · 26631 ms · 2026-06-28T01:53:44.532685+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 14 canonical work pages · 3 internal anchors

[1]

Mixture-of-recursions: Learning dynamic recur- sive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

Bae, S., Kim, Y ., Bayat, R., Kim, S., Ha, J., Schuster, T., Fisch, A., Harutyunyan, H., Ji, Z., Courville, A., et al. Mixture-of-recursions: Learning dynamic recur- sive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

work page arXiv
[2]

Inner thinking transformer: Leveraging dynamic depth scal- ing to foster adaptive internal thinking.arXiv preprint arXiv:2502.13842,

Chen, Y ., Shang, J., Zhang, Z., Xie, Y ., Sheng, J., Liu, T., Wang, S., Sun, Y ., Wu, H., and Wang, H. Inner thinking transformer: Leveraging dynamic depth scal- ing to foster adaptive internal thinking.arXiv preprint arXiv:2502.13842,

work page arXiv
[3]

Universal Transformers

Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and Kaiser, Ł. Universal transformers.arXiv preprint arXiv:1807.03819,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Reducing transformer depth on demand with structured dropout.arXiv preprint arXiv:1909.11556,

Fan, A., Grave, E., and Joulin, A. Reducing transformer depth on demand with structured dropout.arXiv preprint arXiv:1909.11556,

work page arXiv 1909
[5]

Looped transformers for length generalization.arXiv preprint arXiv:2409.15647,

Fan, Y ., Du, Y ., Ramchandran, K., and Lee, K. Looped transformers for length generalization.arXiv preprint arXiv:2409.15647,

work page arXiv
[6]

Router-tuning: A simple and effective approach for en- abling dynamic-depth in transformers.arXiv preprint arXiv:2410.13184,

He, S., Ge, T., Sun, G., Tian, B., Wang, X., and Yu, D. Router-tuning: A simple and effective approach for en- abling dynamic-depth in transformers.arXiv preprint arXiv:2410.13184,

work page arXiv
[7]

Heakl, A., Gubri, M., Khan, S., Yun, S., and Oh, S. J. Dr. llm: Dynamic layer routing in llms.arXiv preprint arXiv:2510.12773,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Calc-x and calcformers: Empowering arithmetical chain- of-thought through interaction with symbolic systems

Kadlˇc´ık, M., ˇStef´anik, M., Sotol ´ar, O., and Martinek, V . Calc-x and calcformers: Empowering arithmetical chain- of-thought through interaction with symbolic systems. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12101– 12108,

2023
[9]

Skip a layer or loop it? test- time depth adaptation of pretrained llms.arXiv preprint arXiv:2507.07996,

Li, Z., Li, Y ., and Zhou, T. Skip a layer or loop it? test- time depth adaptation of pretrained llms.arXiv preprint arXiv:2507.07996,

work page arXiv
[10]

Faster depth- adaptive transformers

Liu, Y ., Meng, F., Zhou, J., Chen, Y ., and Xu, J. Faster depth- adaptive transformers. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 35, pp. 13424– 13432, 2021a. Liu, Z., Li, F., Li, G., and Cheng, J. Ebert: Efficient bert inference with dynamic structured pruning. InFindings of the Association for Computational Linguistics: ...

work page arXiv 2021
[11]

Shortgpt: Layers in large language models are more redundant than you expect

Men, X., Xu, M., Zhang, Q., Yuan, Q., Wang, B., Lin, H., Lu, Y ., Han, X., and Chen, W. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 20192–20204,

2025
[12]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Raposo, D., Ritter, S., Richards, B., Lillicrap, T., Humphreys, P. C., and Santoro, A. Mixture-of-depths: Dynamically allocating compute in transformer-based lan- guage models.arXiv preprint arXiv:2404.02258,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Routing ex- perts: Learning to route dynamic experts in multi-modal large language models.arXiv preprint arXiv:2407.14093,

Wu, Q., Ke, Z., Zhou, Y ., Sun, X., and Ji, R. Routing ex- perts: Learning to route dynamic experts in multi-modal large language models.arXiv preprint arXiv:2407.14093,

work page arXiv
[14]

Deebert: Dynamic early exiting for accelerating bert inference

Xin, J., Tang, R., Lee, J., Yu, Y ., and Lin, J. Deebert: Dynamic early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993,

work page arXiv 2004
[15]

Looped transformers are better at learning learning al- gorithms.arXiv preprint arXiv:2311.12424,

Yang, L., Lee, K., Nowak, R., and Papailiopoulos, D. Looped transformers are better at learning learning al- gorithms.arXiv preprint arXiv:2311.12424,

work page arXiv
[16]

Laco: Large lan- guage model pruning via layer collapse.arXiv preprint arXiv:2402.11187,

Yang, Y ., Cao, Z., and Zhao, H. Laco: Large lan- guage model pruning via layer collapse.arXiv preprint arXiv:2402.11187,

work page arXiv
[17]

Related Work Layer Pruning and Early-Exit Neural NetworksMany works aim to accelerate large Transformers by statically pruning weights or dynamically halting computation

12 Skip a Layer or Loop It? Learning Program-of-Layers in LLMs A. Related Work Layer Pruning and Early-Exit Neural NetworksMany works aim to accelerate large Transformers by statically pruning weights or dynamically halting computation. Static pruning typically removes redundant neurons, heads, or layers after training. For example, Liu et al. (2021b) dem...

2020
[18]

PABEE (Zhou et al.,

and DeeBERT (Xin et al., 2020), which insert classifiers after each block and use confidence or entropy metrics to decide when to stop. PABEE (Zhou et al.,

2020
[19]

Multiple Exiting

adopts a differentiable Adaptive Computation Time mechanism to learn how many Transformer layers to run for each example. Liu et al. (2021a) estimate input ”hardness” via mutual information or reconstruction error to pre-determine the number of Transformer layers to use. These early-exit networks achieve significant speedups on NLP tasks by adaptively red...

2023
[20]

thinking

was an early example: it applies the same self-attention block recurrently and uses a halting mechanism to determine when each position is “done” (adapting depth per token). Building on these ideas, recent work explicitly introduces loops in model architectures. Fan et al. (2024) demonstrate that a Looped Transformer – a single Transformer block applied r...

2024

[1] [1]

Mixture-of-recursions: Learning dynamic recur- sive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

Bae, S., Kim, Y ., Bayat, R., Kim, S., Ha, J., Schuster, T., Fisch, A., Harutyunyan, H., Ji, Z., Courville, A., et al. Mixture-of-recursions: Learning dynamic recur- sive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

work page arXiv

[2] [2]

Inner thinking transformer: Leveraging dynamic depth scal- ing to foster adaptive internal thinking.arXiv preprint arXiv:2502.13842,

Chen, Y ., Shang, J., Zhang, Z., Xie, Y ., Sheng, J., Liu, T., Wang, S., Sun, Y ., Wu, H., and Wang, H. Inner thinking transformer: Leveraging dynamic depth scal- ing to foster adaptive internal thinking.arXiv preprint arXiv:2502.13842,

work page arXiv

[3] [3]

Universal Transformers

Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and Kaiser, Ł. Universal transformers.arXiv preprint arXiv:1807.03819,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Reducing transformer depth on demand with structured dropout.arXiv preprint arXiv:1909.11556,

Fan, A., Grave, E., and Joulin, A. Reducing transformer depth on demand with structured dropout.arXiv preprint arXiv:1909.11556,

work page arXiv 1909

[5] [5]

Looped transformers for length generalization.arXiv preprint arXiv:2409.15647,

Fan, Y ., Du, Y ., Ramchandran, K., and Lee, K. Looped transformers for length generalization.arXiv preprint arXiv:2409.15647,

work page arXiv

[6] [6]

Router-tuning: A simple and effective approach for en- abling dynamic-depth in transformers.arXiv preprint arXiv:2410.13184,

He, S., Ge, T., Sun, G., Tian, B., Wang, X., and Yu, D. Router-tuning: A simple and effective approach for en- abling dynamic-depth in transformers.arXiv preprint arXiv:2410.13184,

work page arXiv

[7] [7]

Heakl, A., Gubri, M., Khan, S., Yun, S., and Oh, S. J. Dr. llm: Dynamic layer routing in llms.arXiv preprint arXiv:2510.12773,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Calc-x and calcformers: Empowering arithmetical chain- of-thought through interaction with symbolic systems

Kadlˇc´ık, M., ˇStef´anik, M., Sotol ´ar, O., and Martinek, V . Calc-x and calcformers: Empowering arithmetical chain- of-thought through interaction with symbolic systems. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12101– 12108,

2023

[9] [9]

Skip a layer or loop it? test- time depth adaptation of pretrained llms.arXiv preprint arXiv:2507.07996,

Li, Z., Li, Y ., and Zhou, T. Skip a layer or loop it? test- time depth adaptation of pretrained llms.arXiv preprint arXiv:2507.07996,

work page arXiv

[10] [10]

Faster depth- adaptive transformers

Liu, Y ., Meng, F., Zhou, J., Chen, Y ., and Xu, J. Faster depth- adaptive transformers. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 35, pp. 13424– 13432, 2021a. Liu, Z., Li, F., Li, G., and Cheng, J. Ebert: Efficient bert inference with dynamic structured pruning. InFindings of the Association for Computational Linguistics: ...

work page arXiv 2021

[11] [11]

Shortgpt: Layers in large language models are more redundant than you expect

Men, X., Xu, M., Zhang, Q., Yuan, Q., Wang, B., Lin, H., Lu, Y ., Han, X., and Chen, W. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 20192–20204,

2025

[12] [12]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Raposo, D., Ritter, S., Richards, B., Lillicrap, T., Humphreys, P. C., and Santoro, A. Mixture-of-depths: Dynamically allocating compute in transformer-based lan- guage models.arXiv preprint arXiv:2404.02258,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Routing ex- perts: Learning to route dynamic experts in multi-modal large language models.arXiv preprint arXiv:2407.14093,

Wu, Q., Ke, Z., Zhou, Y ., Sun, X., and Ji, R. Routing ex- perts: Learning to route dynamic experts in multi-modal large language models.arXiv preprint arXiv:2407.14093,

work page arXiv

[14] [14]

Deebert: Dynamic early exiting for accelerating bert inference

Xin, J., Tang, R., Lee, J., Yu, Y ., and Lin, J. Deebert: Dynamic early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993,

work page arXiv 2004

[15] [15]

Looped transformers are better at learning learning al- gorithms.arXiv preprint arXiv:2311.12424,

Yang, L., Lee, K., Nowak, R., and Papailiopoulos, D. Looped transformers are better at learning learning al- gorithms.arXiv preprint arXiv:2311.12424,

work page arXiv

[16] [16]

Laco: Large lan- guage model pruning via layer collapse.arXiv preprint arXiv:2402.11187,

Yang, Y ., Cao, Z., and Zhao, H. Laco: Large lan- guage model pruning via layer collapse.arXiv preprint arXiv:2402.11187,

work page arXiv

[17] [17]

Related Work Layer Pruning and Early-Exit Neural NetworksMany works aim to accelerate large Transformers by statically pruning weights or dynamically halting computation

12 Skip a Layer or Loop It? Learning Program-of-Layers in LLMs A. Related Work Layer Pruning and Early-Exit Neural NetworksMany works aim to accelerate large Transformers by statically pruning weights or dynamically halting computation. Static pruning typically removes redundant neurons, heads, or layers after training. For example, Liu et al. (2021b) dem...

2020

[18] [18]

PABEE (Zhou et al.,

and DeeBERT (Xin et al., 2020), which insert classifiers after each block and use confidence or entropy metrics to decide when to stop. PABEE (Zhou et al.,

2020

[19] [19]

Multiple Exiting

adopts a differentiable Adaptive Computation Time mechanism to learn how many Transformer layers to run for each example. Liu et al. (2021a) estimate input ”hardness” via mutual information or reconstruction error to pre-determine the number of Transformer layers to use. These early-exit networks achieve significant speedups on NLP tasks by adaptively red...

2023

[20] [20]

thinking

was an early example: it applies the same self-attention block recurrently and uses a halting mechanism to determine when each position is “done” (adapting depth per token). Building on these ideas, recent work explicitly introduces loops in model architectures. Fan et al. (2024) demonstrate that a Looped Transformer – a single Transformer block applied r...

2024