Generative Recursive Reasoning

Junyeob Baek; Mengye Ren; Mingyu Jo; Minsu Kim; Sungjin Ahn; Yoshua Bengio

arxiv: 2605.19376 · v2 · pith:SVHPMX4Dnew · submitted 2026-05-19 · 💻 cs.AI

Generative Recursive Reasoning

Junyeob Baek , Mingyu Jo , Minsu Kim , Mengye Ren , Yoshua Bengio , Sungjin Ahn This is my paper

Pith reviewed 2026-05-21 07:36 UTC · model grok-4.3

classification 💻 cs.AI

keywords generative modelsrecursive reasoninglatent variable modelsvariational inferencemulti-hypothesis reasoningconstraint satisfactionneural reasoning

0 comments

The pith

GRAM turns recursive latent reasoning into probabilistic multi-trajectory computation to support multiple hypotheses and unconditional generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that neural reasoning systems should implement extended computation through iterative refinement of latent states using shared transition functions rather than extending sequences token by token. Existing recursive reasoning models remain deterministic and converge to a single prediction along one trajectory. The authors introduce a generative version that treats the latent reasoning process as stochastic, producing multiple trajectories that represent alternative hypotheses or solution strategies. Trained via amortized variational inference, this yields a latent-variable model that performs conditional reasoning given an input and unconditional generation when inputs are absent, with improvements shown on tasks involving structured reasoning and multiple valid solutions.

Core claim

We introduce Generative Recursive reAsoning Models (GRAM), a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via p_θ(y | x) and, with fixed or absent inputs, unconditional generation via p_θ(x).

What carries the argument

Stochastic latent trajectory that replaces the single deterministic path in recursive reasoning with probabilistic multi-trajectory computation.

Load-bearing premise

Amortized variational inference can train the stochastic latent trajectories to produce useful, non-collapsed multi-hypothesis reasoning without requiring task-specific architectural changes or post-hoc selection of trajectories.

What would settle it

A result on multi-solution tasks showing that sampling additional trajectories yields no gain in solution diversity or accuracy compared to the deterministic baseline.

Figures

Figures reproduced from arXiv: 2605.19376 by Junyeob Baek, Mengye Ren, Mingyu Jo, Minsu Kim, Sungjin Ahn, Yoshua Bengio.

**Figure 2.** Figure 2: GRAM Architecture. A single stochastic latent transition in the hierarchical instantiation z = (h, l). After K low-level refinements via fL, the highlevel update fH produces a deterministic proposal ut, to which stochastic guidance ϵt is added: ht = ut + ϵt. Overview. GRAM models the conditional distribution pθ(y | x) by marginalizing over stochastic latent reasoning trajectories. Given an input x, GRAM… view at source ↗

**Figure 3.** Figure 3: Performance on puzzle benchmarks. On both Sudoku-Extreme and ARC-AGI, GRAM consistently outperforms all deterministic recursive baselines (Looped TF, HRM, TRM), demonstrating that stochastic latent transitions yield substantial gains within the recursive-reasoning paradigm. Looped TF results on ARC-AGI are omitted due to prohibitive training cost (see Section C.1.1) Note that large reasoning model scores a… view at source ↗

**Figure 4.** Figure 4: (Left) Inference-time scaling on Sudoku-Extreme. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Unconditional Sudoku generation. Validity (%) of generated Sudoku puzzles. GRAM achieves higher validity than D3PM with substantially fewer parameters and steps. Setup. To investigate GRAM’s unconditional generative capability beyond conditional reasoning, we evaluate generation in two domains: structured constraint generation on Sudoku (from empty boards, evaluated by the fraction of generated boards … view at source ↗

**Figure 6.** Figure 6: Visualization of the generation process and samples. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative examples of unconditional Sudoku generation by GRAM. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Full ELBO LELBO and surrogate objective LGRAM throughout training (plotted as −ELBO, smaller is better). On both Sudoku-Extreme (left) and N-Queens 8 × 8 (right), both quantities decrease monotonically over training, indicating that gradient updates of LGRAM consistently improve the full variational bound. The two curves do not coincide because LELBO sums KL contributions across all TTotal transitions whil… view at source ↗

**Figure 9.** Figure 9: Example of an 8 × 8 N-Queens puzzle instance. In this example, 5 queens are removed from the full board, leaving 3 queens. The model must find the positions of the remaining queens. This configuration admits exactly 3 valid solutions. Data Generation Details. The N-Queens problem requires placing N queens on an N × N chessboard such that no two queens attack each other—meaning no queens share the same row,… view at source ↗

**Figure 10.** Figure 10: Distribution of the number of valid solutions for generated N-Queens instances. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Graph Coloring Example 3 6 9 12 15 18 Number of solutions 0 500 1000 1500 Counts Vertex 8 Graph Coloring 0 20 40 60 80 Number of solutions 0 500 1000 1500 2000 Counts Vertex 10 Graph Coloring [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Distribution of the number of valid solutions for generated graph coloring instances. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Effect of sampling on ARC-AGI-1 without data augmentation. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Effect of augmentation on sampling efficiency. [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Solution coverage analysis on N-Queens ( [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Additional generated samples from GRAM. We provide 8 additional samples generated unconditionally on binarized MNIST using GRAM. Each row represents a single generated sample, visualized across its recursive refinement process. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: illustrates the unconditional Sudoku generation setup. Starting from an empty board, the task is to generate complete boards, and validity is determined by whether the generated board satisfies all Sudoku constraints [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Latent reasoning trajectory of TRM. The red dot indicates the initial state h0 and the green dot indicates the final state hT . Background color represents the loss landscape: bright yellow corresponds to high loss regions, while dark blue indicates low loss (optimal) regions. TRM follows a single deterministic path with no ability to escape suboptimal trajectories. 25 [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 19.** Figure 19: Latent reasoning trajectories of GRAM (50 samples). [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

read the original abstract

How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce Generative Recursive reAsoning Models (GRAM), a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via $p_\theta(y \mid x)$ and, with fixed or absent inputs, unconditional generation via $p_\theta(x)$. Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. https://ahn-ml.github.io/gram-website

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRAM adds stochastic multi-trajectory sampling to recursive reasoning models, but amortized VI risks collapse that could make the probabilistic part do little real work.

read the letter

The main point is that this paper takes existing recursive reasoning models, which refine a latent state over shared transitions, and adds stochasticity so the system can sample multiple distinct trajectories instead of converging on one path. That change turns the setup into a latent-variable generative model supporting both conditional reasoning and unconditional generation when inputs are absent or fixed. They train it with amortized variational inference and claim gains over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint tasks, plus the ability to scale inference by sampling more paths in parallel or going deeper recursively. The framing is straightforward and the motivation for latent-space computation over token extension makes sense for tasks that benefit from exploring alternatives. What stands out positively is the clean extension that keeps the shared transition structure while enabling branching at inference time without obvious task-specific additions. If the trajectories stay meaningfully different, this could offer a practical route to multi-hypothesis reasoning. The soft spot is the training procedure. Amortized variational inference on models with shared recursive transitions has a well-known tendency toward posterior collapse, where the variational posterior learns to ignore the stochasticity and the model behaves like its deterministic counterpart. The abstract gives no numbers, ablations, or evidence that trajectories actually differ in useful ways, so it is unclear whether the reported improvements come from the multi-trajectory mechanism or simply from extra capacity and training changes. The stress-test concern about collapse looks like it could apply here unless the full experiments include controls that rule it out. This paper is aimed at researchers working on neural reasoning architectures that need to handle uncertainty or generate diverse solutions, such as in planning or constraint satisfaction. Readers interested in scaling test-time compute inside latent states rather than through longer sequences would find the framework worth examining. It deserves a serious referee because the core proposal is coherent and the extension is well-motivated, even though the results section will need close scrutiny on the collapse question and quantitative details.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Generative Recursive reAsoning Models (GRAM) that extend deterministic Recursive Reasoning Models by modeling iterative latent-state refinement as stochastic multi-trajectory computation. This produces a latent-variable generative model supporting conditional reasoning via p_θ(y | x) and unconditional generation via p_θ(x). The approach is trained with amortized variational inference and is claimed to improve over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks while enabling inference-time scaling via recursive depth and parallel trajectory sampling.

Significance. If the multi-trajectory mechanism produces genuinely distinct and useful hypotheses rather than collapsing, and if the reported gains are robustly quantified, the work would offer a principled extension of recursive models to probabilistic settings. This could support more flexible handling of uncertainty and alternative solutions in neural reasoning without requiring task-specific architectural modifications.

major comments (2)

[§3 (Training and Inference)] The central claim that amortized variational inference trains the stochastic latent trajectories to yield non-collapsed, useful multi-hypothesis reasoning (without task-specific changes or post-hoc selection) is load-bearing for the distinction from deterministic RRMs. No analysis of trajectory diversity, posterior utilization of stochasticity, or KL-term behavior is referenced to rule out collapse to a single effective path.
[Abstract and §4 (Experiments)] The abstract and results summary assert improvements over baselines on structured reasoning and multi-solution tasks, yet supply no quantitative metrics, error bars, dataset sizes, ablation controls, or statistical significance tests. This prevents verification that gains arise from the probabilistic multi-trajectory component rather than capacity or training differences.

minor comments (2)

[Abstract] Notation for the generative model p_θ(y | x) and p_θ(x) is introduced clearly in the abstract but should be cross-referenced to the precise definitions of the latent trajectory variables and transition functions in the methods section for consistency.
[Abstract] The website link is provided but the manuscript should include a brief summary of any additional experimental details or visualizations hosted there to ensure self-contained evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript introducing Generative Recursive reAsoning Models (GRAM). We address each major comment below, providing clarifications and committing to specific revisions that strengthen the presentation of our training procedure and experimental results without altering the core claims.

read point-by-point responses

Referee: [§3 (Training and Inference)] The central claim that amortized variational inference trains the stochastic latent trajectories to yield non-collapsed, useful multi-hypothesis reasoning (without task-specific changes or post-hoc selection) is load-bearing for the distinction from deterministic RRMs. No analysis of trajectory diversity, posterior utilization of stochasticity, or KL-term behavior is referenced to rule out collapse to a single effective path.

Authors: We agree that the manuscript would benefit from explicit supporting analysis. In the revised version we will add to §3: (i) visualizations of distinct latent trajectories on example inputs, (ii) quantitative diversity metrics such as average pairwise Euclidean distance between final latent states across sampled trajectories, and (iii) plots of the KL divergence term over training epochs together with posterior utilization statistics (e.g., entropy of the approximate posterior). These additions will directly demonstrate that the stochasticity is actively used rather than collapsed. revision: yes
Referee: [Abstract and §4 (Experiments)] The abstract and results summary assert improvements over baselines on structured reasoning and multi-solution tasks, yet supply no quantitative metrics, error bars, dataset sizes, ablation controls, or statistical significance tests. This prevents verification that gains arise from the probabilistic multi-trajectory component rather than capacity or training differences.

Authors: The current manuscript contains tables reporting accuracy and success rates on the evaluated tasks, but we accept that error bars, explicit dataset cardinalities, ablation controls isolating the stochastic component, and significance testing are not present. We will revise §4 to include standard deviations over five random seeds, precise dataset sizes and splits, an ablation removing the stochastic latent transitions, and paired t-test p-values for all baseline comparisons. The abstract will be updated to reference these quantitative details. revision: yes

Circularity Check

0 steps flagged

No circularity: GRAM derivation introduces independent stochastic trajectories and VI training.

full rationale

The paper's core contribution is the definition of GRAM as a latent-variable model that converts deterministic recursive reasoning into stochastic multi-trajectory sampling, trained end-to-end via amortized variational inference to support both conditional p_θ(y|x) and unconditional p_θ(x) generation. This architecture and training procedure are presented as novel extensions beyond existing deterministic RRMs, with performance gains demonstrated empirically on structured reasoning and constraint satisfaction tasks. No load-bearing step reduces a claimed prediction or uniqueness result to a self-citation, fitted parameter renamed as output, or ansatz smuggled from prior author work; the derivation chain remains self-contained against external benchmarks and does not equate any output quantity to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is limited to the abstract; the ledger therefore captures only the high-level modeling assumptions stated or implied there. No specific numerical free parameters are mentioned.

axioms (1)

domain assumption Amortized variational inference can learn a useful posterior over stochastic latent trajectories for reasoning tasks.
The training procedure described in the abstract relies on this standard assumption from variational generative modeling.

invented entities (1)

GRAM (Generative Recursive reAsoning Model) no independent evidence
purpose: To implement probabilistic multi-trajectory recursive reasoning in latent space.
New model class introduced by the paper; no independent evidence outside the abstract is provided.

pith-pipeline@v0.9.0 · 5703 in / 1349 out tokens · 41161 ms · 2026-05-21T07:36:58.423389+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GRAM models reasoning as a stochastic latent trajectory... z_t ~ p_θ(z_t | z_{t-1}, e_x) ... ϵ_t ~ N(μ_θ(u_t), σ²_θ(u_t)I) ... hierarchical instantiation z=(h,l)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Trained with amortized variational inference... ELBO... deep supervision over N_sup consecutive supervision steps

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 23 internal anchors

[1]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[2]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023
[3]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

work page 2024
[4]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Reasoning by superposition: A theoretical perspective on chain of continuous thought.CoRR, abs/2505.12514, 2025

Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv preprint arXiv:2505.12514, 2025

work page arXiv 2025
[6]

Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648, 2025

Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648, 2025

work page arXiv 2025
[7]

Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424,

Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424, 2023

work page arXiv 2023
[8]

Hierarchical Reasoning Model

Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Less is More: Recursive Reasoning with Tiny Networks

Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks.arXiv preprint arXiv:2510.04871, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers.arXiv preprint arXiv:1807.03819, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Thinking, fast and slow.Farrar, Straus and Giroux, 2011

Daniel Kahneman. Thinking, fast and slow.Farrar, Straus and Giroux, 2011

work page 2011
[12]

arXiv:1709.08568 [cs.LG].https: //arxiv.org/abs/1709.08568

Yoshua Bengio. The consciousness prior.arXiv preprint arXiv:1709.08568, 2017

work page arXiv 2017
[13]

On the Measure of Intelligence

François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[14]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Gradient-based learning applied to document recognition,

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

work page doi:10.1109/5.726791 1998
[16]

An efficient gradient-based algorithm for on-line training of recurrent network trajectories.Neural computation, 2(4):490–501, 1990

Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of recurrent network trajectories.Neural computation, 2(4):490–501, 1990

work page 1990
[17]

Unbiasing Truncated Backpropagation Through Time

Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time.arXiv preprint arXiv:1705.08209, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartold- son, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Text generation beyond discrete token sampling.arXiv preprint arXiv:2505.14827, 2025

Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, and Jianfeng Gao. Text generation beyond discrete token sampling.arXiv preprint arXiv:2505.14827, 2025

work page arXiv 2025
[20]

Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv preprint arXiv:2505.15778, 2025

Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv preprint arXiv:2505.15778, 2025

work page arXiv 2025
[21]

Soft tokens, hard truths.arXiv preprint arXiv:2509.19170, 2025

Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens, hard truths.arXiv preprint arXiv:2509.19170, 2025

work page arXiv 2025
[22]

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025

Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025

work page arXiv 2025
[24]

Mixture-of-recursions: Learning dynamic recur- sive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524, 2025

work page arXiv 2025
[25]

Cotformer: A chain-of- thought driven architecture with budget-adaptive computation cost at inference.arXiv preprint arXiv:2310.10845, 2023

Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of- thought driven architecture with budget-adaptive computation cost at inference.arXiv preprint arXiv:2310.10845, 2023

work page arXiv 2023
[26]

Relaxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672, 2024

Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672, 2024

work page arXiv 2024
[27]

Finding structure in time.Cognitive science, 14(2):179–211, 1990

Jeffrey L Elman. Finding structure in time.Cognitive science, 14(2):179–211, 1990

work page 1990
[28]

Long short-term memory.Neural computation, 9(8): 1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8): 1735–1780, 1997

work page 1997
[29]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- decoder for statistical machine translation.arXiv preprint arXiv:1406.1078, 2014. 12

work page internal anchor Pith review Pith/arXiv arXiv 2014
[30]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[31]

Depth-adaptive transformer.arXiv preprint arXiv:1910.10073, 2019

Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer.arXiv preprint arXiv:1910.10073, 2019

work page arXiv 1910
[32]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

A Recurrent Latent Variable Model for Sequential Data

Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv. org/abs/1506.02216

work page internal anchor Pith review Pith/arXiv arXiv 2016
[34]

Sequential Neural Models with Stochastic Layers

Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers, 2016. URLhttps://arxiv.org/abs/1605.07571

work page internal anchor Pith review Pith/arXiv arXiv 2016
[35]

Deep Kalman Filters

Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep kalman filters, 2015. URL https: //arxiv.org/abs/1511.05121

work page internal anchor Pith review Pith/arXiv arXiv 2015
[36]

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https: //arxiv.org/abs/1811.04551

work page internal anchor Pith review Pith/arXiv arXiv 2019
[37]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[38]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[40]

ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI benchmark.https://arcprize.org/leaderboard, 2026

ARC-Prize-Foundation. ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI benchmark.https://arcprize.org/leaderboard, 2026. Accessed: 2026-1-22

work page 2026
[41]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

work page 2024
[42]

A Note on the Inception Score

Shane Barratt and Rishi Sharma. A note on the inception score.arXiv preprint arXiv:1801.01973, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[44]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

work page 2021
[45]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[46]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[47]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[48]

Minimal implementation of a d3pm (structured denoising diffusion models in discrete state-spaces), in pytorch.https://github.com/cloneofsimo/d3pm, 2024

Simo Ryu. Minimal implementation of a d3pm (structured denoising diffusion models in discrete state-spaces), in pytorch.https://github.com/cloneofsimo/d3pm, 2024

work page 2024
[49]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 13

work page 2023
[50]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

work page 2002
[51]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012
[52]

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

work page 2018
[53]

Group normalization

Yuxin Wu and Kaiming He. Group normalization. InProceedings of the European conference on computer vision (ECCV), pages 3–19, 2018

work page 2018
[54]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[55]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[56]

Improved techniques for training score-based generative models

Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020

work page 2020
[57]

Erd˝os and A

P. Erd˝os and A. Rényi. On the strength of connectedness of a random graph.Acta Mathematica Academiae Scientiarum Hungarica, 12(1):261–267, Mar 1964. ISSN 1588-2632. doi: 10.1007/ BF02066689. URLhttps://doi.org/10.1007/BF02066689

work page doi:10.1007/bf02066689 1964
[58]

Graph colouring meets deep learning: Effective graph neural network models for combinatorial problems

Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep learning: Effective graph neural network models for combinatorial problems. In2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885. IEEE, 2019

work page 2019
[59]

Pomerantsev

Alexey L. Pomerantsev. Principal component analysis (pca).Encyclopedia of Autism Spectrum Disorders, 2014. URLhttps://api.semanticscholar.org/CorpusID:2534141

work page 2014
[60]

Multidimensional binary search trees used for associative searching.Com- mun

Jon Louis Bentley. Multidimensional binary search trees used for associative searching.Com- mun. ACM, 18:509–517, 1975. URL https://api.semanticscholar.org/CorpusID: 13091446. 14 A Additional Method Details A.1 Adaptive Computation Time GRAM optionally adopts adaptive computation time (ACT) [ 8–10] at inference, allowing each trajectory to terminate at a ...

work page 1975
[61]

Input Tokens (B, C, H, W) Linear Scaling[−1,1] (B, C, H, W)

Norm. Input Tokens (B, C, H, W) Linear Scaling[−1,1] (B, C, H, W)

work page
[62]

Conv Conv2d5×5(p= 2) (B, D/2, H, W)SiLU→GN(32) Conv2d5×5(p= 2) (B, D/2, H, W)SiLU→GN(32)

work page
[63]

[8], Jolicoeur-Martineau [9], both the input and output are represented as sequences of shape [B, L], where B denotes the batch size and L the context length

Patch Flatten Patches (B, Np, P 2 · D 2 ) Linear Projection (B, Np, D) Hyperparameters.Following Wang et al. [8], Jolicoeur-Martineau [9], both the input and output are represented as sequences of shape [B, L], where B denotes the batch size and L the context length. Each input sequence includes 16 fixed puzzle embedding tokens. The latent states ht and l...

work page 2000

[1] [1]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[2] [2]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023

[3] [3]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

work page 2024

[4] [4]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Reasoning by superposition: A theoretical perspective on chain of continuous thought.CoRR, abs/2505.12514, 2025

Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv preprint arXiv:2505.12514, 2025

work page arXiv 2025

[6] [6]

Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648, 2025

Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648, 2025

work page arXiv 2025

[7] [7]

Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424,

Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424, 2023

work page arXiv 2023

[8] [8]

Hierarchical Reasoning Model

Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Less is More: Recursive Reasoning with Tiny Networks

Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks.arXiv preprint arXiv:2510.04871, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers.arXiv preprint arXiv:1807.03819, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Thinking, fast and slow.Farrar, Straus and Giroux, 2011

Daniel Kahneman. Thinking, fast and slow.Farrar, Straus and Giroux, 2011

work page 2011

[12] [12]

arXiv:1709.08568 [cs.LG].https: //arxiv.org/abs/1709.08568

Yoshua Bengio. The consciousness prior.arXiv preprint arXiv:1709.08568, 2017

work page arXiv 2017

[13] [13]

On the Measure of Intelligence

François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[14] [14]

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Gradient-based learning applied to document recognition,

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

work page doi:10.1109/5.726791 1998

[16] [16]

An efficient gradient-based algorithm for on-line training of recurrent network trajectories.Neural computation, 2(4):490–501, 1990

Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of recurrent network trajectories.Neural computation, 2(4):490–501, 1990

work page 1990

[17] [17]

Unbiasing Truncated Backpropagation Through Time

Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time.arXiv preprint arXiv:1705.08209, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartold- son, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Text generation beyond discrete token sampling.arXiv preprint arXiv:2505.14827, 2025

Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, and Jianfeng Gao. Text generation beyond discrete token sampling.arXiv preprint arXiv:2505.14827, 2025

work page arXiv 2025

[20] [20]

Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv preprint arXiv:2505.15778, 2025

Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv preprint arXiv:2505.15778, 2025

work page arXiv 2025

[21] [21]

Soft tokens, hard truths.arXiv preprint arXiv:2509.19170, 2025

Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens, hard truths.arXiv preprint arXiv:2509.19170, 2025

work page arXiv 2025

[22] [22]

CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025

Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025

work page arXiv 2025

[24] [24]

Mixture-of-recursions: Learning dynamic recur- sive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524,

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524, 2025

work page arXiv 2025

[25] [25]

Cotformer: A chain-of- thought driven architecture with budget-adaptive computation cost at inference.arXiv preprint arXiv:2310.10845, 2023

Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of- thought driven architecture with budget-adaptive computation cost at inference.arXiv preprint arXiv:2310.10845, 2023

work page arXiv 2023

[26] [26]

Relaxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672, 2024

Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672, 2024

work page arXiv 2024

[27] [27]

Finding structure in time.Cognitive science, 14(2):179–211, 1990

Jeffrey L Elman. Finding structure in time.Cognitive science, 14(2):179–211, 1990

work page 1990

[28] [28]

Long short-term memory.Neural computation, 9(8): 1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8): 1735–1780, 1997

work page 1997

[29] [29]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- decoder for statistical machine translation.arXiv preprint arXiv:1406.1078, 2014. 12

work page internal anchor Pith review Pith/arXiv arXiv 2014

[30] [30]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[31] [31]

Depth-adaptive transformer.arXiv preprint arXiv:1910.10073, 2019

Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer.arXiv preprint arXiv:1910.10073, 2019

work page arXiv 1910

[32] [32]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[33] [33]

A Recurrent Latent Variable Model for Sequential Data

Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv. org/abs/1506.02216

work page internal anchor Pith review Pith/arXiv arXiv 2016

[34] [34]

Sequential Neural Models with Stochastic Layers

Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers, 2016. URLhttps://arxiv.org/abs/1605.07571

work page internal anchor Pith review Pith/arXiv arXiv 2016

[35] [35]

Deep Kalman Filters

Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep kalman filters, 2015. URL https: //arxiv.org/abs/1511.05121

work page internal anchor Pith review Pith/arXiv arXiv 2015

[36] [36]

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https: //arxiv.org/abs/1811.04551

work page internal anchor Pith review Pith/arXiv arXiv 2019

[37] [37]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[38] [38]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[40] [40]

ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI benchmark.https://arcprize.org/leaderboard, 2026

ARC-Prize-Foundation. ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI benchmark.https://arcprize.org/leaderboard, 2026. Accessed: 2026-1-22

work page 2026

[41] [41]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

work page 2024

[42] [42]

A Note on the Inception Score

Shane Barratt and Rishi Sharma. A note on the inception score.arXiv preprint arXiv:1801.01973, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[43] [43]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[44] [44]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

work page 2021

[45] [45]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[46] [46]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024

[47] [47]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[48] [48]

Minimal implementation of a d3pm (structured denoising diffusion models in discrete state-spaces), in pytorch.https://github.com/cloneofsimo/d3pm, 2024

Simo Ryu. Minimal implementation of a d3pm (structured denoising diffusion models in discrete state-spaces), in pytorch.https://github.com/cloneofsimo/d3pm, 2024

work page 2024

[49] [49]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 13

work page 2023

[50] [50]

Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

work page 2002

[51] [51]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

work page 2012

[52] [52]

Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

work page 2018

[53] [53]

Group normalization

Yuxin Wu and Kaiming He. Group normalization. InProceedings of the European conference on computer vision (ECCV), pages 3–19, 2018

work page 2018

[54] [54]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[55] [55]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[56] [56]

Improved techniques for training score-based generative models

Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020

work page 2020

[57] [57]

Erd˝os and A

P. Erd˝os and A. Rényi. On the strength of connectedness of a random graph.Acta Mathematica Academiae Scientiarum Hungarica, 12(1):261–267, Mar 1964. ISSN 1588-2632. doi: 10.1007/ BF02066689. URLhttps://doi.org/10.1007/BF02066689

work page doi:10.1007/bf02066689 1964

[58] [58]

Graph colouring meets deep learning: Effective graph neural network models for combinatorial problems

Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep learning: Effective graph neural network models for combinatorial problems. In2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885. IEEE, 2019

work page 2019

[59] [59]

Pomerantsev

Alexey L. Pomerantsev. Principal component analysis (pca).Encyclopedia of Autism Spectrum Disorders, 2014. URLhttps://api.semanticscholar.org/CorpusID:2534141

work page 2014

[60] [60]

Multidimensional binary search trees used for associative searching.Com- mun

Jon Louis Bentley. Multidimensional binary search trees used for associative searching.Com- mun. ACM, 18:509–517, 1975. URL https://api.semanticscholar.org/CorpusID: 13091446. 14 A Additional Method Details A.1 Adaptive Computation Time GRAM optionally adopts adaptive computation time (ACT) [ 8–10] at inference, allowing each trajectory to terminate at a ...

work page 1975

[61] [61]

Input Tokens (B, C, H, W) Linear Scaling[−1,1] (B, C, H, W)

Norm. Input Tokens (B, C, H, W) Linear Scaling[−1,1] (B, C, H, W)

work page

[62] [62]

Conv Conv2d5×5(p= 2) (B, D/2, H, W)SiLU→GN(32) Conv2d5×5(p= 2) (B, D/2, H, W)SiLU→GN(32)

work page

[63] [63]

[8], Jolicoeur-Martineau [9], both the input and output are represented as sequences of shape [B, L], where B denotes the batch size and L the context length

Patch Flatten Patches (B, Np, P 2 · D 2 ) Linear Projection (B, Np, D) Hyperparameters.Following Wang et al. [8], Jolicoeur-Martineau [9], both the input and output are represented as sequences of shape [B, L], where B denotes the batch size and L the context length. Each input sequence includes 16 fixed puzzle embedding tokens. The latent states ht and l...

work page 2000