pith. sign in

arxiv: 2605.19376 · v1 · pith:SVHPMX4Dnew · submitted 2026-05-19 · 💻 cs.AI

Generative Recursive Reasoning

Pith reviewed 2026-05-20 05:53 UTC · model grok-4.3

classification 💻 cs.AI
keywords generative recursive reasoningstochastic latent trajectoriesvariational inferencemulti-hypothesis reasoningunconditional generationneural reasoningconstraint satisfaction
0
0 comments X p. Extension
pith:SVHPMX4D Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{SVHPMX4D}

Prints a linked pith:SVHPMX4D badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

GRAM turns recursive reasoning probabilistic so models can follow many trajectories instead of one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Generative Recursive reAsoning Models to replace the single fixed path of existing recursive reasoning systems with a probabilistic process. Reasoning is represented as a sequence of random latent-state updates that share the same transition rule, so the model can draw multiple distinct trajectories for the same input. This change produces a generative model capable of both conditional inference and unconditional generation when no input is supplied. A reader would care because it adds the ability to explore alternatives and scale computation at test time without changing the core recursive structure. The training uses amortized variational inference to learn these stochastic paths from data.

Core claim

GRAM models reasoning as a stochastic latent trajectory inside an iterative refinement loop that reuses the same transition function, producing a latent-variable generative model that supports p(y|x) for conditional reasoning and p(x) for unconditional generation when inputs are absent or fixed, and is trained by amortized variational inference to improve over deterministic recursive baselines on structured tasks.

What carries the argument

Stochastic latent trajectory formed by repeated application of a shared probabilistic transition function in latent space, which permits parallel sampling of distinct reasoning paths.

If this is right

  • The model outperforms deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint tasks.
  • Inference-time scaling is possible by increasing recursive depth or drawing more parallel trajectories.
  • Unconditional generation of inputs becomes possible by sampling from the marginal p(x).
  • Multiple hypotheses and alternative solution strategies arise naturally from different latent trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mechanism could be tested on planning problems where several valid action sequences exist for one goal.
  • Parallel trajectory sampling offers a direct way to trade extra compute for higher solution coverage without retraining.

Load-bearing premise

That amortized variational inference applied to stochastic trajectories will yield coherent and diverse reasoning paths instead of mode collapse or incoherent outputs.

What would settle it

On a multi-solution constraint satisfaction task, if increasing the number of sampled trajectories produces no gain in solution diversity or validity compared with a single trajectory, the benefit of the generative component is falsified.

Figures

Figures reproduced from arXiv: 2605.19376 by Junyeob Baek, Mengye Ren, Mingyu Jo, Minsu Kim, Sungjin Ahn, Yoshua Bengio.

Figure 1
Figure 1. Figure 1: Comparison of Latent Reasoning Trajectories. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: GRAM Architecture. A single stochastic latent transition in the hierarchical instantiation z = (h, l). After K low-level refinements via fL, the high￾level update fH produces a deterministic proposal ut, to which stochastic guidance ϵt is added: ht = ut + ϵt. Overview. GRAM models the conditional distri￾bution pθ(y | x) by marginalizing over stochas￾tic latent reasoning trajectories. Given an input x, GRAM… view at source ↗
Figure 3
Figure 3. Figure 3: Performance on puzzle benchmarks. On both Sudoku-Extreme and ARC-AGI, GRAM consistently outperforms all deterministic recursive baselines (Looped TF, HRM, TRM), demonstrating that stochastic latent transitions yield substantial gains within the recursive-reasoning paradigm. Looped TF results on ARC-AGI are omitted due to prohibitive training cost (see Section C.1.1) Note that large reasoning model scores a… view at source ↗
Figure 4
Figure 4. Figure 4: (Left) Inference-time scaling on Sudoku-Extreme. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Unconditional Sudoku generation. Va￾lidity (%) of generated Sudoku puzzles. GRAM achieves higher validity than D3PM with substan￾tially fewer parameters and steps. Setup. To investigate GRAM’s unconditional gen￾erative capability beyond conditional reasoning, we evaluate generation in two domains: structured con￾straint generation on Sudoku (from empty boards, evaluated by the fraction of generated boards … view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of the generation process and samples. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples of unconditional Sudoku generation by GRAM. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Full ELBO LELBO and surrogate objective LGRAM throughout training (plotted as −ELBO, smaller is better). On both Sudoku-Extreme (left) and N-Queens 8 × 8 (right), both quantities decrease monotonically over training, indicating that gradient updates of LGRAM consistently improve the full variational bound. The two curves do not coincide because LELBO sums KL contributions across all TTotal transitions whil… view at source ↗
Figure 9
Figure 9. Figure 9: Example of an 8 × 8 N-Queens puzzle instance. In this example, 5 queens are removed from the full board, leaving 3 queens. The model must find the positions of the remaining queens. This configuration admits exactly 3 valid solutions. Data Generation Details. The N-Queens problem requires placing N queens on an N × N chessboard such that no two queens attack each other—meaning no queens share the same row,… view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of the number of valid solutions for generated N-Queens instances. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Graph Coloring Example 3 6 9 12 15 18 Number of solutions 0 500 1000 1500 Counts Vertex 8 Graph Coloring 0 20 40 60 80 Number of solutions 0 500 1000 1500 2000 Counts Vertex 10 Graph Coloring [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of the number of valid solutions for generated graph coloring instances. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Effect of sampling on ARC-AGI-1 without data augmentation. [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Effect of augmentation on sampling efficiency. [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Solution coverage analysis on N-Queens ( [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Additional generated samples from GRAM. We provide 8 additional samples generated uncondi￾tionally on binarized MNIST using GRAM. Each row represents a single generated sample, visualized across its recursive refinement process. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: illustrates the unconditional Sudoku generation setup. Starting from an empty board, the task is to generate complete boards, and validity is determined by whether the generated board satisfies all Sudoku constraints [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Latent reasoning trajectory of TRM. The red dot indicates the initial state h0 and the green dot indicates the final state hT . Background color represents the loss landscape: bright yellow corresponds to high loss regions, while dark blue indicates low loss (optimal) regions. TRM follows a single deterministic path with no ability to escape suboptimal trajectories. 25 [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 19
Figure 19. Figure 19: Latent reasoning trajectories of GRAM (50 samples). [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗
read the original abstract

How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce \emph{Generative Recursive reAsoning Models (GRAM)}, a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via $p_\theta(y \mid x)$ and, with fixed or absent inputs, unconditional generation via $p_\theta(x)$. Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. \href{https://ahn-ml.github.io/gram-website/}{https://ahn-ml.github.io/gram-website}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Generative Recursive reAsoning Models (GRAM), a probabilistic framework that extends deterministic Recursive Reasoning Models by modeling reasoning as stochastic latent trajectories. Trained via amortized variational inference, GRAM supports conditional reasoning p_θ(y|x) and unconditional generation p_θ(x), with inference-time scaling via recursive depth and parallel trajectory sampling. The central empirical claim is that GRAM outperforms deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks while enabling unconditional generation.

Significance. If the reported gains hold under rigorous controls, the work would be significant for neural reasoning systems by addressing the single-trajectory limitation of prior RRMs and providing a generative model for multi-hypothesis reasoning. The combination of recursive structure with latent stochasticity and unconditional generation capability is a notable modeling advance.

major comments (2)
  1. [§4] §4 (Experiments): The abstract and framework description assert measurable improvements and diversity benefits from stochastic trajectories, yet the manuscript must include explicit quantitative results (e.g., accuracy deltas, error bars, dataset sizes, and ablation controls on the number of trajectories) to substantiate the central empirical claim; without these, the degree of support for outperformance over deterministic baselines cannot be assessed.
  2. [§3.2] §3.2 (Training and Inference): The claim that amortized variational inference on stochastic latent trajectories yields diverse, useful reasoning paths (rather than mode collapse or incoherent samples) is load-bearing for the multi-trajectory advantage; the manuscript should report specific diversity metrics (e.g., sample entropy or distinct solution counts) and controls for posterior collapse to verify this assumption holds in the reported tasks.
minor comments (2)
  1. [Abstract] Notation for the generative model p_θ(x) in the unconditional case should be clarified with respect to the input x being absent or fixed, to avoid ambiguity in the conditional vs. unconditional distinction.
  2. [Abstract] The website link is provided but the manuscript should include a brief summary of any additional results or visualizations hosted there to make the paper self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments. We address the two major comments point by point below. We have made revisions to the manuscript to provide the requested quantitative details and metrics, which we believe strengthen the empirical section.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The abstract and framework description assert measurable improvements and diversity benefits from stochastic trajectories, yet the manuscript must include explicit quantitative results (e.g., accuracy deltas, error bars, dataset sizes, and ablation controls on the number of trajectories) to substantiate the central empirical claim; without these, the degree of support for outperformance over deterministic baselines cannot be assessed.

    Authors: We agree that the manuscript would benefit from more explicit quantitative results to allow readers to assess the improvements. In the revised manuscript, we have included additional tables in Section 4 that provide accuracy values with error bars (standard deviations over 5 independent runs), the exact dataset sizes for each experiment, and ablation studies showing performance for different numbers of trajectories (specifically 1, 5, and 10). These results demonstrate the advantage of GRAM's stochastic trajectories. revision: yes

  2. Referee: [§3.2] §3.2 (Training and Inference): The claim that amortized variational inference on stochastic latent trajectories yields diverse, useful reasoning paths (rather than mode collapse or incoherent samples) is load-bearing for the multi-trajectory advantage; the manuscript should report specific diversity metrics (e.g., sample entropy or distinct solution counts) and controls for posterior collapse to verify this assumption holds in the reported tasks.

    Authors: We thank the referee for this suggestion. To verify that the stochastic trajectories provide diverse reasoning paths without mode collapse, we have added in the revised manuscript specific metrics in Section 3.2, including the average entropy of the trajectory distributions and the number of distinct valid solutions obtained from sampling multiple trajectories. We also report the evolution of the KL divergence to confirm the absence of posterior collapse. These additions support our claim that the generative model produces useful diversity. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents GRAM as a modeling framework that extends deterministic recursive reasoning models into a stochastic latent-variable generative model trained via amortized variational inference. No derivation chain, equations, or first-principles results are shown in the abstract or description that reduce any claimed prediction or result to fitted inputs or self-citations by construction. The central claims concern empirical improvements on reasoning tasks and unconditional generation capability, which are presented as outcomes of the proposed architecture rather than identities or renamings of prior results. No load-bearing self-citation, ansatz smuggling, or uniqueness theorem from overlapping authors is visible. The framework is self-contained as an architectural proposal with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is limited to the abstract; the ledger therefore records only the modeling assumptions stated or implied there.

axioms (1)
  • domain assumption Amortized variational inference can train a latent-variable model of stochastic reasoning trajectories without mode collapse or loss of diversity.
    Standard VAE assumption invoked to justify training the generative recursive model.
invented entities (1)
  • Stochastic latent trajectory no independent evidence
    purpose: To represent multiple alternative reasoning paths inside the recursive model.
    New modeling object introduced to turn deterministic refinement into probabilistic multi-trajectory computation.

pith-pipeline@v0.9.0 · 5719 in / 1238 out tokens · 32738 ms · 2026-05-20T05:53:36.638888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 23 internal anchors

  1. [1]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  2. [2]

    Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

  3. [3]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 17682–17690, 2024

  4. [4]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

  5. [5]

    Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv preprint arXiv:2505.12514, 2025

    Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, and Yuandong Tian. Reasoning by superposition: A theoretical perspective on chain of continuous thought.arXiv preprint arXiv:2505.12514, 2025

  6. [6]

    Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648, 2025

    Halil Alperen Gozeten, M Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, and Samet Oymak. Continuous chain of thought enables parallel exploration and reasoning.arXiv preprint arXiv:2505.23648, 2025

  7. [7]

    Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424, 2023

    Liu Yang, Kangwook Lee, Robert Nowak, and Dimitris Papailiopoulos. Looped transformers are better at learning learning algorithms.arXiv preprint arXiv:2311.12424, 2023

  8. [8]

    Hierarchical Reasoning Model

    Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734, 2025

  9. [9]

    Less is More: Recursive Reasoning with Tiny Networks

    Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks.arXiv preprint arXiv:2510.04871, 2025. 11

  10. [10]

    Universal Transformers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers.arXiv preprint arXiv:1807.03819, 2018

  11. [11]

    Thinking, fast and slow.Farrar, Straus and Giroux, 2011

    Daniel Kahneman. Thinking, fast and slow.Farrar, Straus and Giroux, 2011

  12. [12]

    The consciousness prior.arXiv preprint arXiv:1709.08568, 2017

    Yoshua Bengio. The consciousness prior.arXiv preprint arXiv:1709.08568, 2017

  13. [13]

    On the Measure of Intelligence

    François Chollet. On the measure of intelligence.arXiv preprint arXiv:1911.01547, 2019

  14. [14]

    ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

    Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc- agi-2: A new challenge for frontier ai reasoning systems.arXiv preprint arXiv:2505.11831, 2025

  15. [15]

    Lecun, L

    Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

  16. [16]

    An efficient gradient-based algorithm for on-line training of recurrent network trajectories.Neural computation, 2(4):490–501, 1990

    Ronald J Williams and Jing Peng. An efficient gradient-based algorithm for on-line training of recurrent network trajectories.Neural computation, 2(4):490–501, 1990

  17. [17]

    Unbiasing Truncated Backpropagation Through Time

    Corentin Tallec and Yann Ollivier. Unbiasing truncated backpropagation through time.arXiv preprint arXiv:1705.08209, 2017

  18. [18]

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartold- son, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171, 2025

  19. [19]

    Text generation beyond discrete token sampling.arXiv preprint arXiv:2505.14827, 2025

    Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, and Jianfeng Gao. Text generation beyond discrete token sampling.arXiv preprint arXiv:2505.14827, 2025

  20. [20]

    Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv preprint arXiv:2505.15778, 2025

    Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, and Xin Eric Wang. Soft thinking: Unlocking the reasoning potential of llms in continuous concept space.arXiv preprint arXiv:2505.15778, 2025

  21. [21]

    Soft tokens, hard truths.arXiv preprint arXiv:2509.19170, 2025

    Natasha Butt, Ariel Kwiatkowski, Ismail Labiad, Julia Kempe, and Yann Ollivier. Soft tokens, hard truths.arXiv preprint arXiv:2509.19170, 2025

  22. [22]

    CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

    Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv:2502.21074, 2025

  23. [23]

    Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025

    Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, and Dong Wang. Hybrid latent reasoning via reinforcement learning.arXiv preprint arXiv:2505.18454, 2025

  24. [24]

    Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524, 2025

    Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, et al. Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation.arXiv preprint arXiv:2507.10524, 2025

  25. [25]

    Cotformer: A chain-of- thought driven architecture with budget-adaptive computation cost at inference.arXiv preprint arXiv:2310.10845, 2023

    Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. Cotformer: A chain-of- thought driven architecture with budget-adaptive computation cost at inference.arXiv preprint arXiv:2310.10845, 2023

  26. [26]

    Relaxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672, 2024

    Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, and Tal Schuster. Relaxed recursive transformers: Effective parameter sharing with layer-wise lora.arXiv preprint arXiv:2410.20672, 2024

  27. [27]

    Finding structure in time.Cognitive science, 14(2):179–211, 1990

    Jeffrey L Elman. Finding structure in time.Cognitive science, 14(2):179–211, 1990

  28. [28]

    Long short-term memory.Neural computation, 9(8): 1735–1780, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8): 1735–1780, 1997

  29. [29]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- decoder for statistical machine translation.arXiv preprint arXiv:1406.1078, 2014. 12

  30. [30]

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations.arXiv preprint arXiv:1909.11942, 2019

  31. [31]

    Depth-adaptive transformer.arXiv preprint arXiv:1910.10073, 2019

    Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer.arXiv preprint arXiv:1910.10073, 2019

  32. [32]

    Adaptive Computation Time for Recurrent Neural Networks

    Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983, 2016

  33. [33]

    A Recurrent Latent Variable Model for Sequential Data

    Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data, 2016. URL https://arxiv. org/abs/1506.02216

  34. [34]

    Sequential Neural Models with Stochastic Layers

    Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers, 2016. URLhttps://arxiv.org/abs/1605.07571

  35. [35]

    Deep Kalman Filters

    Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep kalman filters, 2015. URL https: //arxiv.org/abs/1511.05121

  36. [36]

    Learning Latent Dynamics for Planning from Pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https: //arxiv.org/abs/1811.04551

  37. [37]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

  38. [38]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  39. [39]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  40. [40]

    ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI benchmark.https://arcprize.org/leaderboard, 2026

    ARC-Prize-Foundation. ARC-AGI benchmarking: Leaderboard and dataset for the ARC-AGI benchmark.https://arcprize.org/leaderboard, 2026. Accessed: 2026-1-22

  41. [41]

    Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

    Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

  42. [42]

    A Note on the Inception Score

    Shane Barratt and Rishi Sharma. A note on the inception score.arXiv preprint arXiv:1801.01973, 2018

  43. [43]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  44. [44]

    Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

  45. [45]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  46. [46]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  47. [47]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  48. [48]

    Minimal implementation of a d3pm (structured denoising diffusion models in discrete state-spaces), in pytorch.https://github.com/cloneofsimo/d3pm, 2024

    Simo Ryu. Minimal implementation of a d3pm (structured denoising diffusion models in discrete state-spaces), in pytorch.https://github.com/cloneofsimo/d3pm, 2024

  49. [49]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 13

  50. [50]

    Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 2002

  51. [51]

    Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

  52. [52]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning.Neural networks, 107:3–11, 2018

  53. [53]

    Group normalization

    Yuxin Wu and Kaiming He. Group normalization. InProceedings of the European conference on computer vision (ECCV), pages 3–19, 2018

  54. [54]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  55. [55]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096, 2018

  56. [56]

    Improved techniques for training score-based generative models

    Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020

  57. [57]

    Erd˝os and A

    P. Erd˝os and A. Rényi. On the strength of connectedness of a random graph.Acta Mathematica Academiae Scientiarum Hungarica, 12(1):261–267, Mar 1964. ISSN 1588-2632. doi: 10.1007/ BF02066689. URLhttps://doi.org/10.1007/BF02066689

  58. [58]

    Graph colouring meets deep learning: Effective graph neural network models for combinatorial problems

    Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep learning: Effective graph neural network models for combinatorial problems. In2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885. IEEE, 2019

  59. [59]

    Pomerantsev

    Alexey L. Pomerantsev. Principal component analysis (pca).Encyclopedia of Autism Spectrum Disorders, 2014. URLhttps://api.semanticscholar.org/CorpusID:2534141

  60. [60]

    Multidimensional binary search trees used for associative searching.Com- mun

    Jon Louis Bentley. Multidimensional binary search trees used for associative searching.Com- mun. ACM, 18:509–517, 1975. URL https://api.semanticscholar.org/CorpusID: 13091446. 14 A Additional Method Details A.1 Adaptive Computation Time GRAM optionally adopts adaptive computation time (ACT) [ 8–10] at inference, allowing each trajectory to terminate at a ...

  61. [61]

    Input Tokens (B, C, H, W) Linear Scaling[−1,1] (B, C, H, W)

    Norm. Input Tokens (B, C, H, W) Linear Scaling[−1,1] (B, C, H, W)

  62. [62]

    Conv Conv2d5×5(p= 2) (B, D/2, H, W)SiLU→GN(32) Conv2d5×5(p= 2) (B, D/2, H, W)SiLU→GN(32)

  63. [63]

    [8], Jolicoeur-Martineau [9], both the input and output are represented as sequences of shape [B, L], where B denotes the batch size and L the context length

    Patch Flatten Patches (B, Np, P 2 · D 2 ) Linear Projection (B, Np, D) Hyperparameters.Following Wang et al. [8], Jolicoeur-Martineau [9], both the input and output are represented as sequences of shape [B, L], where B denotes the batch size and L the context length. Each input sequence includes 16 fixed puzzle embedding tokens. The latent states ht and l...