Multi-Perspective Transformers in ARC-AGI-2 Challenge

Caleb Talley; Fariha Sheikh; Seun Adekunle; Vedant Tibrewal; Weiwen Dong; Xinyu Wu

arxiv: 2605.01154 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.AI

Multi-Perspective Transformers in ARC-AGI-2 Challenge

Caleb Talley , Vedant Tibrewal , Seun Adekunle , Weiwen Dong , Xinyu Wu , Fariha Sheikh This is my paper

Pith reviewed 2026-05-09 18:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords ARC-AGI-2transformerstest-time trainingproducts of expertsgeneralizationfew-shot learningvisual reasoningTinyLM

0 comments

The pith

A TinyLM-based transformer with test-time training and products of experts reaches 21.7% accuracy on ARC-AGI-2 evaluation puzzles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a multi-perspective transformer approach built on TinyLM and augmented with test-time training plus products of experts to address ARC-AGI-2 visual puzzles. These puzzles test the ability to generalize from few examples, read symbolic patterns, and apply rules in new arrangements. The reported results show the model reaches 96.1% accuracy on the training set while attaining 21.7% on the evaluation set. A sympathetic reader would care because the benchmark isolates core generalization skills that current large-scale training often fails to produce.

Core claim

Our model, based on TinyLM with multi-perspective transformers and additional fine-tuning at test time that includes Test-Time-Training (TTT) and Products of Experts (POE), achieves 96.1% accuracy on the training set and 21.7% accuracy on the evaluation set of ARC-AGI-2.

What carries the argument

Multi-perspective transformers combined with Test-Time-Training (TTT) and Products of Experts (POE) that adapt the TinyLM base model to each new puzzle at inference time.

If this is right

The adaptation techniques produce higher training accuracy by capturing common puzzle structures.
Moderate evaluation accuracy demonstrates partial success at applying learned rules to unseen puzzle layouts.
Products of experts combine multiple learned perspectives to improve robustness on symbolic tasks.
The overall pipeline shows one workable way to handle few-example generalization in visual reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar test-time adaptation could be tested on other abstraction benchmarks to see whether the same accuracy gap appears.
Scaling the base model size while keeping the TTT and POE components fixed would test whether larger capacity narrows the training-to-evaluation drop.
Evaluating on a fresh set of puzzles that introduce entirely new rule primitives would isolate whether the current generalization is limited to patterns seen in training.

Load-bearing premise

That test-time training and products of experts let the model flexibly interpret symbolic meaning and apply rules across varying contexts rather than simply overfitting to the training puzzle distributions.

What would settle it

Performance on the evaluation set stays at roughly the same level when test-time training and products of experts are turned off, which would show the gains come from the base model instead of the adaptation methods.

Figures

Figures reproduced from arXiv: 2605.01154 by Caleb Talley, Fariha Sheikh, Seun Adekunle, Vedant Tibrewal, Weiwen Dong, Xinyu Wu.

**Figure 1.** Figure 1: Example ARC-AGI 2 Puzzles (Huang and Grady, 2024) 1.2 Methods Overview The algorithm starts with tokenizing each grid, turning it into a short left-to-right text with small markers for width, height, and colors, so the model understands the grid’s shape and palette. We then create multiple views of the same puzzle by rotating, flipping, transposing, and relabeling colors. Using the tokenized examples, a c… view at source ↗

**Figure 2.** Figure 2: Vocabulary Dictionary [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Encoding Algorithm 3.4 TinyLM Transformer Model Our system is built around TinyLM, a decoder-only transformer architecture meticulously optimized for the Abstraction and Reasoning Corpus (ARC) domain. The design centers on balancing representational power, computational efficiency, and generalization capability, prioritizing compactness with a default configuration of approximately 20 million parameters. … view at source ↗

**Figure 4.** Figure 4: Model evaluation before few-shot prompting [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Model evaluation after few-shot prompting [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

read the original abstract

ARC-AGI-2 is a benchmark of human-intuitive visual puzzles that measures a machine's ability to generalize from limited examples, interpret symbolic meaning, and flexibly apply rules in varying contexts. In this paper, we discuss our approach to solving the ARC-AGI-2 puzzles with TinyLM, with additional fine-tuning at test time, including Test-Time-Training (TTT) and Products of Experts (POE). Our model achieves 96.1% accuracy on the training set and 21.7% accuracy on the evaluation set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

High train accuracy and modest eval score on ARC-AGI-2 likely indicate memorization through test-time adaptation rather than genuine abstraction.

read the letter

The main thing to know is that the authors get 21.7% accuracy on the ARC-AGI-2 evaluation set using a small model with test-time training and products of experts, after hitting 96.1% on the training set. This is an incremental result at best. The paper applies established techniques to the benchmark without proposing new theoretical ideas or first-principles changes. The multi-perspective transformers framing seems to be a way to combine different views, but it builds directly on prior work in adaptation methods. Reporting specific accuracy numbers is at least a positive step for transparency. The main weakness is that the high training accuracy raises a real question about whether the model is abstracting rules or simply memorizing the limited set of training puzzles. ARC tasks are few and repetitive in structure, so test-time training could easily allow per-task fitting rather than cross-task generalization. The evaluation number by itself does not address this without ablations that test the contribution of each component or controls for how novel the evaluation tasks are. There are also no details on experimental setup like data splits or variance in the results. This work is aimed at researchers who follow every update on the ARC-AGI benchmark and want to see how adaptation tricks perform there. A reader looking for new insights into abstraction or generalization will not find much here. The evidence presented does not strongly support the claim of flexible rule application in varying contexts. I would not bring this to a reading group or cite it in my own work. It does not deserve peer review in its current state because the central results require more rigorous validation to be convincing.

Referee Report

2 major / 2 minor

Summary. The paper proposes using a TinyLM transformer model enhanced with test-time training (TTT) and products of experts (POE) to solve ARC-AGI-2 visual reasoning puzzles. It claims this multi-perspective approach enables interpretation of symbolic meaning and flexible rule application, achieving 96.1% accuracy on the training set and 21.7% accuracy on the evaluation set.

Significance. If the reported accuracies are reproducible and ablations confirm that TTT and POE drive genuine cross-task generalization rather than memorization of the limited ARC task distribution, the result would be significant for few-shot abstract reasoning. ARC-AGI-2 is a demanding benchmark, and a 21.7% evaluation score would exceed many current baselines; however, the absence of any experimental controls, baselines, or statistical details prevents assessing whether the claim holds.

major comments (2)

[Abstract] Abstract: The central empirical claims (96.1% training accuracy and 21.7% evaluation accuracy) are presented with no description of the experimental protocol, including train/eval split sizes, whether TTT is applied to evaluation tasks (and if so, how without ground-truth labels), model size, training hyperparameters, or any ablation studies isolating TTT or POE. This is load-bearing for the generalization claim, as the high training accuracy is consistent with per-task memorization on the ~400 structured ARC tasks.
[Abstract] Abstract: No comparison to baselines (e.g., TinyLM without TTT/POE, or standard ARC solvers) or controls for task overlap between splits is provided. Without such evidence, the 21.7% evaluation figure cannot be interpreted as support for the claim that TTT and POE enable 'flexibly interpret symbolic meaning and apply rules in varying contexts' rather than overfitting to training puzzle distributions.

minor comments (2)

[Abstract] The term 'TinyLM' is used without definition, architecture details, or citation to prior work.
[Abstract] No related-work section or discussion of prior ARC-AGI solvers is present, making it difficult to situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater experimental transparency. We agree that the current abstract is too brief and will revise the manuscript to include the requested details on protocol, ablations, and baselines while preserving the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central empirical claims (96.1% training accuracy and 21.7% evaluation accuracy) are presented with no description of the experimental protocol, including train/eval split sizes, whether TTT is applied to evaluation tasks (and if so, how without ground-truth labels), model size, training hyperparameters, or any ablation studies isolating TTT or POE. This is load-bearing for the generalization claim, as the high training accuracy is consistent with per-task memorization on the ~400 structured ARC tasks.

Authors: We accept this criticism. The abstract will be expanded in revision to state the ARC-AGI-2 split sizes (~400 training tasks, held-out evaluation set), TinyLM parameter count, core hyperparameters, and the TTT procedure (self-supervised adaptation on test inputs alone, without ground-truth labels, via reconstruction and consistency objectives). We will also add a dedicated experimental section with ablations that isolate TTT and POE contributions, directly addressing the memorization concern by showing performance drops when either component is removed. revision: yes
Referee: [Abstract] Abstract: No comparison to baselines (e.g., TinyLM without TTT/POE, or standard ARC solvers) or controls for task overlap between splits is provided. Without such evidence, the 21.7% evaluation figure cannot be interpreted as support for the claim that TTT and POE enable 'flexibly interpret symbolic meaning and apply rules in varying contexts' rather than overfitting to training puzzle distributions.

Authors: We agree that baselines and overlap controls are required for interpretability. The revised manuscript will include a new results table comparing the full model against (i) TinyLM without TTT or POE and (ii) representative published ARC baselines. We will also document the train/eval split construction to confirm zero task overlap, allowing the 21.7% evaluation accuracy to be read as evidence of cross-task generalization rather than distribution memorization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims with no derivations or self-referential reductions

full rationale

The manuscript reports an empirical ML approach (TinyLM + TTT + POE) and two accuracy numbers (96.1 % train, 21.7 % eval) on ARC-AGI-2. No equations, parameter-fitting steps, uniqueness theorems, or derivation chains appear in the text. The central claims are measured accuracies rather than any quantity that is defined in terms of itself or obtained by renaming a fitted input as a prediction. Self-citations, if present, are irrelevant because no load-bearing mathematical argument exists to reduce to them. The result is therefore self-contained against external benchmarks and receives the default non-circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or new constructs, so the ledger is empty.

pith-pipeline@v0.9.0 · 5399 in / 1093 out tokens · 39240 ms · 2026-05-09T18:59:17.526460+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Sonya Huang and Pat Grady , title =

work page
[2]

2025 , eprint=

Searching Latent Program Spaces , author=. 2025 , eprint=

work page 2025
[3]

2025 , eprint=

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems , author=. 2025 , eprint=

work page 2025
[4]

2025 , howpublished =

Francois Chollet and Mike Knoop and Greg Kamradt and Walter Reade and Addison Howard , title =. 2025 , howpublished =

work page 2025
[5]

Lewish Hemens , title =

work page
[6]

Hinton , title =

Geoffrey E. Hinton , title =. Neural Computation , volume =. 2002 , publisher =

work page 2002
[7]

arXiv preprint arXiv:2411.07279 , year=

The Surprising Effectiveness of Test-Time Training for Abstract Reasoning , author =. arXiv preprint arXiv:2411.07279 , year =

work page arXiv
[8]

Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

work page 2023
[9]

2025 , eprint=

Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective , author=. 2025 , eprint=

work page 2025
[10]

2025 , url =

ARC Prize Foundation , title =. 2025 , url =

work page 2025
[11]

2023 , url =

Michael Hodel , title =. 2023 , url =

work page 2023
[12]

2025 , url =

Lewis Hughes , title =. 2025 , url =

work page 2025
[13]

2022 , url =

Michael Hodel , title =. 2022 , url =

work page 2022

[1] [1]

Sonya Huang and Pat Grady , title =

work page

[2] [2]

2025 , eprint=

Searching Latent Program Spaces , author=. 2025 , eprint=

work page 2025

[3] [3]

2025 , eprint=

ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems , author=. 2025 , eprint=

work page 2025

[4] [4]

2025 , howpublished =

Francois Chollet and Mike Knoop and Greg Kamradt and Walter Reade and Addison Howard , title =. 2025 , howpublished =

work page 2025

[5] [5]

Lewish Hemens , title =

work page

[6] [6]

Hinton , title =

Geoffrey E. Hinton , title =. Neural Computation , volume =. 2002 , publisher =

work page 2002

[7] [7]

arXiv preprint arXiv:2411.07279 , year=

The Surprising Effectiveness of Test-Time Training for Abstract Reasoning , author =. arXiv preprint arXiv:2411.07279 , year =

work page arXiv

[8] [8]

Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =

work page 2023

[9] [9]

2025 , eprint=

Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective , author=. 2025 , eprint=

work page 2025

[10] [10]

2025 , url =

ARC Prize Foundation , title =. 2025 , url =

work page 2025

[11] [11]

2023 , url =

Michael Hodel , title =. 2023 , url =

work page 2023

[12] [12]

2025 , url =

Lewis Hughes , title =. 2025 , url =

work page 2025

[13] [13]

2022 , url =

Michael Hodel , title =. 2022 , url =

work page 2022