Multi-Perspective Transformers in ARC-AGI-2 Challenge
Pith reviewed 2026-05-09 18:59 UTC · model grok-4.3
The pith
A TinyLM-based transformer with test-time training and products of experts reaches 21.7% accuracy on ARC-AGI-2 evaluation puzzles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our model, based on TinyLM with multi-perspective transformers and additional fine-tuning at test time that includes Test-Time-Training (TTT) and Products of Experts (POE), achieves 96.1% accuracy on the training set and 21.7% accuracy on the evaluation set of ARC-AGI-2.
What carries the argument
Multi-perspective transformers combined with Test-Time-Training (TTT) and Products of Experts (POE) that adapt the TinyLM base model to each new puzzle at inference time.
If this is right
- The adaptation techniques produce higher training accuracy by capturing common puzzle structures.
- Moderate evaluation accuracy demonstrates partial success at applying learned rules to unseen puzzle layouts.
- Products of experts combine multiple learned perspectives to improve robustness on symbolic tasks.
- The overall pipeline shows one workable way to handle few-example generalization in visual reasoning benchmarks.
Where Pith is reading between the lines
- Similar test-time adaptation could be tested on other abstraction benchmarks to see whether the same accuracy gap appears.
- Scaling the base model size while keeping the TTT and POE components fixed would test whether larger capacity narrows the training-to-evaluation drop.
- Evaluating on a fresh set of puzzles that introduce entirely new rule primitives would isolate whether the current generalization is limited to patterns seen in training.
Load-bearing premise
That test-time training and products of experts let the model flexibly interpret symbolic meaning and apply rules across varying contexts rather than simply overfitting to the training puzzle distributions.
What would settle it
Performance on the evaluation set stays at roughly the same level when test-time training and products of experts are turned off, which would show the gains come from the base model instead of the adaptation methods.
Figures
read the original abstract
ARC-AGI-2 is a benchmark of human-intuitive visual puzzles that measures a machine's ability to generalize from limited examples, interpret symbolic meaning, and flexibly apply rules in varying contexts. In this paper, we discuss our approach to solving the ARC-AGI-2 puzzles with TinyLM, with additional fine-tuning at test time, including Test-Time-Training (TTT) and Products of Experts (POE). Our model achieves 96.1% accuracy on the training set and 21.7% accuracy on the evaluation set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes using a TinyLM transformer model enhanced with test-time training (TTT) and products of experts (POE) to solve ARC-AGI-2 visual reasoning puzzles. It claims this multi-perspective approach enables interpretation of symbolic meaning and flexible rule application, achieving 96.1% accuracy on the training set and 21.7% accuracy on the evaluation set.
Significance. If the reported accuracies are reproducible and ablations confirm that TTT and POE drive genuine cross-task generalization rather than memorization of the limited ARC task distribution, the result would be significant for few-shot abstract reasoning. ARC-AGI-2 is a demanding benchmark, and a 21.7% evaluation score would exceed many current baselines; however, the absence of any experimental controls, baselines, or statistical details prevents assessing whether the claim holds.
major comments (2)
- [Abstract] Abstract: The central empirical claims (96.1% training accuracy and 21.7% evaluation accuracy) are presented with no description of the experimental protocol, including train/eval split sizes, whether TTT is applied to evaluation tasks (and if so, how without ground-truth labels), model size, training hyperparameters, or any ablation studies isolating TTT or POE. This is load-bearing for the generalization claim, as the high training accuracy is consistent with per-task memorization on the ~400 structured ARC tasks.
- [Abstract] Abstract: No comparison to baselines (e.g., TinyLM without TTT/POE, or standard ARC solvers) or controls for task overlap between splits is provided. Without such evidence, the 21.7% evaluation figure cannot be interpreted as support for the claim that TTT and POE enable 'flexibly interpret symbolic meaning and apply rules in varying contexts' rather than overfitting to training puzzle distributions.
minor comments (2)
- [Abstract] The term 'TinyLM' is used without definition, architecture details, or citation to prior work.
- [Abstract] No related-work section or discussion of prior ARC-AGI solvers is present, making it difficult to situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for greater experimental transparency. We agree that the current abstract is too brief and will revise the manuscript to include the requested details on protocol, ablations, and baselines while preserving the core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central empirical claims (96.1% training accuracy and 21.7% evaluation accuracy) are presented with no description of the experimental protocol, including train/eval split sizes, whether TTT is applied to evaluation tasks (and if so, how without ground-truth labels), model size, training hyperparameters, or any ablation studies isolating TTT or POE. This is load-bearing for the generalization claim, as the high training accuracy is consistent with per-task memorization on the ~400 structured ARC tasks.
Authors: We accept this criticism. The abstract will be expanded in revision to state the ARC-AGI-2 split sizes (~400 training tasks, held-out evaluation set), TinyLM parameter count, core hyperparameters, and the TTT procedure (self-supervised adaptation on test inputs alone, without ground-truth labels, via reconstruction and consistency objectives). We will also add a dedicated experimental section with ablations that isolate TTT and POE contributions, directly addressing the memorization concern by showing performance drops when either component is removed. revision: yes
-
Referee: [Abstract] Abstract: No comparison to baselines (e.g., TinyLM without TTT/POE, or standard ARC solvers) or controls for task overlap between splits is provided. Without such evidence, the 21.7% evaluation figure cannot be interpreted as support for the claim that TTT and POE enable 'flexibly interpret symbolic meaning and apply rules in varying contexts' rather than overfitting to training puzzle distributions.
Authors: We agree that baselines and overlap controls are required for interpretability. The revised manuscript will include a new results table comparing the full model against (i) TinyLM without TTT or POE and (ii) representative published ARC baselines. We will also document the train/eval split construction to confirm zero task overlap, allowing the 21.7% evaluation accuracy to be read as evidence of cross-task generalization rather than distribution memorization. revision: yes
Circularity Check
No circularity: empirical performance claims with no derivations or self-referential reductions
full rationale
The manuscript reports an empirical ML approach (TinyLM + TTT + POE) and two accuracy numbers (96.1 % train, 21.7 % eval) on ARC-AGI-2. No equations, parameter-fitting steps, uniqueness theorems, or derivation chains appear in the text. The central claims are measured accuracies rather than any quantity that is defined in terms of itself or obtained by renaming a fitted input as a prediction. Self-citations, if present, are irrelevant because no load-bearing mathematical argument exists to reduce to them. The result is therefore self-contained against external benchmarks and receives the default non-circularity score.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Sonya Huang and Pat Grady , title =
- [2]
-
[3]
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems , author=. 2025 , eprint=
work page 2025
-
[4]
Francois Chollet and Mike Knoop and Greg Kamradt and Walter Reade and Addison Howard , title =. 2025 , howpublished =
work page 2025
-
[5]
Lewish Hemens , title =
-
[6]
Geoffrey E. Hinton , title =. Neural Computation , volume =. 2002 , publisher =
work page 2002
-
[7]
arXiv preprint arXiv:2411.07279 , year=
The Surprising Effectiveness of Test-Time Training for Abstract Reasoning , author =. arXiv preprint arXiv:2411.07279 , year =
-
[8]
Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems 36 (NeurIPS 2023) , year =
work page 2023
-
[9]
Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective , author=. 2025 , eprint=
work page 2025
- [10]
- [11]
- [12]
- [13]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.