Language-Guided Abstraction for Visual Reasoning

Ruping Wang; Xu-Jing Ye; Yuan-Gen Wang

arxiv: 2606.12847 · v1 · pith:PQN76U2Inew · submitted 2026-06-11 · 💻 cs.CV

Language-Guided Abstraction for Visual Reasoning

Xu-Jing Ye , Yuan-Gen Wang , Ruping Wang This is my paper

Pith reviewed 2026-06-27 07:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords abstraction and reasoningvisual reasoninglanguage-guided learningprivileged informationfew-shot generalizationlightweight modelssemantic embeddingsARC benchmark

0 comments

The pith

A temporary language branch injects structured semantic embeddings into visual training to improve abstract rule learning on ARC tasks in a final 18-million-parameter model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that adds a language-guided branch only during training to help a visual model capture high-level semantics for few-shot abstraction tasks. This branch compresses language descriptions of tasks into embeddings and aligns them with visual features before the branch is removed at inference. The result is a small model that avoids overfitting to pixel patterns and outperforms prior vision-only methods. A reader would care if the approach shows a practical way to use language priors without keeping large models in the deployed system. This targets the challenge of learning general transformation rules rather than surface patterns from limited examples.

Core claim

The central claim is that guiding a visual model with semantic embeddings derived from language descriptions of transformation rules, aligned through cross-attention, enables better learning of abstract concepts in few-shot settings on the ARC benchmark. The approach structures language annotations into embeddings and transfers their information during training only, producing improved generalization on new tasks while keeping the inference model lightweight at 18 million parameters.

What carries the argument

The language-guided Learning Using Privileged Information branch, consisting of a Semantic Compression Module that refines language descriptions into structured embeddings and a Cross-Attention Projector that aligns those embeddings with visual features during training.

If this is right

The final model requires no language input at test time and stays at 18 million parameters.
Visual reasoning avoids overfitting to low-level patterns by incorporating high-level semantic guidance from language.
The framework connects language-based and vision-only approaches to abstraction tasks.
Ablation experiments confirm separate contributions from the compression step and the alignment projector.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar privileged language information could be collected and applied to other few-shot visual reasoning problems beyond ARC.
If the embeddings encode general rules, the training method might support transfer to abstract domains with novel transformations.
Expanding the collection of structured language descriptions for such tasks could strengthen the guidance effect.

Load-bearing premise

The embeddings from compressed language descriptions must accurately represent abstract transformation rules in a form that transfers to the visual model without introducing distribution shifts or new overfitting.

What would settle it

An experiment that replaces the semantic embeddings with unrelated vectors and finds no loss in the reported performance gains, or that shows the model fails to generalize on a fresh set of ARC tasks with distinct rule structures, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.12847 by Ruping Wang, Xu-Jing Ye, Yuan-Gen Wang.

**Figure 1.** Figure 1: Sample tasks from the ARC training set. Task 0 involves a rule of positional extension, while Task 1 requires filling enclosed regions. The model must infer these hidden rules from the Support Set to solve the Query Set. ∗Corresponding author: Yuan-Gen Wang (wangyg@gzhu.edu.cn) ∗∗Corresponding author: Ruping Wang (dxcyjc@163.com) wangyg@gzhu.edu.cn (Y. Wang); dxcyjc@163.com (R. Wang) ORCID(s): 1. Introduct… view at source ↗

**Figure 2.** Figure 2: Visualization of the generalization gap. Training tasks typically involve specific core knowledge priors (e.g., object persistence), while test tasks require combining these priors into novel abstract rules. This strict separation necessitates a model capable of meta-learning rather than simple pattern recognition. These limitations are further reflected in quantitative metrics. On the harder ARC-2 benchm… view at source ↗

**Figure 3.** Figure 3: Overview of the proposed L-VARC framework. The architecture is organized into two branches. Top (Main Backbone): A shared Vision Transformer (ViT) encodes the input grid and predicts the output grid, optimized by the task reconstruction loss. This branch is used for both training and inference. Bottom (Training Branch): A language-guided training branch injects semantic knowledge. Raw descriptions are refi… view at source ↗

**Figure 5.** Figure 5: Illustration of the semantic compression module. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on unseen test tasks. Case A (Logic Transfer): The training task demonstrates a “crop green object" rule. In the test input, despite the presence of distractors (blue/orange boxes), L-VARC correctly generalizes the rule (crop largest object) to crop the target green box, showing significantly higher voting confidence (48) than the baseline (6). Case B (Complex Filling): L-VARC demons… view at source ↗

**Figure 6.** Figure 6: Sensitivity to the alignment loss weight [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Training Dynamics. L-VARC (CAP) demonstrates [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 9.** Figure 9: Number of successfully solved tasks per cate [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 8.** Figure 8: Three-run distribution of PASS@1 on ARC-1. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

read the original abstract

The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i.e., VARC). The former depends heavily on LLMs, consuming billions of parameters. The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3. In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e.g., CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters. Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at https://github.com/GZHU-DVL/L-VARC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

L-VARC adds an LLM compression step and cross-attention projector in a LUPI branch for ARC but the abstract shows no metrics or validation that the embeddings capture the rules.

read the letter

L-VARC uses language from an LLM to guide a visual model on ARC tasks during training only, then drops that branch so the final model has just 18 million parameters. The abstract says this beats prior work, but it gives no numbers at all.

The new element is the Semantic Compression Module that takes LARC descriptions, runs them through DeepSeek-V3 with one prompt to produce structured embeddings, and the Cross-Attention Projector that aligns those embeddings to visual features in the LUPI branch. This combination for ARC has not been tried before.

The approach makes sense for the problem. Vision-only models often latch onto pixel patterns and overfit, while full language models are too large. Letting language provide privileged information at train time and removing it later keeps things efficient and focused on the visual side.

The soft spot is the missing evidence. The abstract asserts outperformance and that ablations back the two modules, yet there are no metrics, no baseline details, no error bars, and no check on whether the embeddings actually encode the abstract rules like color changes or rotations instead of just rephrasing the input. Without that, it is impossible to know if the projector transfers useful semantics or just acts as extra regularization.

This work is aimed at the ARC community and anyone exploring how to add language priors to small visual reasoners. A reader who wants to see a concrete LUPI implementation for this benchmark would find the architecture worth looking at.

It deserves a serious referee because the idea is grounded in existing techniques but applied in a new way to a key testbed, and the code is public. I recommend sending it to peer review so the experimental claims can be examined properly.

Referee Report

2 major / 1 minor

Summary. The paper proposes L-VARC, a language-guided LUPI framework for ARC visual reasoning. A Semantic Compression Module feeds a unified task-agnostic prompt to DeepSeek-V3 on LARC data to produce structured embeddings within CLIP-length limits; these are aligned to visual features via a Cross-Attention Projector during training only. The LUPI branch is discarded at inference, yielding an 18M-parameter model claimed to outperform prior SOTA, with ablations confirming the two modules.

Significance. If the quantitative claims hold with proper controls, the work would show a practical route to injecting linguistic priors into small visual backbones for few-shot abstraction tasks without inference overhead, addressing the parameter bloat of LLM-only ARC methods and the overfitting of pure vision approaches.

major comments (2)

[Abstract] Abstract and experimental section: the central claim of outperforming SOTA with an 18M model via LUPI requires quantitative support (accuracy deltas, baselines, error bars, statistical tests); none are supplied in the abstract and the skeptic note indicates the full experimental section lacks these details, rendering the outperformance assertion unverifiable.
[Method] Semantic Compression Module and Cross-Attention Projector sections: the claim that DeepSeek-V3 embeddings on the unified prompt capture abstract transformation rules (rotation, color mapping) in a task-agnostic manner, rather than surface descriptions, is load-bearing for the LUPI benefit; no embedding-quality validation, no ablation isolating the projector, and no train/test distribution alignment check are described, so the reported gains could arise from regularization alone.

minor comments (1)

[Abstract] The GitHub link is a positive step for reproducibility; ensure the released code includes the exact prompt template and projector architecture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the quantitative claims require more explicit support and that additional validation for the modules would strengthen the paper. We address each major comment below and commit to revisions.

read point-by-point responses

Referee: [Abstract] Abstract and experimental section: the central claim of outperforming SOTA with an 18M model via LUPI requires quantitative support (accuracy deltas, baselines, error bars, statistical tests); none are supplied in the abstract and the skeptic note indicates the full experimental section lacks these details, rendering the outperformance assertion unverifiable.

Authors: We agree the abstract should include specific quantitative support and that the experimental section needs accuracy deltas, baselines, error bars, and statistical tests for verifiability. In the revision we will update the abstract with key performance numbers (e.g., accuracy improvements over prior SOTA) and expand the experiments section accordingly, including any required controls to address the skeptic note. revision: yes
Referee: [Method] Semantic Compression Module and Cross-Attention Projector sections: the claim that DeepSeek-V3 embeddings on the unified prompt capture abstract transformation rules (rotation, color mapping) in a task-agnostic manner, rather than surface descriptions, is load-bearing for the LUPI benefit; no embedding-quality validation, no ablation isolating the projector, and no train/test distribution alignment check are described, so the reported gains could arise from regularization alone.

Authors: We acknowledge the need for explicit validation that the embeddings capture abstract rules rather than surface features. In the revision we will add embedding-quality validation (e.g., qualitative examples or similarity analyses), an ablation study isolating the Cross-Attention Projector, and train/test distribution alignment checks to demonstrate the source of the gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external LLM processing and standard training.

full rationale

The paper introduces L-VARC using DeepSeek-V3 (external) for the Semantic Compression Module on LARC data and a Cross-Attention Projector in a LUPI setup discarded at inference. No equations, self-citations, or fitted inputs are presented as predictions or first-principles results. The method is self-contained against external benchmarks with no reduction of claims to author-defined inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the utility of the crowd-sourced LARC descriptions and the effectiveness of the two new modules; no free parameters or invented entities are introduced beyond standard training choices.

axioms (1)

domain assumption Crowd-sourced language descriptions in LARC accurately encode the abstract visual transformation rules.
The framework depends on these descriptions being faithful inputs to the Semantic Compression Module.

pith-pipeline@v0.9.1-grok · 5812 in / 1266 out tokens · 20587 ms · 2026-06-27T07:45:46.046809+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 7 linked inside Pith

[1]

F.Chollet, Onthemeasureofintelligence, arXiv:1911.01547(2019)

Pith/arXiv arXiv 1911
[2]

Chollet, M

F. Chollet, M. Knoop, G. Kamradt, B. Landers, Arc prize 2024: Technical report, arXiv:2412.04604 (2024)

arXiv 2024
[3]

R. Wang, E. Zelikman, G. Poesia, Y. Pu, N. Haber, N. D. Goodman, Hypothesis search: Inductive reasoning with language models, in: InternationalConferenceonLearningRepresentations(ICLR),2024

2024
[4]

Ellis, C

K. Ellis, C. Wong, M. Nye, M. Sablé-Meyer, L. Morales, L. Hewitt, L.Cary,A.Solar-Lezama,J.B.Tenenbaum, Dreamcoder:Bootstrap- ping inductive program synthesis with wake-sleep library learning, in: ACM SIGPLAN International Conference on Programming Lan- guage Design and Implementation (PLDI), 2021, pp. 835–850

2021
[5]

K. Hu, A. Cy, L. Qiu, X. D. Ding, R. Wang, Y. E. Zhu, J. Andreas, K. He, Arc is a vision problem!, arXiv:2511.14761 (2025)

arXiv 2025
[6]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Trans- formers for image recognition at scale, in: International Conference on Learning Representations (ICLR), 2021

2021
[7]

L.Bottou,V.Vapnik, Locallearningalgorithms, NeuralComputation 4 (1992) 888–900

1992
[8]

Y. Sun, X. Wang, Z. Liu, J. Miller, A. Efros, M. Hardt, Test-time training with self-supervision for generalization under distribution shifts, in: International Conference on Machine Learning (ICML), 2020, pp. 9229–9248

2020
[9]

W. Li, Y. Xu, S. Sanner, E. B. Khalil, Tackling the abstraction and reasoning corpus with vision transformers: the importance of 2d representation, positions, and objects, Transactions on Machine Learning Research (TMLR) (2025)

2025
[10]

Zhang, Y

B. Zhang, Y. Zang, X. Dong, Y. Cao, H. Duan, D. Lin, J. Wang, Think visually, reason textually: Vision-language synergy in ARC, arXiv:2511.15703 (2025)

arXiv 2025
[11]

Z.Jia,J.Wang,K.Song,Z.Wang,X.Ma,R.Jin, Aduetofperception andreasoning:Clipandllmbrainstormingforscenetextrecognition, Neurocomputing (2025) 132236

2025
[12]

Vapnik, A

V. Vapnik, A. Vashist, A new learning paradigm: Learning using privileged information, Neural Networks 22 (2009) 544–557

2009
[13]

Acquaviva, Y

S. Acquaviva, Y. Pu, M. Kryven, T. Sechopoulos, C. Wong, G. Ecanow, M. Nye, M. Tessler, J. Tenenbaum, Communicating natural programs to humans and machines, in: Advances in Neural Information Processing Systems (NeurIPS), volume 35, 2022, pp. 3731–3743

2022
[14]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al., Deepseek-v3 technical report, arXiv:2412.19437 (2024)

Pith/arXiv arXiv 2024
[15]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G.Sastry,A.Askell,P.Mishkin,J.Clark,etal., Learningtransferable visual models from natural language supervision, in: International Conference on Machine Learning (ICML), 2021, pp. 8748–8763

2021
[16]

W.Zhang,Q.Tan,P.Li,Q.Zhang,R.Wang,Cross-modaltransformer withlanguagequeryforreferringimagesegmentation, Neurocomput- ing 536 (2023) 191–205

2023
[17]

Pechyony, V

D. Pechyony, V. Vapnik, On the theory of learnining with privileged information, Advances in neural information processing systems (NeurIPS) 23 (2010)

2010
[18]

24824–24837

J.Wei,X.Wang,D.Schuurmans,M.Bosma,F.Xia,E.Chi,Q.V.Le, D.Zhou,etal., Chain-of-thoughtpromptingelicitsreasoninginlarge language models, in: Advances in Neural Information Processing Systems (NeurIPS), volume 35, 2022, pp. 24824–24837

2022
[19]

J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrapping language-image pre-training for unified vision-language understanding and genera- tion, in: International Conference on Machine Learning (ICML), 2022, pp. 12888–12900

2022
[20]

A. v. d. Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding, arXiv:1807.03748 (2018)

Pith/arXiv arXiv 2018
[21]

Yadkori, Hierarchical reasoning model, arXiv:2506.21734 (2025)

G.Wang,J.Li,Y.Sun,X.Chen,C.Liu,Y.Wu,M.Lu,S.Song,Y.A. Yadkori, Hierarchical reasoning model, arXiv:2506.21734 (2025)

Pith/arXiv arXiv 2025
[22]

Jolicoeur-Martineau, Less is more: Recursive reasoning with tiny networks, arXiv:2510.04871 (2025)

A. Jolicoeur-Martineau, Less is more: Recursive reasoning with tiny networks, arXiv:2510.04871 (2025)

Pith/arXiv arXiv 2025
[23]

X.Wang,Z.Ji,Y.Pang,Y.Yu,Acognition-drivenframeworkforfew- shotclass-incrementallearning,Neurocomputing600(2024)128118

2024
[24]

Vahdati, A

S. Vahdati, A. Aioanei, H. Suresh, J. Lehmann, The arc of progress towards agi: A living survey of abstraction and reasoning, arXiv:2603.13372 (2026)

arXiv 2026
[25]

Bratus, D

S. Bratus, D. F. Jenny, A. Plesner, R. Wattenhofer, A survey on the abstraction and reasoning corpus, TechRxiv (2026)

2026
[26]

W. L. de Oliveira, M. Bobokhonov, M. Caorsi, A. Podestà, G. Bel- tramo, L. Crosato, M. Bonotto, F. Cecchetto, H. Espic, D. T. Salajan, et al., Arc-agi-2 technical report, arXiv:2603.06590 (2026)

arXiv 2026
[27]

Bratus, D

S. Bratus, D. F. Jenny, A. Plesner, R. Wattenhofer, A survey on the abstraction and reasoning corpus, TechRxiv 2026 (2026)

2026
[28]

W.-J. Shu, X. Qiu, R.-J. Zhu, H. H. Chen, Y. Liu, H. Yang, Loopvit: Scaling visual arc with looped transformers, arXiv:2602.02156 (2026)

arXiv 2026
[29]

X. Yan, C. Li, Y. Shao, Y. Meng, Learning using statistical invari- ants with privileged information, Information Sciences 709 (2025) 122069

2025
[30]

Q. Song, H. Li, Y. Yu, H. Zhou, L. Yang, S. Bai, Q. She, Z. Huang, Y. Zhao, Codedance: A dynamic tool-integrated mllm for executable visual reasoning, arXiv:2512.17312 (2025)

arXiv 2025
[31]

Zhang, M

Z. Zhang, M. Jiang, J. Kong, J. Li, Llm guided counterfactual reasoning for zero-shot knowledge based visual question answering, Neurocomputing (2025) 131828

2025
[32]

Jiang, J

H. Jiang, J. Fu, J. Fang, C. Gao, X. Wang, X. He, Y. Li, Univlr: Unifying text and vision in visual latent reasoning for multimodal llms, arXiv:2605.11856 (2026)

Pith/arXiv arXiv 2026
[33]

Vaishnav, T

M. Vaishnav, T. Tammet, Symbolic grounding reveals representa- tional bottlenecks in abstract visual reasoning, arXiv:2604.21346 (2026)

Pith/arXiv arXiv 2026
[34]

Zhang, Z

W. Zhang, Z. Cheng, Y. He, Multimodal self-instruct: Synthetic abstractimageandvisualreasoninginstructionusinglanguagemodel, in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 19228–19252. Ye et al.:Preprint submitted to ElsevierPage 10 of 10

2024

[1] [1]

F.Chollet, Onthemeasureofintelligence, arXiv:1911.01547(2019)

Pith/arXiv arXiv 1911

[2] [2]

Chollet, M

F. Chollet, M. Knoop, G. Kamradt, B. Landers, Arc prize 2024: Technical report, arXiv:2412.04604 (2024)

arXiv 2024

[3] [3]

R. Wang, E. Zelikman, G. Poesia, Y. Pu, N. Haber, N. D. Goodman, Hypothesis search: Inductive reasoning with language models, in: InternationalConferenceonLearningRepresentations(ICLR),2024

2024

[4] [4]

Ellis, C

K. Ellis, C. Wong, M. Nye, M. Sablé-Meyer, L. Morales, L. Hewitt, L.Cary,A.Solar-Lezama,J.B.Tenenbaum, Dreamcoder:Bootstrap- ping inductive program synthesis with wake-sleep library learning, in: ACM SIGPLAN International Conference on Programming Lan- guage Design and Implementation (PLDI), 2021, pp. 835–850

2021

[5] [5]

K. Hu, A. Cy, L. Qiu, X. D. Ding, R. Wang, Y. E. Zhu, J. Andreas, K. He, Arc is a vision problem!, arXiv:2511.14761 (2025)

arXiv 2025

[6] [6]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Trans- formers for image recognition at scale, in: International Conference on Learning Representations (ICLR), 2021

2021

[7] [7]

L.Bottou,V.Vapnik, Locallearningalgorithms, NeuralComputation 4 (1992) 888–900

1992

[8] [8]

Y. Sun, X. Wang, Z. Liu, J. Miller, A. Efros, M. Hardt, Test-time training with self-supervision for generalization under distribution shifts, in: International Conference on Machine Learning (ICML), 2020, pp. 9229–9248

2020

[9] [9]

W. Li, Y. Xu, S. Sanner, E. B. Khalil, Tackling the abstraction and reasoning corpus with vision transformers: the importance of 2d representation, positions, and objects, Transactions on Machine Learning Research (TMLR) (2025)

2025

[10] [10]

Zhang, Y

B. Zhang, Y. Zang, X. Dong, Y. Cao, H. Duan, D. Lin, J. Wang, Think visually, reason textually: Vision-language synergy in ARC, arXiv:2511.15703 (2025)

arXiv 2025

[11] [11]

Z.Jia,J.Wang,K.Song,Z.Wang,X.Ma,R.Jin, Aduetofperception andreasoning:Clipandllmbrainstormingforscenetextrecognition, Neurocomputing (2025) 132236

2025

[12] [12]

Vapnik, A

V. Vapnik, A. Vashist, A new learning paradigm: Learning using privileged information, Neural Networks 22 (2009) 544–557

2009

[13] [13]

Acquaviva, Y

S. Acquaviva, Y. Pu, M. Kryven, T. Sechopoulos, C. Wong, G. Ecanow, M. Nye, M. Tessler, J. Tenenbaum, Communicating natural programs to humans and machines, in: Advances in Neural Information Processing Systems (NeurIPS), volume 35, 2022, pp. 3731–3743

2022

[14] [14]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al., Deepseek-v3 technical report, arXiv:2412.19437 (2024)

Pith/arXiv arXiv 2024

[15] [15]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G.Sastry,A.Askell,P.Mishkin,J.Clark,etal., Learningtransferable visual models from natural language supervision, in: International Conference on Machine Learning (ICML), 2021, pp. 8748–8763

2021

[16] [16]

W.Zhang,Q.Tan,P.Li,Q.Zhang,R.Wang,Cross-modaltransformer withlanguagequeryforreferringimagesegmentation, Neurocomput- ing 536 (2023) 191–205

2023

[17] [17]

Pechyony, V

D. Pechyony, V. Vapnik, On the theory of learnining with privileged information, Advances in neural information processing systems (NeurIPS) 23 (2010)

2010

[18] [18]

24824–24837

J.Wei,X.Wang,D.Schuurmans,M.Bosma,F.Xia,E.Chi,Q.V.Le, D.Zhou,etal., Chain-of-thoughtpromptingelicitsreasoninginlarge language models, in: Advances in Neural Information Processing Systems (NeurIPS), volume 35, 2022, pp. 24824–24837

2022

[19] [19]

J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrapping language-image pre-training for unified vision-language understanding and genera- tion, in: International Conference on Machine Learning (ICML), 2022, pp. 12888–12900

2022

[20] [20]

A. v. d. Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding, arXiv:1807.03748 (2018)

Pith/arXiv arXiv 2018

[21] [21]

Yadkori, Hierarchical reasoning model, arXiv:2506.21734 (2025)

G.Wang,J.Li,Y.Sun,X.Chen,C.Liu,Y.Wu,M.Lu,S.Song,Y.A. Yadkori, Hierarchical reasoning model, arXiv:2506.21734 (2025)

Pith/arXiv arXiv 2025

[22] [22]

Jolicoeur-Martineau, Less is more: Recursive reasoning with tiny networks, arXiv:2510.04871 (2025)

A. Jolicoeur-Martineau, Less is more: Recursive reasoning with tiny networks, arXiv:2510.04871 (2025)

Pith/arXiv arXiv 2025

[23] [23]

X.Wang,Z.Ji,Y.Pang,Y.Yu,Acognition-drivenframeworkforfew- shotclass-incrementallearning,Neurocomputing600(2024)128118

2024

[24] [24]

Vahdati, A

S. Vahdati, A. Aioanei, H. Suresh, J. Lehmann, The arc of progress towards agi: A living survey of abstraction and reasoning, arXiv:2603.13372 (2026)

arXiv 2026

[25] [25]

Bratus, D

S. Bratus, D. F. Jenny, A. Plesner, R. Wattenhofer, A survey on the abstraction and reasoning corpus, TechRxiv (2026)

2026

[26] [26]

W. L. de Oliveira, M. Bobokhonov, M. Caorsi, A. Podestà, G. Bel- tramo, L. Crosato, M. Bonotto, F. Cecchetto, H. Espic, D. T. Salajan, et al., Arc-agi-2 technical report, arXiv:2603.06590 (2026)

arXiv 2026

[27] [27]

Bratus, D

S. Bratus, D. F. Jenny, A. Plesner, R. Wattenhofer, A survey on the abstraction and reasoning corpus, TechRxiv 2026 (2026)

2026

[28] [28]

W.-J. Shu, X. Qiu, R.-J. Zhu, H. H. Chen, Y. Liu, H. Yang, Loopvit: Scaling visual arc with looped transformers, arXiv:2602.02156 (2026)

arXiv 2026

[29] [29]

X. Yan, C. Li, Y. Shao, Y. Meng, Learning using statistical invari- ants with privileged information, Information Sciences 709 (2025) 122069

2025

[30] [30]

Q. Song, H. Li, Y. Yu, H. Zhou, L. Yang, S. Bai, Q. She, Z. Huang, Y. Zhao, Codedance: A dynamic tool-integrated mllm for executable visual reasoning, arXiv:2512.17312 (2025)

arXiv 2025

[31] [31]

Zhang, M

Z. Zhang, M. Jiang, J. Kong, J. Li, Llm guided counterfactual reasoning for zero-shot knowledge based visual question answering, Neurocomputing (2025) 131828

2025

[32] [32]

Jiang, J

H. Jiang, J. Fu, J. Fang, C. Gao, X. Wang, X. He, Y. Li, Univlr: Unifying text and vision in visual latent reasoning for multimodal llms, arXiv:2605.11856 (2026)

Pith/arXiv arXiv 2026

[33] [33]

Vaishnav, T

M. Vaishnav, T. Tammet, Symbolic grounding reveals representa- tional bottlenecks in abstract visual reasoning, arXiv:2604.21346 (2026)

Pith/arXiv arXiv 2026

[34] [34]

Zhang, Z

W. Zhang, Z. Cheng, Y. He, Multimodal self-instruct: Synthetic abstractimageandvisualreasoninginstructionusinglanguagemodel, in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 19228–19252. Ye et al.:Preprint submitted to ElsevierPage 10 of 10

2024