Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning

arxiv: 2604.21346 · v1 · submitted 2026-04-23 · 💻 cs.AI · cs.CL· cs.CV

Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning

Mohit Vaishnav , Tanel Tammet This is my paper

Pith reviewed 2026-05-09 22:21 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV

keywords abstract visual reasoningBongard problemssymbolic groundingrepresentation bottleneckvision-language modelsLOGO programsdiagnostic probeconcept learning

0 comments p. Extension

The pith

Switching from raw images to symbolic descriptions from generative programs lets language models solve abstract visual reasoning tasks where vision-language models fail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether failures on abstract visual reasoning benchmarks like Bongard-LOGO stem from reasoning limits or from how models extract structure from pixels. It compares end-to-end vision-language models on images against language models given symbolic inputs derived from the ground-truth programs that generate those images. Large and consistent accuracy gains with the symbolic versions, reaching the mid-90s on free-form problems, show that the shift away from pixels drives the improvement far more than prompt details or input formatting. This positions symbolic grounding as a diagnostic probe that establishes an upper bound on performance once representation is no longer the obstacle.

Core claim

Reformulating Bongard-LOGO problems in a Componential-Grammatical paradigm as symbolic reasoning tasks based on LOGO-style action programs or structured descriptions allows LLMs to achieve mid-90s accuracy while matched visual baselines stay near chance; ablations confirm that the move to symbolic structure outweighs changes in input format, explicit concept prompts, or minimal visual grounding.

What carries the argument

Symbolic inputs derived from ground-truth generative programs, used as a controlled diagnostic probe in the Componential-Grammatical paradigm to separate representation from reasoning.

If this is right

Representation quality, rather than reasoning capacity, limits current vision-language models on abstract concept learning.
Symbolic descriptions can function as an upper-bound diagnostic for evaluating how close visual models come to ideal performance.
Explicit concept prompts and minor changes in input format matter less than the presence of explicit symbolic structure.
Ablation results imply that efforts to improve pixel-to-concept mapping will yield larger gains than further scaling of visual reasoning alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems that automatically convert images into compact symbolic descriptions could bridge the gap without requiring full end-to-end visual learning.
The same diagnostic approach could be applied to other multimodal benchmarks to test whether representation bottlenecks appear outside of LOGO-style tasks.
If models learn to generate their own symbolic programs from images, the performance gap observed here would shrink without external symbolic oracles.

Load-bearing premise

The symbolic inputs derived from the ground-truth generative programs capture exactly the information needed for the task without introducing artifacts or advantages unavailable to the visual models under matched definitions.

What would settle it

A vision-language model achieving mid-90s accuracy on the original pixel-based Bongard-LOGO tasks when given identical task definitions and training would falsify the claim that representation is the primary bottleneck.

Figures

Figures reproduced from arXiv: 2604.21346 by Mohit Vaishnav, Tanel Tammet.

**Figure 1.** Figure 1: Example Bongard-LOGO instance under the Componential–Grammatical interface. From left to right we show the rendered shape, its action-program (AP) representation, and its action-description (AD) rendering. In the concept-conditioned variant, the prompt additionally provides the high-level concept label, allowing us to compare visual, procedural, and naturallanguage interfaces for the same underlying exa… view at source ↗

**Figure 2.** Figure 2: Comparison of visual and symbolic pipelines on Bongard-LOGO. In the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Per-model percentage-point change under symbolic interventions, aggregated from Table [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a clear performance gap on Bongard-LOGO between VLMs on pixels and LLMs on symbolic inputs from ground-truth programs, but the symbolic side may embed abstractions that tilt the comparison.

read the letter

The main thing to know is that VLMs stay near chance on raw images while LLMs hit the mid-90s once given symbolic descriptions or LOGO-style programs derived from the same generative source. The authors treat the symbolic route as a diagnostic probe rather than a deployed system, and they keep the task definitions matched across the two tracks. That produces a measurable upper bound on what current visual encoders are missing for abstract reasoning tasks.

Referee Report

2 major / 2 minor

Summary. The paper claims that representation, not reasoning, is the primary bottleneck for vision-language models on abstract visual reasoning tasks such as Bongard-LOGO. It supports this by showing that LLMs given symbolic inputs (LOGO-style action programs or structured descriptions derived from ground-truth generative programs via the Componential-Grammatical paradigm) reach mid-90s accuracy on free-form problems, while strong visual baselines remain near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding indicate that the shift to symbolic structure drives the gains, positioning symbolic inputs as a controlled diagnostic upper bound.

Significance. If the symbolic inputs are informationally equivalent to what a perfect visual parser could recover, the results usefully isolate representation as a load-bearing limitation in current VLMs and provide a reproducible diagnostic method for future work on abstraction. The empirical comparison on an external benchmark with ground-truth programs is a strength, but the diagnostic interpretation depends on the fairness of the symbolic encoding.

major comments (2)

[Abstract and §3] Abstract and §3 (C-G paradigm definition): the central claim that the performance gap isolates representation as the sole bottleneck requires that the symbolic inputs contain exactly the information recoverable by a perfect visual parser (low-level primitives without explicit concept encodings or rule-facilitating structure). If the ground-truth-derived LOGO programs or structured descriptions directly encode higher-level attributes such as shape types, relations, or drawing sequences that reveal the Bongard rule, the LLM gains reflect access to pre-abstracted structure rather than a fair test of visual representation quality.
[Experiments] Experiments section (ablations on input format and minimal visual grounding): these ablations do not address whether the symbolic representations introduce artifacts or advantages (e.g., explicit relational encodings) unavailable to visual models under matched task definitions, leaving the 'controlled diagnostic upper bound' interpretation vulnerable.

minor comments (2)

[Methods and Results] Methods and results sections: report error bars, exact data splits, and full task-definition matching details to allow verification that visual and symbolic conditions are perfectly aligned.
[Figures and Notation] Figure captions and notation: clarify how 'minimal visual grounding' is operationalized and ensure all acronyms (e.g., C-G) are expanded on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and detailed comments, which highlight important nuances in interpreting our diagnostic results. We address each major comment point by point below, providing clarifications and indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (C-G paradigm definition): the central claim that the performance gap isolates representation as the sole bottleneck requires that the symbolic inputs contain exactly the information recoverable by a perfect visual parser (low-level primitives without explicit concept encodings or rule-facilitating structure). If the ground-truth-derived LOGO programs or structured descriptions directly encode higher-level attributes such as shape types, relations, or drawing sequences that reveal the Bongard rule, the LLM gains reflect access to pre-abstracted structure rather than a fair test of visual representation quality.

Authors: We agree that the validity of isolating representation as the bottleneck hinges on the symbolic inputs being informationally equivalent to the output of a perfect visual parser. In the C-G paradigm, the LOGO-style action programs are sequences of low-level primitives (e.g., FORWARD, TURN, PENUP, PENDOWN) that exactly match the ground-truth generative process used to create the images; they contain no explicit shape-type labels, abstract relations, or the Bongard rule itself. The LLM must still perform the abstraction step by comparing the programs across positive and negative sets. Structured descriptions are direct compositional renderings of these same primitives without added rule-facilitating structure. To make this equivalence explicit and address the concern, we will revise §3 to include a precise enumeration of the allowed primitives and an explicit statement that no higher-level encodings are introduced beyond the generative program. This revision will be made. revision: yes
Referee: [Experiments] Experiments section (ablations on input format and minimal visual grounding): these ablations do not address whether the symbolic representations introduce artifacts or advantages (e.g., explicit relational encodings) unavailable to visual models under matched task definitions, leaving the 'controlled diagnostic upper bound' interpretation vulnerable.

Authors: The existing ablations already vary input format (raw action sequences vs. structured descriptions) and test minimal visual grounding, showing that performance remains high as long as symbolic structure is present. However, we acknowledge that a more direct comparison of information content is warranted to rule out unintended advantages. We will add a dedicated paragraph in the Experiments section (and a corresponding note in the discussion) that (i) enumerates the exact information present in the symbolic inputs, (ii) confirms the absence of pre-encoded relations or concepts, and (iii) contrasts this with the pixel-level information available to visual models under identical task definitions. This addition will reinforce that the symbolic inputs serve as a controlled upper bound without introducing extraneous artifacts. The revision will be made. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison with external ground-truth programs

full rationale

The paper conducts an empirical evaluation of VLMs on raw images versus LLMs on symbolic representations derived from the benchmark's provided ground-truth generative programs. No mathematical derivations, fitted parameters, or predictions are presented that reduce to the result by construction. The central claim rests on measured accuracy differences under matched task definitions, with ablations on input formats. This is a standard experimental diagnostic setup using an external benchmark; no self-definitional steps, self-citation load-bearing premises, or renamings of known results appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that symbolic conversions preserve task-relevant structure without adding information unavailable to visual models; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Symbolic descriptions derived from ground-truth LOGO programs contain exactly the information required for the abstract concept learning task.
Invoked when treating symbolic inputs as an upper-bound diagnostic; if the conversion loses or adds structure, the performance gap cannot be attributed solely to representation.

pith-pipeline@v0.9.0 · 5498 in / 1173 out tokens · 37581 ms · 2026-05-09T22:21:40.165547+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Mikhail M Bongard

Nsa: Neuro-symbolic arc challenge.arXiv preprint arXiv:2501.04424. Mikhail M Bongard. 1970.Pattern Recognition. Hay- den Book Company, Rochelle Park, NJ. Declan Campbell, Sunayana Rane, Tyler Giallanza, C Nicolò De Sabbata, Guillermo Ortiz-Jiménez, Pas- cal Frossard, and Olga Russakovsky. 2024. Under- standing the limits of vision language models through ...

work page arXiv 1970
[2]

In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 20406–20417

Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 20406–20417. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language...

work page arXiv 2021
[3]

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Causal graphical models for vision-language compositional understanding. InInternational Con- ference on Learning Representations (ICLR). Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, and Zuxuan Wu. 2024. Synthesize, diagnose, and opti- mize: Towards fine-grained vision-language under- standing. InProceedings of the IEEE/CVF Confer- ence on Computer Vis...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

13* **Describe the visual characteristics of *each* category (cat_1 and cat_2) in detail.** Focus on shapes, lines, orientations, spatial relationships, and quantities

**Analyze Positive and Negative Images:** 12* Carefully compare the positive (cat_2) and negative (cat_1) images. 13* **Describe the visual characteristics of *each* category (cat_1 and cat_2) in detail.** Focus on shapes, lines, orientations, spatial relationships, and quantities. Break down complex shapes into simpler components. Describe shapes as if e...
[5]

**Analyze the Test Image:** 17* Describe the test image's visual features in detail, using the same level of detail as in step 1. 18
[6]

cat_2 (Positive)

**Classify the Test Image:** 20* Based on your analysis of the positive, negative, and test images, determine whether the test image belongs to cat_2 (positive) or cat_1 (negative). 21* **Provide a clear justification for your classification.** Explain *why* the test image's features align more closely with the positive examples or the negative examples, ...
[10]

Analysis

**Crucially, your reasoning must focus on the final geometric properties of the shapes, not the specific steps used to construct them.** 13 14Here is the overall concept behind the positive samples: {concept} 15 16**Your Task and Required Output:** 17Respond **only** with a single JSON object. Do not include any text or explanations outside of the JSON st...
[11]

action descriptions

Analyze the provided "action descriptions" for the positive, negative, and test sets
[12]

Infer the final geometric properties of the shapes from these descriptions
[13]

Identify the abstract rule that defines the positive set
[14]

Analysis

**Crucially, your reasoning must focus on the final geometric properties of the shapes, not the specific steps used to construct them.** 13 14**Your Task and Required Output:** 15Respond **only** with a single JSON object. Do not include any text or explanations outside of the JSON structure. 16 17The JSON object must have the following keys: 18{ 19"Analy...
[15]

11*`set of base actions`: A list of primitives like`line_LENGTH`or`arc_RADIUS_ANGLE`

**Stage 1: Base Shape Definition:** Shapes are first defined with basic geometry using degrees for angles. 11*`set of base actions`: A list of primitives like`line_LENGTH`or`arc_RADIUS_ANGLE`. 12*`turn angles`: A sequence of turns like`L90.0--R45.0...`, specified in degrees
[16]

zigzag" or

**Stage 2: Object Creation (`BasicAction`):** The system reads the base geometry and creates programmable`BasicAction`objects. At this stage, a visual stroke style (`line_type`like "zigzag" or "triangle") is added
[17]

All values are **normalized** into a [0, 1] range

**Stage 3: Final Serialization (The Input You See):** The`BasicAction`objects are serialized into the final action program strings. All values are **normalized** into a [0, 1] range. 15* **Line String:**`line_TYPE_LENGTH-TURNANGLE` 16* **Arc String:**`arc_TYPE_ARCANGLE_ARCRADIUS-TURNANGLE` 17 19 18* **Program Structure Hierarchy:** 19The final action prog...
[18]

**Positive Samples (`pos`):** Action programs that all conform to a common abstract rule
[19]

**Negative Samples (`neg`):** Action programs that do not conform to this rule
[20]

Analysis

**Test Sample:** An action program to be categorized, accompanied by its rendered image to provide visual context. 25 26* **Crucial Interpretive Note:** While the action programs detail *how* a figure is constructed, your task is to reason about the final, abstract geometric properties of the resulting figure. The construction method is just one of severa...

2000

[1] [1]

Mikhail M Bongard

Nsa: Neuro-symbolic arc challenge.arXiv preprint arXiv:2501.04424. Mikhail M Bongard. 1970.Pattern Recognition. Hay- den Book Company, Rochelle Park, NJ. Declan Campbell, Sunayana Rane, Tyler Giallanza, C Nicolò De Sabbata, Guillermo Ortiz-Jiménez, Pas- cal Frossard, and Olga Russakovsky. 2024. Under- standing the limits of vision language models through ...

work page arXiv 1970

[2] [2]

In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 20406–20417

Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Confer- ence on Computer Vision, pages 20406–20417. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language...

work page arXiv 2021

[3] [3]

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Causal graphical models for vision-language compositional understanding. InInternational Con- ference on Learning Representations (ICLR). Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, and Zuxuan Wu. 2024. Synthesize, diagnose, and opti- mize: Towards fine-grained vision-language under- standing. InProceedings of the IEEE/CVF Confer- ence on Computer Vis...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

13* **Describe the visual characteristics of *each* category (cat_1 and cat_2) in detail.** Focus on shapes, lines, orientations, spatial relationships, and quantities

**Analyze Positive and Negative Images:** 12* Carefully compare the positive (cat_2) and negative (cat_1) images. 13* **Describe the visual characteristics of *each* category (cat_1 and cat_2) in detail.** Focus on shapes, lines, orientations, spatial relationships, and quantities. Break down complex shapes into simpler components. Describe shapes as if e...

[5] [5]

**Analyze the Test Image:** 17* Describe the test image's visual features in detail, using the same level of detail as in step 1. 18

[6] [6]

cat_2 (Positive)

**Classify the Test Image:** 20* Based on your analysis of the positive, negative, and test images, determine whether the test image belongs to cat_2 (positive) or cat_1 (negative). 21* **Provide a clear justification for your classification.** Explain *why* the test image's features align more closely with the positive examples or the negative examples, ...

[7] [10]

Analysis

**Crucially, your reasoning must focus on the final geometric properties of the shapes, not the specific steps used to construct them.** 13 14Here is the overall concept behind the positive samples: {concept} 15 16**Your Task and Required Output:** 17Respond **only** with a single JSON object. Do not include any text or explanations outside of the JSON st...

[8] [11]

action descriptions

Analyze the provided "action descriptions" for the positive, negative, and test sets

[9] [12]

Infer the final geometric properties of the shapes from these descriptions

[10] [13]

Identify the abstract rule that defines the positive set

[11] [14]

Analysis

**Crucially, your reasoning must focus on the final geometric properties of the shapes, not the specific steps used to construct them.** 13 14**Your Task and Required Output:** 15Respond **only** with a single JSON object. Do not include any text or explanations outside of the JSON structure. 16 17The JSON object must have the following keys: 18{ 19"Analy...

[12] [15]

11*`set of base actions`: A list of primitives like`line_LENGTH`or`arc_RADIUS_ANGLE`

**Stage 1: Base Shape Definition:** Shapes are first defined with basic geometry using degrees for angles. 11*`set of base actions`: A list of primitives like`line_LENGTH`or`arc_RADIUS_ANGLE`. 12*`turn angles`: A sequence of turns like`L90.0--R45.0...`, specified in degrees

[13] [16]

zigzag" or

**Stage 2: Object Creation (`BasicAction`):** The system reads the base geometry and creates programmable`BasicAction`objects. At this stage, a visual stroke style (`line_type`like "zigzag" or "triangle") is added

[14] [17]

All values are **normalized** into a [0, 1] range

**Stage 3: Final Serialization (The Input You See):** The`BasicAction`objects are serialized into the final action program strings. All values are **normalized** into a [0, 1] range. 15* **Line String:**`line_TYPE_LENGTH-TURNANGLE` 16* **Arc String:**`arc_TYPE_ARCANGLE_ARCRADIUS-TURNANGLE` 17 19 18* **Program Structure Hierarchy:** 19The final action prog...

[15] [18]

**Positive Samples (`pos`):** Action programs that all conform to a common abstract rule

[16] [19]

**Negative Samples (`neg`):** Action programs that do not conform to this rule

[17] [20]

Analysis

**Test Sample:** An action program to be categorized, accompanied by its rendered image to provide visual context. 25 26* **Crucial Interpretive Note:** While the action programs detail *how* a figure is constructed, your task is to reason about the final, abstract geometric properties of the resulting figure. The construction method is just one of severa...

2000