CREG: Compass Relational Evidence Graph for Characterizing Directional Structure in VLM Spatial-Reasoning Attribution

Heqing Du; Kaizhen Tan; Yang Feng

arxiv: 2603.20475 · v3 · submitted 2026-03-20 · 💻 cs.CV

CREG: Compass Relational Evidence Graph for Characterizing Directional Structure in VLM Spatial-Reasoning Attribution

Kaizhen Tan , Yang Feng , Heqing Du This is my paper

Pith reviewed 2026-05-15 07:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords attributionvision-language modelsspatial reasoningdirectional alignmentCREGdiagnostic frameworkbenchmark

0 comments

The pith

Box-only geometry achieves over 30 degrees lower directional alignment error than current VLM attribution methods on spatial tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CREG as a training-free way to turn token attribution scores into a directional distribution centered on a reference object and then measure how well that distribution matches the queried spatial relation. It shows that simple bounding-box geometry consistently produces tighter directional alignment than any of the model-based attribution techniques tested. This matters because it demonstrates that higher accuracy on spatial reasoning questions does not automatically mean the model is recovering evidence organized around the actual direction. The work supplies a shared metric and a set of diagnostic interventions to separate genuine directional structure from layout biases in the image.

Core claim

CREG converts token-level attribution into a reference-centered compass distribution and computes its Direction Alignment Error against the true spatial relation. On three spatial-relation benchmarks, box-only geometry records more than 30 degrees lower error than existing attribution methods, indicating that the directional evidence recoverable from current techniques remains limited and frequently entangled with image layout. Task accuracy gains from LoRA fine-tuning or newer model versions do not reliably reduce this error, showing that improved answers can occur without corresponding improvement in directional attribution structure.

What carries the argument

CREG (Compass Relational Evidence Graph), which aggregates attribution into a directional histogram relative to the reference object and quantifies alignment via Direction Alignment Error.

If this is right

Attribution heatmaps from current methods mix directional evidence with image layout biases.
Higher task accuracy in VLMs does not guarantee better directional attribution.
Small-scale LoRA training and newer model generations can raise accuracy without lowering Direction Alignment Error.
CREG supplies a controlled protocol for checking whether spatial-reasoning gains are accompanied by more organized directional evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed gap implies that existing attribution techniques may be unsuitable for explaining spatial decisions inside VLMs.
New attribution algorithms could be designed to explicitly favor directional organization around the reference object.
The same compass readout could be applied to other relational reasoning tasks to test whether layout bias is a broader problem.

Load-bearing premise

The reference-centered compass distribution and Direction Alignment Error metric isolate directional evidence structure from image layout biases without introducing their own artifacts.

What would settle it

An attribution method that achieves Direction Alignment Error equal to or lower than box-only geometry while preserving or improving task accuracy would falsify the claim that current attribution methods are inherently limited in recovering directional structure.

read the original abstract

Standard attribution heatmaps show where a vision-language model (VLM) focuses, but they do not reveal whether the recovered evidence is organized by the queried spatial relation or merely reflects image layout. To address this problem, we introduce CREG (Compass Relational Evidence Graph), a training-free diagnostic framework that converts token-level attribution into a reference-centered compass distribution and measures its directional alignment. CREG provides a shared directional readout across attribution methods and makes comparison with geometric controls explicit. Across three spatial-relation benchmarks, box-only geometry achieves Direction Alignment Error more than 30 degrees lower than current model-based attribution methods, leaving a substantial gap between attribution structure and simple target localization. To examine this gap, we apply a diagnostic battery including target intervention, reference-center randomization, and variance partition. Taken together, the results suggest that the directional structure recoverable from current attribution methods is limited and often mixed with image layout. We further find that higher task accuracy does not reliably coincide with better directional attribution: small-scale LoRA training and newer model generations can improve task accuracy while leaving Direction Alignment Error unchanged or worse. These findings characterize what current attribution methods reveal rather than the model's internal spatial representation. CREG provides a controlled protocol for testing whether improvements in spatial reasoning are accompanied by more directionally organized evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CREG turns VLM attributions into a compass readout for spatial direction and shows current methods lag simple geometry by over 30 degrees, with controls that mostly hold up.

read the letter

The main takeaway is that CREG converts token attributions into a reference-centered compass distribution and measures directional alignment error, revealing that box-only geometry beats model-based methods by a wide margin on three spatial benchmarks. The framework also includes a diagnostic battery of target intervention, reference randomization, and variance checks to separate real directional signal from layout noise. This is useful because it gives a shared, geometry-grounded way to evaluate different attribution techniques without training anything new. The finding that task accuracy often fails to track better directional alignment is a clear and practical distinction. The soft spots are limited. The abstract skips the exact formulas for compass construction and error calculation, so the full paper needs to spell those out for easy reproduction. The gap looks real from the controls described, but more detail on the exact models, attribution methods, and how many images per benchmark would strengthen the case that the result generalizes. No circularity or obvious fitting artifacts show up in the protocol. This is for people working on VLM interpretability for robotics or navigation who want a concrete test for whether explanations actually reflect spatial structure. It deserves peer review because the protocol is straightforward to apply and the gap it reports is large enough to matter for downstream trust.

Referee Report

2 major / 2 minor

Summary. The paper introduces CREG, a training-free diagnostic framework that converts token-level attribution maps from vision-language models into reference-centered compass distributions and quantifies directional alignment via a Direction Alignment Error (DAE) metric. Across three spatial-relation benchmarks, it reports that simple box-only geometric controls achieve more than 30° lower DAE than current model-based attribution methods. A diagnostic battery (target intervention, reference-center randomization, variance partition) is used to show that recovered directional structure is limited and often entangled with image layout biases, and that task accuracy improvements (e.g., via LoRA or newer models) do not reliably reduce DAE.

Significance. If the quantitative gap and diagnostic results hold under explicit verification, the work supplies a controlled, shared readout for assessing whether attribution methods capture directional evidence structure rather than layout artifacts. This directly addresses a gap in VLM interpretability for spatial reasoning and provides a falsifiable protocol for testing whether accuracy gains reflect improved evidence organization.

major comments (2)

[Methods] Methods section on compass construction: the reference-centered compass distribution and Direction Alignment Error metric are described at a conceptual level but without explicit equations, binning procedure, or pseudocode for converting attributions to angular distributions; this is load-bearing for reproducing the central >30° DAE gap claim and confirming the metric isolates directional structure from layout biases.
[Results] Results, diagnostic battery subsection: the variance partition and reference-center randomization outcomes are summarized qualitatively but lack the specific numerical breakdowns (e.g., percentage of variance attributed to directional vs. layout components) needed to evaluate the strength of the claim that attributions are 'mixed with image layout.'

minor comments (2)

[Abstract] Abstract: state the exact number of benchmarks, models, and attribution methods evaluated to allow immediate assessment of scope.
[Figures] Figure captions: ensure all compass visualizations include explicit scale bars for angle bins and reference-center markers for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications for improved reproducibility and quantitative rigor.

read point-by-point responses

Referee: [Methods] Methods section on compass construction: the reference-centered compass distribution and Direction Alignment Error metric are described at a conceptual level but without explicit equations, binning procedure, or pseudocode for converting attributions to angular distributions; this is load-bearing for reproducing the central >30° DAE gap claim and confirming the metric isolates directional structure from layout biases.

Authors: We agree that the Methods section requires more explicit detail. In the revised manuscript we will add the full equations defining the reference-centered compass distribution, specify the binning procedure (including angular resolution and aggregation of attribution weights), provide the exact formula for Direction Alignment Error as a circular statistic, and include pseudocode for the attribution-to-distribution pipeline. These additions will enable direct reproduction of the reported >30° DAE gap and confirm isolation from layout biases. revision: yes
Referee: [Results] Results, diagnostic battery subsection: the variance partition and reference-center randomization outcomes are summarized qualitatively but lack the specific numerical breakdowns (e.g., percentage of variance attributed to directional vs. layout components) needed to evaluate the strength of the claim that attributions are 'mixed with image layout.'

Authors: We acknowledge that the diagnostic results are currently presented only qualitatively. We will expand the Results section to report the precise numerical outcomes, including the percentage of variance attributed to directional versus layout components from the variance partition, and quantitative DAE shifts (with means and confidence intervals) from the reference-center randomization experiments. These numbers will strengthen the evidence for layout entanglement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines CREG explicitly as a conversion of token attributions into a reference-centered compass distribution followed by a geometric Direction Alignment Error (DAE) computation against ground-truth spatial relations. This DAE is then applied uniformly to both model-based attribution maps and an independent box-only geometric baseline. The reported >30° DAE gap is therefore an empirical comparison between two distinct inputs (attribution heatmaps vs. bounding-box centers), not a quantity derived from the metric by construction. No equations reduce the target result to fitted parameters, self-citations, or ansatzes imported from prior author work. The diagnostic battery (target intervention, reference-center randomization, variance partition) further operates as external controls on the same geometric readout. The framework is therefore self-contained against external benchmarks and contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review identifies no explicit free parameters, axioms, or invented entities; CREG is introduced as a training-free conversion and measurement protocol without additional postulated constructs.

pith-pipeline@v0.9.0 · 5534 in / 1070 out tokens · 66506 ms · 2026-05-15T07:55:55.637452+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We then divide the angular space into K sectors, with default K=8, and aggregate attribution mass within each sector

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.