arxiv: 2605.06043 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

Domain Generalization through Spatial Relation Induction over Visual Primitives

Dat Nguyen , Duc-Duy Nguyen

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:10 UTC · model grok-4.3

classification 💻 cs.CV

keywords domain generalizationvisual primitivesspatial relationssoft predicatesconcept bottleneckrelational compositionimage classification

0 comments

The pith

Explicit spatial relations over visual primitives strengthen domain generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that domain generalization can be enhanced by explicitly factoring visual recognition into visual primitives and their spatial relational compositions instead of leaving structure implicit. This is done by using soft predicates to score alignments among primitives detected via a concept bottleneck layer. A sympathetic reader would care if this leads to more stable performance when domains shift in compositionally relevant ways, such as changes in part arrangements. If correct, the method would provide a way to learn invariant structures end-to-end without solely depending on training tricks like augmentation.

Core claim

PARSE represents images through visual primitives located via heatmaps and evaluates their spatial relations using differentiable soft binary, ternary, and quaternary predicates. These relations are scored in a structural layer, and class probabilities are computed from the joint evidence of class-specific compositions. This explicit modeling improves accuracy on domain generalization benchmarks.

What carries the argument

Soft predicates of different arities applied to primitive spatial coordinates to create differentiable spatial alignment measures.

If this is right

The method achieves over 4.5 percentage point accuracy gains on the CUB-DG benchmark.
It remains competitive with existing methods on the DomainBed suite.
Primitives and relations are learned jointly through the end-to-end architecture.
Decisions rely on evidence from multiple relational compositions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could extend to other vision tasks where spatial structure is key, like scene understanding.
The approach might offer more interpretable decisions by highlighting active relations.
It suggests testing the method on benchmarks designed to vary relational structures deliberately.

Load-bearing premise

The learned visual primitives and their spatial relations will correspond to domain-invariant features that support reliable classification across domains.

What would settle it

Ablating the structural scoring layer on CUB-DG and DomainBed, then checking whether the accuracy advantage over baselines disappears.

Figures

Figures reproduced from arXiv: 2605.06043 by Dat Nguyen, Duc-Duy Nguyen.

**Figure 1.** Figure 1: The overall pipeline of PARSE. primitives and their spatial relations, while learning both end-to-end from image-level supervision. We first describe the primitive-and-structure component of PARSE in Section 3.1, where we formally define visual primitives and the spatial predicates over their locations. Then, we describe our network architecture that allows jointly learning both these structures and primit… view at source ↗

**Figure 2.** Figure 2: Error analysis on 2 correct and 2 false classification samples of the target (Photo) domain. view at source ↗

**Figure 3.** Figure 3: Primitive 0 visualization over different view at source ↗

read the original abstract

Domain generalization requires identifying stable representations that support reliable classification across domains. Most existing methods seek such stability through improving the training process, for example, through model selection strategies, data augmentation, or feature-alignment objectives. Although these strategies can be effective, they leave the representation learning of structural composition implicit, which may limit performance on compositional domain generalization benchmarks. In this work, we propose Primitive-Aware Relational Structure for domain gEneralization (PARSE), an image classification framework that factors visual recognition into visual primitives and their relational composition. We represent these compositions using soft binary, ternary, and quaternary predicates over primitive locations, yielding differentiable measures of spatial alignment that can be learned end-to-end. To learn primitives and relational structures jointly, we design an end-to-end architecture with three components: (1) a convolutional neural network (CNN) backbone that extracts general visual features, (2) a concept bottleneck layer that maps these features to primitive heatmaps with differentiable spatial coordinates, and (3) a structural scoring layer that evaluates candidate spatial relations among the detected primitives. We then compute class probability from the joint evidence of its class-specific relational compositions. Across CUB-DG and the DomainBed benchmark suite,PARSE improves accuracy by over 4.5 percentage points on CUB-DG and remains competitive with existing DG methods on DomainBed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PARSE adds explicit soft multi-arity predicates over primitive heatmaps for domain generalization and reports gains on CUB-DG, but the primitives may still latch onto domain-specific cues.

read the letter

The main takeaway is that this paper factors visual recognition into a concept bottleneck that produces primitive heatmaps with coordinates, then scores class-specific soft binary, ternary, and quaternary spatial predicates on those locations to produce the final output. That relational scoring layer is the distinct piece compared with standard DG approaches that rely on alignment or augmentation alone. The architecture stays fully differentiable and end-to-end, which is a clean design choice. On the reported results, it lifts accuracy more than 4.5 points on CUB-DG while staying competitive on the DomainBed suite, so the method at least does not collapse on conventional benchmarks. The idea of making compositional structure explicit rather than implicit is reasonable for tasks that involve part relations. The soft spot is exactly the one the stress-test note raises. Nothing in the described pipeline supplies a loss or constraint that pushes the heatmaps to ignore texture, color, or lighting shifts and focus only on geometric layout. If the CNN backbone still encodes domain-sensitive appearance, the downstream predicates will simply compose those unstable signals, and the claimed invariance benefit disappears. The abstract does not show ablations or visualizations that would confirm the primitives are stable across domains, so the attribution of the gains remains open. This work is aimed at researchers already working on structured or compositional domain generalization who might want to test explicit relational layers. It is coherent enough on its own terms to deserve a serious referee who can check the full experimental details, baselines, and any supporting analysis of the heatmaps.

Referee Report

2 major / 1 minor

Summary. The paper proposes PARSE, an end-to-end image classification architecture for domain generalization that decomposes recognition into (1) a CNN backbone, (2) a concept bottleneck producing primitive heatmaps with differentiable coordinates, and (3) a structural scoring layer that computes class-specific soft binary/ternary/quaternary spatial predicates over those primitives. Class probabilities are derived from the joint evidence of these relational compositions. Empirical claims include a >4.5 percentage point accuracy gain on CUB-DG and competitive performance versus existing DG methods on the DomainBed suite.

Significance. If the primitives and induced relations can be shown to be domain-invariant, the approach offers a structured, interpretable alternative to implicit alignment or augmentation strategies and could advance compositional domain generalization. The end-to-end differentiability of the predicate scoring is a technical strength, but the significance is limited by the absence of direct evidence that the bottleneck discovers stable geometric structure rather than domain-sensitive appearance cues.

major comments (2)

[Architecture (concept bottleneck + structural scoring)] The architecture description (concept bottleneck and structural scoring layer): no invariance loss, part-level supervision, or cross-domain consistency regularizer is applied to the primitive heatmaps. Without such a mechanism, the soft predicates may simply compose domain-sensitive detectors, directly undermining the claim that relational induction produces domain-invariant features responsible for the reported gains.
[Experiments (CUB-DG results)] Experimental section on CUB-DG: the headline >4.5 pp improvement is presented without reported ablations that isolate the contribution of the relational scoring layer (e.g., removing predicates while keeping the bottleneck and capacity), without statistical significance across multiple runs, and without controls for extra parameters introduced by the predicate heads. This makes it impossible to attribute the gain to spatial relation induction rather than incidental regularization or capacity.

minor comments (1)

[Abstract] The acronym expansion in the abstract contains an inconsistent capitalization ('gEneralization'); this should be corrected for consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment point by point below and describe the revisions we will incorporate.

read point-by-point responses

Referee: The architecture description (concept bottleneck and structural scoring layer): no invariance loss, part-level supervision, or cross-domain consistency regularizer is applied to the primitive heatmaps. Without such a mechanism, the soft predicates may simply compose domain-sensitive detectors, directly undermining the claim that relational induction produces domain-invariant features responsible for the reported gains.

Authors: We acknowledge the absence of an explicit invariance loss or cross-domain regularizer on the primitive heatmaps. However, because the predicates operate exclusively on differentiable spatial coordinates rather than appearance features, the relational scoring layer imposes a geometric inductive bias. End-to-end optimization against the classification objective therefore favors primitives whose locations support consistent relational evidence across domains; appearance-specific cues that fail to align spatially cannot contribute reliably to the class scores. We will add a clarifying paragraph in Section 3.3 of the revised manuscript that explicitly articulates this mechanism and its connection to domain invariance. revision: partial
Referee: Experimental section on CUB-DG: the headline >4.5 pp improvement is presented without reported ablations that isolate the contribution of the relational scoring layer (e.g., removing predicates while keeping the bottleneck and capacity), without statistical significance across multiple runs, and without controls for extra parameters introduced by the predicate heads. This makes it impossible to attribute the gain to spatial relation induction rather than incidental regularization or capacity.

Authors: We agree that the current experimental presentation does not sufficiently isolate the contribution of the relational scoring layer. In the revised manuscript we will include (i) an ablation that removes the predicate heads while increasing the capacity of the concept bottleneck to match parameter count, (ii) mean accuracy and standard deviation over five independent runs with different random seeds, and (iii) a table reporting parameter counts for all model variants. These additions will enable clearer attribution of the observed gains to spatial relation induction. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper defines an explicit end-to-end architecture (CNN backbone to concept-bottleneck primitive heatmaps to structural scoring layer with soft predicates) whose class probabilities are computed from learned relational compositions. Performance claims rest on empirical results across CUB-DG and DomainBed rather than any reduction of the target quantity to fitted inputs or self-referential definitions. No load-bearing step equates a prediction to its own construction or imports uniqueness via self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the new concepts of visual primitives and multi-arity soft predicates as core mechanisms, plus standard assumptions about CNN feature extraction and end-to-end differentiability; no specific numerical free parameters are detailed in the abstract.

axioms (2)

domain assumption A CNN backbone extracts general visual features suitable for downstream primitive detection.
Invoked as the first stage of the three-component architecture.
domain assumption Primitive heatmaps with differentiable spatial coordinates can be produced by a concept bottleneck layer.
Core premise enabling the structural scoring layer.

invented entities (2)

Visual primitives no independent evidence
purpose: Basic components into which visual recognition is factored.
New representational unit introduced to support relational composition.
Soft binary, ternary, and quaternary predicates no independent evidence
purpose: Differentiable measures of spatial alignment among detected primitives.
Novel mechanism for evaluating relational compositions.

pith-pipeline@v0.9.0 · 5535 in / 1521 out tokens · 66219 ms · 2026-05-08T14:10:38.296510+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 11 canonical work pages · 3 internal anchors

[1]

Invariant Risk Minimization

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893,

work page internal anchor Pith review arXiv 1907
[2]

Zhi Chen, Yijie Bei, and Cynthia Rudin

doi: 10.1007/978-3-031-20050-2_26. Zhi Chen, Yijie Bei, and Cynthia Rudin. Concept whitening for interpretable image recognition. Nature Machine Intelligence, 2(12):772–782,

work page doi:10.1007/978-3-031-20050-2_26
[3]

URL https://ojs

doi: 10.1609/aaai.v39i4.32439. URL https://ojs. aaai.org/index.php/AAAI/article/view/32439. Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19113–19122, June

work page doi:10.1609/aaai.v39i4.32439
[4]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review arXiv
[5]

Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu

doi: 10.48550/arxiv.1705.10667. Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The neuro- symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584,

work page doi:10.48550/arxiv.1705.10667 1904
[6]

Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf

doi: 10.1007/978-3-031-19836-6_3. Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. InICML (1), pages 10–18,

work page doi:10.1007/978-3-031-19836-6_3
[7]

Taming Transformers for High-Resolution Image Synthesis , booktitle =

doi: 10.1109/CVPR46437.2021.00858. 11 Duc-Duy Nguyen and Dat Nguyen. VirDA: Reusing backbone for unsupervised domain adaptation with visual reprogramming.Transactions on Machine Learning Research,

work page doi:10.1109/cvpr46437.2021.00858 2021
[8]

URLhttps://openreview.net/forum?id=Qh7or7JRFI

ISSN 2835-8856. URLhttps://openreview.net/forum?id=Qh7or7JRFI. Aiden Nibali, Zhen He, Stuart Morgan, and Luke Prendergast. Numerical coordinate regression with convolutional neural networks.arXiv preprint arXiv:1801.07372,

work page arXiv
[9]

Tao Sun, Cheng Lu, Tianshuo Zhang, and Haibin Ling

doi: 10.1007/978-3-319-49409-8_35. Tao Sun, Cheng Lu, Tianshuo Zhang, and Haibin Ling. Safe self-refinement for transformer-based domain adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),

work page doi:10.1007/978-3-319-49409-8_35
[10]

Erm++: An improved baseline for domain generalization.arXiv preprint arXiv:2304.01973,

doi: 10.48550/arxiv.2304.01973. Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty, and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5018–5027,

work page doi:10.48550/arxiv.2304.01973
[11]

URL: https://openreview.net/forum?id=XGzk5OKWFFc

ICLR 2022 (poster). URL: https://openreview.net/forum?id=XGzk5OKWFFc. Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Joshua B. Tenenbaum. Neural-symbolic vqa: disentangling reasoning from vision and language understanding. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 10...

2022
[12]

mixup: Beyond Empirical Risk Minimization

doi: 10.48550/arxiv.1710.09412. Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xiang. Domain generalization with mixstyle.arXiv preprint arXiv:2104.02008,

work page internal anchor Pith review doi:10.48550/arxiv.1710.09412
[13]

The CUB-DG dataset comprises 4 domains, with 47,152 images across 200 classes of North American bird species

12 Appendix A Dataset and Implementation details The DomainBed consists of 5 datasets: PACS [Li et al., 2017] (4 domains, 7 classes, and 9,991 images), VLCS [Fang et al., 2013] (4 domains, 5 classes, and 10,729 images), Office- Home [Venkateswara et al., 2017] (4 domains, 65 classes, and 15,588 images), TerraIncognita [Beery et al., 2018] (4 domains, 10 c...

2017