Weakly Supervised Concept Learning for Object-centric Visual Reasoning

Bettina Finzel; Gesina Schwalbe; Sparsh Tiwari

arxiv: 2605.08201 · v1 · submitted 2026-05-05 · 💻 cs.LG · cs.AI· cs.CV

Weakly Supervised Concept Learning for Object-centric Visual Reasoning

Sparsh Tiwari , Bettina Finzel , Gesina Schwalbe This is my paper

Pith reviewed 2026-05-12 00:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords weak supervisionconcept learningobject-centric visionneurosymbolic AIvariational autoencoderinductive logic programmingdomain generalization

0 comments

The pith

Sparse concept labels combined with VAE self-supervision ground object representations that support logical rule induction from images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a two-stage neurosymbolic pipeline that first extracts object-centric concepts from raw images and then feeds those concepts as symbols into rule-learning systems. It achieves this by training a slot-based variational autoencoder whose reconstruction objective competes with a small number of human-provided concept labels on the latent dimensions. A sympathetic reader would care because the resulting symbols enable discovery of abstract rules for visual reasoning tasks while cutting the required labeled data to one percent and preserving performance under domain shifts where other methods degrade. The approach is evaluated on both synthetic datasets designed for rule induction and real-world image collections, showing that the grounded symbols translate effectively into background knowledge for inductive logic programming, decision trees, and Bayesian networks.

Core claim

The central claim is that a slot-based VAE architecture integrates reconstruction-based self-supervision with sparse concept guidance on latent slots to produce human-interpretable, grounded object representations; these representations convert directly into symbolic background knowledge that allows inductive logic programming and related reasoning engines to discover complex abstract rules for object-centric tasks, even when only one percent of the training data carries concept labels and when test images come from substantially shifted domains.

What carries the argument

A slot-based variational autoencoder whose reconstruction loss competes with limited concept supervision on the latent dimensions to learn disentangled, object-centric representations.

If this is right

Object-centric reasoning tasks become feasible with labeling budgets reduced by two orders of magnitude.
The learned symbols remain usable by multiple symbolic engines including inductive logic programming, decision trees, and Bayesian networks.
Performance holds under domain shifts that cause fully supervised perception modules to fail.
At one percent supervision the method exceeds the domain generalization of current foundation-model baselines on the evaluated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weak-supervision recipe could be applied to other perception modules that feed symbolic planners in robotics or planning domains.
Increasing the number of slots or concepts would test whether the grounding mechanism scales without additional label cost.
Replacing the VAE with other self-supervised objectives might further lower the supervision threshold while keeping the symbols interpretable.

Load-bearing premise

The VAE reconstruction signal together with the few concept labels will produce latent dimensions that correspond to stable, human-interpretable object properties rather than dataset-specific artifacts.

What would settle it

Running the full pipeline on a domain-shifted test set and finding that the induced logical rules achieve accuracy no better than random guessing, or that the extracted concepts cannot be matched to human-provided interpretations even at one percent supervision.

Figures

Figures reproduced from arXiv: 2605.08201 by Bettina Finzel, Gesina Schwalbe, Sparsh Tiwari.

**Figure 2.** Figure 2: UMAP visualizations of the latent space, colored by different ground-truth concepts [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Left to right: Samples from the used datasets Clevr, Clevr-Tex, 2D version of Clevr, 3D shapes, melanoma , HAM10000, Dsprites [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Model accuracy on in-domain (HAM, left) and out of [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Neurosymbolic systems promise to combine deep neural network's (DNN) processing of raw sensor inputs with few-shot performance of symbolic artificial intelligence. Two-stage approaches explicitly decouple DNN based perception from subsequent rule based reasoning. This avoids optimization and interpretability issues of end to end differentiable approaches, but requires costly labels for the perception output. This paper introduces an efficient weak supervision scheme for the perception stage to ground its output symbols for logical induction in object-centric reasoning tasks. It combines a slot-based architecture for object-centricity with a Variational Autoencoder (VAE) for self-supervision, competing with concept guidance on latent dimensions for human interpretable grounding. The resulting predictions are translated into symbolic background knowledge for reasoning frameworks, such as Inductive Logic Programming (ILP), Decision Trees, and Bayesian Networks. Our extensive empirical evaluation on synthetic and real world datasets shows that our approach can discover complex, abstract rules for object centric reasoning whilst reducing supervision to as little as 1% of labels, and being robust even under substantial domain shift. Notably, at 1% supervision it even outperforms state of the art foundation model baselines in domain generalization

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a VAE-slot competition trick to ground object concepts with 1% labels for neurosymbolic reasoning, but the empirical backing for reliable disentanglement and domain-shift gains is still thin.

read the letter

The main takeaway is that this work tries to solve the labeling bottleneck in two-stage neurosymbolic pipelines by letting a VAE reconstruction loss compete with sparse concept supervision on slot latents, then feeding the resulting symbols into ILP or decision trees. That setup is the concrete new piece relative to the fully supervised perception stages they cite. It does a clean job laying out why object-centric slots plus self-supervision could cut annotation costs while keeping the symbolic part interpretable and few-shot. The framing around domain shift robustness is also useful for anyone thinking about real-world transfer in visual reasoning. On the soft spots, the central claims rest on unshown experiments. The abstract asserts outperformance at 1% supervision and better generalization than foundation models, yet there are no reported alignment scores between latents and ground-truth attributes, no ablation removing the guidance term, and no statistical detail on how often the downstream reasoner actually receives human-interpretable symbols rather than spurious correlations. Without those checks it is hard to rule out that the reported rule discovery works mainly on the synthetic data or because the symbolic module exploits non-grounded features. The stress-test worry about collapse under domain shift therefore lands until the paper shows direct evidence that the latents stay disentangled at low supervision. This is the kind of paper that would interest people already working on neurosymbolic object reasoning or weak supervision for perception modules. A reader in that niche could pick up the architectural pattern even if the numbers need verification. I would send it for peer review so the experimental controls and metrics can be examined properly rather than desk-rejecting it outright.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a two-stage neurosymbolic pipeline for object-centric visual reasoning that decouples perception from symbolic reasoning. Perception uses a slot-based VAE trained with competing reconstruction and sparse concept-supervision losses on selected latent dimensions; the resulting symbols are fed as background knowledge to ILP, decision trees, or Bayesian networks. The central empirical claim is that this scheme discovers complex abstract rules while reducing labeled supervision to 1 % and remains robust under substantial domain shift, even outperforming foundation-model baselines in generalization on both synthetic and real-world datasets.

Significance. If the grounding and generalization results hold, the work would meaningfully lower the annotation cost of neurosymbolic systems and demonstrate a practical route to interpretable, few-shot visual reasoning. The combination of self-supervised disentanglement with minimal concept guidance is a concrete contribution to the perception stage of two-stage neurosymbolic architectures.

major comments (3)

[§4] §4 (Experiments), 1 % supervision rows: the reported outperformance over foundation-model baselines and the claim of reliable rule discovery rest on the assumption that the competing VAE + sparse guidance loss produces human-interpretable, grounded concepts. No concept-alignment scores, mutual-information metrics between latents and ground-truth attributes, or ablation that removes the 1 % guidance term are provided; without these, it is impossible to confirm that the downstream ILP/decision-tree gains are not artifacts of non-interpretable features or dataset-specific correlations.
[§3.2] §3.2 (Method), competing-loss formulation: the paper states that the unsupervised VAE term enforces disentanglement while the sparse concept guidance aligns selected dimensions. However, no analysis or hyper-parameter study shows that this balance remains stable when supervision drops to 1 % or when test domains differ; the absence of such analysis makes the domain-shift robustness claim difficult to evaluate.
[Table 2] Table 2 / Figure 5 (domain-shift results): the generalization gains are presented without statistical significance tests across multiple random seeds or runs, and without an ablation that isolates the contribution of the VAE self-supervision versus the concept guidance. This weakens the load-bearing claim that the method is “robust even under substantial domain shift.”

minor comments (2)

[Abstract] The abstract and §1 contain minor grammatical inconsistencies (e.g., “whilst reducing supervision to as little as 1% of labels” and inconsistent capitalization of “Neurosymbolic”).
[§3] Notation for the slot-VAE latent dimensions and the sparse supervision mask is introduced without a clear summary table; a single equation or diagram reference would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that highlight opportunities to strengthen the empirical support for our claims. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§4] §4 (Experiments), 1 % supervision rows: the reported outperformance over foundation-model baselines and the claim of reliable rule discovery rest on the assumption that the competing VAE + sparse guidance loss produces human-interpretable, grounded concepts. No concept-alignment scores, mutual-information metrics between latents and ground-truth attributes, or ablation that removes the 1 % guidance term are provided; without these, it is impossible to confirm that the downstream ILP/decision-tree gains are not artifacts of non-interpretable features or dataset-specific correlations.

Authors: We agree that direct quantitative evidence of concept grounding is needed to support the interpretability assumption. In the revised manuscript we will add concept-alignment accuracy scores and mutual-information values between selected latent dimensions and ground-truth attributes at the 1 % supervision level. We will also include an ablation that removes the sparse guidance term while keeping the VAE reconstruction loss, to isolate its contribution to the downstream symbolic reasoning performance. revision: yes
Referee: [§3.2] §3.2 (Method), competing-loss formulation: the paper states that the unsupervised VAE term enforces disentanglement while the sparse concept guidance aligns selected dimensions. However, no analysis or hyper-parameter study shows that this balance remains stable when supervision drops to 1 % or when test domains differ; the absence of such analysis makes the domain-shift robustness claim difficult to evaluate.

Authors: We acknowledge the absence of a dedicated sensitivity study on the loss weighting. The revised version will contain an analysis that varies the relative weight between the VAE reconstruction term and the sparse concept-supervision term, reporting downstream task performance at 1 % supervision and across the domain-shift settings to demonstrate stability of the balance. revision: yes
Referee: [Table 2] Table 2 / Figure 5 (domain-shift results): the generalization gains are presented without statistical significance tests across multiple random seeds or runs, and without an ablation that isolates the contribution of the VAE self-supervision versus the concept guidance. This weakens the load-bearing claim that the method is “robust even under substantial domain shift.”

Authors: We agree that statistical rigor and isolating ablations are required. The revision will report means and standard deviations over at least five random seeds, include statistical significance tests (e.g., paired t-tests with p-values), and add ablations that separately disable the VAE self-supervision term and the concept-guidance term to quantify their individual roles in domain-shift generalization. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical pipeline

full rationale

The manuscript describes an empirical architecture (slot-VAE with competing reconstruction and sparse concept-supervision losses) whose outputs are fed to off-the-shelf symbolic reasoners. All reported performance numbers are obtained by training on external datasets and measuring accuracy, domain-shift robustness, and comparison against baselines; no equation or claim is shown to be definitionally equivalent to its own fitted parameters or to a self-citation chain. The central claim therefore remains independently falsifiable by replication on the cited datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the ability of slot attention and VAE self-supervision to produce usable symbols; no free parameters or invented entities are specified in the abstract.

axioms (2)

domain assumption Slot-based architectures can decompose images into distinct object-centric representations suitable for downstream reasoning.
Stated as the foundation for the perception stage.
domain assumption VAE reconstruction provides sufficient self-supervisory signal to ground latent dimensions when competing against sparse concept guidance.
Core mechanism enabling the 1% supervision regime.

pith-pipeline@v0.9.0 · 5510 in / 1411 out tokens · 51708 ms · 2026-05-12T00:45:10.562280+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ltotal = Lrecon + β LKL + δ Lsup + λ Lcoord + γ Lpresence ... slot-based VAE with concept heads for shape/color/size/material
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Slot Attention ... 10 slots, 3 attention iterations ... 15% supervision checkpoint frozen for predicate generation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Repre- sentation Learning: A Review and New Perspectives.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Repre- sentation Learning: A Review and New Perspectives.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013. 1

work page 2013
[2]

Ondrej Biza, Sjoerd van Steenkiste, Mehdi S. M. Sajjadi, Gamaleldin F. Elsayed, Aravindh Mahendran, and Thomas Kipf. Invariant slot attention: Object discovery with slot- centric reference frames, 2023. 6

work page 2023
[3]

3d shapes dataset

Chris Burgess and Hyunjik Kim. 3d shapes dataset. https://github.com/deepmind/3dshapes-dataset/, 2018. 3

work page 2018
[4]

Learning programs by learning from failures, 2020

Andrew Cropper and Rolf Morel. Learning programs by learning from failures, 2020. 2, 3

work page 2020
[5]

Artur d’Avila Garcez and Lu´ıs C. Lamb. Neurosymbolic AI: The 3rd wave.Artificial Intelligence Review, 56(11):12387– 12406, 2023. 1

work page 2023
[6]

MIT Press, 2016

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016. 1

work page 2016
[7]

Beta-V AE: Learning basic visual con- cepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. Beta-V AE: Learning basic visual con- cepts with a constrained variational framework. InPosters 5th Int. Conf. Learning Representations, 2016. 2, 3

work page 2016
[8]

Melanoma skin cancer dataset of 10000 images, 2022

Muhammad Hasnain Javid. Melanoma skin cancer dataset of 10000 images, 2022. 3

work page 2022
[9]

Lawrence Zitnick, and Ross Girshick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elemen- tary visual reasoning, 2016. 3

work page 2016
[10]

Clevrtex: A texture-rich benchmark for unsupervised multi- object segmentation, 2021

Laurynas Karazija, Iro Laina, and Christian Rupprecht. Clevrtex: A texture-rich benchmark for unsupervised multi- object segmentation, 2021. 3

work page 2021
[11]

Is disentan- glement all you need? comparing concept-based & disentanglement approaches.arXiv preprint arXiv:2104.06917,

Dmitry Kazhdan, Botty Dimanov, Helena Andres Terre, Mateja Jamnik, Pietro Li `o, and Adrian Weller. Is disentan- glement all you need? Comparing concept-based & disen- tanglement approaches.CoRR, abs/2104.06917, 2021. 1

work page arXiv 2021
[12]

DeepGraphLog for Layered Neu- rosymbolic AI

Adem Kikaj, Giuseppe Marra, Floris Geerts, Robin Man- haeve, and Luc De Raedt. DeepGraphLog for Layered Neu- rosymbolic AI. InECAI 2025, pages 1551–1558. IOS Press,

work page 2025
[13]

Disentangling by factoris- ing

Hyunjik Kim and Andriy Mnih. Disentangling by factoris- ing. InProc. 2018 Int. Conf. Machine Learning, pages 2649–

work page 2018
[14]

Harold W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics (NRL), 52, 1955. 3

work page 1955
[15]

Prompt-driven dynamic object-centric learning for single domain generalization

Deng Li, Aming Wu, Yaowei Wang, and Yahong Han. Prompt-driven dynamic object-centric learning for single domain generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17606–17615, 2024. 2

work page 2024
[16]

O’Neil, and Sotirios A

Xiao Liu, Pedro Sanchez, Spyridon Thermos, Alison Q. O’Neil, and Sotirios A. Tsaftaris. Learning disentangled rep- resentations in the imaging domain.Medical Image Analysis, 80:102516, 2022. 2

work page 2022
[17]

Challenging Common Assumptions in the Unsu- pervised Learning of Disentangled Representations

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Sch ¨olkopf, and Olivier Bachem. Challenging Common Assumptions in the Unsu- pervised Learning of Disentangled Representations. InPro- ceedings of the 36th International Conference on Machine Learning, pages 4114–4124. PMLR, 2019. 1, 2

work page 2019
[18]

Object- Centric Learning with Slot Attention

Francesco Locatello, Dirk Weissenborn, Thomas Un- terthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object- Centric Learning with Slot Attention. InAdvances in Neural Information Processing Systems, pages 11525–11538. Cur- ran Associates, Inc., 2020. 2

work page 2020
[19]

Exploring the Effectiveness of Object-Centric Representa- tions in Visual Question Answering: Comparative Insights with Foundation Models

Amir Mohammad Karimi Mamaghan, Samuele Papa, Karl Henrik Johansson, Stefan Bauer, and Andrea Dittadi. Exploring the Effectiveness of Object-Centric Representa- tions in Visual Question Answering: Comparative Insights with Foundation Models. InThe Thirteenth International Conference on Learning Representations, 2024. 2, 3

work page 2024
[20]

Deepproblog: Neu- ral probabilistic logic programming, 2018

Robin Manhaeve, Sebastijan Duman ˇci´c, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. Deepproblog: Neu- ral probabilistic logic programming, 2018. 2

work page 2018
[21]

Tenenbaum, and Jiajun Wu

Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from nat- ural supervision. InInt. Conf. Learning Representations,

work page
[22]

dsprites: Disentanglement testing sprites dataset

Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017. 3

work page 2017
[23]

Logical versus analogical or symbolic ver- sus connectionist or neat versus scruffy.AI Mag., 12:34–51,

Marvin Minsky. Logical versus analogical or symbolic ver- sus connectionist or neat versus scruffy.AI Mag., 12:34–51,

work page
[24]

Inductive logic programming.New Generation Computing, 8(4):295–318, 1991

Stephen Muggleton. Inductive logic programming.New Generation Computing, 8(4):295–318, 1991. 1

work page 1991
[25]

Dinov2: Learning robust visual features with- out supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024
[26]

Enhancing Symbolic Machine Learning by Sub- symbolic Representations, 2025

Stephen Roth, Lennart Baur, Derian Boer, and Stefan Kramer. Enhancing Symbolic Machine Learning by Sub- symbolic Representations, 2025. 9

work page 2025
[27]

Bridging the gap to real-world object-centric learning, 2023

Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Do- minik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Sch¨olkopf, Thomas Brox, and Francesco Locatello. Bridging the gap to real-world object-centric learning, 2023. 3 9

work page 2023
[28]

The HAM10000 dataset, a large collection of multi-source der- matoscopic images of common pigmented skin lesions.Sci

Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The HAM10000 dataset, a large collection of multi-source der- matoscopic images of common pigmented skin lesions.Sci. Data, 5:180161, 2018. 3

work page 2018
[29]

Burgess, and Alexander Lerchner

Nicholas Watters, Loic Matthey, Christopher P. Burgess, and Alexander Lerchner. Spatial broadcast decoder: A simple ar- chitecture for learning disentangled representations in vaes,

work page

[1] [1]

Repre- sentation Learning: A Review and New Perspectives.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Repre- sentation Learning: A Review and New Perspectives.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013. 1

work page 2013

[2] [2]

Ondrej Biza, Sjoerd van Steenkiste, Mehdi S. M. Sajjadi, Gamaleldin F. Elsayed, Aravindh Mahendran, and Thomas Kipf. Invariant slot attention: Object discovery with slot- centric reference frames, 2023. 6

work page 2023

[3] [3]

3d shapes dataset

Chris Burgess and Hyunjik Kim. 3d shapes dataset. https://github.com/deepmind/3dshapes-dataset/, 2018. 3

work page 2018

[4] [4]

Learning programs by learning from failures, 2020

Andrew Cropper and Rolf Morel. Learning programs by learning from failures, 2020. 2, 3

work page 2020

[5] [5]

Artur d’Avila Garcez and Lu´ıs C. Lamb. Neurosymbolic AI: The 3rd wave.Artificial Intelligence Review, 56(11):12387– 12406, 2023. 1

work page 2023

[6] [6]

MIT Press, 2016

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.Deep Learning. MIT Press, 2016. 1

work page 2016

[7] [7]

Beta-V AE: Learning basic visual con- cepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. Beta-V AE: Learning basic visual con- cepts with a constrained variational framework. InPosters 5th Int. Conf. Learning Representations, 2016. 2, 3

work page 2016

[8] [8]

Melanoma skin cancer dataset of 10000 images, 2022

Muhammad Hasnain Javid. Melanoma skin cancer dataset of 10000 images, 2022. 3

work page 2022

[9] [9]

Lawrence Zitnick, and Ross Girshick

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elemen- tary visual reasoning, 2016. 3

work page 2016

[10] [10]

Clevrtex: A texture-rich benchmark for unsupervised multi- object segmentation, 2021

Laurynas Karazija, Iro Laina, and Christian Rupprecht. Clevrtex: A texture-rich benchmark for unsupervised multi- object segmentation, 2021. 3

work page 2021

[11] [11]

Is disentan- glement all you need? comparing concept-based & disentanglement approaches.arXiv preprint arXiv:2104.06917,

Dmitry Kazhdan, Botty Dimanov, Helena Andres Terre, Mateja Jamnik, Pietro Li `o, and Adrian Weller. Is disentan- glement all you need? Comparing concept-based & disen- tanglement approaches.CoRR, abs/2104.06917, 2021. 1

work page arXiv 2021

[12] [12]

DeepGraphLog for Layered Neu- rosymbolic AI

Adem Kikaj, Giuseppe Marra, Floris Geerts, Robin Man- haeve, and Luc De Raedt. DeepGraphLog for Layered Neu- rosymbolic AI. InECAI 2025, pages 1551–1558. IOS Press,

work page 2025

[13] [13]

Disentangling by factoris- ing

Hyunjik Kim and Andriy Mnih. Disentangling by factoris- ing. InProc. 2018 Int. Conf. Machine Learning, pages 2649–

work page 2018

[14] [14]

Harold W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics (NRL), 52, 1955. 3

work page 1955

[15] [15]

Prompt-driven dynamic object-centric learning for single domain generalization

Deng Li, Aming Wu, Yaowei Wang, and Yahong Han. Prompt-driven dynamic object-centric learning for single domain generalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17606–17615, 2024. 2

work page 2024

[16] [16]

O’Neil, and Sotirios A

Xiao Liu, Pedro Sanchez, Spyridon Thermos, Alison Q. O’Neil, and Sotirios A. Tsaftaris. Learning disentangled rep- resentations in the imaging domain.Medical Image Analysis, 80:102516, 2022. 2

work page 2022

[17] [17]

Challenging Common Assumptions in the Unsu- pervised Learning of Disentangled Representations

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Sch ¨olkopf, and Olivier Bachem. Challenging Common Assumptions in the Unsu- pervised Learning of Disentangled Representations. InPro- ceedings of the 36th International Conference on Machine Learning, pages 4114–4124. PMLR, 2019. 1, 2

work page 2019

[18] [18]

Object- Centric Learning with Slot Attention

Francesco Locatello, Dirk Weissenborn, Thomas Un- terthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object- Centric Learning with Slot Attention. InAdvances in Neural Information Processing Systems, pages 11525–11538. Cur- ran Associates, Inc., 2020. 2

work page 2020

[19] [19]

Exploring the Effectiveness of Object-Centric Representa- tions in Visual Question Answering: Comparative Insights with Foundation Models

Amir Mohammad Karimi Mamaghan, Samuele Papa, Karl Henrik Johansson, Stefan Bauer, and Andrea Dittadi. Exploring the Effectiveness of Object-Centric Representa- tions in Visual Question Answering: Comparative Insights with Foundation Models. InThe Thirteenth International Conference on Learning Representations, 2024. 2, 3

work page 2024

[20] [20]

Deepproblog: Neu- ral probabilistic logic programming, 2018

Robin Manhaeve, Sebastijan Duman ˇci´c, Angelika Kimmig, Thomas Demeester, and Luc De Raedt. Deepproblog: Neu- ral probabilistic logic programming, 2018. 2

work page 2018

[21] [21]

Tenenbaum, and Jiajun Wu

Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from nat- ural supervision. InInt. Conf. Learning Representations,

work page

[22] [22]

dsprites: Disentanglement testing sprites dataset

Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017. 3

work page 2017

[23] [23]

Logical versus analogical or symbolic ver- sus connectionist or neat versus scruffy.AI Mag., 12:34–51,

Marvin Minsky. Logical versus analogical or symbolic ver- sus connectionist or neat versus scruffy.AI Mag., 12:34–51,

work page

[24] [24]

Inductive logic programming.New Generation Computing, 8(4):295–318, 1991

Stephen Muggleton. Inductive logic programming.New Generation Computing, 8(4):295–318, 1991. 1

work page 1991

[25] [25]

Dinov2: Learning robust visual features with- out supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024

[26] [26]

Enhancing Symbolic Machine Learning by Sub- symbolic Representations, 2025

Stephen Roth, Lennart Baur, Derian Boer, and Stefan Kramer. Enhancing Symbolic Machine Learning by Sub- symbolic Representations, 2025. 9

work page 2025

[27] [27]

Bridging the gap to real-world object-centric learning, 2023

Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Do- minik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Sch¨olkopf, Thomas Brox, and Francesco Locatello. Bridging the gap to real-world object-centric learning, 2023. 3 9

work page 2023

[28] [28]

The HAM10000 dataset, a large collection of multi-source der- matoscopic images of common pigmented skin lesions.Sci

Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The HAM10000 dataset, a large collection of multi-source der- matoscopic images of common pigmented skin lesions.Sci. Data, 5:180161, 2018. 3

work page 2018

[29] [29]

Burgess, and Alexander Lerchner

Nicholas Watters, Loic Matthey, Christopher P. Burgess, and Alexander Lerchner. Spatial broadcast decoder: A simple ar- chitecture for learning disentangled representations in vaes,

work page