pith. sign in

arxiv: 2606.30101 · v1 · pith:ZSBJZRHQnew · submitted 2026-06-29 · 💻 cs.RO · cs.CV

SIR: Structured Image Representations for Explainable Robot Learning

Pith reviewed 2026-06-30 05:47 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords scene graphsrobot policy learningexplainable representationsgraph sparsificationvisual embeddingsdataset bias analysisstructured representations
0
0 comments X

The pith

Robot policies learn to sparsify scene graphs from images end-to-end, raising task success while exposing dataset biases through the selected objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Structured Image Representations to replace opaque visual embeddings in robot policies with explicit scene graphs that a learned module sparsifies into task-relevant sub-graphs. This structure is passed directly to the action model, making decisions inspectable by examining which objects the policy attends to at each step. Evaluations claim higher average success on RoboCasa tasks than image baselines, and the sub-graphs allow detection of when the model relies on distractors or ignores key items. A reader would care because current learned policies are hard to debug or trust when they fail due to hidden correlations in training data.

Core claim

Our approach first constructs a fully connected graph, using image-derived features as initial node representations. Then, a module learns to sparsify this graph end-to-end, creating a task-relevant sub-graph that is passed to the action generation model. This process makes our model intrinsically explainable. Evaluations on RoboCasa show that our sparse graph policies outperform image-based baselines on average with 19.5% vs 14.81% success rate. Most importantly, we show that the learned sparse graphs are a powerful tool for model analysis. By analysing when the model's sub-graph deviates from human expectation, such as by including distractor nodes or omitting key objects, we successfully

What carries the argument

end-to-end learned sparsification module that prunes a fully connected scene graph of image features into a task-relevant sub-graph for the policy

If this is right

  • Sparse graph policies reach 19.5 percent success versus 14.81 percent for image baselines on the RoboCasa benchmark.
  • Inspection of the learned sub-graphs reveals cases where the model includes distractor nodes or omits key objects.
  • Such inspection identifies spurious correlations and positional biases present in the training dataset.
  • The representation supplies intrinsic explainability because the policy's inputs are the explicit selected nodes rather than opaque embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the sparsification proves stable across environments, the same graphs could serve as a bridge for transferring policies between simulation and real robots by focusing on object relations instead of pixel statistics.
  • Dataset bias detection via graph inspection could be turned into an active data collection loop that flags and augments scenes where the model consistently selects the wrong nodes.
  • The method suggests that many visual policy failures may stem from attending to the wrong objects, so similar sparsification layers might improve performance in other embodied tasks such as manipulation or navigation.

Load-bearing premise

The sparsification step selects sub-graphs that capture the true task-relevant objects and relations without omitting critical elements or adding distractors in ways the downstream policy cannot handle.

What would settle it

Re-training the identical policy head on the full unsparsified graph or on raw images and observing equal or higher success rates, or finding that the selected sub-graphs show no systematic relation to human-labeled task objects yet still produce the reported performance gains.

Figures

Figures reproduced from arXiv: 2606.30101 by Jan Schwab, Jens Bosch, Maximilian Xiling Li, Minh-Trung Tang, Moritz Haberland, Nils Blank, Paul Mattes, Rudolf Lioutikov.

Figure 1
Figure 1. Figure 1: First, SIR extracts image features to generate an initial, fully connected SG, using these features as node representations. This Fully-Connected-Graph (FC-Graph) is then passed to the Goal-Conditioned Imitation Learning (GCIL) model. Within the model, a learnable sparsification module first prunes the Fully-Connected-Graph (FC-Graph) into a minimal, task-relevant sub-graph, retaining only the nodes the mo… view at source ↗
Figure 2
Figure 2. Figure 2: Multiple views of a scene observation can be han [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: When novel distractor objects, not seen during training, are introduced at inference, the image-based baseline’s (Img) performance [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Possible sub-graph generation, when using our explana [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparing Fig. 5a (a hard task) and Fig. 5b (an easier task) shows that the explanation for the harder task is more consistent, suggesting the model learns to identify more informative nodes as task complexity increases. These learned sub-graphs are effective at revealing significant dataset biases. For example, in Fig. 5c, the model includes many unimportant nodes, which indicates that it may be exploitin… view at source ↗
Figure 6
Figure 6. Figure 6: Percentage of nodes to be kept in the sub-graph per seeded model [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Percentage of nodes to be kept in the sub-graph per seeded model [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Percentage of nodes to be kept in the sub-graph per seeded model [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Percentage of nodes to be kept in the sub-graph per seeded model [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Percentage of nodes to be kept in the sub-graph per seeded model [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Percentage of nodes to be kept in the sub-graph per seeded model in [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Grad-CAM explanations over time for a rollout of the [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
read the original abstract

Existing robot policies based on learned visual embeddings lack explicit structure and are sensitive to visual distractions. Thus, the representations that drive their behaviour are often opaque, making their decision-making process difficult to interpret. To address this, we introduce Structured Image Representations (SIR), a method that leverages Scene Graphs (SGs) as an intermediate representation for robot policy learning. Our approach first constructs a fully connected graph, using image-derived features as initial node representations. Then, a module learns to sparsify this graph end-to-end, creating a task-relevant sub-graph that is passed to the action generation model. This process makes our model intrinsically explainable. Evaluations on RoboCasa show that our sparse graph policies outperform image-based baselines on average with 19.5% vs 14.81% success rate. Most importantly, we show that the learned sparse graphs are a powerful tool for model analysis. By analysing when the model's sub-graph deviates from human expectation, such as by including distractor nodes or omitting key objects, we successfully uncover dataset biases, including spurious correlations and positional biases. https://github.com/intuitive-robots/SIR_Model

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Structured Image Representations (SIR) that build fully-connected scene graphs from image features, apply an end-to-end learned sparsification module to produce task-relevant sub-graphs, and feed these to an action-generation policy. On RoboCasa it reports average success rates of 19.5 % for the sparse-graph policies versus 14.81 % for image-based baselines and claims that analysis of deviations between the learned sub-graphs and human expectations reveals dataset biases such as spurious correlations and positional biases.

Significance. If the sparsified graphs can be shown to be both performance-critical and faithful to scene structure, the approach would supply a concrete mechanism for intrinsically explainable robot policies together with a diagnostic tool for data artifacts. The public code release is a clear asset. At present, however, the absence of statistical detail on the reported gains and the lack of controls that separate sparsifier behavior from policy behavior limit the strength of both the performance and the bias-analysis claims.

major comments (2)
  1. [Abstract] Abstract: aggregate success rates (19.5 % vs 14.81 %) are presented without variance, trial counts, statistical tests, or implementation details for the image-based baselines, so it is impossible to determine whether the reported lift is reliable or reproducible across seeds and environments.
  2. [Method (sparsification module)] Method description of the sparsification module: because the sparsifier is trained jointly with the policy on the same RoboCasa trajectories, any spurious correlation in the data can be exploited by node selection; the downstream policy then only needs to act on the selected nodes. This creates a circularity that affects both the performance attribution and the subsequent claim that graph deviations reliably expose dataset biases rather than model artifacts. No ablation that freezes the sparsifier, substitutes a fixed graph, or compares against human node labels is described.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for highlighting issues with statistical reporting and the need for additional controls on the sparsification module. Below we provide point-by-point responses and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: aggregate success rates (19.5 % vs 14.81 %) are presented without variance, trial counts, statistical tests, or implementation details for the image-based baselines, so it is impossible to determine whether the reported lift is reliable or reproducible across seeds and environments.

    Authors: We concur that the abstract would benefit from additional statistical information to allow readers to assess the reliability of the results. The full manuscript reports results over multiple seeds with standard deviations and specifies the number of evaluation episodes. We will revise the abstract to include these details, such as the number of trials and variance measures, along with a brief note on baseline implementations. revision: yes

  2. Referee: [Method (sparsification module)] Method description of the sparsification module: because the sparsifier is trained jointly with the policy on the same RoboCasa trajectories, any spurious correlation in the data can be exploited by node selection; the downstream policy then only needs to act on the selected nodes. This creates a circularity that affects both the performance attribution and the subsequent claim that graph deviations reliably expose dataset biases rather than model artifacts. No ablation that freezes the sparsifier, substitutes a fixed graph, or compares against human node labels is described.

    Authors: The referee correctly identifies a potential issue with joint end-to-end training. We will add text to the method section explaining that while the sparsifier can leverage data correlations, the resulting sub-graphs are still used to analyze model decisions by comparing them to human expectations. This comparison can reveal when the model attends to biased features. To further address the concern, we will include an ablation study comparing the learned sparsifier against a non-learned baseline (e.g., random sparsification) in the revised manuscript. We disagree that this necessarily invalidates the bias analysis, as the deviations are observed post-training and provide diagnostic value regardless of how the selection was learned. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical pipeline with external benchmark evaluation

full rationale

The paper presents a learned end-to-end sparsification pipeline for scene graphs without any derivation chain, equations, or first-principles results. Performance claims rest on success-rate comparisons against image baselines on the external RoboCasa benchmark, and bias analysis is qualitative. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the text. The joint training of sparsifier and policy does not match any enumerated circularity pattern because no specific reduction to inputs by construction is exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions that scene graphs can be reliably extracted from images and that a learned sparsifier can identify task-relevant subgraphs without external supervision. No free parameters or invented entities are explicitly named in the abstract.

axioms (1)
  • domain assumption Scene graphs constructed from image-derived features contain nodes and edges that can be meaningfully sparsified for downstream control.
    Invoked when the paper states that a module learns to sparsify the fully connected graph into a task-relevant sub-graph.

pith-pipeline@v0.9.1-grok · 5747 in / 1264 out tokens · 23654 ms · 2026-06-30T05:47:35.340754+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    A survey of robot learning from demon- stration.Robotics and autonomous systems, 57(5):469–483,

    Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demon- stration.Robotics and autonomous systems, 57(5):469–483,

  2. [2]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models.https:// arxiv.org/abs/2310.08864, 2023. 1

  3. [3]

    One-shot imitation learning with graph neural networks for pick-and- place manipulation tasks.IEEE Robotics and Automation Letters, 2023

    Francesco Di Felice, Salvatore D’Avella, Alberto Remus, Paolo Tripicchio, and Carlo Alberto Avizzano. One-shot imitation learning with graph neural networks for pick-and- place manipulation tasks.IEEE Robotics and Automation Letters, 2023. 2, 3

  4. [4]

    Goal-conditioned imitation learning

    Yiming Ding, Carlos Florensa, Pieter Abbeel, and Mari- ano Phielipp. Goal-conditioned imitation learning. InAd- vances in Neural Information Processing Systems. Curran Associates, Inc., 2019. 1

  5. [5]

    Towards fusing point cloud and visual representations for imitation learning,

    Atalay Donat, Xiaogang Jia, Xi Huang, Aleksandar Tara- novic, Denis Blessing, Ge Li, Hongyi Zhou, Hanyi Zhang, Rudolf Lioutikov, and Gerhard Neumann. Towards fusing point cloud and visual representations for imitation learning,

  6. [6]

    Imitation from observation using rl and graph-based repre- sentation of demonstrations

    Yassine El Manyari, Patrick Le Callet, and Laurent Doll ´e. Imitation from observation using rl and graph-based repre- sentation of demonstrations. In2022 21st IEEE Interna- tional Conference on Machine Learning and Applications (ICMLA), pages 1258–1265. IEEE, 2022. 2

  7. [7]

    Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning

    Qiao Gu, Ali Kuwajerwala, Sacha Morin, Krishna Murthy Jatavallabhula, Bipasha Sen, Aditya Agarwal, Corban Rivera, William Paul, Kirsty Ellis, Rama Chellappa, et al. Conceptgraphs: Open-vocabulary 3d scene graphs for per- ception and planning. In2024 IEEE International Confer- ence on Robotics and Automation (ICRA), pages 5021–5028. IEEE, 2024. 1, 3

  8. [8]

    Aditya Prakash, and Wei Jin

    Mohammad Hashemi, Shengbo Gong, Juntong Ni, Wenqi Fan, B. Aditya Prakash, and Wei Jin. A Comprehensive Survey on Graph Reduction: Sparsification, Coarsening, and Condensation. pages 8058–8066, 2024. ISSN: 1045-0823. 3

  9. [9]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4, 1

  10. [10]

    VIRL: Self-supervised visual graph inverse reinforcement learning

    Lei Huang, Weijia Cai, Zihan Zhu, Chen Feng, Helge Rhodin, and Zhengbo Zou. VIRL: Self-supervised visual graph inverse reinforcement learning. In8th Annual Confer- ence on Robot Learning, 2024. 3

  11. [11]

    Graph inverse reinforcement learning from diverse videos

    Sateesh Kumar, Jonathan Zamora, Nicklas Hansen, Rishabh Jangir, and Xiaolong Wang. Graph inverse reinforcement learning from diverse videos. InProceedings of The 6th Con- ference on Robot Learning, pages 55–66. PMLR, 2023. 3

  12. [12]

    Self-Attention Graph Pooling

    Junhyun Lee, Inyeop Lee, and Jaewoo Kang. Self-Attention Graph Pooling. InProceedings of the 36th International Conference on Machine Learning, pages 3734–3743. PMLR,

  13. [13]

    Collective behavior clone with visual attention via neural interaction graph prediction.arXiv preprint arXiv:2503.06869, 2025

    Kai Li, Zhao Ma, Liang Li, and Shiyu Zhao. Collective behavior clone with visual attention via neural interaction graph prediction.arXiv preprint arXiv:2503.06869, 2025. 3

  14. [14]

    Multi-objective photoreal simulation (mops) dataset for computer vision in robotic manipulation

    Maximilian Xiling Li, Paul Mattes, Nils Blank, Ko- rbinian Franz Rudolf, Paul Werner L ¨odige, and Rudolf Li- outikov. Multi-objective photoreal simulation (mops) dataset for computer vision in robotic manipulation. InStructured World Models for Robotic Manipulation, 2025. 1

  15. [15]

    Efficient and interpretable robot manipulation with graph neural networks.IEEE Robotics and Automation Let- ters, 7(2):2740–2747, 2022

    Yixin Lin, Austin S Wang, Eric Undersander, and Akshara Rai. Efficient and interpretable robot manipulation with graph neural networks.IEEE Robotics and Automation Let- ters, 7(2):2740–2747, 2022. 2, 3

  16. [16]

    Careful Selection and Thoughtful Discarding: Graph Explicit Pooling Utiliz- ing Discarded Nodes, 2023

    Chuang Liu, Wenhang Yu, Kuang Gao, Xueqi Ma, Yibing Zhan, Jia Wu, Bo Du, and Wenbin Hu. Careful Selection and Thoughtful Discarding: Graph Explicit Pooling Utiliz- ing Discarded Nodes, 2023. arXiv:2311.12644 [cs]. 3

  17. [17]

    What mat- ters in learning from offline human demonstrations for robot manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiri- any, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart´ın-Mart´ın. What mat- ters in learning from offline human demonstrations for robot manipulation. In5th Annual Conference on Robot Learning,

  18. [18]

    Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manip- ulation tasks.IEEE Robotics and Automation Letters, 7(3): 7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wol- fram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manip- ulation tasks.IEEE Robotics and Automation Letters, 7(3): 7327–7334, 2022. 2, 4, 1

  19. [19]

    Robocasa: Large-scale simulation of every- day tasks for generalist robots

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of every- day tasks for generalist robots. InRobotics: Science and Systems (RSS), 2024. 2, 4, 5, 6, 1

  20. [20]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherl...

  21. [21]

    An algorithmic perspective on imitation learning.Foundations and Trends® in Robotics, 7(1-2):1–179, 2018

    Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, Jan Peters, et al. An algorithmic perspective on imitation learning.Foundations and Trends® in Robotics, 7(1-2):1–179, 2018. 1

  22. [22]

    Imitating human behaviour with diffusion models

    Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, and Sam Devlin. Imitating human behaviour with diffusion models. In The Eleventh International Conference on Learning Repre- sentations, 2023. 1

  23. [23]

    FiLM: Visual Reasoning with a General Conditioning Layer.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018

    Ethan Perez, Florian Strub, Harm de Vries, Vincent Du- moulin, and Aaron Courville. FiLM: Visual Reasoning with a General Conditioning Layer.Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018. 4

  24. [24]

    Pointnet: Deep learning on point sets for 3d classification and segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

  25. [25]

    Compose by focus: Scene graph-based atomic skills, 2025

    Han Qi, Changhe Chen, and Heng Yang. Compose by focus: Scene graph-based atomic skills, 2025. 2, 3, 8

  26. [26]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, 2021. 4

  27. [27]

    Learnt Sparsification for Inter- pretable Graph Neural Networks, 2021

    Mandeep Rathee, Zijian Zhang, Thorben Funke, Megha Khosla, and Avishek Anand. Learnt Sparsification for Inter- pretable Graph Neural Networks, 2021. arXiv:2106.12920 [cs]. 3

  28. [28]

    Goal conditioned imitation learning using score- based diffusion policies

    Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Li- outikov. Goal conditioned imitation learning using score- based diffusion policies. InRobotics: Science and Systems,

  29. [29]

    Multimodal diffusion transformer: Learn- ing versatile behavior from multimodal goals

    Moritz Reuss, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Wenzel, and Rudolf Lioutikov. Multimodal diffusion transformer: Learn- ing versatile behavior from multimodal goals. InRobotics: Science and Systems, 2024. 1, 3, 4

  30. [30]

    FLOWER: Democratizing generalist robot policies with efficient vision- language-flow models

    Moritz Reuss, Hongyi Zhou, Marcel R ¨uhle, ¨Omer Erdinc ¸ Ya˘gmurlu, Fabian Otto, and Rudolf Lioutikov. FLOWER: Democratizing generalist robot policies with efficient vision- language-flow models. In9th Annual Conference on Robot Learning, 2025. 1

  31. [31]

    Behavior transformers: Cloning kmodes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022

    Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Al- tanzaya, and Lerrel Pinto. Behavior transformers: Cloning kmodes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022. 1

  32. [32]

    Graph-structured visual im- itation

    Maximilian Sieb, Zhou Xian, Audrey Huang, Oliver Kroe- mer, and Katerina Fragkiadaki. Graph-structured visual im- itation. InConference on Robot Learning, pages 979–989. PMLR, 2020. 3

  33. [33]

    Scene graph contrastive learning for embodied navigation

    Kunal Pratap Singh, Jordi Salvador, Luca Weihs, and Aniruddha Kembhavi. Scene graph contrastive learning for embodied navigation. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 10884– 10894, 2023. 2

  34. [34]

    Maximum likelihood training of score-based diffusion mod- els.Advances in neural information processing systems, 34: 1415–1428, 2021

    Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion mod- els.Advances in neural information processing systems, 34: 1415–1428, 2021. 1

  35. [35]

    3d-oes: Viewpoint-invariant object-factorized environment simulators, 2020

    Hsiao-Yu Fish Tung, Zhou Xian, Mihir Prabhudesai, Shamit Lal, and Katerina Fragkiadaki. 3d-oes: Viewpoint-invariant object-factorized environment simulators, 2020. 2, 3

  36. [36]

    Attention is all you need.Advances in Neural Information Processing Systems, 2017

    A Vaswani. Attention is all you need.Advances in Neural Information Processing Systems, 2017. 1, 4

  37. [37]

    Instant policy: In- context imitation learning via graph diffusion

    Vitalis V osylius and Edward Johns. Instant policy: In- context imitation learning via graph diffusion. InThe Thir- teenth International Conference on Learning Representa- tions, 2025. 2, 3

  38. [38]

    Learning to Reduce the Scale of Large Graphs: A Comprehensive Survey.ACM Trans

    Hongjia Xu, Liangliang Zhang, Yao Ma, Sheng Zhou, Zhuo- nan Zheng, and Jiajun Bu. Learning to Reduce the Scale of Large Graphs: A Comprehensive Survey.ACM Trans. Knowl. Discov. Data, 19(5):101:1–101:25, 2025. 3

  39. [39]

    Sparse Graph Attention Networks

    Yang Ye and Shihao Ji. Sparse Graph Attention Networks. IEEE Transactions on Knowledge and Data Engineering, 35 (1):905–916, 2023. 3, 1

  40. [40]

    A Recipe for Causal Graph Regression: Confounding Effects Revisited

    Yujia Yin, Tianyi Qu, Zihao Wang, and Yifan Chen. A Recipe for Causal Graph Regression: Confounding Effects Revisited. 2025. 3, 1

  41. [41]

    Hierarchical Graph Rep- resentation Learning with Differentiable Pooling

    Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical Graph Rep- resentation Learning with Differentiable Pooling. InAd- vances in Neural Information Processing Systems. Curran Associates, Inc., 2018. 3

  42. [42]

    Graph Sparsifi- cation via Mixture of Graphs, 2024

    Guibin Zhang, Xiangguo Sun, Yanwei Yue, Chonghe Jiang, Kun Wang, Tianlong Chen, and Shirui Pan. Graph Sparsifi- cation via Mixture of Graphs, 2024. arXiv:2405.14260 [cs]. 3

  43. [43]

    Robust Graph Representation Learning via Neural Sparsification

    Cheng Zheng, Bo Zong, Wei Cheng, Dongjin Song, Jingchao Ni, Wenchao Yu, Haifeng Chen, and Wei Wang. Robust Graph Representation Learning via Neural Sparsification. In Proceedings of the 37th International Conference on Ma- chine Learning, pages 11458–11468. PMLR, 2020. ISSN: 2640-3498. 3

  44. [44]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart ´ın- Mart´ın, Abhishek Joshi, Kevin Lin, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation frame- work and benchmark for robot learning. InarXiv preprint arXiv:2009.12293, 2020. 2

  45. [45]

    Hierarchical planning for long-horizon manipulation with geometric and symbolic scene graphs

    Yifeng Zhu, Jonathan Tremblay, Stan Birchfield, and Yuke Zhu. Hierarchical planning for long-horizon manipulation with geometric and symbolic scene graphs. In2021 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 6541–6548. IEEE, 2021. 2, 3

  46. [46]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 1 SIR: Structured Image Representations for Explainable Robot Learning Supplementary Material

  47. [47]

    Additional information Source code will be made publicly available upon accep- tance. Our GNN implementation employs residual connec- tions and normalization to mitigate oversmoothing, a phe- nomenon to which small, fully-connected graphs are partic- ularly susceptible. 8.1. Pre-Trained Models The pre-trained models for generating Cropped-Image- Features ...

  48. [48]

    How- ever, when combining Cropped-Image-Features with BB- Coordinates, we use a smaller ResNet8 backbone with an embedding dimension of 37

    backbone with an embedding dimension of 256. How- ever, when combining Cropped-Image-Features with BB- Coordinates, we use a smaller ResNet8 backbone with an embedding dimension of 37. This lower dimension was chosen specifically to match the size of the object label vec- tors (one-hot encoded length of 37), allowing the visual fea- tures to serve as a di...

  49. [49]

    Ad- ditionally, we provide fine-grained performance metrics for Table 4

    Additional Evaluation Results The following sections present results for SIR using BC, generated graphs and the CALVIN benchmark [18]. Ad- ditionally, we provide fine-grained performance metrics for Table 4. Ablation results on RoboCasa using generated graphs, which are predicted using a fine-tuned DETR model. Models are not trained on these generated gra...

  50. [50]

    Additional Explainability Results We further present additional explainability results for SIR on 6 tasks in RoboCasa, not present in the main paper, as well as a Grad-CAM visualization of the image baseline models. 10.1. Sub-Graph Statistics The generated explanations by the different models are pre- sented in Tab. 17, Tab. 18, Tab. 19, Tab. 20, Tab. 21,...