SADP: Subgoal-Aware Diffusion Policy for Explainable Robots Learned from Foundation Model Generated Demonstrations

Site Hu; Takato Horii

arxiv: 2605.16871 · v1 · pith:B7Y4EP6Unew · submitted 2026-05-16 · 💻 cs.RO

SADP: Subgoal-Aware Diffusion Policy for Explainable Robots Learned from Foundation Model Generated Demonstrations

Site Hu , Takato Horii This is my paper

Pith reviewed 2026-05-19 20:47 UTC · model grok-4.3

classification 💻 cs.RO

keywords subgoal-aware diffusion policyexplainable roboticsfoundation modelsimitation learningrobot manipulationdiffusion policieslong-horizon tasks

0 comments

The pith

Conditioning diffusion policies on foundation-model-generated subgoals improves both task success and explainability for robot manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that robots can be made more explainable by explicitly modeling subgoal structure in imitation learning, addressing the lack of subgoal supervision in datasets. It proposes using foundation models to generate subgoal annotations automatically from task demonstrations. Then it trains a diffusion policy conditioned on both task and subgoal descriptions, with an extra head to predict when subgoals are completed. This built-in interpretability allows monitoring progress and diagnosing issues without sacrificing performance, as shown in higher success rates compared to standard approaches.

Core claim

SADP structures policy execution around human-interpretable subgoals by conditioning action generation on both task-level and subgoal-level descriptions from foundation model annotations, and uses a lightweight auxiliary head to predict subgoal completion states, enabling the robot to expose its current execution stage while achieving higher task success rates than task-conditioned diffusion baselines in RLBench simulations and real-world UR5e robot evaluations.

What carries the argument

The Subgoal-Aware Diffusion Policy (SADP) that conditions action generation on task and subgoal descriptions and includes an auxiliary predictor for subgoal completion.

If this is right

Robots can provide subgoal-level execution signals for real-time progress monitoring.
Failures can be diagnosed by identifying at which subgoal the policy struggles.
Built-in interpretability is achieved alongside improved task performance without post-hoc methods.
Long-horizon manipulation tasks become more tractable due to structured subgoal progression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work could explore using these subgoal signals for adaptive replanning when a subgoal fails.
Similar subgoal generation might apply to other policy architectures like transformers for robotics.
This method could help in creating datasets with inherent explainability for training more transparent agents.

Load-bearing premise

Foundation models autonomously generate accurate subgoal annotations from raw task demonstrations without introducing systematic biases or errors.

What would settle it

A direct comparison of policy success rates and explanation accuracy when trained on foundation model generated subgoals versus manually annotated subgoals would show if the automatic annotations degrade performance or mislead monitoring.

Figures

Figures reproduced from arXiv: 2605.16871 by Site Hu, Takato Horii.

**Figure 2.** Figure 2: Subgoal-aware diffusion policy with completion prediction head. The completion prediction head shares the same encoded observation features with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Six tasks in simulation experiment. The tasks cover different temporal [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Real-world experiment setup. The red circles indicate the RGB-D [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Predicted subgoal completion scores for a representative successful [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Predicted subgoal completion scores for a representative failed [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Explainable robots require not only successful task execution but also the ability to expose internal decision-making process in a user-friendly manner. However, most imitation learning methods are trained solely on task-level demonstrations, without explicitly modeling subgoal structure or execution progress. This limitation is further exacerbated by the scarcity of subgoal-level supervision in standard robot learning datasets, which restricts the development of robots that can convey the subtasks they are executing during long-horizon manipulation. To address this issue, this paper proposes Subgoal-Aware Diffusion Policy (SADP), a framework that leverages foundation models to autonomously generate subgoal-annotated demonstrations and trains diffusion policies on these datasets. SADP structures policy execution around human-interpretable subgoals by conditioning action generation on both task-level and subgoal-level descriptions. A lightweight auxiliary head further predicts subgoal completion states, allowing the robot to expose its current execution stage and monitor subgoal progression. Experiments in RLBench simulations and real-world evaluations on a UR5e robot demonstrate that SADP achieves higher task success rates than strong task-conditioned diffusion baselines, while providing subgoal-level execution signals for monitoring progress and diagnosing failures. These results highlight that built-in, rather than post-hoc, interpretability can coexist with high task performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SADP conditions diffusion policies on foundation-model subgoal labels plus an auxiliary completion head, reporting better success rates on RLBench and UR5e while adding built-in progress signals, but the accuracy of those labels is untested.

read the letter

The main takeaway is that this work takes existing diffusion policies for manipulation and adds subgoal conditioning generated by a foundation model, along with a small auxiliary head that predicts subgoal completion. The result is a policy that produces both actions and human-readable execution stage signals without post-processing. They show higher task success than plain task-conditioned baselines in simulation and on a real UR5e arm. That combination of performance and built-in monitoring is the practical part worth noting. The approach is a direct extension of prior diffusion-policy and subgoal work rather than a big conceptual leap, but the pipeline is straightforward to implement if you already run diffusion models. The clearest weakness is the lack of any check on the foundation-model subgoal labels themselves. There is no reported human agreement, label noise stats, or ablation that replaces the model labels with human ones. If the generated subgoals contain systematic errors or biases, the conditioning signal becomes unreliable and the reported gains could come from extra input dimensions instead of genuine subgoal structure. The auxiliary head is presented as helping both performance and explainability, yet the summary gives no separate ablation showing its effect on success rate versus its value for monitoring. This paper is useful for people already working on imitation learning for long-horizon tasks who want a simple way to add progress signals. Readers focused on diffusion policies or interpretable robot learning will get the most out of the empirical results on standard benchmarks plus hardware. It is grounded enough in reproducible setups to deserve a full referee process rather than a desk reject, even though the label-quality assumption needs tighter evidence.

Referee Report

2 major / 3 minor

Summary. The paper proposes Subgoal-Aware Diffusion Policy (SADP), which uses foundation models to autonomously generate subgoal-annotated demonstrations from task-level robot data. It trains a diffusion policy conditioned on both task-level and subgoal-level descriptions, augmented by a lightweight auxiliary head that predicts subgoal completion states to enable monitoring of execution progress. Experiments in RLBench simulations and real-world UR5e robot evaluations report higher task success rates than strong task-conditioned diffusion baselines, while providing subgoal-level signals for interpretability and failure diagnosis.

Significance. If the foundation-model-generated subgoal labels are sufficiently accurate, SADP demonstrates that built-in subgoal conditioning and auxiliary prediction can improve both performance and explainability in long-horizon imitation learning without requiring manually annotated subgoal datasets. This could be a useful direction for making diffusion policies more transparent in robotics applications.

major comments (2)

Abstract and Experiments section: the central performance and explainability claims rest on the assumption that foundation-model-generated subgoal annotations are accurate and unbiased, yet the manuscript provides no quantitative validation such as inter-annotator agreement with humans, label noise statistics, or an ablation replacing FM labels with human-generated ones. Without this, it is unclear whether reported success-rate gains arise from genuine subgoal awareness or from incidental effects of extra conditioning dimensions.
Experiments section: the comparison to task-conditioned diffusion baselines does not include controls for the number of conditioning tokens or the auxiliary head's contribution, making it difficult to isolate whether the subgoal signals themselves drive the observed improvements in RLBench and UR5e tasks.

minor comments (3)

Specify the exact foundation model, prompt templates, and post-processing steps used for subgoal annotation so that the generation process can be reproduced.
Clarify the precise architecture and loss weighting of the auxiliary completion head relative to the main diffusion policy.
Add error bars or statistical significance tests to the success-rate tables to support the claim of consistent outperformance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which highlight important aspects of validation and experimental design. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: Abstract and Experiments section: the central performance and explainability claims rest on the assumption that foundation-model-generated subgoal annotations are accurate and unbiased, yet the manuscript provides no quantitative validation such as inter-annotator agreement with humans, label noise statistics, or an ablation replacing FM labels with human-generated ones. Without this, it is unclear whether reported success-rate gains arise from genuine subgoal awareness or from incidental effects of extra conditioning dimensions.

Authors: We agree that direct quantitative validation of the foundation-model-generated subgoal annotations would help substantiate the claims and rule out alternative explanations for the observed gains. In the revised manuscript, we will add a dedicated analysis in the Experiments section (or an appendix) reporting agreement rates between the generated labels and human annotations on a representative sample of demonstrations, along with basic label noise and consistency statistics across repeated FM queries. We will also briefly discuss the prompting strategy and model choice used to generate the annotations. These additions will clarify the reliability of the subgoal labels without altering the core experimental results. revision: yes
Referee: Experiments section: the comparison to task-conditioned diffusion baselines does not include controls for the number of conditioning tokens or the auxiliary head's contribution, making it difficult to isolate whether the subgoal signals themselves drive the observed improvements in RLBench and UR5e tasks.

Authors: We acknowledge that the current set of baselines leaves open the possibility that performance differences arise from factors other than the subgoal conditioning itself. In the revised version, we will add two targeted controls: (1) a task-conditioned diffusion baseline augmented with an equivalent number of additional conditioning tokens (e.g., repeated or dummy tokens) to match the token budget used in SADP, and (2) an ablation of SADP that retains subgoal conditioning but removes the auxiliary completion predictor head. These new comparisons will be reported alongside the existing results in the Experiments section to better isolate the contribution of the subgoal-aware components. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with external validation

full rationale

The paper describes an empirical method that uses foundation models to generate subgoal annotations, then trains a conditioned diffusion policy with an auxiliary completion head. No equations, fitted parameters, or derivations are presented that reduce claimed performance gains or explainability to inputs by construction. Claims rest on RLBench simulations and UR5e real-robot comparisons against task-conditioned baselines, which are independent measurements. No self-citation chains, uniqueness theorems, or ansatzes imported from prior author work appear as load-bearing steps. The central assumption about FM label quality is a validity concern rather than a definitional or fitted-input circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified premise that foundation models produce subgoal labels of sufficient quality and consistency to improve both policy performance and human interpretability; no free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Foundation models can generate accurate subgoal annotations from task demonstrations without systematic bias or error that would harm policy learning.
This premise is required for the generated dataset to be usable; it is invoked when the abstract states that foundation models 'autonomously generate subgoal-annotated demonstrations'.

pith-pipeline@v0.9.0 · 5747 in / 1289 out tokens · 24384 ms · 2026-05-19T20:47:41.639676+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SADP structures policy execution around human-interpretable subgoals by conditioning action generation on both task-level and subgoal-level descriptions. A lightweight auxiliary head further predicts subgoal completion states

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

[1]

Transparent, explainable, and accountable ai for robotics,

S. Wachter, B. Mittelstadt, and L. Floridi, “Transparent, explainable, and accountable ai for robotics,” Science robotics, vol. 2, no. 6, p. eaan6080, 2017

work page 2017
[2]

A review of robot learning for manipulation: Challenges, representations, and algorithms,

O. Kroemer, S. Niekum, and G. Konidaris, “A review of robot learning for manipulation: Challenges, representations, and algorithms,” Journal of machine learning research, vol. 22, no. 30, pp. 1–82, 2021

work page 2021
[3]

A survey of demonstration learning,

A. Correia and L. A. Alexandre, “A survey of demonstration learning,” Robotics and Autonomous Systems, vol. 182, p. 104812, 2024

work page 2024
[4]

Hierarchical reinforce- ment learning: A survey and open research challenges,

M. Hutsebaut-Buysse, K. Mets, and S. Latr ´e, “Hierarchical reinforce- ment learning: A survey and open research challenges,” Machine Learning and Knowledge Extraction, vol. 4, no. 1, pp. 172–221, 2022

work page 2022
[5]

Explainable artificial intelligence: A survey of needs, techniques, applications, and future direction,

M. Mersha, K. Lam, J. Wood, A. K. Alshami, and J. Kalita, “Explainable artificial intelligence: A survey of needs, techniques, applications, and future direction,” Neurocomputing, vol. 599, p. 128111, 2024

work page 2024
[6]

V oxposer: Composable 3d value maps for robotic manipulation with language models,

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei, “V oxposer: Composable 3d value maps for robotic manipulation with language models,” in Conference on Robot Learning. PMLR, 2023, pp. 540–562

work page 2023
[7]

Scaling up and distilling down: Language-guided robot skill acquisition,

H. Ha, P. Florence, and S. Song, “Scaling up and distilling down: Language-guided robot skill acquisition,” in Conference on Robot Learning. PMLR, 2023, pp. 3766–3777

work page 2023
[8]

Tarad: Task-aware robot affordance- centric diffusion policy learned from llm-generated demonstrations,

S. Hu, T. Nagai, and T. Horii, “Tarad: Task-aware robot affordance- centric diffusion policy learned from llm-generated demonstrations,” IEEE Robotics and Automation Letters, 2025

work page 2025
[9]

Rlbench: The robot learning benchmark & learning environment,

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3019–3026, 2020

work page 2020
[10]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,

C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature machine intelligence, vol. 1, no. 5, pp. 206–215, 2019

work page 2019
[11]

Explanation in artificial intelligence: Insights from the social sciences,

T. Miller, “Explanation in artificial intelligence: Insights from the social sciences,” Artificial intelligence, vol. 267, pp. 1–38, 2019

work page 2019
[12]

Explainable agents and robots: Results from a systematic literature review,

S. Anjomshoae, A. Najjar, D. Calvaresi, and K. Fr ¨amling, “Explainable agents and robots: Results from a systematic literature review,” in 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2019), Montreal, Canada, May 13–17, 2019. Inter- national Foundation for Autonomous Agents and Multiagent Systems, 2019, pp. 1078–1088

work page 2019
[13]

Explainable autonomous robots: A survey and perspective,

T. Sakai and T. Nagai, “Explainable autonomous robots: A survey and perspective,” Advanced Robotics, vol. 36, no. 5-6, pp. 219–238, 2022

work page 2022
[14]

Improving robot controller transparency through autonomous policy explanation,

B. Hayes and J. A. Shah, “Improving robot controller transparency through autonomous policy explanation,” in Proceedings of the 2017 ACM/IEEE international conference on human-robot interaction, 2017, pp. 303–312

work page 2017
[15]

A tale of two explanations: Enhancing human trust by explaining robot behavior,

M. Edmonds, F. Gao, H. Liu, X. Xie, S. Qi, B. Rothrock, Y . Zhu, Y . N. Wu, H. Lu, and S.-C. Zhu, “A tale of two explanations: Enhancing human trust by explaining robot behavior,” Science Robotics, vol. 4, no. 37, p. eaay4663, 2019

work page 2019
[16]

Explainable autonomous robots in continuous state space based on graph-structured world model,

S. Hu and T. Nagai, “Explainable autonomous robots in continuous state space based on graph-structured world model,” Advanced Robotics, pp. 1–17, 2023

work page 2023
[17]

Adaptive and transparent decision- making in autonomous robots through graph-structured world models,

S. Hu, T. Horii, and T. Nagai, “Adaptive and transparent decision- making in autonomous robots through graph-structured world models,” Advanced Robotics, vol. 38, no. 22, pp. 1579–1599, 2024

work page 2024
[18]

Data-efficient hierarchical reinforcement learning,

O. Nachum, S. S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” Advances in neural information processing systems, vol. 31, 2018

work page 2018
[19]

Hierarchical planning through goal-conditioned offline reinforcement learning,

J. Li, C. Tang, M. Tomizuka, and W. Zhan, “Hierarchical planning through goal-conditioned offline reinforcement learning,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10 216–10 223, 2022

work page 2022
[20]

Hierarchical diffusion for offline decision making,

W. Li, X. Wang, B. Jin, and H. Zha, “Hierarchical diffusion for offline decision making,” in International Conference on Machine Learning. PMLR, 2023, pp. 20 035–20 064

work page 2023
[21]

Seqvla: Sequential task execution for long-horizon manipulation with completion-aware vision- language-action model,

R. Yang, Z. An, L. ZHou, and Y . Feng, “Seqvla: Sequential task execution for long-horizon manipulation with completion-aware vision- language-action model,” arXiv preprint arXiv:2509.14138, 2025

work page arXiv 2025
[22]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” The International Journal of Robotics Research, 10 2024

work page 2024
[23]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” in Proceedings of Robotics: Science and Systems (RSS), 2024

work page 2024
[24]

3d diffuser actor: Policy diffusion with 3d scene representations,

T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki, “3d diffuser actor: Policy diffusion with 3d scene representations,” in Conference on Robot Learning. PMLR, 2025, pp. 1949–1974

work page 2025
[25]

Do as i can, not as i say: Grounding language in robotic affordances,

A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, et al., “Do as i can, not as i say: Grounding language in robotic affordances,” in Conference on robot learning. PMLR, 2023, pp. 287–318

work page 2023
[26]

Text2motion: From natural language instructions to feasible plans,

K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg, “Text2motion: From natural language instructions to feasible plans,” Autonomous Robots, vol. 47, no. 8, pp. 1345–1365, 2023

work page 2023
[27]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International conference on machine learning. PMLR, 2022, pp. 9118–9147

work page 2022
[28]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” in European Conference on Computer Vision. Springer, 2024, pp. 38–55

work page 2024
[29]

Sam 2: Segment anything in images and videos,

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, et al., “Sam 2: Segment anything in images and videos,” in The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[30]

Copa: General robotic manipulation through spatial constraints of parts with foundation models,

H. Huang, F. Lin, Y . Hu, S. Wang, and Y . Gao, “Copa: General robotic manipulation through spatial constraints of parts with foundation models,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 9488–9495

work page 2024
[31]

Robotgpt: Robot manipulation learning from chatgpt,

Y . Jin, D. Li, A. Yong, J. Shi, P. Hao, F. Sun, J. Zhang, and B. Fang, “Robotgpt: Robot manipulation learning from chatgpt,” IEEE Robotics and Automation Letters, vol. 9, no. 3, pp. 2543–2550, 2024

work page 2024
[32]

Gensim2: Scaling robot data generation with multi-modal and reason- ing llms,

P. Hua, M. Liu, A. Macaluso, Y . Lin, W. Zhang, H. Xu, and L. Wang, “Gensim2: Scaling robot data generation with multi-modal and reason- ing llms,” in Conference on Robot Learning. PMLR, 2025, pp. 5030– 5066

work page 2025
[33]

Openvla: An open-source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al., “Openvla: An open-source vision-language-action model,” in Conference on Robot Learning. PMLR, 2025, pp. 2679–2713

work page 2025
[34]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al., “π0: a vision-language-action flow model for general robot control,” arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

π0.5: a vision-language-action model with open-world generalization,

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, b. ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke...

work page 2025
[36]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, et al., “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1702–1713

work page 2025
[37]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021
[38]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018, pp. 3942–3951

work page 2018
[39]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

work page 2017
[40]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Transparent, explainable, and accountable ai for robotics,

S. Wachter, B. Mittelstadt, and L. Floridi, “Transparent, explainable, and accountable ai for robotics,” Science robotics, vol. 2, no. 6, p. eaan6080, 2017

work page 2017

[2] [2]

A review of robot learning for manipulation: Challenges, representations, and algorithms,

O. Kroemer, S. Niekum, and G. Konidaris, “A review of robot learning for manipulation: Challenges, representations, and algorithms,” Journal of machine learning research, vol. 22, no. 30, pp. 1–82, 2021

work page 2021

[3] [3]

A survey of demonstration learning,

A. Correia and L. A. Alexandre, “A survey of demonstration learning,” Robotics and Autonomous Systems, vol. 182, p. 104812, 2024

work page 2024

[4] [4]

Hierarchical reinforce- ment learning: A survey and open research challenges,

M. Hutsebaut-Buysse, K. Mets, and S. Latr ´e, “Hierarchical reinforce- ment learning: A survey and open research challenges,” Machine Learning and Knowledge Extraction, vol. 4, no. 1, pp. 172–221, 2022

work page 2022

[5] [5]

Explainable artificial intelligence: A survey of needs, techniques, applications, and future direction,

M. Mersha, K. Lam, J. Wood, A. K. Alshami, and J. Kalita, “Explainable artificial intelligence: A survey of needs, techniques, applications, and future direction,” Neurocomputing, vol. 599, p. 128111, 2024

work page 2024

[6] [6]

V oxposer: Composable 3d value maps for robotic manipulation with language models,

W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei, “V oxposer: Composable 3d value maps for robotic manipulation with language models,” in Conference on Robot Learning. PMLR, 2023, pp. 540–562

work page 2023

[7] [7]

Scaling up and distilling down: Language-guided robot skill acquisition,

H. Ha, P. Florence, and S. Song, “Scaling up and distilling down: Language-guided robot skill acquisition,” in Conference on Robot Learning. PMLR, 2023, pp. 3766–3777

work page 2023

[8] [8]

Tarad: Task-aware robot affordance- centric diffusion policy learned from llm-generated demonstrations,

S. Hu, T. Nagai, and T. Horii, “Tarad: Task-aware robot affordance- centric diffusion policy learned from llm-generated demonstrations,” IEEE Robotics and Automation Letters, 2025

work page 2025

[9] [9]

Rlbench: The robot learning benchmark & learning environment,

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3019–3026, 2020

work page 2020

[10] [10]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,

C. Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature machine intelligence, vol. 1, no. 5, pp. 206–215, 2019

work page 2019

[11] [11]

Explanation in artificial intelligence: Insights from the social sciences,

T. Miller, “Explanation in artificial intelligence: Insights from the social sciences,” Artificial intelligence, vol. 267, pp. 1–38, 2019

work page 2019

[12] [12]

Explainable agents and robots: Results from a systematic literature review,

S. Anjomshoae, A. Najjar, D. Calvaresi, and K. Fr ¨amling, “Explainable agents and robots: Results from a systematic literature review,” in 18th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2019), Montreal, Canada, May 13–17, 2019. Inter- national Foundation for Autonomous Agents and Multiagent Systems, 2019, pp. 1078–1088

work page 2019

[13] [13]

Explainable autonomous robots: A survey and perspective,

T. Sakai and T. Nagai, “Explainable autonomous robots: A survey and perspective,” Advanced Robotics, vol. 36, no. 5-6, pp. 219–238, 2022

work page 2022

[14] [14]

Improving robot controller transparency through autonomous policy explanation,

B. Hayes and J. A. Shah, “Improving robot controller transparency through autonomous policy explanation,” in Proceedings of the 2017 ACM/IEEE international conference on human-robot interaction, 2017, pp. 303–312

work page 2017

[15] [15]

A tale of two explanations: Enhancing human trust by explaining robot behavior,

M. Edmonds, F. Gao, H. Liu, X. Xie, S. Qi, B. Rothrock, Y . Zhu, Y . N. Wu, H. Lu, and S.-C. Zhu, “A tale of two explanations: Enhancing human trust by explaining robot behavior,” Science Robotics, vol. 4, no. 37, p. eaay4663, 2019

work page 2019

[16] [16]

Explainable autonomous robots in continuous state space based on graph-structured world model,

S. Hu and T. Nagai, “Explainable autonomous robots in continuous state space based on graph-structured world model,” Advanced Robotics, pp. 1–17, 2023

work page 2023

[17] [17]

Adaptive and transparent decision- making in autonomous robots through graph-structured world models,

S. Hu, T. Horii, and T. Nagai, “Adaptive and transparent decision- making in autonomous robots through graph-structured world models,” Advanced Robotics, vol. 38, no. 22, pp. 1579–1599, 2024

work page 2024

[18] [18]

Data-efficient hierarchical reinforcement learning,

O. Nachum, S. S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” Advances in neural information processing systems, vol. 31, 2018

work page 2018

[19] [19]

Hierarchical planning through goal-conditioned offline reinforcement learning,

J. Li, C. Tang, M. Tomizuka, and W. Zhan, “Hierarchical planning through goal-conditioned offline reinforcement learning,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10 216–10 223, 2022

work page 2022

[20] [20]

Hierarchical diffusion for offline decision making,

W. Li, X. Wang, B. Jin, and H. Zha, “Hierarchical diffusion for offline decision making,” in International Conference on Machine Learning. PMLR, 2023, pp. 20 035–20 064

work page 2023

[21] [21]

Seqvla: Sequential task execution for long-horizon manipulation with completion-aware vision- language-action model,

R. Yang, Z. An, L. ZHou, and Y . Feng, “Seqvla: Sequential task execution for long-horizon manipulation with completion-aware vision- language-action model,” arXiv preprint arXiv:2509.14138, 2025

work page arXiv 2025

[22] [22]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” The International Journal of Robotics Research, 10 2024

work page 2024

[23] [23]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” in Proceedings of Robotics: Science and Systems (RSS), 2024

work page 2024

[24] [24]

3d diffuser actor: Policy diffusion with 3d scene representations,

T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki, “3d diffuser actor: Policy diffusion with 3d scene representations,” in Conference on Robot Learning. PMLR, 2025, pp. 1949–1974

work page 2025

[25] [25]

Do as i can, not as i say: Grounding language in robotic affordances,

A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, et al., “Do as i can, not as i say: Grounding language in robotic affordances,” in Conference on robot learning. PMLR, 2023, pp. 287–318

work page 2023

[26] [26]

Text2motion: From natural language instructions to feasible plans,

K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg, “Text2motion: From natural language instructions to feasible plans,” Autonomous Robots, vol. 47, no. 8, pp. 1345–1365, 2023

work page 2023

[27] [27]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International conference on machine learning. PMLR, 2022, pp. 9118–9147

work page 2022

[28] [28]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” in European Conference on Computer Vision. Springer, 2024, pp. 38–55

work page 2024

[29] [29]

Sam 2: Segment anything in images and videos,

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, et al., “Sam 2: Segment anything in images and videos,” in The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[30] [30]

Copa: General robotic manipulation through spatial constraints of parts with foundation models,

H. Huang, F. Lin, Y . Hu, S. Wang, and Y . Gao, “Copa: General robotic manipulation through spatial constraints of parts with foundation models,” in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 9488–9495

work page 2024

[31] [31]

Robotgpt: Robot manipulation learning from chatgpt,

Y . Jin, D. Li, A. Yong, J. Shi, P. Hao, F. Sun, J. Zhang, and B. Fang, “Robotgpt: Robot manipulation learning from chatgpt,” IEEE Robotics and Automation Letters, vol. 9, no. 3, pp. 2543–2550, 2024

work page 2024

[32] [32]

Gensim2: Scaling robot data generation with multi-modal and reason- ing llms,

P. Hua, M. Liu, A. Macaluso, Y . Lin, W. Zhang, H. Xu, and L. Wang, “Gensim2: Scaling robot data generation with multi-modal and reason- ing llms,” in Conference on Robot Learning. PMLR, 2025, pp. 5030– 5066

work page 2025

[33] [33]

Openvla: An open-source vision-language-action model,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, et al., “Openvla: An open-source vision-language-action model,” in Conference on Robot Learning. PMLR, 2025, pp. 2679–2713

work page 2025

[34] [34]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al., “π0: a vision-language-action flow model for general robot control,” arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

π0.5: a vision-language-action model with open-world generalization,

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, b. ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke...

work page 2025

[36] [36]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, et al., “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” in Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 1702–1713

work page 2025

[37] [37]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021

[38] [38]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018, pp. 3942–3951

work page 2018

[39] [39]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

work page 2017

[40] [40]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023