RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields

Dieter Fox; Heni Ben Amor; Jiafei Duan; Ransalu Senanayake; Som Sagar; Sreevishakh Vasudevan; Yifan Zhou

arxiv: 2412.02818 · v4 · pith:V7OM6JV2new · submitted 2024-12-03 · 💻 cs.RO · cs.LG

RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields

Som Sagar , Jiafei Duan , Sreevishakh Vasudevan , Yifan Zhou , Heni Ben Amor , Dieter Fox , Ransalu Senanayake This is my paper

Pith reviewed 2026-05-23 07:46 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords robot manipulationvulnerability detectionvision-language embeddingsdeep reinforcement learningpotential fieldsrobot safetymanipulation policiessemantic embeddings

0 comments

The pith

A reinforcement learning policy on a vision-language embedding treated as a potential field uncovers up to 23% more unique robot manipulation vulnerabilities than baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Robot manipulation policies are prone to failure under real-world variations, yet identifying those variations through direct physical testing is costly and unsafe. The paper trains a separate deep RL policy that treats a continuous vision-language embedding as a potential field, moving toward regions that cause failures while avoiding successful ones. This policy is learned entirely from virtual rollouts using limited success-failure data. Experiments on simulation benchmarks and a physical robot arm show the approach reveals more subtle vulnerabilities than existing vision-language methods. The resulting vulnerability map also supports fine-tuning the original policy with reduced data.

Core claim

The central claim is that treating a vision-language embedding space as a semantic potential field allows a deep RL vulnerability prediction policy, trained on virtual runs, to scalably locate failure-prone regions for a target manipulation policy, producing a probabilistic vulnerability-likelihood map that identifies up to 23% more unique vulnerabilities than state-of-the-art baselines while also improving the original policy through targeted fine-tuning.

What carries the argument

The semantic potential field formed by the continuous vision-language embedding, which guides the deep RL vulnerability prediction policy to navigate toward failure regions.

If this is right

The vulnerability prediction policy enables scalable analysis without expensive or unsafe physical trials.
Querying the policy produces a probabilistic map of vulnerability likelihood across the embedding space.
Fine-tuning the original manipulation policy on the discovered vulnerabilities improves performance with substantially less data.
The method reveals subtle vulnerabilities that heuristic testing overlooks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of vulnerability discovery into its own policy could allow the testing strategy to be refined independently of the manipulation policy being evaluated.
Because the embedding serves as a proxy space, the same framework could be applied to other embodied tasks where direct variation sampling is expensive.
If the embedding captures additional modalities, the potential-field approach might locate vulnerabilities arising from combined visual and language perturbations.

Load-bearing premise

The vision-language embedding trained on limited success-failure data contains enough semantic and visual variation to act as a potential field that reliably separates vulnerable regions from successful ones.

What would settle it

A controlled test on a held-out manipulation task or physical robot where the framework finds no more unique vulnerabilities than the vision-language baselines, or where the vulnerability map shows no correlation with measured failure rates.

Figures

Figures reproduced from arXiv: 2412.02818 by Dieter Fox, Heni Ben Amor, Jiafei Duan, Ransalu Senanayake, Som Sagar, Sreevishakh Vasudevan, Yifan Zhou.

**Figure 2.** Figure 2: RoboMD Framework: (1) A PPO-based deep RL agent identifies configurations most likely to induce failures by [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The pipeline illustrates how rollouts with disruptions (e.g., object or lighting changes) are processed to learn meaningful [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Continuous Action Space Exploration. The diagram [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Some environment variations for both simulation and real-world evaluation. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Action diversity across RL algorithms. The X-axis [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Individual FM analysis of multiple models. Each radar plot represents the failure likelihood of a specific actions. The [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Confusion matrices of embeddings trained using a) [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 8.** Figure 8: Failure distribution before and after fine-tuning “Lift” [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 10.** Figure 10: Scenes from experiments on real world robot [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Scenes from experiments on Robosuite C. Baselines To validate the effectiveness of our method, we compared it against two categories of baselines: Reinforcement Learning (RL) baselines and Vision-Language Model (VLM) baselines. Below, we detail their implementation, hyperparameters, and specific configurations. 1) Reinforcement Learning (RL) Baselines: The RL baselines were implemented using well-establi… view at source ↗

**Figure 12.** Figure 12: The order in which the confusion matrix is a) Image Ecoder + BCE b) Image + Text Encoder + BCE loss c) Image [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Testing Robustness Under Visual Perturbations: Successful Rollout in Training vs. Failure Induced by Red Table [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Performance comparison of behavior cloning (BC) and diffusion-based policies on the Lift task before and after [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 8.** Figure 8: 1) Change cube color to red 2) Change cube color to green 3) Change cube color to blue 4) Change cube color to gray 5) Change table color to green 6) Change table color to blue 7) Change table color to red 8) Change table color to gray 9) Resize table to (0.8, 0.2, 0.025) 10) Resize table to (0.2, 0.8, 0.025) 11) Resize cube to (0.04, 0.04, 0.04) 12) Resize cube to (0.01, 0.01, 0.01) 13) Resize cube to (0.… view at source ↗

**Figure 15.** Figure 15: kNN Accuracy Drop with Increasing k in Continuous [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 17.** Figure 17: BC lift finetuned on a combined dataset of 12 different [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

**Figure 16.** Figure 16: Training loss for training action representations [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 18.** Figure 18: Environmental and Object Perturbations on Manipulation Tasks [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

read the original abstract

Robot manipulation policies, while central to the promise of physical AI, are highly vulnerable in the presence of external variations in the real world. Diagnosing these vulnerabilities is hindered by two key challenges: (i) the relevant variations to test against are often unknown, and (ii) direct testing in the real world is costly and unsafe. We introduce a framework that tackles both issues by learning a separate deep reinforcement learning (deep RL) policy for vulnerability prediction through virtual runs on a continuous vision-language embedding trained with limited success-failure data. By treating this embedding space, which is rich in semantic and visual variations, as a potential field, the policy learns to move toward vulnerable regions while being repelled from success regions. This vulnerability prediction policy, trained on virtual rollouts, enables scalable and safe vulnerability analysis without expensive physical trials. By querying this policy, our framework builds a probabilistic vulnerability-likelihood map. Experiments across simulation benchmarks and a physical robot arm show that our framework uncovers up to 23% more unique vulnerabilities than state-of-the-art vision-language baselines, revealing subtle vulnerabilities overlooked by heuristic testing. Additionally, we show that fine-tuning the manipulation policy with the vulnerabilities discovered by our framework improves manipulation performance with much less fine-tuning data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames vulnerability search in robot policies as RL navigation inside a vision-language embedding treated as a potential field, but provides no evidence that the embedding actually separates vulnerable regions.

read the letter

The main contribution is treating a vision-language embedding—built from limited success-failure pairs—as a continuous potential field, then training a separate deep RL policy to navigate toward vulnerable states for a target manipulation policy. This produces a probabilistic vulnerability map from virtual rollouts only. They report up to 23% more unique vulnerabilities than vision-language baselines on simulation benchmarks and a physical arm, plus improved fine-tuning with less data afterward.

Referee Report

2 major / 0 minor

Summary. The paper introduces RoboMD, a framework that trains a separate deep RL policy to discover vulnerabilities in robot manipulation policies. It does so by treating a continuous vision-language embedding—trained on limited success-failure data—as a semantic potential field that attracts the policy toward vulnerable regions and repels it from successful ones. Virtual rollouts of this policy are used to construct a probabilistic vulnerability map. Experiments on simulation benchmarks and a physical arm reportedly uncover up to 23% more unique vulnerabilities than vision-language baselines and enable more data-efficient fine-tuning of the original policy.

Significance. If the embedding truly encodes sufficient semantic and visual variation to form a separating potential field, the approach would provide a scalable, simulation-only method for identifying subtle vulnerabilities that heuristic testing misses, reducing reliance on costly or unsafe physical trials and improving robustness of manipulation policies.

major comments (2)

[Abstract] Abstract: the central claim that the vision-language embedding 'is rich in semantic and visual variations' and can serve as a potential field that 'meaningfully separates vulnerable from successful regions' rests on an unverified representational assumption; no architecture, training objective, gradient validation, or coverage analysis is supplied to support that the limited success-failure data produces the required separating structure.
[Abstract] Abstract: the reported 'up to 23% more unique vulnerabilities' is presented without any description of how vulnerabilities are quantified, what statistical significance tests were used, how the state-of-the-art vision-language baselines were implemented, or the data-exclusion rules applied; these omissions make the quantitative improvement impossible to evaluate and render the experimental claim load-bearing but unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make revisions to improve clarity and support for the claims in the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the vision-language embedding 'is rich in semantic and visual variations' and can serve as a potential field that 'meaningfully separates vulnerable from successful regions' rests on an unverified representational assumption; no architecture, training objective, gradient validation, or coverage analysis is supplied to support that the limited success-failure data produces the required separating structure.

Authors: We agree that the abstract's claim would benefit from explicit support. The manuscript body details the embedding architecture and contrastive training on success-failure pairs, but we will revise to briefly incorporate these elements into the abstract, add a gradient validation figure or analysis showing directional separation, and include coverage metrics demonstrating that the limited data yields the required structure. This addresses the unverified assumption directly. revision: yes
Referee: [Abstract] Abstract: the reported 'up to 23% more unique vulnerabilities' is presented without any description of how vulnerabilities are quantified, what statistical significance tests were used, how the state-of-the-art vision-language baselines were implemented, or the data-exclusion rules applied; these omissions make the quantitative improvement impossible to evaluate and render the experimental claim load-bearing but unsupported.

Authors: We concur that the abstract requires these methodological details for the quantitative result to be evaluable. In revision, we will expand the abstract to concisely define unique vulnerabilities (distinct failure modes via embedding-space clustering), note the statistical tests applied, summarize baseline implementations, and state the data-exclusion criteria. Corresponding expansions will also appear in the results section to ensure the claim is fully supported. revision: yes

Circularity Check

0 steps flagged

No circularity: framework uses separately trained embedding and RL policy on virtual rollouts

full rationale

The paper presents an empirical framework that trains a vision-language embedding on limited success-failure data, treats the resulting space as a potential field by construction of the method, and trains a separate deep RL policy on virtual rollouts within that space to produce a vulnerability map. No equations, derivations, or self-citations are shown that reduce any claimed prediction or result to its own inputs by definition. Experimental claims (e.g., 23% more vulnerabilities) rest on benchmark comparisons rather than fitted parameters renamed as predictions or self-referential uniqueness theorems. The approach is self-contained as a methodological pipeline without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim depends on the vision-language embedding containing the relevant variations for policy failures and on virtual rollouts being representative of real-world vulnerabilities; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption The vision-language embedding space is rich in semantic and visual variations that align with manipulation policy vulnerabilities
Invoked when treating the embedding as a potential field for the vulnerability prediction policy.

pith-pipeline@v0.9.0 · 5773 in / 1220 out tokens · 42399 ms · 2026-05-23T07:46:59.960344+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and c...

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 1 Pith paper · 16 internal anchors

[1]

The colosseum: A benchmark for evaluating generalization for robotic manipulation

Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Kr- ishna, Jesse Thomason, and Dieter Fox. The colosseum: A benchmark for evaluating generalization for robotic manipulation. arXiv preprint arXiv:2402.08191 , 2024. URL https://arxiv.org/pdf/2402.08191

work page arXiv 2024
[2]

Decomposing the generalization gap in imitation learn- ing for visual robotic manipulation

Annie Xie, Lisa Lee, Ted Xiao, and Chelsea Finn. Decomposing the generalization gap in imitation learn- ing for visual robotic manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 3153–3160. IEEE, 2024. URL https: //arxiv.org/abs/2307.03659

work page arXiv 2024
[3]

Data scaling laws in im- itation learning for robotic manipulation

Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in im- itation learning for robotic manipulation. arXiv preprint arXiv:2410.18647, 2024. URL https://arxiv.org/abs/2410. 18647

work page arXiv 2024
[4]

The role of predictive uncertainty and diversity in embodied ai and robot learning

Ransalu Senanayake. The role of predictive uncertainty and diversity in embodied ai and robot learning. arXiv preprint arXiv:2405.03164, 2024. URL https://arxiv.org/ pdf/2405.03164

work page arXiv 2024
[5]

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Di- eter Fox, Ajay Mandlekar, and Yijie Guo. AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation. arXiv preprint arXiv:2410.00371, 2024. URL https://arxiv.org/pdf/2410. 00371

work page arXiv 2024
[6]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023. URL https: //arxiv.org/abs/2303.04137

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Probabilistic robotics

Sebastian Thrun. Probabilistic robotics. Communications of the ACM , 45(3):52–57, 2002. URL https://docs.ufpr. br/∼danielsantos/ProbabilisticRobotics.pdf

work page 2002
[8]

Fast Gaussian Process Occupancy Maps

Simon T O’Callaghan and Fabio T Ramos. Gaussian process occupancy maps. The International Journal of Robotics Research , 31(1):42–62, 2012. URL https: //arxiv.org/pdf/1811.10156

work page internal anchor Pith review Pith/arXiv arXiv 2012
[9]

What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems , 30,

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems , 30,

work page
[10]

URL https://arxiv.org/pdf/1703.04977

work page internal anchor Pith review Pith/arXiv arXiv
[11]

On the importance of exploration for generalization in re- inforcement learning

Yiding Jiang, J Zico Kolter, and Roberta Raileanu. On the importance of exploration for generalization in re- inforcement learning. Advances in Neural Information Processing Systems, 36, 2024. URL https://arxiv.org/pdf/ 2306.05483

work page arXiv 2024
[12]

A bayesian approach to generative adversarial imitation learning

Wonseok Jeon, Seokin Seo, and Kee-Eung Kim. A bayesian approach to generative adversarial imitation learning. Advances in neural information processing systems , 31, 2018. URL https://papers.nips.cc/paper files/paper/2018/file/ 943aa0fcda4ee2901a7de9321663b114-Paper.pdf

work page 2018
[13]

Safe imitation learning via fast bayesian reward inference from preferences

Daniel Brown, Russell Coleman, Ravi Srinivasan, and Scott Niekum. Safe imitation learning via fast bayesian reward inference from preferences. In International Con- ference on Machine Learning, pages 1165–1177. PMLR,

work page
[14]

URL https://papers.nips.cc/paper files/paper/2018/ file/943aa0fcda4ee2901a7de9321663b114-Paper.pdf

work page 2018
[15]

Bayesian In- verse Reinforcement Learning

Deepak Ramachandran and Eyal Amir. Bayesian In- verse Reinforcement Learning. In IJCAI, volume 7, pages 2586–2591, 2007. URL https://www.ijcai.org/ Proceedings/07/Papers/416.pdf

work page 2007
[16]

Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress

Christopher Agia, Rohan Sinha, Jingyun Yang, Zi-ang Cao, Rika Antonova, Marco Pavone, and Jeannette Bohg. Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress. arXiv preprint arXiv:2410.04640, 2024. URL https://arxiv.org/pdf/2410. 04640

work page arXiv 2024
[17]

Interactive Semantic Interventions for VLMs: A Human- in-the-Loop Investigation of VLM Failure

Lukas Klein, Kenza Amara, Carsten T L ¨uth, Hendrik Strobelt, Mennatallah El-Assady, and Paul F Jaeger. Interactive Semantic Interventions for VLMs: A Human- in-the-Loop Investigation of VLM Failure. In Neurips Safe Generative AI Workshop 2024 , 2024. URL https: //openreview.net/pdf?id=3kMucCYhYN

work page 2024
[18]

Decider: Leveraging foundation model priors for improved model failure detection and explanation

Rakshith Subramanyam, Kowshik Thopalli, Vivek Narayanaswamy, and Jayaraman J Thiagarajan. Decider: Leveraging foundation model priors for improved model failure detection and explanation. In European Con- ference on Computer Vision , pages 465–482. Springer,

work page
[19]

URL https://arxiv.org/pdf/2408.00331

work page arXiv
[20]

Reflect: Summarizing robot experiences for failure explanation and correction

Zeyi Liu, Arpit Bahety, and Shuran Song. Reflect: Summarizing robot experiences for failure explanation and correction. arXiv preprint arXiv:2306.15724 , 2023. URL https://arxiv.org/abs/2306.15724

work page arXiv 2023
[22]

URL https://arxiv.org/pdf/2406.07145

work page arXiv
[23]

How do we fail? stress testing perception in autonomous vehicles

Harrison Delecki, Masha Itkina, Bernard Lange, Ransalu Senanayake, and Mykel J Kochenderfer. How do we fail? stress testing perception in autonomous vehicles. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 5139–5146. IEEE,

work page 2022
[24]

URL https://arxiv.org/pdf/2203.14155

work page arXiv
[25]

Curiosity-driven red- teaming for large language models

Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, and Pulkit Agrawal. Curiosity-driven red- teaming for large language models. arXiv preprint arXiv:2402.19464, 2024. URL https://arxiv.org/pdf/2402. 19464

work page arXiv 2024
[26]

A survey of algorithms for black-box safety validation of cyber-physical systems

Anthony Corso, Robert Moss, Mark Koren, Ritchie Lee, and Mykel Kochenderfer. A survey of algorithms for black-box safety validation of cyber-physical systems. Journal of Artificial Intelligence Research , 72:377–428,

work page
[27]

URL https://arxiv.org/pdf/2005.02979

work page arXiv 2005
[28]

Out-of-distribution detection for automotive perception

Julia Nitsch, Masha Itkina, Ransalu Senanayake, Juan Nieto, Max Schmidt, Roland Siegwart, Mykel J Kochen- derfer, and Cesar Cadena. Out-of-distribution detection for automotive perception. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) , pages 2938–2943. IEEE, 2021. URL https://arxiv.org/ pdf/2011.01413

work page arXiv 2021
[29]

SAFE: Sensitivity-aware features for out-of-distribution object detection

Samuel Wilson, Tobias Fischer, Feras Dayoub, Dimity Miller, and Niko S ¨underhauf. SAFE: Sensitivity-aware features for out-of-distribution object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 23565–23576, 2023. URL https://openaccess.thecvf.com/content/ICCV2023/ papers/Wilson SAFE Sensitivity-Aware Features...

work page 2023
[30]

Pytorch-ood: A library for out-of-distribution detection based on pytorch

Konstantin Kirchheim, Marco Filax, and Frank Ortmeier. Pytorch-ood: A library for out-of-distribution detection based on pytorch. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4351–4360, 2022. URL https://openaccess.thecvf.com/content/CVPR2022W/ HCIS/papers/Kirchheim PyTorch-OOD A Library for Out-of-Distribut...

work page 2022
[31]

PAGER: A Framework for Failure Analysis of Deep Regression Models

Jayaraman J Thiagarajan, Vivek Narayanaswamy, Puja Trivedi, and Rushil Anirudh. PAGER: A Framework for Failure Analysis of Deep Regression Models. arXiv preprint arXiv:2309.10977, 2023. URL https://arxiv.org/ pdf/2309.10977

work page arXiv 2023
[32]

Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation

Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation. arXiv preprint arXiv:2501.18564, 2025

work page arXiv 2025
[33]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. URL https://arxiv.org/pdf/2212.06817

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

URL https://arxiv.org/pdf/2307.15818

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Learning Dexterous In-Hand Manipulation

OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pa- chocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipula- tion. The International Journal of Robotics Research , 39 (1):3–20, 2020. URL https://arxiv.org/pdf/1808.00177

work page internal anchor Pith review Pith/arXiv arXiv 2020
[37]

Human-AI Safety: A Descendant of Generative AI and Control Systems Safety

Andrea Bajcsy and Jaime F Fisac. Human-AI Safety: A Descendant of Generative AI and Control Systems Safety. arXiv preprint arXiv:2405.09794 , 2024. URL https://arxiv.org/pdf/2405.09794

work page arXiv 2024
[38]

Concept: Dynamic Risk Assessment for AI-Controlled Robotic Systems

Philipp Grimmeisen, Friedrich Sautter, and Andrey Morozov. Concept: Dynamic Risk Assessment for AI-Controlled Robotic Systems. arXiv preprint arXiv:2401.14147, 2024. URL https://arxiv.org/pdf/2401. 14147

work page arXiv 2024
[39]

The situation awareness framework for explainable AI (SAFE-AI) and human factors considerations for XAI systems

Lindsay Sanneman and Julie A Shah. The situation awareness framework for explainable AI (SAFE-AI) and human factors considerations for XAI systems. Inter- national Journal of Human–Computer Interaction , 38 (18-20):1772–1788, 2022. URL https://pmc.ncbi.nlm. nih.gov/articles/PMC7338174/

work page 2022
[40]

Failure prediction with statistical guaran- tees for vision-based robot control

Alec Farid, David Snyder, Allen Z Ren, and Anirudha Majumdar. Failure prediction with statistical guaran- tees for vision-based robot control. arXiv preprint arXiv:2202.05894, 2022. URL https://arxiv.org/pdf/2202. 05894

work page arXiv 2022
[41]

Distributionally robust policy learning via adversarial environment gen- eration

Allen Z Ren and Anirudha Majumdar. Distributionally robust policy learning via adversarial environment gen- eration. IEEE Robotics and Automation Letters , 7(2): 1379–1386, 2022. URL https://arxiv.org/pdf/2107.06353

work page arXiv 2022
[42]

Teaser: Fast and certiﬁable point cloud registration,

Heng Yang, Jingnan Shi, and Luca Carlone. Teaser: Fast and certifiable point cloud registration. IEEE Transactions on Robotics , 37(2):314–333, 2020. URL https://arxiv.org/abs/2001.07715

work page arXiv 2020
[43]

Full-Distribution Generalization Bounds for Imitation Learning Policies

Joseph A Vincent, Haruki Nishimura, Masha Itkina, and Mac Schwager. Full-Distribution Generalization Bounds for Imitation Learning Policies. In First Workshop on Out-of-Distribution Generalization in Robotics at CoRL 2023 , 2023. URL https://openreview.net/pdf?id= JZkwYiyy9I

work page 2023
[44]

Minimum-violation LTL Planning with Conflicting Specifications

Jana Tmov, Luis I Reyes Castro, Sertac Karaman, Emilio Frazzoli, and Daniela Rus. Minimum-violation LTL plan- ning with conflicting specifications. In 2013 American Control Conference, pages 200–205. IEEE, 2013. URL https://arxiv.org/pdf/1303.3679

work page internal anchor Pith review Pith/arXiv arXiv 2013
[45]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. URL https://arxiv.org/pdf/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Inter- national Conference on Learning Representations , 2020. URL https://arxiv.org/pdf/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2020
[47]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In International conference on ma- chine learning , pages 8748–8763. PMLR, 2021. URL https://arxiv.org/pdf/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[48]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart´ın-Mart´ın, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation frame- work and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020. URL https://arxiv.org/abs/2009. 12293

work page internal anchor Pith review Pith/arXiv arXiv 2009
[49]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demon- strations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021. URL https://arxiv.org/abs/2108. 03298

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Ireti- ayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In 7th Annual Conference on Robot Learning , 2023. URL https://arxiv.org/abs/2310.17596

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

URL https://arxiv.org/abs/1602.01783

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning , pages 1861–1870. PMLR, 2018. URL https://arxiv.org/ abs/1801.01290

work page internal anchor Pith review Pith/arXiv arXiv 2018
[54]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021. URL https://arxiv.org/ abs/2110.06169

work page internal anchor Pith review Pith/arXiv arXiv 2021
[55]

Off-policy deep reinforcement learning without explo- ration

Scott Fujimoto, David Meger, and Doina Precup. Off- policy deep reinforcement learning without exploration. In International conference on machine learning , pages 2052–2062. PMLR, 2019. URL https://arxiv.org/abs/ 1812.02900

work page arXiv 2052
[56]

Learning to generalize across long-horizon tasks from human demonstrations

Ajay Mandlekar, Danfei Xu, Roberto Mart ´ın-Mart´ın, Silvio Savarese, and Li Fei-Fei. Learning to generalize across long-horizon tasks from human demonstrations. arXiv preprint arXiv:2003.06085 , 2020. URL https: //arxiv.org/abs/2003.06085

work page arXiv 2003
[57]

Modularity through attention: Efficient training and transfer of language- conditioned policies for robot manipulation

Yifan Zhou, Shubham Sonawani, Mariano Phielipp, Si- mon Stepputtis, and Heni Ben Amor. Modularity through attention: Efficient training and transfer of language- conditioned policies for robot manipulation. arXiv preprint arXiv:2212.04573, 2022. URL https://arxiv.org/ abs/2212.04573. APPENDIX I. E XPERIMENTAL SETUP A. Real-World Experiment Setup Real-worl...

work page arXiv 2022
[58]

The following RL methods were included: • Proximal Policy Optimization (PPO): A policy-gradient method known for its stability and efficiency

Reinforcement Learning (RL) Baselines: The RL base- lines were implemented using well-established algorithms, each optimized for the task to ensure a fair comparison. The following RL methods were included: • Proximal Policy Optimization (PPO): A policy-gradient method known for its stability and efficiency. Key hyper- parameters included: – Learning rate...

work page
[59]

We evaluated 3 state- of-the-art VLMs adapted to our task:

Vision-Language Model (VLM) Baselines: The VLM baselines take advantage of the interplay between visual and textual modalities for task representation. We evaluated 3 state- of-the-art VLMs adapted to our task:

work page
[60]

First, we process the output tra- jectories into videos and compute the appropriate frame rate to generate video sequences equivalent to 15 frames per trajectory pair

Qwen2-VL Additionally, we leverage GPT-4o with in-context learning, using five demonstrations. First, we process the output tra- jectories into videos and compute the appropriate frame rate to generate video sequences equivalent to 15 frames per trajectory pair. These sequences, representing perturbation scenarios, are provided to the VLMs along with a sy...

work page
[61]

Change cube color to red

work page
[62]

Change cube color to green

work page
[63]

Change cube color to blue

work page
[64]

Change cube color to gray

work page
[65]

Change table color to green

work page
[66]

Change table color to blue

work page
[67]

Change table color to red

work page
[68]

Change table color to gray

work page
[69]

Resize table to (0.8, 0.2, 0.025)

work page
[70]

Resize table to (0.2, 0.8, 0.025)

work page
[71]

Resize cube to (0.04, 0.04, 0.04)

work page
[72]

Resize cube to (0.01, 0.01, 0.01)

work page
[73]

Resize cube to (0.04, 0.01, 0.01)

work page
[74]

Change robot color to red

work page
[75]

Change robot color to green

work page
[76]

Change robot color to cyan

work page
[77]

Change robot color to gray

work page
[78]

Change lighting color to red

work page
[79]

Change lighting color to green

work page
[80]

Change lighting color to blue

work page
[81]

Evaluation Fig 12 illustrates the similarity structure of embeddings trained using only Binary Cross-Entropy (BCE) loss, resulting in highly correlated representations

Change lighting color to gray B. Evaluation Fig 12 illustrates the similarity structure of embeddings trained using only Binary Cross-Entropy (BCE) loss, resulting in highly correlated representations. In contrast, the right matrix, trained with a combination of BCE and Contrastive Loss, demonstrates improved separation, as evidenced by the stronger diago...

work page
[82]

Instead of relying solely on visual cues, the model gains an additional perspective on the underlying concepts (e.g., object names, attributes, or relations)

Semantic Guidance: Textual representations carry rich semantic information that can guide the image backbone. Instead of relying solely on visual cues, the model gains an additional perspective on the underlying concepts (e.g., object names, attributes, or relations)

work page
[83]

Improved Discriminative Power: With access to text- based information, the model can differentiate between visually similar classes by leveraging linguistic differ- ences in their corresponding textual descriptions

work page

Showing first 80 references.

[1] [1]

The colosseum: A benchmark for evaluating generalization for robotic manipulation

Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Kr- ishna, Jesse Thomason, and Dieter Fox. The colosseum: A benchmark for evaluating generalization for robotic manipulation. arXiv preprint arXiv:2402.08191 , 2024. URL https://arxiv.org/pdf/2402.08191

work page arXiv 2024

[2] [2]

Decomposing the generalization gap in imitation learn- ing for visual robotic manipulation

Annie Xie, Lisa Lee, Ted Xiao, and Chelsea Finn. Decomposing the generalization gap in imitation learn- ing for visual robotic manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 3153–3160. IEEE, 2024. URL https: //arxiv.org/abs/2307.03659

work page arXiv 2024

[3] [3]

Data scaling laws in im- itation learning for robotic manipulation

Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in im- itation learning for robotic manipulation. arXiv preprint arXiv:2410.18647, 2024. URL https://arxiv.org/abs/2410. 18647

work page arXiv 2024

[4] [4]

The role of predictive uncertainty and diversity in embodied ai and robot learning

Ransalu Senanayake. The role of predictive uncertainty and diversity in embodied ai and robot learning. arXiv preprint arXiv:2405.03164, 2024. URL https://arxiv.org/ pdf/2405.03164

work page arXiv 2024

[5] [5]

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Di- eter Fox, Ajay Mandlekar, and Yijie Guo. AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation. arXiv preprint arXiv:2410.00371, 2024. URL https://arxiv.org/pdf/2410. 00371

work page arXiv 2024

[6] [6]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023. URL https: //arxiv.org/abs/2303.04137

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Probabilistic robotics

Sebastian Thrun. Probabilistic robotics. Communications of the ACM , 45(3):52–57, 2002. URL https://docs.ufpr. br/∼danielsantos/ProbabilisticRobotics.pdf

work page 2002

[8] [8]

Fast Gaussian Process Occupancy Maps

Simon T O’Callaghan and Fabio T Ramos. Gaussian process occupancy maps. The International Journal of Robotics Research , 31(1):42–62, 2012. URL https: //arxiv.org/pdf/1811.10156

work page internal anchor Pith review Pith/arXiv arXiv 2012

[9] [9]

What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems , 30,

Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems , 30,

work page

[10] [10]

URL https://arxiv.org/pdf/1703.04977

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

On the importance of exploration for generalization in re- inforcement learning

Yiding Jiang, J Zico Kolter, and Roberta Raileanu. On the importance of exploration for generalization in re- inforcement learning. Advances in Neural Information Processing Systems, 36, 2024. URL https://arxiv.org/pdf/ 2306.05483

work page arXiv 2024

[12] [12]

A bayesian approach to generative adversarial imitation learning

Wonseok Jeon, Seokin Seo, and Kee-Eung Kim. A bayesian approach to generative adversarial imitation learning. Advances in neural information processing systems , 31, 2018. URL https://papers.nips.cc/paper files/paper/2018/file/ 943aa0fcda4ee2901a7de9321663b114-Paper.pdf

work page 2018

[13] [13]

Safe imitation learning via fast bayesian reward inference from preferences

Daniel Brown, Russell Coleman, Ravi Srinivasan, and Scott Niekum. Safe imitation learning via fast bayesian reward inference from preferences. In International Con- ference on Machine Learning, pages 1165–1177. PMLR,

work page

[14] [14]

URL https://papers.nips.cc/paper files/paper/2018/ file/943aa0fcda4ee2901a7de9321663b114-Paper.pdf

work page 2018

[15] [15]

Bayesian In- verse Reinforcement Learning

Deepak Ramachandran and Eyal Amir. Bayesian In- verse Reinforcement Learning. In IJCAI, volume 7, pages 2586–2591, 2007. URL https://www.ijcai.org/ Proceedings/07/Papers/416.pdf

work page 2007

[16] [16]

Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress

Christopher Agia, Rohan Sinha, Jingyun Yang, Zi-ang Cao, Rika Antonova, Marco Pavone, and Jeannette Bohg. Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress. arXiv preprint arXiv:2410.04640, 2024. URL https://arxiv.org/pdf/2410. 04640

work page arXiv 2024

[17] [17]

Interactive Semantic Interventions for VLMs: A Human- in-the-Loop Investigation of VLM Failure

Lukas Klein, Kenza Amara, Carsten T L ¨uth, Hendrik Strobelt, Mennatallah El-Assady, and Paul F Jaeger. Interactive Semantic Interventions for VLMs: A Human- in-the-Loop Investigation of VLM Failure. In Neurips Safe Generative AI Workshop 2024 , 2024. URL https: //openreview.net/pdf?id=3kMucCYhYN

work page 2024

[18] [18]

Decider: Leveraging foundation model priors for improved model failure detection and explanation

Rakshith Subramanyam, Kowshik Thopalli, Vivek Narayanaswamy, and Jayaraman J Thiagarajan. Decider: Leveraging foundation model priors for improved model failure detection and explanation. In European Con- ference on Computer Vision , pages 465–482. Springer,

work page

[19] [19]

URL https://arxiv.org/pdf/2408.00331

work page arXiv

[20] [20]

Reflect: Summarizing robot experiences for failure explanation and correction

Zeyi Liu, Arpit Bahety, and Shuran Song. Reflect: Summarizing robot experiences for failure explanation and correction. arXiv preprint arXiv:2306.15724 , 2023. URL https://arxiv.org/abs/2306.15724

work page arXiv 2023

[21] [22]

URL https://arxiv.org/pdf/2406.07145

work page arXiv

[22] [23]

How do we fail? stress testing perception in autonomous vehicles

Harrison Delecki, Masha Itkina, Bernard Lange, Ransalu Senanayake, and Mykel J Kochenderfer. How do we fail? stress testing perception in autonomous vehicles. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 5139–5146. IEEE,

work page 2022

[23] [24]

URL https://arxiv.org/pdf/2203.14155

work page arXiv

[24] [25]

Curiosity-driven red- teaming for large language models

Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, and Pulkit Agrawal. Curiosity-driven red- teaming for large language models. arXiv preprint arXiv:2402.19464, 2024. URL https://arxiv.org/pdf/2402. 19464

work page arXiv 2024

[25] [26]

A survey of algorithms for black-box safety validation of cyber-physical systems

Anthony Corso, Robert Moss, Mark Koren, Ritchie Lee, and Mykel Kochenderfer. A survey of algorithms for black-box safety validation of cyber-physical systems. Journal of Artificial Intelligence Research , 72:377–428,

work page

[26] [27]

URL https://arxiv.org/pdf/2005.02979

work page arXiv 2005

[27] [28]

Out-of-distribution detection for automotive perception

Julia Nitsch, Masha Itkina, Ransalu Senanayake, Juan Nieto, Max Schmidt, Roland Siegwart, Mykel J Kochen- derfer, and Cesar Cadena. Out-of-distribution detection for automotive perception. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) , pages 2938–2943. IEEE, 2021. URL https://arxiv.org/ pdf/2011.01413

work page arXiv 2021

[28] [29]

SAFE: Sensitivity-aware features for out-of-distribution object detection

Samuel Wilson, Tobias Fischer, Feras Dayoub, Dimity Miller, and Niko S ¨underhauf. SAFE: Sensitivity-aware features for out-of-distribution object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 23565–23576, 2023. URL https://openaccess.thecvf.com/content/ICCV2023/ papers/Wilson SAFE Sensitivity-Aware Features...

work page 2023

[29] [30]

Pytorch-ood: A library for out-of-distribution detection based on pytorch

Konstantin Kirchheim, Marco Filax, and Frank Ortmeier. Pytorch-ood: A library for out-of-distribution detection based on pytorch. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4351–4360, 2022. URL https://openaccess.thecvf.com/content/CVPR2022W/ HCIS/papers/Kirchheim PyTorch-OOD A Library for Out-of-Distribut...

work page 2022

[30] [31]

PAGER: A Framework for Failure Analysis of Deep Regression Models

Jayaraman J Thiagarajan, Vivek Narayanaswamy, Puja Trivedi, and Rushil Anirudh. PAGER: A Framework for Failure Analysis of Deep Regression Models. arXiv preprint arXiv:2309.10977, 2023. URL https://arxiv.org/ pdf/2309.10977

work page arXiv 2023

[31] [32]

Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation

Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation. arXiv preprint arXiv:2501.18564, 2025

work page arXiv 2025

[32] [33]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. URL https://arxiv.org/pdf/2212.06817

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [35]

URL https://arxiv.org/pdf/2307.15818

work page internal anchor Pith review Pith/arXiv arXiv

[34] [36]

Learning Dexterous In-Hand Manipulation

OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pa- chocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipula- tion. The International Journal of Robotics Research , 39 (1):3–20, 2020. URL https://arxiv.org/pdf/1808.00177

work page internal anchor Pith review Pith/arXiv arXiv 2020

[35] [37]

Human-AI Safety: A Descendant of Generative AI and Control Systems Safety

Andrea Bajcsy and Jaime F Fisac. Human-AI Safety: A Descendant of Generative AI and Control Systems Safety. arXiv preprint arXiv:2405.09794 , 2024. URL https://arxiv.org/pdf/2405.09794

work page arXiv 2024

[36] [38]

Concept: Dynamic Risk Assessment for AI-Controlled Robotic Systems

Philipp Grimmeisen, Friedrich Sautter, and Andrey Morozov. Concept: Dynamic Risk Assessment for AI-Controlled Robotic Systems. arXiv preprint arXiv:2401.14147, 2024. URL https://arxiv.org/pdf/2401. 14147

work page arXiv 2024

[37] [39]

The situation awareness framework for explainable AI (SAFE-AI) and human factors considerations for XAI systems

Lindsay Sanneman and Julie A Shah. The situation awareness framework for explainable AI (SAFE-AI) and human factors considerations for XAI systems. Inter- national Journal of Human–Computer Interaction , 38 (18-20):1772–1788, 2022. URL https://pmc.ncbi.nlm. nih.gov/articles/PMC7338174/

work page 2022

[38] [40]

Failure prediction with statistical guaran- tees for vision-based robot control

Alec Farid, David Snyder, Allen Z Ren, and Anirudha Majumdar. Failure prediction with statistical guaran- tees for vision-based robot control. arXiv preprint arXiv:2202.05894, 2022. URL https://arxiv.org/pdf/2202. 05894

work page arXiv 2022

[39] [41]

Distributionally robust policy learning via adversarial environment gen- eration

Allen Z Ren and Anirudha Majumdar. Distributionally robust policy learning via adversarial environment gen- eration. IEEE Robotics and Automation Letters , 7(2): 1379–1386, 2022. URL https://arxiv.org/pdf/2107.06353

work page arXiv 2022

[40] [42]

Teaser: Fast and certiﬁable point cloud registration,

Heng Yang, Jingnan Shi, and Luca Carlone. Teaser: Fast and certifiable point cloud registration. IEEE Transactions on Robotics , 37(2):314–333, 2020. URL https://arxiv.org/abs/2001.07715

work page arXiv 2020

[41] [43]

Full-Distribution Generalization Bounds for Imitation Learning Policies

Joseph A Vincent, Haruki Nishimura, Masha Itkina, and Mac Schwager. Full-Distribution Generalization Bounds for Imitation Learning Policies. In First Workshop on Out-of-Distribution Generalization in Robotics at CoRL 2023 , 2023. URL https://openreview.net/pdf?id= JZkwYiyy9I

work page 2023

[42] [44]

Minimum-violation LTL Planning with Conflicting Specifications

Jana Tmov, Luis I Reyes Castro, Sertac Karaman, Emilio Frazzoli, and Daniela Rus. Minimum-violation LTL plan- ning with conflicting specifications. In 2013 American Control Conference, pages 200–205. IEEE, 2013. URL https://arxiv.org/pdf/1303.3679

work page internal anchor Pith review Pith/arXiv arXiv 2013

[43] [45]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. URL https://arxiv.org/pdf/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [46]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Inter- national Conference on Learning Representations , 2020. URL https://arxiv.org/pdf/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2020

[45] [47]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In International conference on ma- chine learning , pages 8748–8763. PMLR, 2021. URL https://arxiv.org/pdf/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021

[46] [48]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart´ın-Mart´ın, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation frame- work and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020. URL https://arxiv.org/abs/2009. 12293

work page internal anchor Pith review Pith/arXiv arXiv 2009

[47] [49]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demon- strations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021. URL https://arxiv.org/abs/2108. 03298

work page internal anchor Pith review Pith/arXiv arXiv 2021

[48] [50]

MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Ireti- ayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In 7th Annual Conference on Robot Learning , 2023. URL https://arxiv.org/abs/2310.17596

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [52]

URL https://arxiv.org/abs/1602.01783

work page internal anchor Pith review Pith/arXiv arXiv

[50] [53]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning , pages 1861–1870. PMLR, 2018. URL https://arxiv.org/ abs/1801.01290

work page internal anchor Pith review Pith/arXiv arXiv 2018

[51] [54]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021. URL https://arxiv.org/ abs/2110.06169

work page internal anchor Pith review Pith/arXiv arXiv 2021

[52] [55]

Off-policy deep reinforcement learning without explo- ration

Scott Fujimoto, David Meger, and Doina Precup. Off- policy deep reinforcement learning without exploration. In International conference on machine learning , pages 2052–2062. PMLR, 2019. URL https://arxiv.org/abs/ 1812.02900

work page arXiv 2052

[53] [56]

Learning to generalize across long-horizon tasks from human demonstrations

Ajay Mandlekar, Danfei Xu, Roberto Mart ´ın-Mart´ın, Silvio Savarese, and Li Fei-Fei. Learning to generalize across long-horizon tasks from human demonstrations. arXiv preprint arXiv:2003.06085 , 2020. URL https: //arxiv.org/abs/2003.06085

work page arXiv 2003

[54] [57]

Modularity through attention: Efficient training and transfer of language- conditioned policies for robot manipulation

Yifan Zhou, Shubham Sonawani, Mariano Phielipp, Si- mon Stepputtis, and Heni Ben Amor. Modularity through attention: Efficient training and transfer of language- conditioned policies for robot manipulation. arXiv preprint arXiv:2212.04573, 2022. URL https://arxiv.org/ abs/2212.04573. APPENDIX I. E XPERIMENTAL SETUP A. Real-World Experiment Setup Real-worl...

work page arXiv 2022

[55] [58]

The following RL methods were included: • Proximal Policy Optimization (PPO): A policy-gradient method known for its stability and efficiency

Reinforcement Learning (RL) Baselines: The RL base- lines were implemented using well-established algorithms, each optimized for the task to ensure a fair comparison. The following RL methods were included: • Proximal Policy Optimization (PPO): A policy-gradient method known for its stability and efficiency. Key hyper- parameters included: – Learning rate...

work page

[56] [59]

We evaluated 3 state- of-the-art VLMs adapted to our task:

Vision-Language Model (VLM) Baselines: The VLM baselines take advantage of the interplay between visual and textual modalities for task representation. We evaluated 3 state- of-the-art VLMs adapted to our task:

work page

[57] [60]

First, we process the output tra- jectories into videos and compute the appropriate frame rate to generate video sequences equivalent to 15 frames per trajectory pair

Qwen2-VL Additionally, we leverage GPT-4o with in-context learning, using five demonstrations. First, we process the output tra- jectories into videos and compute the appropriate frame rate to generate video sequences equivalent to 15 frames per trajectory pair. These sequences, representing perturbation scenarios, are provided to the VLMs along with a sy...

work page

[58] [61]

Change cube color to red

work page

[59] [62]

Change cube color to green

work page

[60] [63]

Change cube color to blue

work page

[61] [64]

Change cube color to gray

work page

[62] [65]

Change table color to green

work page

[63] [66]

Change table color to blue

work page

[64] [67]

Change table color to red

work page

[65] [68]

Change table color to gray

work page

[66] [69]

Resize table to (0.8, 0.2, 0.025)

work page

[67] [70]

Resize table to (0.2, 0.8, 0.025)

work page

[68] [71]

Resize cube to (0.04, 0.04, 0.04)

work page

[69] [72]

Resize cube to (0.01, 0.01, 0.01)

work page

[70] [73]

Resize cube to (0.04, 0.01, 0.01)

work page

[71] [74]

Change robot color to red

work page

[72] [75]

Change robot color to green

work page

[73] [76]

Change robot color to cyan

work page

[74] [77]

Change robot color to gray

work page

[75] [78]

Change lighting color to red

work page

[76] [79]

Change lighting color to green

work page

[77] [80]

Change lighting color to blue

work page

[78] [81]

Evaluation Fig 12 illustrates the similarity structure of embeddings trained using only Binary Cross-Entropy (BCE) loss, resulting in highly correlated representations

Change lighting color to gray B. Evaluation Fig 12 illustrates the similarity structure of embeddings trained using only Binary Cross-Entropy (BCE) loss, resulting in highly correlated representations. In contrast, the right matrix, trained with a combination of BCE and Contrastive Loss, demonstrates improved separation, as evidenced by the stronger diago...

work page

[79] [82]

Instead of relying solely on visual cues, the model gains an additional perspective on the underlying concepts (e.g., object names, attributes, or relations)

Semantic Guidance: Textual representations carry rich semantic information that can guide the image backbone. Instead of relying solely on visual cues, the model gains an additional perspective on the underlying concepts (e.g., object names, attributes, or relations)

work page

[80] [83]

Improved Discriminative Power: With access to text- based information, the model can differentiate between visually similar classes by leveraging linguistic differ- ences in their corresponding textual descriptions

work page