pith. sign in

arxiv: 2412.02818 · v4 · pith:V7OM6JV2new · submitted 2024-12-03 · 💻 cs.RO · cs.LG

RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields

Pith reviewed 2026-05-23 07:46 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords robot manipulationvulnerability detectionvision-language embeddingsdeep reinforcement learningpotential fieldsrobot safetymanipulation policiessemantic embeddings
0
0 comments X

The pith

A reinforcement learning policy on a vision-language embedding treated as a potential field uncovers up to 23% more unique robot manipulation vulnerabilities than baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Robot manipulation policies are prone to failure under real-world variations, yet identifying those variations through direct physical testing is costly and unsafe. The paper trains a separate deep RL policy that treats a continuous vision-language embedding as a potential field, moving toward regions that cause failures while avoiding successful ones. This policy is learned entirely from virtual rollouts using limited success-failure data. Experiments on simulation benchmarks and a physical robot arm show the approach reveals more subtle vulnerabilities than existing vision-language methods. The resulting vulnerability map also supports fine-tuning the original policy with reduced data.

Core claim

The central claim is that treating a vision-language embedding space as a semantic potential field allows a deep RL vulnerability prediction policy, trained on virtual runs, to scalably locate failure-prone regions for a target manipulation policy, producing a probabilistic vulnerability-likelihood map that identifies up to 23% more unique vulnerabilities than state-of-the-art baselines while also improving the original policy through targeted fine-tuning.

What carries the argument

The semantic potential field formed by the continuous vision-language embedding, which guides the deep RL vulnerability prediction policy to navigate toward failure regions.

If this is right

  • The vulnerability prediction policy enables scalable analysis without expensive or unsafe physical trials.
  • Querying the policy produces a probabilistic map of vulnerability likelihood across the embedding space.
  • Fine-tuning the original manipulation policy on the discovered vulnerabilities improves performance with substantially less data.
  • The method reveals subtle vulnerabilities that heuristic testing overlooks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of vulnerability discovery into its own policy could allow the testing strategy to be refined independently of the manipulation policy being evaluated.
  • Because the embedding serves as a proxy space, the same framework could be applied to other embodied tasks where direct variation sampling is expensive.
  • If the embedding captures additional modalities, the potential-field approach might locate vulnerabilities arising from combined visual and language perturbations.

Load-bearing premise

The vision-language embedding trained on limited success-failure data contains enough semantic and visual variation to act as a potential field that reliably separates vulnerable regions from successful ones.

What would settle it

A controlled test on a held-out manipulation task or physical robot where the framework finds no more unique vulnerabilities than the vision-language baselines, or where the vulnerability map shows no correlation with measured failure rates.

Figures

Figures reproduced from arXiv: 2412.02818 by Dieter Fox, Heni Ben Amor, Jiafei Duan, Ransalu Senanayake, Som Sagar, Sreevishakh Vasudevan, Yifan Zhou.

Figure 1
Figure 1. Figure 1: RoboMD diagnoses failure modes in pre-trained ma [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: RoboMD Framework: (1) A PPO-based deep RL agent identifies configurations most likely to induce failures by [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The pipeline illustrates how rollouts with disruptions (e.g., object or lighting changes) are processed to learn meaningful [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Continuous Action Space Exploration. The diagram [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Some environment variations for both simulation and real-world evaluation. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Action diversity across RL algorithms. The X-axis [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Individual FM analysis of multiple models. Each radar plot represents the failure likelihood of a specific actions. The [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Confusion matrices of embeddings trained using a) [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Failure distribution before and after fine-tuning “Lift” [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Scenes from experiments on real world robot [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Scenes from experiments on Robosuite C. Baselines To validate the effectiveness of our method, we compared it against two categories of baselines: Reinforcement Learning (RL) baselines and Vision-Language Model (VLM) baselines. Below, we detail their implementation, hyperparameters, and specific configurations. 1) Reinforcement Learning (RL) Baselines: The RL base￾lines were implemented using well-establi… view at source ↗
Figure 12
Figure 12. Figure 12: The order in which the confusion matrix is a) Image Ecoder + BCE b) Image + Text Encoder + BCE loss c) Image [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Testing Robustness Under Visual Perturbations: Successful Rollout in Training vs. Failure Induced by Red Table [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Performance comparison of behavior cloning (BC) and diffusion-based policies on the Lift task before and after [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 8
Figure 8. Figure 8: 1) Change cube color to red 2) Change cube color to green 3) Change cube color to blue 4) Change cube color to gray 5) Change table color to green 6) Change table color to blue 7) Change table color to red 8) Change table color to gray 9) Resize table to (0.8, 0.2, 0.025) 10) Resize table to (0.2, 0.8, 0.025) 11) Resize cube to (0.04, 0.04, 0.04) 12) Resize cube to (0.01, 0.01, 0.01) 13) Resize cube to (0.… view at source ↗
Figure 15
Figure 15. Figure 15: kNN Accuracy Drop with Increasing k in Continuous [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: BC lift finetuned on a combined dataset of 12 different [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 16
Figure 16. Figure 16: Training loss for training action representations [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: Environmental and Object Perturbations on Manipulation Tasks [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
read the original abstract

Robot manipulation policies, while central to the promise of physical AI, are highly vulnerable in the presence of external variations in the real world. Diagnosing these vulnerabilities is hindered by two key challenges: (i) the relevant variations to test against are often unknown, and (ii) direct testing in the real world is costly and unsafe. We introduce a framework that tackles both issues by learning a separate deep reinforcement learning (deep RL) policy for vulnerability prediction through virtual runs on a continuous vision-language embedding trained with limited success-failure data. By treating this embedding space, which is rich in semantic and visual variations, as a potential field, the policy learns to move toward vulnerable regions while being repelled from success regions. This vulnerability prediction policy, trained on virtual rollouts, enables scalable and safe vulnerability analysis without expensive physical trials. By querying this policy, our framework builds a probabilistic vulnerability-likelihood map. Experiments across simulation benchmarks and a physical robot arm show that our framework uncovers up to 23% more unique vulnerabilities than state-of-the-art vision-language baselines, revealing subtle vulnerabilities overlooked by heuristic testing. Additionally, we show that fine-tuning the manipulation policy with the vulnerabilities discovered by our framework improves manipulation performance with much less fine-tuning data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces RoboMD, a framework that trains a separate deep RL policy to discover vulnerabilities in robot manipulation policies. It does so by treating a continuous vision-language embedding—trained on limited success-failure data—as a semantic potential field that attracts the policy toward vulnerable regions and repels it from successful ones. Virtual rollouts of this policy are used to construct a probabilistic vulnerability map. Experiments on simulation benchmarks and a physical arm reportedly uncover up to 23% more unique vulnerabilities than vision-language baselines and enable more data-efficient fine-tuning of the original policy.

Significance. If the embedding truly encodes sufficient semantic and visual variation to form a separating potential field, the approach would provide a scalable, simulation-only method for identifying subtle vulnerabilities that heuristic testing misses, reducing reliance on costly or unsafe physical trials and improving robustness of manipulation policies.

major comments (2)
  1. [Abstract] Abstract: the central claim that the vision-language embedding 'is rich in semantic and visual variations' and can serve as a potential field that 'meaningfully separates vulnerable from successful regions' rests on an unverified representational assumption; no architecture, training objective, gradient validation, or coverage analysis is supplied to support that the limited success-failure data produces the required separating structure.
  2. [Abstract] Abstract: the reported 'up to 23% more unique vulnerabilities' is presented without any description of how vulnerabilities are quantified, what statistical significance tests were used, how the state-of-the-art vision-language baselines were implemented, or the data-exclusion rules applied; these omissions make the quantitative improvement impossible to evaluate and render the experimental claim load-bearing but unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make revisions to improve clarity and support for the claims in the abstract.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the vision-language embedding 'is rich in semantic and visual variations' and can serve as a potential field that 'meaningfully separates vulnerable from successful regions' rests on an unverified representational assumption; no architecture, training objective, gradient validation, or coverage analysis is supplied to support that the limited success-failure data produces the required separating structure.

    Authors: We agree that the abstract's claim would benefit from explicit support. The manuscript body details the embedding architecture and contrastive training on success-failure pairs, but we will revise to briefly incorporate these elements into the abstract, add a gradient validation figure or analysis showing directional separation, and include coverage metrics demonstrating that the limited data yields the required structure. This addresses the unverified assumption directly. revision: yes

  2. Referee: [Abstract] Abstract: the reported 'up to 23% more unique vulnerabilities' is presented without any description of how vulnerabilities are quantified, what statistical significance tests were used, how the state-of-the-art vision-language baselines were implemented, or the data-exclusion rules applied; these omissions make the quantitative improvement impossible to evaluate and render the experimental claim load-bearing but unsupported.

    Authors: We concur that the abstract requires these methodological details for the quantitative result to be evaluable. In revision, we will expand the abstract to concisely define unique vulnerabilities (distinct failure modes via embedding-space clustering), note the statistical tests applied, summarize baseline implementations, and state the data-exclusion criteria. Corresponding expansions will also appear in the results section to ensure the claim is fully supported. revision: yes

Circularity Check

0 steps flagged

No circularity: framework uses separately trained embedding and RL policy on virtual rollouts

full rationale

The paper presents an empirical framework that trains a vision-language embedding on limited success-failure data, treats the resulting space as a potential field by construction of the method, and trains a separate deep RL policy on virtual rollouts within that space to produce a vulnerability map. No equations, derivations, or self-citations are shown that reduce any claimed prediction or result to its own inputs by definition. Experimental claims (e.g., 23% more vulnerabilities) rest on benchmark comparisons rather than fitted parameters renamed as predictions or self-referential uniqueness theorems. The approach is self-contained as a methodological pipeline without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim depends on the vision-language embedding containing the relevant variations for policy failures and on virtual rollouts being representative of real-world vulnerabilities; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption The vision-language embedding space is rich in semantic and visual variations that align with manipulation policy vulnerabilities
    Invoked when treating the embedding as a potential field for the vulnerability prediction policy.

pith-pipeline@v0.9.0 · 5773 in / 1220 out tokens · 42399 ms · 2026-05-23T07:46:59.960344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and c...

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 1 Pith paper · 16 internal anchors

  1. [1]

    The colosseum: A benchmark for evaluating generalization for robotic manipulation

    Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Kr- ishna, Jesse Thomason, and Dieter Fox. The colosseum: A benchmark for evaluating generalization for robotic manipulation. arXiv preprint arXiv:2402.08191 , 2024. URL https://arxiv.org/pdf/2402.08191

  2. [2]

    Decomposing the generalization gap in imitation learn- ing for visual robotic manipulation

    Annie Xie, Lisa Lee, Ted Xiao, and Chelsea Finn. Decomposing the generalization gap in imitation learn- ing for visual robotic manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 3153–3160. IEEE, 2024. URL https: //arxiv.org/abs/2307.03659

  3. [3]

    Data scaling laws in im- itation learning for robotic manipulation

    Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in im- itation learning for robotic manipulation. arXiv preprint arXiv:2410.18647, 2024. URL https://arxiv.org/abs/2410. 18647

  4. [4]

    The role of predictive uncertainty and diversity in embodied ai and robot learning

    Ransalu Senanayake. The role of predictive uncertainty and diversity in embodied ai and robot learning. arXiv preprint arXiv:2405.03164, 2024. URL https://arxiv.org/ pdf/2405.03164

  5. [5]

    AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

    Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Di- eter Fox, Ajay Mandlekar, and Yijie Guo. AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation. arXiv preprint arXiv:2410.00371, 2024. URL https://arxiv.org/pdf/2410. 00371

  6. [6]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023. URL https: //arxiv.org/abs/2303.04137

  7. [7]

    Probabilistic robotics

    Sebastian Thrun. Probabilistic robotics. Communications of the ACM , 45(3):52–57, 2002. URL https://docs.ufpr. br/∼danielsantos/ProbabilisticRobotics.pdf

  8. [8]

    Fast Gaussian Process Occupancy Maps

    Simon T O’Callaghan and Fabio T Ramos. Gaussian process occupancy maps. The International Journal of Robotics Research , 31(1):42–62, 2012. URL https: //arxiv.org/pdf/1811.10156

  9. [9]

    What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems , 30,

    Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems , 30,

  10. [10]

    URL https://arxiv.org/pdf/1703.04977

  11. [11]

    On the importance of exploration for generalization in re- inforcement learning

    Yiding Jiang, J Zico Kolter, and Roberta Raileanu. On the importance of exploration for generalization in re- inforcement learning. Advances in Neural Information Processing Systems, 36, 2024. URL https://arxiv.org/pdf/ 2306.05483

  12. [12]

    A bayesian approach to generative adversarial imitation learning

    Wonseok Jeon, Seokin Seo, and Kee-Eung Kim. A bayesian approach to generative adversarial imitation learning. Advances in neural information processing systems , 31, 2018. URL https://papers.nips.cc/paper files/paper/2018/file/ 943aa0fcda4ee2901a7de9321663b114-Paper.pdf

  13. [13]

    Safe imitation learning via fast bayesian reward inference from preferences

    Daniel Brown, Russell Coleman, Ravi Srinivasan, and Scott Niekum. Safe imitation learning via fast bayesian reward inference from preferences. In International Con- ference on Machine Learning, pages 1165–1177. PMLR,

  14. [14]

    URL https://papers.nips.cc/paper files/paper/2018/ file/943aa0fcda4ee2901a7de9321663b114-Paper.pdf

  15. [15]

    Bayesian In- verse Reinforcement Learning

    Deepak Ramachandran and Eyal Amir. Bayesian In- verse Reinforcement Learning. In IJCAI, volume 7, pages 2586–2591, 2007. URL https://www.ijcai.org/ Proceedings/07/Papers/416.pdf

  16. [16]

    Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress

    Christopher Agia, Rohan Sinha, Jingyun Yang, Zi-ang Cao, Rika Antonova, Marco Pavone, and Jeannette Bohg. Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress. arXiv preprint arXiv:2410.04640, 2024. URL https://arxiv.org/pdf/2410. 04640

  17. [17]

    Interactive Semantic Interventions for VLMs: A Human- in-the-Loop Investigation of VLM Failure

    Lukas Klein, Kenza Amara, Carsten T L ¨uth, Hendrik Strobelt, Mennatallah El-Assady, and Paul F Jaeger. Interactive Semantic Interventions for VLMs: A Human- in-the-Loop Investigation of VLM Failure. In Neurips Safe Generative AI Workshop 2024 , 2024. URL https: //openreview.net/pdf?id=3kMucCYhYN

  18. [18]

    Decider: Leveraging foundation model priors for improved model failure detection and explanation

    Rakshith Subramanyam, Kowshik Thopalli, Vivek Narayanaswamy, and Jayaraman J Thiagarajan. Decider: Leveraging foundation model priors for improved model failure detection and explanation. In European Con- ference on Computer Vision , pages 465–482. Springer,

  19. [19]

    URL https://arxiv.org/pdf/2408.00331

  20. [20]

    Reflect: Summarizing robot experiences for failure explanation and correction

    Zeyi Liu, Arpit Bahety, and Shuran Song. Reflect: Summarizing robot experiences for failure explanation and correction. arXiv preprint arXiv:2306.15724 , 2023. URL https://arxiv.org/abs/2306.15724

  21. [22]

    URL https://arxiv.org/pdf/2406.07145

  22. [23]

    How do we fail? stress testing perception in autonomous vehicles

    Harrison Delecki, Masha Itkina, Bernard Lange, Ransalu Senanayake, and Mykel J Kochenderfer. How do we fail? stress testing perception in autonomous vehicles. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 5139–5146. IEEE,

  23. [24]

    URL https://arxiv.org/pdf/2203.14155

  24. [25]

    Curiosity-driven red- teaming for large language models

    Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, and Pulkit Agrawal. Curiosity-driven red- teaming for large language models. arXiv preprint arXiv:2402.19464, 2024. URL https://arxiv.org/pdf/2402. 19464

  25. [26]

    A survey of algorithms for black-box safety validation of cyber-physical systems

    Anthony Corso, Robert Moss, Mark Koren, Ritchie Lee, and Mykel Kochenderfer. A survey of algorithms for black-box safety validation of cyber-physical systems. Journal of Artificial Intelligence Research , 72:377–428,

  26. [27]

    URL https://arxiv.org/pdf/2005.02979

  27. [28]

    Out-of-distribution detection for automotive perception

    Julia Nitsch, Masha Itkina, Ransalu Senanayake, Juan Nieto, Max Schmidt, Roland Siegwart, Mykel J Kochen- derfer, and Cesar Cadena. Out-of-distribution detection for automotive perception. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) , pages 2938–2943. IEEE, 2021. URL https://arxiv.org/ pdf/2011.01413

  28. [29]

    SAFE: Sensitivity-aware features for out-of-distribution object detection

    Samuel Wilson, Tobias Fischer, Feras Dayoub, Dimity Miller, and Niko S ¨underhauf. SAFE: Sensitivity-aware features for out-of-distribution object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 23565–23576, 2023. URL https://openaccess.thecvf.com/content/ICCV2023/ papers/Wilson SAFE Sensitivity-Aware Features...

  29. [30]

    Pytorch-ood: A library for out-of-distribution detection based on pytorch

    Konstantin Kirchheim, Marco Filax, and Frank Ortmeier. Pytorch-ood: A library for out-of-distribution detection based on pytorch. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4351–4360, 2022. URL https://openaccess.thecvf.com/content/CVPR2022W/ HCIS/papers/Kirchheim PyTorch-OOD A Library for Out-of-Distribut...

  30. [31]

    PAGER: A Framework for Failure Analysis of Deep Regression Models

    Jayaraman J Thiagarajan, Vivek Narayanaswamy, Puja Trivedi, and Rushil Anirudh. PAGER: A Framework for Failure Analysis of Deep Regression Models. arXiv preprint arXiv:2309.10977, 2023. URL https://arxiv.org/ pdf/2309.10977

  31. [32]

    Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation

    Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation. arXiv preprint arXiv:2501.18564, 2025

  32. [33]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. URL https://arxiv.org/pdf/2212.06817

  33. [35]

    URL https://arxiv.org/pdf/2307.15818

  34. [36]

    Learning Dexterous In-Hand Manipulation

    OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pa- chocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipula- tion. The International Journal of Robotics Research , 39 (1):3–20, 2020. URL https://arxiv.org/pdf/1808.00177

  35. [37]

    Human-AI Safety: A Descendant of Generative AI and Control Systems Safety

    Andrea Bajcsy and Jaime F Fisac. Human-AI Safety: A Descendant of Generative AI and Control Systems Safety. arXiv preprint arXiv:2405.09794 , 2024. URL https://arxiv.org/pdf/2405.09794

  36. [38]

    Concept: Dynamic Risk Assessment for AI-Controlled Robotic Systems

    Philipp Grimmeisen, Friedrich Sautter, and Andrey Morozov. Concept: Dynamic Risk Assessment for AI-Controlled Robotic Systems. arXiv preprint arXiv:2401.14147, 2024. URL https://arxiv.org/pdf/2401. 14147

  37. [39]

    The situation awareness framework for explainable AI (SAFE-AI) and human factors considerations for XAI systems

    Lindsay Sanneman and Julie A Shah. The situation awareness framework for explainable AI (SAFE-AI) and human factors considerations for XAI systems. Inter- national Journal of Human–Computer Interaction , 38 (18-20):1772–1788, 2022. URL https://pmc.ncbi.nlm. nih.gov/articles/PMC7338174/

  38. [40]

    Failure prediction with statistical guaran- tees for vision-based robot control

    Alec Farid, David Snyder, Allen Z Ren, and Anirudha Majumdar. Failure prediction with statistical guaran- tees for vision-based robot control. arXiv preprint arXiv:2202.05894, 2022. URL https://arxiv.org/pdf/2202. 05894

  39. [41]

    Distributionally robust policy learning via adversarial environment gen- eration

    Allen Z Ren and Anirudha Majumdar. Distributionally robust policy learning via adversarial environment gen- eration. IEEE Robotics and Automation Letters , 7(2): 1379–1386, 2022. URL https://arxiv.org/pdf/2107.06353

  40. [42]

    Teaser: Fast and certifiable point cloud registration,

    Heng Yang, Jingnan Shi, and Luca Carlone. Teaser: Fast and certifiable point cloud registration. IEEE Transactions on Robotics , 37(2):314–333, 2020. URL https://arxiv.org/abs/2001.07715

  41. [43]

    Full-Distribution Generalization Bounds for Imitation Learning Policies

    Joseph A Vincent, Haruki Nishimura, Masha Itkina, and Mac Schwager. Full-Distribution Generalization Bounds for Imitation Learning Policies. In First Workshop on Out-of-Distribution Generalization in Robotics at CoRL 2023 , 2023. URL https://openreview.net/pdf?id= JZkwYiyy9I

  42. [44]

    Minimum-violation LTL Planning with Conflicting Specifications

    Jana Tmov, Luis I Reyes Castro, Sertac Karaman, Emilio Frazzoli, and Daniela Rus. Minimum-violation LTL plan- ning with conflicting specifications. In 2013 American Control Conference, pages 200–205. IEEE, 2013. URL https://arxiv.org/pdf/1303.3679

  43. [45]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. URL https://arxiv.org/pdf/1707.06347

  44. [46]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Inter- national Conference on Learning Representations , 2020. URL https://arxiv.org/pdf/2010.11929

  45. [47]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In International conference on ma- chine learning , pages 8748–8763. PMLR, 2021. URL https://arxiv.org/pdf/2103.00020

  46. [48]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart´ın-Mart´ın, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation frame- work and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020. URL https://arxiv.org/abs/2009. 12293

  47. [49]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demon- strations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021. URL https://arxiv.org/abs/2108. 03298

  48. [50]

    MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations

    Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Ireti- ayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In 7th Annual Conference on Robot Learning , 2023. URL https://arxiv.org/abs/2310.17596

  49. [52]

    URL https://arxiv.org/abs/1602.01783

  50. [53]

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning , pages 1861–1870. PMLR, 2018. URL https://arxiv.org/ abs/1801.01290

  51. [54]

    Offline Reinforcement Learning with Implicit Q-Learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021. URL https://arxiv.org/ abs/2110.06169

  52. [55]

    Off-policy deep reinforcement learning without explo- ration

    Scott Fujimoto, David Meger, and Doina Precup. Off- policy deep reinforcement learning without exploration. In International conference on machine learning , pages 2052–2062. PMLR, 2019. URL https://arxiv.org/abs/ 1812.02900

  53. [56]

    Learning to generalize across long-horizon tasks from human demonstrations

    Ajay Mandlekar, Danfei Xu, Roberto Mart ´ın-Mart´ın, Silvio Savarese, and Li Fei-Fei. Learning to generalize across long-horizon tasks from human demonstrations. arXiv preprint arXiv:2003.06085 , 2020. URL https: //arxiv.org/abs/2003.06085

  54. [57]

    Modularity through attention: Efficient training and transfer of language- conditioned policies for robot manipulation

    Yifan Zhou, Shubham Sonawani, Mariano Phielipp, Si- mon Stepputtis, and Heni Ben Amor. Modularity through attention: Efficient training and transfer of language- conditioned policies for robot manipulation. arXiv preprint arXiv:2212.04573, 2022. URL https://arxiv.org/ abs/2212.04573. APPENDIX I. E XPERIMENTAL SETUP A. Real-World Experiment Setup Real-worl...

  55. [58]

    The following RL methods were included: • Proximal Policy Optimization (PPO): A policy-gradient method known for its stability and efficiency

    Reinforcement Learning (RL) Baselines: The RL base- lines were implemented using well-established algorithms, each optimized for the task to ensure a fair comparison. The following RL methods were included: • Proximal Policy Optimization (PPO): A policy-gradient method known for its stability and efficiency. Key hyper- parameters included: – Learning rate...

  56. [59]

    We evaluated 3 state- of-the-art VLMs adapted to our task:

    Vision-Language Model (VLM) Baselines: The VLM baselines take advantage of the interplay between visual and textual modalities for task representation. We evaluated 3 state- of-the-art VLMs adapted to our task:

  57. [60]

    First, we process the output tra- jectories into videos and compute the appropriate frame rate to generate video sequences equivalent to 15 frames per trajectory pair

    Qwen2-VL Additionally, we leverage GPT-4o with in-context learning, using five demonstrations. First, we process the output tra- jectories into videos and compute the appropriate frame rate to generate video sequences equivalent to 15 frames per trajectory pair. These sequences, representing perturbation scenarios, are provided to the VLMs along with a sy...

  58. [61]

    Change cube color to red

  59. [62]

    Change cube color to green

  60. [63]

    Change cube color to blue

  61. [64]

    Change cube color to gray

  62. [65]

    Change table color to green

  63. [66]

    Change table color to blue

  64. [67]

    Change table color to red

  65. [68]

    Change table color to gray

  66. [69]

    Resize table to (0.8, 0.2, 0.025)

  67. [70]

    Resize table to (0.2, 0.8, 0.025)

  68. [71]

    Resize cube to (0.04, 0.04, 0.04)

  69. [72]

    Resize cube to (0.01, 0.01, 0.01)

  70. [73]

    Resize cube to (0.04, 0.01, 0.01)

  71. [74]

    Change robot color to red

  72. [75]

    Change robot color to green

  73. [76]

    Change robot color to cyan

  74. [77]

    Change robot color to gray

  75. [78]

    Change lighting color to red

  76. [79]

    Change lighting color to green

  77. [80]

    Change lighting color to blue

  78. [81]

    Evaluation Fig 12 illustrates the similarity structure of embeddings trained using only Binary Cross-Entropy (BCE) loss, resulting in highly correlated representations

    Change lighting color to gray B. Evaluation Fig 12 illustrates the similarity structure of embeddings trained using only Binary Cross-Entropy (BCE) loss, resulting in highly correlated representations. In contrast, the right matrix, trained with a combination of BCE and Contrastive Loss, demonstrates improved separation, as evidenced by the stronger diago...

  79. [82]

    Instead of relying solely on visual cues, the model gains an additional perspective on the underlying concepts (e.g., object names, attributes, or relations)

    Semantic Guidance: Textual representations carry rich semantic information that can guide the image backbone. Instead of relying solely on visual cues, the model gains an additional perspective on the underlying concepts (e.g., object names, attributes, or relations)

  80. [83]

    Improved Discriminative Power: With access to text- based information, the model can differentiate between visually similar classes by leveraging linguistic differ- ences in their corresponding textual descriptions

Showing first 80 references.