RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields
Pith reviewed 2026-05-23 07:46 UTC · model grok-4.3
The pith
A reinforcement learning policy on a vision-language embedding treated as a potential field uncovers up to 23% more unique robot manipulation vulnerabilities than baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that treating a vision-language embedding space as a semantic potential field allows a deep RL vulnerability prediction policy, trained on virtual runs, to scalably locate failure-prone regions for a target manipulation policy, producing a probabilistic vulnerability-likelihood map that identifies up to 23% more unique vulnerabilities than state-of-the-art baselines while also improving the original policy through targeted fine-tuning.
What carries the argument
The semantic potential field formed by the continuous vision-language embedding, which guides the deep RL vulnerability prediction policy to navigate toward failure regions.
If this is right
- The vulnerability prediction policy enables scalable analysis without expensive or unsafe physical trials.
- Querying the policy produces a probabilistic map of vulnerability likelihood across the embedding space.
- Fine-tuning the original manipulation policy on the discovered vulnerabilities improves performance with substantially less data.
- The method reveals subtle vulnerabilities that heuristic testing overlooks.
Where Pith is reading between the lines
- The separation of vulnerability discovery into its own policy could allow the testing strategy to be refined independently of the manipulation policy being evaluated.
- Because the embedding serves as a proxy space, the same framework could be applied to other embodied tasks where direct variation sampling is expensive.
- If the embedding captures additional modalities, the potential-field approach might locate vulnerabilities arising from combined visual and language perturbations.
Load-bearing premise
The vision-language embedding trained on limited success-failure data contains enough semantic and visual variation to act as a potential field that reliably separates vulnerable regions from successful ones.
What would settle it
A controlled test on a held-out manipulation task or physical robot where the framework finds no more unique vulnerabilities than the vision-language baselines, or where the vulnerability map shows no correlation with measured failure rates.
Figures
read the original abstract
Robot manipulation policies, while central to the promise of physical AI, are highly vulnerable in the presence of external variations in the real world. Diagnosing these vulnerabilities is hindered by two key challenges: (i) the relevant variations to test against are often unknown, and (ii) direct testing in the real world is costly and unsafe. We introduce a framework that tackles both issues by learning a separate deep reinforcement learning (deep RL) policy for vulnerability prediction through virtual runs on a continuous vision-language embedding trained with limited success-failure data. By treating this embedding space, which is rich in semantic and visual variations, as a potential field, the policy learns to move toward vulnerable regions while being repelled from success regions. This vulnerability prediction policy, trained on virtual rollouts, enables scalable and safe vulnerability analysis without expensive physical trials. By querying this policy, our framework builds a probabilistic vulnerability-likelihood map. Experiments across simulation benchmarks and a physical robot arm show that our framework uncovers up to 23% more unique vulnerabilities than state-of-the-art vision-language baselines, revealing subtle vulnerabilities overlooked by heuristic testing. Additionally, we show that fine-tuning the manipulation policy with the vulnerabilities discovered by our framework improves manipulation performance with much less fine-tuning data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RoboMD, a framework that trains a separate deep RL policy to discover vulnerabilities in robot manipulation policies. It does so by treating a continuous vision-language embedding—trained on limited success-failure data—as a semantic potential field that attracts the policy toward vulnerable regions and repels it from successful ones. Virtual rollouts of this policy are used to construct a probabilistic vulnerability map. Experiments on simulation benchmarks and a physical arm reportedly uncover up to 23% more unique vulnerabilities than vision-language baselines and enable more data-efficient fine-tuning of the original policy.
Significance. If the embedding truly encodes sufficient semantic and visual variation to form a separating potential field, the approach would provide a scalable, simulation-only method for identifying subtle vulnerabilities that heuristic testing misses, reducing reliance on costly or unsafe physical trials and improving robustness of manipulation policies.
major comments (2)
- [Abstract] Abstract: the central claim that the vision-language embedding 'is rich in semantic and visual variations' and can serve as a potential field that 'meaningfully separates vulnerable from successful regions' rests on an unverified representational assumption; no architecture, training objective, gradient validation, or coverage analysis is supplied to support that the limited success-failure data produces the required separating structure.
- [Abstract] Abstract: the reported 'up to 23% more unique vulnerabilities' is presented without any description of how vulnerabilities are quantified, what statistical significance tests were used, how the state-of-the-art vision-language baselines were implemented, or the data-exclusion rules applied; these omissions make the quantitative improvement impossible to evaluate and render the experimental claim load-bearing but unsupported.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will make revisions to improve clarity and support for the claims in the abstract.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the vision-language embedding 'is rich in semantic and visual variations' and can serve as a potential field that 'meaningfully separates vulnerable from successful regions' rests on an unverified representational assumption; no architecture, training objective, gradient validation, or coverage analysis is supplied to support that the limited success-failure data produces the required separating structure.
Authors: We agree that the abstract's claim would benefit from explicit support. The manuscript body details the embedding architecture and contrastive training on success-failure pairs, but we will revise to briefly incorporate these elements into the abstract, add a gradient validation figure or analysis showing directional separation, and include coverage metrics demonstrating that the limited data yields the required structure. This addresses the unverified assumption directly. revision: yes
-
Referee: [Abstract] Abstract: the reported 'up to 23% more unique vulnerabilities' is presented without any description of how vulnerabilities are quantified, what statistical significance tests were used, how the state-of-the-art vision-language baselines were implemented, or the data-exclusion rules applied; these omissions make the quantitative improvement impossible to evaluate and render the experimental claim load-bearing but unsupported.
Authors: We concur that the abstract requires these methodological details for the quantitative result to be evaluable. In revision, we will expand the abstract to concisely define unique vulnerabilities (distinct failure modes via embedding-space clustering), note the statistical tests applied, summarize baseline implementations, and state the data-exclusion criteria. Corresponding expansions will also appear in the results section to ensure the claim is fully supported. revision: yes
Circularity Check
No circularity: framework uses separately trained embedding and RL policy on virtual rollouts
full rationale
The paper presents an empirical framework that trains a vision-language embedding on limited success-failure data, treats the resulting space as a potential field by construction of the method, and trains a separate deep RL policy on virtual rollouts within that space to produce a vulnerability map. No equations, derivations, or self-citations are shown that reduce any claimed prediction or result to its own inputs by definition. Experimental claims (e.g., 23% more vulnerabilities) rest on benchmark comparisons rather than fitted parameters renamed as predictions or self-referential uniqueness theorems. The approach is self-contained as a methodological pipeline without load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The vision-language embedding space is rich in semantic and visual variations that align with manipulation policy vulnerabilities
Forward citations
Cited by 1 Pith paper
-
RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models
RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and c...
Reference graph
Works this paper leans on
-
[1]
The colosseum: A benchmark for evaluating generalization for robotic manipulation
Wilbert Pumacay, Ishika Singh, Jiafei Duan, Ranjay Kr- ishna, Jesse Thomason, and Dieter Fox. The colosseum: A benchmark for evaluating generalization for robotic manipulation. arXiv preprint arXiv:2402.08191 , 2024. URL https://arxiv.org/pdf/2402.08191
-
[2]
Decomposing the generalization gap in imitation learn- ing for visual robotic manipulation
Annie Xie, Lisa Lee, Ted Xiao, and Chelsea Finn. Decomposing the generalization gap in imitation learn- ing for visual robotic manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 3153–3160. IEEE, 2024. URL https: //arxiv.org/abs/2307.03659
-
[3]
Data scaling laws in im- itation learning for robotic manipulation
Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in im- itation learning for robotic manipulation. arXiv preprint arXiv:2410.18647, 2024. URL https://arxiv.org/abs/2410. 18647
-
[4]
The role of predictive uncertainty and diversity in embodied ai and robot learning
Ransalu Senanayake. The role of predictive uncertainty and diversity in embodied ai and robot learning. arXiv preprint arXiv:2405.03164, 2024. URL https://arxiv.org/ pdf/2405.03164
-
[5]
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Di- eter Fox, Ajay Mandlekar, and Yijie Guo. AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation. arXiv preprint arXiv:2410.00371, 2024. URL https://arxiv.org/pdf/2410. 00371
-
[6]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023. URL https: //arxiv.org/abs/2303.04137
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Sebastian Thrun. Probabilistic robotics. Communications of the ACM , 45(3):52–57, 2002. URL https://docs.ufpr. br/∼danielsantos/ProbabilisticRobotics.pdf
work page 2002
-
[8]
Fast Gaussian Process Occupancy Maps
Simon T O’Callaghan and Fabio T Ramos. Gaussian process occupancy maps. The International Journal of Robotics Research , 31(1):42–62, 2012. URL https: //arxiv.org/pdf/1811.10156
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[9]
Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems , 30,
-
[10]
URL https://arxiv.org/pdf/1703.04977
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
On the importance of exploration for generalization in re- inforcement learning
Yiding Jiang, J Zico Kolter, and Roberta Raileanu. On the importance of exploration for generalization in re- inforcement learning. Advances in Neural Information Processing Systems, 36, 2024. URL https://arxiv.org/pdf/ 2306.05483
-
[12]
A bayesian approach to generative adversarial imitation learning
Wonseok Jeon, Seokin Seo, and Kee-Eung Kim. A bayesian approach to generative adversarial imitation learning. Advances in neural information processing systems , 31, 2018. URL https://papers.nips.cc/paper files/paper/2018/file/ 943aa0fcda4ee2901a7de9321663b114-Paper.pdf
work page 2018
-
[13]
Safe imitation learning via fast bayesian reward inference from preferences
Daniel Brown, Russell Coleman, Ravi Srinivasan, and Scott Niekum. Safe imitation learning via fast bayesian reward inference from preferences. In International Con- ference on Machine Learning, pages 1165–1177. PMLR,
-
[14]
URL https://papers.nips.cc/paper files/paper/2018/ file/943aa0fcda4ee2901a7de9321663b114-Paper.pdf
work page 2018
-
[15]
Bayesian In- verse Reinforcement Learning
Deepak Ramachandran and Eyal Amir. Bayesian In- verse Reinforcement Learning. In IJCAI, volume 7, pages 2586–2591, 2007. URL https://www.ijcai.org/ Proceedings/07/Papers/416.pdf
work page 2007
-
[16]
Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress
Christopher Agia, Rohan Sinha, Jingyun Yang, Zi-ang Cao, Rika Antonova, Marco Pavone, and Jeannette Bohg. Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress. arXiv preprint arXiv:2410.04640, 2024. URL https://arxiv.org/pdf/2410. 04640
-
[17]
Interactive Semantic Interventions for VLMs: A Human- in-the-Loop Investigation of VLM Failure
Lukas Klein, Kenza Amara, Carsten T L ¨uth, Hendrik Strobelt, Mennatallah El-Assady, and Paul F Jaeger. Interactive Semantic Interventions for VLMs: A Human- in-the-Loop Investigation of VLM Failure. In Neurips Safe Generative AI Workshop 2024 , 2024. URL https: //openreview.net/pdf?id=3kMucCYhYN
work page 2024
-
[18]
Decider: Leveraging foundation model priors for improved model failure detection and explanation
Rakshith Subramanyam, Kowshik Thopalli, Vivek Narayanaswamy, and Jayaraman J Thiagarajan. Decider: Leveraging foundation model priors for improved model failure detection and explanation. In European Con- ference on Computer Vision , pages 465–482. Springer,
- [19]
-
[20]
Reflect: Summarizing robot experiences for failure explanation and correction
Zeyi Liu, Arpit Bahety, and Shuran Song. Reflect: Summarizing robot experiences for failure explanation and correction. arXiv preprint arXiv:2306.15724 , 2023. URL https://arxiv.org/abs/2306.15724
- [22]
-
[23]
How do we fail? stress testing perception in autonomous vehicles
Harrison Delecki, Masha Itkina, Bernard Lange, Ransalu Senanayake, and Mykel J Kochenderfer. How do we fail? stress testing perception in autonomous vehicles. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 5139–5146. IEEE,
work page 2022
- [24]
-
[25]
Curiosity-driven red- teaming for large language models
Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, and Pulkit Agrawal. Curiosity-driven red- teaming for large language models. arXiv preprint arXiv:2402.19464, 2024. URL https://arxiv.org/pdf/2402. 19464
-
[26]
A survey of algorithms for black-box safety validation of cyber-physical systems
Anthony Corso, Robert Moss, Mark Koren, Ritchie Lee, and Mykel Kochenderfer. A survey of algorithms for black-box safety validation of cyber-physical systems. Journal of Artificial Intelligence Research , 72:377–428,
- [27]
-
[28]
Out-of-distribution detection for automotive perception
Julia Nitsch, Masha Itkina, Ransalu Senanayake, Juan Nieto, Max Schmidt, Roland Siegwart, Mykel J Kochen- derfer, and Cesar Cadena. Out-of-distribution detection for automotive perception. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) , pages 2938–2943. IEEE, 2021. URL https://arxiv.org/ pdf/2011.01413
-
[29]
SAFE: Sensitivity-aware features for out-of-distribution object detection
Samuel Wilson, Tobias Fischer, Feras Dayoub, Dimity Miller, and Niko S ¨underhauf. SAFE: Sensitivity-aware features for out-of-distribution object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 23565–23576, 2023. URL https://openaccess.thecvf.com/content/ICCV2023/ papers/Wilson SAFE Sensitivity-Aware Features...
work page 2023
-
[30]
Pytorch-ood: A library for out-of-distribution detection based on pytorch
Konstantin Kirchheim, Marco Filax, and Frank Ortmeier. Pytorch-ood: A library for out-of-distribution detection based on pytorch. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4351–4360, 2022. URL https://openaccess.thecvf.com/content/CVPR2022W/ HCIS/papers/Kirchheim PyTorch-OOD A Library for Out-of-Distribut...
work page 2022
-
[31]
PAGER: A Framework for Failure Analysis of Deep Regression Models
Jayaraman J Thiagarajan, Vivek Narayanaswamy, Puja Trivedi, and Rushil Anirudh. PAGER: A Framework for Failure Analysis of Deep Regression Models. arXiv preprint arXiv:2309.10977, 2023. URL https://arxiv.org/ pdf/2309.10977
-
[32]
Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation
Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation. arXiv preprint arXiv:2501.18564, 2025
-
[33]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. URL https://arxiv.org/pdf/2212.06817
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
URL https://arxiv.org/pdf/2307.15818
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Learning Dexterous In-Hand Manipulation
OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pa- chocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipula- tion. The International Journal of Robotics Research , 39 (1):3–20, 2020. URL https://arxiv.org/pdf/1808.00177
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[37]
Human-AI Safety: A Descendant of Generative AI and Control Systems Safety
Andrea Bajcsy and Jaime F Fisac. Human-AI Safety: A Descendant of Generative AI and Control Systems Safety. arXiv preprint arXiv:2405.09794 , 2024. URL https://arxiv.org/pdf/2405.09794
-
[38]
Concept: Dynamic Risk Assessment for AI-Controlled Robotic Systems
Philipp Grimmeisen, Friedrich Sautter, and Andrey Morozov. Concept: Dynamic Risk Assessment for AI-Controlled Robotic Systems. arXiv preprint arXiv:2401.14147, 2024. URL https://arxiv.org/pdf/2401. 14147
-
[39]
Lindsay Sanneman and Julie A Shah. The situation awareness framework for explainable AI (SAFE-AI) and human factors considerations for XAI systems. Inter- national Journal of Human–Computer Interaction , 38 (18-20):1772–1788, 2022. URL https://pmc.ncbi.nlm. nih.gov/articles/PMC7338174/
work page 2022
-
[40]
Failure prediction with statistical guaran- tees for vision-based robot control
Alec Farid, David Snyder, Allen Z Ren, and Anirudha Majumdar. Failure prediction with statistical guaran- tees for vision-based robot control. arXiv preprint arXiv:2202.05894, 2022. URL https://arxiv.org/pdf/2202. 05894
-
[41]
Distributionally robust policy learning via adversarial environment gen- eration
Allen Z Ren and Anirudha Majumdar. Distributionally robust policy learning via adversarial environment gen- eration. IEEE Robotics and Automation Letters , 7(2): 1379–1386, 2022. URL https://arxiv.org/pdf/2107.06353
-
[42]
Teaser: Fast and certifiable point cloud registration,
Heng Yang, Jingnan Shi, and Luca Carlone. Teaser: Fast and certifiable point cloud registration. IEEE Transactions on Robotics , 37(2):314–333, 2020. URL https://arxiv.org/abs/2001.07715
-
[43]
Full-Distribution Generalization Bounds for Imitation Learning Policies
Joseph A Vincent, Haruki Nishimura, Masha Itkina, and Mac Schwager. Full-Distribution Generalization Bounds for Imitation Learning Policies. In First Workshop on Out-of-Distribution Generalization in Robotics at CoRL 2023 , 2023. URL https://openreview.net/pdf?id= JZkwYiyy9I
work page 2023
-
[44]
Minimum-violation LTL Planning with Conflicting Specifications
Jana Tmov, Luis I Reyes Castro, Sertac Karaman, Emilio Frazzoli, and Daniela Rus. Minimum-violation LTL plan- ning with conflicting specifications. In 2013 American Control Conference, pages 200–205. IEEE, 2013. URL https://arxiv.org/pdf/1303.3679
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[45]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. URL https://arxiv.org/pdf/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[46]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Inter- national Conference on Learning Representations , 2020. URL https://arxiv.org/pdf/2010.11929
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[47]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In International conference on ma- chine learning , pages 8748–8763. PMLR, 2021. URL https://arxiv.org/pdf/2103.00020
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[48]
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart´ın-Mart´ın, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation frame- work and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020. URL https://arxiv.org/abs/2009. 12293
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[49]
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demon- strations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021. URL https://arxiv.org/abs/2108. 03298
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[50]
MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations
Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Ireti- ayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In 7th Annual Conference on Robot Learning , 2023. URL https://arxiv.org/abs/2310.17596
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
URL https://arxiv.org/abs/1602.01783
work page internal anchor Pith review Pith/arXiv arXiv
-
[53]
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning , pages 1861–1870. PMLR, 2018. URL https://arxiv.org/ abs/1801.01290
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[54]
Offline Reinforcement Learning with Implicit Q-Learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021. URL https://arxiv.org/ abs/2110.06169
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[55]
Off- policy deep reinforcement learning without exploration
Scott Fujimoto, David Meger, and Doina Precup. Off- policy deep reinforcement learning without exploration. In International conference on machine learning , pages 2052–2062. PMLR, 2019. URL https://arxiv.org/abs/ 1812.02900
-
[56]
Learning to generalize across long-horizon tasks from human demonstrations
Ajay Mandlekar, Danfei Xu, Roberto Mart ´ın-Mart´ın, Silvio Savarese, and Li Fei-Fei. Learning to generalize across long-horizon tasks from human demonstrations. arXiv preprint arXiv:2003.06085 , 2020. URL https: //arxiv.org/abs/2003.06085
-
[57]
Yifan Zhou, Shubham Sonawani, Mariano Phielipp, Si- mon Stepputtis, and Heni Ben Amor. Modularity through attention: Efficient training and transfer of language- conditioned policies for robot manipulation. arXiv preprint arXiv:2212.04573, 2022. URL https://arxiv.org/ abs/2212.04573. APPENDIX I. E XPERIMENTAL SETUP A. Real-World Experiment Setup Real-worl...
-
[58]
Reinforcement Learning (RL) Baselines: The RL base- lines were implemented using well-established algorithms, each optimized for the task to ensure a fair comparison. The following RL methods were included: • Proximal Policy Optimization (PPO): A policy-gradient method known for its stability and efficiency. Key hyper- parameters included: – Learning rate...
-
[59]
We evaluated 3 state- of-the-art VLMs adapted to our task:
Vision-Language Model (VLM) Baselines: The VLM baselines take advantage of the interplay between visual and textual modalities for task representation. We evaluated 3 state- of-the-art VLMs adapted to our task:
-
[60]
Qwen2-VL Additionally, we leverage GPT-4o with in-context learning, using five demonstrations. First, we process the output tra- jectories into videos and compute the appropriate frame rate to generate video sequences equivalent to 15 frames per trajectory pair. These sequences, representing perturbation scenarios, are provided to the VLMs along with a sy...
-
[61]
Change cube color to red
-
[62]
Change cube color to green
-
[63]
Change cube color to blue
-
[64]
Change cube color to gray
-
[65]
Change table color to green
-
[66]
Change table color to blue
-
[67]
Change table color to red
-
[68]
Change table color to gray
-
[69]
Resize table to (0.8, 0.2, 0.025)
-
[70]
Resize table to (0.2, 0.8, 0.025)
-
[71]
Resize cube to (0.04, 0.04, 0.04)
-
[72]
Resize cube to (0.01, 0.01, 0.01)
-
[73]
Resize cube to (0.04, 0.01, 0.01)
-
[74]
Change robot color to red
-
[75]
Change robot color to green
-
[76]
Change robot color to cyan
-
[77]
Change robot color to gray
-
[78]
Change lighting color to red
-
[79]
Change lighting color to green
-
[80]
Change lighting color to blue
-
[81]
Change lighting color to gray B. Evaluation Fig 12 illustrates the similarity structure of embeddings trained using only Binary Cross-Entropy (BCE) loss, resulting in highly correlated representations. In contrast, the right matrix, trained with a combination of BCE and Contrastive Loss, demonstrates improved separation, as evidenced by the stronger diago...
-
[82]
Semantic Guidance: Textual representations carry rich semantic information that can guide the image backbone. Instead of relying solely on visual cues, the model gains an additional perspective on the underlying concepts (e.g., object names, attributes, or relations)
-
[83]
Improved Discriminative Power: With access to text- based information, the model can differentiate between visually similar classes by leveraging linguistic differ- ences in their corresponding textual descriptions
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.