{"total":28,"items":[{"citing_arxiv_id":"2605.22816","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-21T17:58:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22183","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Action with Visual Primitives","primary_cat":"cs.RO","submitted_at":"2026-05-21T08:52:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17077","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning","primary_cat":"cs.RO","submitted_at":"2026-05-16T16:52:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13632","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T14:58:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"semantic shifts, and provide little transparency when failures occur. When perception fails due to clutter, lighting variation, or unseen objects, humans have no explicit interface to re-ground the robot's attention or provide targeted corrective guidance. Recent work has begun to move beyond direct \"Sense-to-Act\" policies through Embodied Chain-of- Thought (CoT) reasoning [2,13,11,35,27,20], shifting toward a more structured \"Sense, Think, and Act\" paradigm. By explicitly predicting intermediate representations, such as task decomposition, grounding cues, or motion plans, these methods improve interpretability and expose part of the policy's decision-making process. However, in existing systems, the reasoning process remains largely self-contained: although inter-"},{"citing_arxiv_id":"2605.13119","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T07:40:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"where each Tg is a callable VLA tool specialized for one tool family. The instruction zk grounds this tool family in the current scene by specifying the object, relation, and desired local effect. The selected tool Tgk executes the call over a bounded low-level horizon. Given robot observations ot, it produces actions and a bounded trajectory at ∼T gk(· |o t, zk, ht), τ k = (otk:tk+Hk , atk:tk+Hk−1),(2) where ht denotes an optional short execution history and Hk is the call horizon. After or during execution, the selected tool returns feedback rk such as progress or completion information. The agent state is then updated as sk+1 =U(s k, ck, τk, rk), r k ∈ R,(3) and the loop repeats until the task terminates. Thus, the interface defines both directions of com-"},{"citing_arxiv_id":"2605.12167","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:15:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation [ICML 2026] Den Oord et al., 2017) for discrete latent actions, as shown in the left part of Figure 2. Given a current RGB frame orgb t and a future RGB frame orgb t+k, we first extract RGB features using a ViT-based (Dosovitskiy, 2020) image encoder: hrgb t =E rgb(orgb t ), h rgb t+k =E rgb(orgb t+k).(1) To model the temporal interaction between the current and future observations, we introduce a set of learnable latent action queries, which are initialized and interact with the RGB features through a spatiotemporal transformer: ˜ht→t+k =T (m) \u0010 q(m), hrgb t , hrgb t+k \u0011 ,(2) where q(m) denotes the latent action queries and T (m) de- notes the modality-specific spatiotemporal transformer that"},{"citing_arxiv_id":"2604.18463","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Using large language models for embodied planning introduces systematic safety risks","primary_cat":"cs.AI","submitted_at":"2026-04-20T16:18:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"More recent work has explored multimodal embod- ied language models that incorporate sensor data directly [38], vision-language-action models that output robot actions as text tokens [39], and closed-loop reasoning systems that incorporate environ- ment feedback [40]. Additional approaches include hierarchical policies bridging high-level language to low-level motor execution [41], efficient action tokenization [42], foundation models for humanoid robots [43], large-scale multi-robot datasets [44], and on-device distillation of language models for robot planning with minimal human supervision [45]. Our benchmark evaluates raw LLM planning capabilities rather than hybrid systems integrating external verifiers. Systems combining LLMs with symbolic planners [46, 47] may exhibit different"},{"citing_arxiv_id":"2604.15938","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-17T10:56:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VADF adds an Adaptive Loss Network for hard-negative training sampling and a Hierarchical Vision Task Segmenter for adaptive noise scheduling during inference to speed convergence and reduce timeouts in diffusion robotic policies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"distinguishesitselfviatheHVTS,whichachieveszero-shotadaptiveacceleration. Unlike prior works, HVTS jointly co-adapts both the denoising budget (Nd) and action prediction horizon (Na) based on real-time subtask semantics, providing a plug-and-play solution without additional training. 4 X. Yu et al. TaskDecompositionandHierarchicalControlHierarchicalarchitectures[11, 22,31] and Chain-of-Thought (CoT) reasoning [1,2,33] have shown excellence in long-horizon tasks through semantic planning and spatial decomposition. Re- lated advances in VLA systems and spatial reasoning [14,15,18,19] further high- light the value of structured perception, reasoning, and action for complex em- bodied decision-making. However, their autoregressive nature often incurs high latency, hindering real-time interaction."},{"citing_arxiv_id":"2604.14125","ref_index":2,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System","primary_cat":"cs.CV","submitted_at":"2026-04-15T17:50:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"grated paradigms have shown considerable promise, they face a critical bot- tleneck [13,15] that fine-tuning VLMs on relatively scarce and domain-specific manipulation data inevitably degrades their original reasoning capabilities. This degradation, widely recognized as catastrophic forgetting, ultimately limits the ability to leverage the full cognitive power of the most advanced VLMs. Hierarchical systems [2,23,32] offer a compelling alternative by explicitly decoupling high-level semantic planning from low-level motor control. In this paradigm, the VLM operates purely as a high-level planner, preserving its rea- soning capabilities by avoiding low-level fine-tuning, while a dedicated action expert executes the plans. However, the success of this decoupled design heav-"},{"citing_arxiv_id":"2604.09059","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Vision-Language-Action World Models for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-04-10T07:38:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[5] Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3 [6] Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 3 [7] Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al."},{"citing_arxiv_id":"2603.15620","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Generalizable Robotic Manipulation in Dynamic Environments","primary_cat":"cs.CV","submitted_at":"2026-03-16T17:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.08392","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs","primary_cat":"cs.RO","submitted_at":"2026-02-09T08:47:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2, S2 [12] Tadas Baltruˇsaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and tax- onomy.IEEE transactions on pattern analysis and machine intelligence, 41(2):423-443, 2018. 3 [13] Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. [14] Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al."},{"citing_arxiv_id":"2601.07060","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-01-11T21:00:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Several works repur- pose pre-trained Vision-Language Models (VLMs) [1, 22, 41, 63, 71, 90, 104, 154, 155] into VLA policies that map visual observations and language instructions to low-level robot actions [6, 21, 23, 47, 52, 74, 75, 105, 123, 149] by fine-tuning on large-scale robotics datasets [27, 54, 94]. A prominent paradigm, pioneered by the RT series [4, 10, 157], formulates action generation as autoregressive prediction over tokenized sequences [ 9, 55, 70, 156]. In parallel, diffusion-based action generators [19, 37, 53, 80, 137] treat control as a denoising process in continuous trajectory spaces, producing smoother temporal dynamics. Despite their success, both paradigms rely on direct action prediction,"},{"citing_arxiv_id":"2511.18960","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention","primary_cat":"cs.LG","submitted_at":"2025-11-24T10:22:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.18085","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Continually Evolving Skill Knowledge in Vision Language Action Model","primary_cat":"cs.RO","submitted_at":"2025-11-22T15:00:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stellar VLA achieves continual learning in VLA models by maintaining a growing knowledge space and routing tasks to specialized experts conditioned on semantic relations, delivering strong LIBERO benchmark results with only 1% data replay and successful real-world transfer on dual-arm hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.12710","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reflection-Based Task Adaptation for Self-Improving VLA","primary_cat":"cs.RO","submitted_at":"2025-10-14T16:44:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reflective Self-Adaptation combines failure-reflective reinforcement learning with success-guided imitation learning to enable faster and more reliable task adaptation for pre-trained Vision-Language-Action models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.04447","ref_index":85,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge","primary_cat":"cs.CV","submitted_at":"2025-07-06T16:14:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Choromanski, Xi Chen, Yevgen Chebotar, Justice Carbajal, Noah Brown, Anthony Brohan, Montserrat Gonzalez Arenas, and Kehang Han. RT-2: vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, CoRL 2023, 6-9 November 2023, Atlanta, GA, USA, volume 229 of Proceedings of Machine Learning Research, pages 2165-2183. PMLR, 2023. 3, 28 [85] Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, Debidatta Dwibedi, and Dorsa Sadigh. RT-H: action hierarchies using language. CoRR, abs/2403.01823, 2024. 3, 28 [86] Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al."},{"citing_arxiv_id":"2507.01925","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey on Vision-Language-Action Models: An Action Tokenization Perspective","primary_cat":"cs.RO","submitted_at":"2025-07-02T17:34:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"large-scale embodied datasets and training generalist agents end-to-end on top of vision-language foundation models [21, 22, 23]. These diverse approaches have led to a rapid proliferation of VLA models in robotic manipulation [24, 25], navigation [26, 27], and autonomous driving [28, 29, 30], demonstrating promising capabilities in multitask learning [31], long-horizon task completion [22], and strong generalization [32]. By leveraging foundation model intelligence, they offer new directions for addressing long-standing challenges in embodied AI, such as data scarcity and poor cross-embodiment transferability, and pave the way foragents capable of solving open-ended tasks expressed via open-vocabulary instructions in open-world physical environments. The rapid progress, promising empirical results, and growing diversity of VLA models create an urgent need for"},{"citing_arxiv_id":"2505.13255","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Policy Contrastive Decoding for Robotic Foundation Models","primary_cat":"cs.RO","submitted_at":"2025-05-19T15:39:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PCD redirects robotic policies toward object-relevant visual features via contrastive decoding on masked inputs, improving generalization without retraining or weight access.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.16054","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","primary_cat":"cs.LG","submitted_at":"2025-04-22T17:31:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.15558","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning","primary_cat":"cs.AI","submitted_at":"2025-03-18T22:06:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.05231","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Kaiwu: A Multimodal Manipulation Dataset and Framework for Robot Learning and Human-Robot Interaction","primary_cat":"cs.RO","submitted_at":"2025-03-07T08:28:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the Kaiwu multimodal dataset and framework with 11,664 synchronized assembling demonstrations including hand motions, pressures, sounds, multi-view videos, motion capture, eye gaze, and EMG signals with timestamp-based and semantic annotations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.03480","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning","primary_cat":"cs.RO","submitted_at":"2025-03-05T13:16:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.19417","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2025-02-26T18:58:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A hierarchical VLA architecture lets robots follow complex instructions and situated feedback by separating high-level reasoning from low-level control.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.05855","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control","primary_cat":"cs.RO","submitted_at":"2025-02-09T11:25:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Zheng, Z. Chen, J. Jang, Y . Li, C. Wang, M. Ding, D. Fox, and H. Yao. Grape: Generalizing robot policy via preference alignment. arXiv preprint arXiv:2411.19309, 2024. [45] Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen. Improving vision-language- action model with online reinforcement learning. arXiv preprint arXiv:2501.16664, 2025. [46] S. Belkhale, T. Ding, T. Xiao, P. Sermanet, Q. Vuong, J. Tompson, Y . Chebotar, D. Dwibedi, and D. Sadigh. Rt-h: Action hierarchies using language. arXiv preprint arXiv:2403.01823, 2024. [47] L. Yen-Chen, A. Zeng, S. Song, P. Isola, and T.-Y . Lin. Learning to see before learning to act: Visual pre-training for manipulation. In 2020 IEEE International Conference on Robotics and"},{"citing_arxiv_id":"2501.09747","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2025-01-16T18:57:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022. [3] Suneel Belkhale and Dorsa Sadigh. Minivla: A better vla with a smaller footprint, 2024. URL https://github.com/ Stanford-ILIAD/openvla-mini. [4] Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hier- archies using language, 2024. URL https://arxiv.org/abs/ 2403.01823. [5] Lucas Beyer, Andreas Steiner, Andr 'e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen,"},{"citing_arxiv_id":"2412.13877","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2024-12-18T14:17:16+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RoboMIND is a large-scale multi-embodiment teleoperation dataset for robot manipulation containing 107k trajectories across four robots, with failure annotations and a digital twin simulator.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"show the annotation of a video of a Franka Emika Panda arm picking the apple and placing it in the drawer using the above standard procedure in Figure 5. The results show that our annotation scheme can accurately segment the key actions in the video and provide precise language descriptions of these key actions. These detailed descriptions can be used for training models like RT-H [7]. IV. D ATASET ANALYSIS Based on a standardized procedure, we collected a large- scale, multi-embodiment dataset named RoboMIND. This dataset consists of 107k high-quality trajectories across 4 robotic embodiments, 479 tasks, 96 object classes, and 38 skills. Robotic data diversity plays a crucial role in model generalization, encompassing various dimensions across hard-"},{"citing_arxiv_id":"2405.14093","ref_index":116,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey on Vision-Language-Action Models for Embodied AI","primary_cat":"cs.RO","submitted_at":"2024-05-23T01:43:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RT-H [114]♢(Model design follows RT-2) BC (disc) Diverse+KitchenReal: Diverse+Kitchen eval tasks RT-X [115]♢(Models from RT-1 and RT-2) BC (disc) [SC: OXE]Real: BridgeV2, RT-1 evaluation tasks, etc. OpenVLA [35]♢DINOv2, SigLIP Prismatic-7B Symbol- tuning Concat BC (disc) OXE, DROIDReal: BridgeV2, RT-1 evaluation tasks, Franka-Tabletop, DROID, etc. OpenVLA-OFT [116]♢ (Improves OpenVLA with OFT recipe) BC (cont, parallel decode w/ chunk.) LIBERO, [SC]Sim: LIBERO;Real(ALOHA setup): fold, scoop, put TraceVLA [117]♢(Model design follows OpenVLA, adding visual trace prompting) BC (disc) BridgeV2, Fractal, [SC] Sim: SimplerEnv;Real(WidowX): pick, push, fold, swipe π0 [118]♢SigLIP PaliGemma Action expert Concat Flow matching OXE, [SC:π-"}],"limit":50,"offset":0}