{"total":15,"items":[{"citing_arxiv_id":"2607.02501","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots","primary_cat":"cs.RO","submitted_at":"2026-07-02T17:58:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.01804","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VLA-Corrector: Lightweight Detect-and-Correct Inference for Adaptive Action Horizon","primary_cat":"cs.RO","submitted_at":"2026-07-02T07:18:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLA-Corrector adds a detect-and-correct inference layer using a latent vision monitor and online gradient guidance to enable adaptive action horizons in chunked VLA policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31723","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniTacVLA: Unified Tactile Understanding and Prediction in Vision Language Action Models","primary_cat":"cs.RO","submitted_at":"2026-06-30T14:24:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniTacVLA builds a state-aware and dynamics-aware tactile prior via unified latent space, tactile chain-of-thought, and mixed real/predicted feedback controller to boost dexterous manipulation performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30686","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning","primary_cat":"cs.RO","submitted_at":"2026-06-28T14:03:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLA benchmark success rates cannot distinguish semantic generalization from physical reasoning due to an identifiability gap in current evaluation protocols.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20905","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vesta: A Generalist Embodied Reasoning Model","primary_cat":"cs.RO","submitted_at":"2026-06-18T20:01:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20458","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented Urban Navigation","primary_cat":"cs.RO","submitted_at":"2026-06-18T16:40:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A training-free fusion layer enables stale VLM selections to improve a real-time planner's trajectory scoring for urban sidewalk navigation, yielding 30% ADE reduction in challenging scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17055","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"T-Rex: Tactile-Reactive Dexterous Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-15T17:59:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"T-Rex introduces a large tactile dataset and MoT architecture that achieves over 30% higher success rates than baselines on 12 tasks requiring force control and deformable object handling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22671","ref_index":7,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model","primary_cat":"cs.CV","submitted_at":"2026-05-21T16:14:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"BehaviorVLA learns long-horizon behavioral representations via causal Mamba encoder and phase-conditioned decoder, reporting SOTA results of 58% on RoboTwin 2.0, 98% on LIBERO, 4.36 on CALVIN, and matching OpenVLA-OFT performance with 50% data in sim-to-real transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10942","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. [11] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455-14465, 2024. [12] Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi-Wing Fu, Shanghang Zhang, et al. Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning.arXiv preprint arXiv:2506.01953, 2025. [13] Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin,"},{"citing_arxiv_id":"2605.07308","ref_index":11,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-08T06:17:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"highlights the importance of effective training strategies by introducing a componentized architecture, where outputs from a visual-language model serve as preconditions for the diffusion-based action transformer. In this paper, following the vanilla VLA GO-1 [10], we adopt this manner to formu- late the action modeling. Moreover, to enhance inference speed, several VLA [11, 12, 16, 29-31, 33, 34] approaches adopt a dual-system strategy for action generation, where the slow system targets on high-level Vision-Language Model reasoning and the fast system acts as low-level policy. For in- stance, Gr00t-N1 [5] employs a fast visual stream alongside a slower semantic reasoning stream to preserve high-level planning ability and accelerate low-level action prediction."},{"citing_arxiv_id":"2604.28192","ref_index":9,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning","primary_cat":"cs.RO","submitted_at":"2026-04-30T17:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. [8] Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model.arXiv preprint arXiv:2503.10631, 2025. 10 [9] Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Renrui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi-Wing Fu, Shanghang Zhang, et al. Fast-in-slow: A dual-system foundation model unifying fast manipulation within slow reasoning.arXiv preprint arXiv:2506.01953, 2025. [10] Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng,"},{"citing_arxiv_id":"2604.04161","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adaptive Action Chunking at Inference-time for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-05T16:03:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. RT-1: Robotics Transformer for real-world control at scale.Robotics: Science and Systems XIX, 2023. 2 [6] Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. WorldVLA: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 2 [7] Hao Chen, Jiaming Liu, Chenyang Gu, Zhuoyang Liu, Ren- rui Zhang, Xiaoqi Li, Xiao He, Yandong Guo, Chi-Wing Fu, Shanghang Zhang, et al. Fast-in-Slow: A dual-system foun- dation model unifying fast manipulation within slow reason- ing.arXiv preprint arXiv:2506.01953, 2025. 2 [8] Ruopei Chen, Ke Wang, et al. Adaptive action chunk selec- tor. Stanford CS224R 2025 Final Report, 2025."},{"citing_arxiv_id":"2602.09023","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-02-09T18:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TwinRL expands RL exploration via digital twin reconstruction and twin RL warm-up to guide real-world learning, reaching near-100% success with 20 minutes of on-robot time across four tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.01773","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"IGen: Scalable Data Generation for Robot Learning from Open-World Images","primary_cat":"cs.RO","submitted_at":"2025-12-01T15:15:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.13073","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey","primary_cat":"cs.RO","submitted_at":"2025-08-18T16:45:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"generation or a token to refuse infeasible commands. VQ- VLA [136] adopts a convolutional residual VQ-VAE [146], which is pretrained with action sequence, to take the place of the binning method of OpenVLA [26]. The model shows linear performance gains from more simulated data and has less sim-to-real gap. Some models integrate the action expert into the VLA backbone, Fast-in-Slow [40] is a typical example. As shown in the \"Unified Action Expert\" part at the top-right of Fig. 5, the action expert utilizes the final transformer blocks of the VLM backbone. They run at dif- ferent frequencies, enabling seamless coordination between the two systems within a single pretrained model. 3.2.2 Parallel-based Methods As shown in the lower part of Fig."}],"limit":50,"offset":0}