{"total":12,"items":[{"citing_arxiv_id":"2605.10903","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:41:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational overhead during adaptation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02881","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MolmoAct2: Action Reasoning Models for Real-world Deployment","primary_cat":"cs.RO","submitted_at":"2026-05-04T17:51:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17887","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement","primary_cat":"cs.RO","submitted_at":"2026-04-20T06:57:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict action accuracy on AgiBot and 9.7-17.6% gains in real-robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"[52] Andy Zhai, Brae Liu, Bruno Fang, Chalse Cai, Ellie Ma, Ethan Yin, Hao Wang, Hugo Zhou, James Wang, Lights Shi, et al. Igniting vlms toward the embodied space.arXiv preprint arXiv:2509.11766, 2025. 1 [53] Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipula- tion with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 2 [54] Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126, 2024. 2 [55] Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination."},{"citing_arxiv_id":"2602.00937","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining","primary_cat":"cs.RO","submitted_at":"2026-01-31T23:32:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.07371","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ESPADA: Execution Speedup via Semantics Aware Demonstration Data Downsampling for Imitation Learning","primary_cat":"cs.RO","submitted_at":"2025-12-08T10:08:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ESPADA uses semantic segmentation from VLMs and LLMs plus DTW to downsample non-critical segments in demonstrations, delivering about 2x faster robot execution in behavior cloning while maintaining task success rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.08547","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation","primary_cat":"cs.RO","submitted_at":"2025-10-09T17:55:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"R2RGen introduces a simulator-free three-stage pipeline that parses, augments, and post-processes real pointcloud observation-action pairs to improve spatial generalization in robotic manipulation policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.15953","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation","primary_cat":"cs.RO","submitted_at":"2025-06-19T01:38:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViTacFormer learns a cross-modal visuo-tactile latent space with autoregressive tactile prediction and an easy-to-hard curriculum, then uses the representation for imitation learning that yields ~50% higher success and the first reported 11-stage, 2.5-minute autonomous dexterous tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.15799","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Steering Your Diffusion Policy with Latent Space Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2025-06-18T18:35:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.07339","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Real-Time Execution of Action Chunking Flow Policies","primary_cat":"cs.RO","submitted_at":"2025-06-09T01:01:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Real-time chunking (RTC) allows diffusion- and flow-based action chunking policies to execute smoothly and asynchronously, maintaining high success rates on dynamic tasks even with significant inference latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.19645","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","primary_cat":"cs.RO","submitted_at":"2025-02-27T00:30:29+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[56] Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 , 2023. [57] Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity. arXiv preprint arXiv:2410.13126 , 2024. [58] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d- vla: 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 , 2024. [59] Hongkuan Zhou, Xiangtong Yao, Yuan Meng, Siming Sun, Zhenshan Bing, Kai Huang, and Alois Knoll. Language-conditioned learning for robotic manipulation:"},{"citing_arxiv_id":"2501.09747","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2025-01-16T18:57:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[69] Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 , 2023. [70] Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity. arXiv preprint arXiv:2410.13126 , 2024. [71] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d- vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631 , 2024. [72] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d- vla: 3d vision-language-action generative world model."},{"citing_arxiv_id":"2410.24164","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","primary_cat":"cs.LG","submitted_at":"2024-10-31T17:22:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"of a larger VLM backbone and a smalleraction expertfor processing robot states and actions. The VLM backbone weights are initialized from PaliGemma [5], providing representations learned from large-scale Internet pre-training. The resultingπ 0 model can be used to control multiple robot embodiments with differing action spaces to accomplish a wide variety of tasks. of more complex and dexterous behaviors, such as tying shoelaces [58] or cooking shrimp [17], we show that our framework can learn very long tasks, sometimes tens of minutes in length, for behaviors that combine both physical dexterity and combinatorial complexity. For example, our laundry folding task requires the robot to manipulate a variety of clothing items that can start in any configuration, and fold multiple items in sequence."}],"limit":50,"offset":0}