{"total":15,"items":[{"citing_arxiv_id":"2605.17486","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization","primary_cat":"cs.RO","submitted_at":"2026-05-17T14:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11114","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection","primary_cat":"cs.RO","submitted_at":"2026-05-11T18:23:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SEVO raises ACT and SmolVLA pick-and-place success from 30-35% to 75-85% in novel environments by using active illumination, semantic cues, and diversified teleoperation data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00438","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation","primary_cat":"cs.AI","submitted_at":"2026-05-01T06:15:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Eo-1: Interleaved vision-text-action pretraining for general robot control. arXiv preprint arXiv:2508.21112, 2025. [21] Delin Qu, Haoming Song, Qizhi Chen, Y uanqi Y ao, Xinyi Y e, Y an Ding, Zhigang Wang, Ji- aY uan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for vision-language-action model. arXiv preprint arXiv:2501.15830, 2025. [22] Austin Stone, Ted Xiao, Y ao Lu, Keerthana Gopalakrishnan, Kuang-Huei Lee, Quan Vuong, Paul Wohlhart, Sean Kirmani, Brianna Zitkovich, Fei Xia, et al. Open-world object manipula- tion using pre-trained vision-language models. arXiv preprint arXiv:2303.00905, 2023. [23] Wan Team, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Y u,"},{"citing_arxiv_id":"2604.15483","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"${\\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities","primary_cat":"cs.LG","submitted_at":"2026-04-16T19:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ishnan, Kuang-Huei Lee, Quan Vuong, Paul Wohlhart, Brianna Zitkovich, Fei Xia, Chelsea Finn, et al. Open- world object manipulation using pre-trained vision- language models.arXiv preprint arXiv:2303.00905, 2023. [75] Danny Driess, Fei Xia, Mehdi S.M. Sajjadi, et al. Palm- e: An embodied multimodal language model.Interna- tional Conference on Machine Learning (ICML), 2023. [76] Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts.Interna- tional Conference on Machine Learning (ICML), 2023. 3 [77] OX-Embodiment Collaboration, A Padalkar, A Pooley, A Jain, A Bewley, A Herzog, A Irpan, A Khazatsky,"},{"citing_arxiv_id":"2511.15279","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception","primary_cat":"cs.RO","submitted_at":"2025-11-19T09:42:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EyeVLA transfers open-world VLM understanding to a PTZ camera control policy via hierarchical action tokens and GRPO reinforcement learning, reaching 96% task completion on 50 real scenes with only 500 training samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.03233","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data","primary_cat":"cs.RO","submitted_at":"2025-05-06T06:59:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.16054","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","primary_cat":"cs.LG","submitted_at":"2025-04-22T17:31:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.19645","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","primary_cat":"cs.RO","submitted_at":"2025-02-27T00:30:29+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[45] Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, and Yu Su. Llm- planner: Few-shot grounded planning for embodied agents with large language models, 2023. URL https: //arxiv.org/abs/2212.04088. [46] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020. [47] Austin Stone, Ted Xiao, Yao Lu, Keerthana Gopalakrish- nan, Kuang-Huei Lee, Quan Vuong, Paul Wohlhart, Sean Kirmani, Brianna Zitkovich, Fei Xia, et al. Open-world object manipulation using pre-trained vision-language models. arXiv preprint arXiv:2303.00905 , 2023. [48] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks."},{"citing_arxiv_id":"2502.19417","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2025-02-26T18:58:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A hierarchical VLA architecture lets robots follow complex instructions and situated feedback by separating high-level reasoning from low-level control.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.10345","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies","primary_cat":"cs.RO","submitted_at":"2024-12-13T18:40:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.09246","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OpenVLA: An Open-Source Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2024-06-13T15:46:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"existing foundation models for vision and language as a core building block for training robotic policies that can generalize to objects, scenes, and tasks beyond their training data. Towards this goal, existing work has explored integrating pretrained language and vision-language models for robotic representation learning [12-14] and as a component in modular systems for task planning and execution [15, 16]. More recently, they have been used for directly learning vision- language-action models [VLAs; 1, 7, 17, 18] for control. VLAs provide a direct instantiation of using pretrained vision-and-language foundation models for robotics, directly fine-tuning visually- conditioned language models (VLMs) such as PaLI [19, 20] to generate robot control actions."},{"citing_arxiv_id":"2405.14093","ref_index":98,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Vision-Language-Action Models for Embodied AI","primary_cat":"cs.RO","submitted_at":"2024-05-23T01:43:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"VIMA [96] ViT, Mask R-CNN T5 TFM Xattn BC (SE(2)) [SC:VIMA-Data]Sim(Ravens): VIMA-Bench BC-Z [82] ResNet18 (p, s) USE MLP FiLM BC (cont) [SC]Real(EDR): pick-place/wipe/drag, grasp, push RT-1 [97] EfficientNet USE TFM FiLM BC (disc) [SC: Fractal]Real(EDR): pick-place, move, knock MOO [96] OWL-ViT (p), EfficientNet (s) USE TFM FiLM BC (disc) [SC]Real(EDR): pick, move near, knock, place upright, place into Q-Transformer [98] EfficientNet USE TFM FiLM TD error Fractal, Auto- collect Sim: pick;Real(EDR): pick, place, open/close drawer, move near (RT-Trajectory) [99] EfficientNet TFM BC (disc) [SC]Real(EDR): pick, place, fold towel, swivel chair, etc. (ACT) [100] ResNet18 CV AE-TFM BC (cont, action chunking) [SC] with ALOHA Sim: transfer cube, bimanual insertion;Real (ViperX, WidowX): slot battery, open cup, etc"},{"citing_arxiv_id":"2310.10639","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models","primary_cat":"cs.RO","submitted_at":"2023-10-16T17:57:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SuSIE uses a finetuned InstructPix2Pix diffusion model to propose subgoal images that guide a low-level goal-conditioned policy, achieving SOTA zero-shot performance on CALVIN and real-world manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.08864","ref_index":115,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Open X-Embodiment: Robotic Learning Datasets and RT-X Models","primary_cat":"cs.RO","submitted_at":"2023-10-13T05:20:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A collaborative dataset spanning 22 robots and 527 skills enables RT-X models that transfer capabilities across different robot embodiments.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Fox, \"Cliport: What and where pathways for robotic manipulation,\" in Conference on Robot Learning . PMLR, 2022, pp. 894-906. [114] A. Stone, T. Xiao, Y . Lu, K. Gopalakrishnan, K.-H. Lee, Q. Vuong, P. Wohlhart, B. Zitkovich, F. Xia, C. Finn et al., \"Open-world object manipulation using pre-trained vision-language models,\" arXiv preprint arXiv:2303.00905, 2023. [115] Y . Mu, Q. Zhang, M. Hu, W. Wang, M. Ding, J. Jin, B. Wang, J. Dai, Y . Qiao, and P. Luo, \"EmbodiedGPT: Vision-language pre-training via embodied chain of thought,\" arXiv preprint arXiv:2305.15021 , 2023. [116] E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. Courville, \"Film: Visual reasoning with a general conditioning layer,\" 2017. [117] M."},{"citing_arxiv_id":"2307.05973","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models","primary_cat":"cs.RO","submitted_at":"2023-07-12T07:40:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Levine. Lm-nav: Robotic navigation with large pre- trained models of language, vision, and action. arXiv preprint arXiv:2207.04429, 2022. [32] Y . Cui, S. Karamcheti, R. Palleti, N. Shivakumar, P. Liang, and D. Sadigh. \" no, to the right\"- online language corrections for robotic manipulation via shared autonomy. arXiv preprint arXiv:2301.02555, 2023. [33] A. Stone, T. Xiao, Y . Lu, K. Gopalakrishnan, K.-H. Lee, Q. Vuong, P. Wohlhart, B. Zitkovich, F. Xia, C. Finn, et al. Open-world object manipulation using pre-trained vision-language models. arXiv preprint arXiv:2303.00905, 2023. [34] S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation."}],"limit":50,"offset":0}