{"total":43,"items":[{"citing_arxiv_id":"2605.21414","ref_index":73,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction","primary_cat":"cs.RO","submitted_at":"2026-05-20T17:10:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20774","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-20T06:15:30+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLA-REPLICA is a low-cost and reproducible real-world benchmark for evaluating VLA models in robotic manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19294","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies","primary_cat":"cs.RO","submitted_at":"2026-05-19T03:14:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DEFLECT is an offline post-training method that improves async VLA policy success rates under high inference delays by using flow-matching likelihood ratios on counterfactual fresh/stale action pairs from a frozen reference policy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15298","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhysBrain 1.0 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-14T18:11:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PhysBrain 1.0 extracts scene elements, spatial dynamics, actions and depth relations from human egocentric video to create QA supervision for VLMs, then transfers the resulting physical priors to VLA policies via capability-preserving adaptation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13632","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T14:58:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ing methods, highlighting the value of interactive reasoning for failure recovery in embodied control. Details of the project can be found here:https://signalispupupu.github.io/GTA-VLA_ProjPage/ 1 Introduction The pursuit of robust generalist robotic agents for open-world environments is a central goal of em- bodied AI. A major step toward this vision is the emergence of Vision-Language-Action (VLA) mod- els [38,18,21,29,9,3,4,37,10,6,24,8], which leverage large pre-trained vision language models to scale robot learning across diverse tasks and embodiments. Despite this progress, most existing VLAs still operate through an implicit direct \"Sense-to-Act\" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies often become brittle under visual and"},{"citing_arxiv_id":"2605.13548","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AttenA+: Rectifying Action Inequality in Robotic Foundation Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T13:55:37+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"3 -62.2 OpenVLA-OFT [17] 97.6 98.4 97.9 94.5 97.10 2.90 +1.5 -51.7 AttenA+OFT (ours)99.0±0.16 100±0.00 98.8±0.28 96.6±0.30 98.60 1.40- - Table 2: Performance on RoboTwin 2.0 Compared with SOTA Methods. Method Embodied PT.Clean Rand. SR↑ ER↓ SR-I RER-R π0 [2] ✓ 65.92 58.40 62.20 37.80 +30.3 -80.1 π0.5[10] ✓ 82.74 76.76 79.75 20.25 +12.7 -62.8 X-VLA [40] ✓ 72.90 72.80 72.85 27.15 +19.6 -72.2 Motus [13] ✓ 88.66 87.02 87.80 12.20 +4.6 -38.0 LingBot-V A [14] ✓ 92.90 91.50 92.2 7.80 +0.3 -3.3 Fast-W AM [29] ✗ 91.88 91.78 91.80 8.20 +0.6 -7.7 AttenA+W AM (ours) ✗ 93.06 91.86 92.46 7.54- - steps (clip, align, release), where precision is essential but receives equal loss weight to fast transi- tional motions."},{"citing_arxiv_id":"2605.13403","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RotVLA: Rotational Latent Action for Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2026-05-13T11:58:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"conditional generative process, producing smooth and coherent action sequences. 2.2 Latent Action Model Most VLA methods rely on large-scale, robot-specific annotated datasets for pretraining. However, learning from cross-embodiment datasets remains challenging due to substantial variations in visual observations, action spaces, and embodiment configurations [26]. Furthermore, these approaches rely on ground-truth action annotations, which limits their ability to leverage large-scale Internet video data. To address these issues, a line of work focuses on Latent Action Models, which aim to learn a unified action space across heterogeneous datasets by encoding the transition between sequential observations. Mainstream approaches [6, 27-31] typically adopt an Inverse Dynamics Model (IDM)"},{"citing_arxiv_id":"2605.11817","ref_index":5,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2026-05-12T09:08:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11564","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning","primary_cat":"cs.RO","submitted_at":"2026-05-12T05:49:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"support, middleware, data formats, and policy architecture support. For example, some stacks combine robot arm and robot gripper drivers, making it difficult to use other end effectors on arms. Framework Humanoids Bimanual Single arm Robot grippers Teleop Cameras Middleware(s) Data format(s) Policies Ark [16] : LCM : Pickle LeRobot [9] : Threads/gRPC 1 : LeRobotDataset ManiUniCon [53] : Shm : Zarr PAPRLE [30] : ROS : Pickle n/a PyRobot [35] : ROS : Pickle n/a RCS [25] : RPC : Parquet RoBits [20] : ZMQ : NPZ/JSON n/a UMI, DP [12, 13] : Shm : Zarr : DP RIO (ours) :any :any 1LeRobot uses Threads for hardware drivers and gRPC for asynchronous policy inference. Despite this proliferation of frameworks, robot code remains"},{"citing_arxiv_id":"2605.10921","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark","primary_cat":"cs.RO","submitted_at":"2026-05-11T17:54:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"forcing: Implicit spatial representation alignment for vision-language-action model.arXiv preprint arXiv:2510.12276, 2025a. Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, et al. Cronusvla: Towards efficient and robust manipulation via multi-frame vision-language-action modeling. Proceedings of the AAAI Conference on Artificial Intelligence, 40(22), 2026a. Ji Li, Bo Wang, Jing Xia, Mingyi Li, and Shiyan Hu. Himm: Human-inspired long-term memory modeling for embodied exploration and question answering.arXiv preprint arXiv:2602.15513, 2026b. Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic"},{"citing_arxiv_id":"2605.10821","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unified Noise Steering for Efficient Human-Guided VLA Adaptation","primary_cat":"cs.RO","submitted_at":"2026-05-11T16:37:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025. [19] Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025. [20] Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Num Lui, Yuyao Ye, Yitao Liang, et al. Dexgraspvla: A vision-language-action framework towards general dexterous grasping.arXiv preprint arXiv:2502.20900, 2025. [21] Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla:"},{"citing_arxiv_id":"2605.10819","ref_index":62,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T16:37:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07931","ref_index":51,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy","primary_cat":"cs.CV","submitted_at":"2026-05-08T16:04:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[49] Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. [50] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024. [51] Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025. [52] Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul"},{"citing_arxiv_id":"2605.07514","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-08T09:44:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"suggests that consistency-guided selection can identify branches with more reliable action-conditioned consequences. More examples are provided in Appendix J. 5.2 Results on RoboTwin 2.0 Table 2:Results on RoboTwin 2.0.\" ∗\" denotes our reimplementation; all other results are taken from [25]. TTSindicates whether test-time scaling is applied. Method TTS Average SR (%) X-VLA [43]✗72.9 π0 [6]✗65.9 π0.5 [19]✗82.7 Motus [4]✗88.7 ∗LingBot-V A [25]✗90.2 + Consistency-Consensus (ours)✓93.0 We further evaluate consistency-guided se- lection on RoboTwin 2.0 [9], following the setup of LingBot-V A [25]. RoboTwin 2.0 contains over 50 bimanual manipulation tasks requiring coordinated dual-arm con- trol, providing a complementary testbed"},{"citing_arxiv_id":"2605.07306","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-08T06:15:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BioProVLA-Agent integrates protocol parsing, visual state verification, and VLA-based execution in a closed-loop multi-agent framework with AugSmolVLA augmentation to improve robustness for biological lab tasks like tube handling and liquid pouring.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"relatively reliable depth perception, object segmentation, or three-dimensional reconstruction. In wet-lab scenarios involv- ing transparent labware, reﬂective surfaces, and liquid contain- ers, such perception outputs can become unstable. Meanwhile, Vision-Language-Action (VLA) models and imitation learning methods, such as X-VLA [ 7] and SmolVLA [ 8], have shown promising performance in language-conditioned robotic con- trol, dual-arm manipulation, and cross-embodiment generaliza- tion. Nevertheless, most VLA systems still emphasize direct observation-to-action mapping and lack explicit semantic ver- iﬁcation before and after execution. As a result, VLA execu- tion is commonly treated as an instruction-following process,"},{"citing_arxiv_id":"2605.06481","ref_index":97,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-07T16:06:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Bestresults are in bold, and second-best are underlined. Method Cam Robot LayoutGeo AvgLight BG Lang Noise Avg Vision-Language-Action Models OpenVLA-OFT [33] 56.4 31.9 74.2 54.2 88.7 93.3 79.5 75.8 69.6 π0 [3] 13.8 6.0 68.9 29.6 85.0 81.4 58.8 79.0 53.6 π0.5 [60] 75.4 77.585.779.5 96.994.6 85.6 89.7 85.7 ABot-M0 [87] 60.4 67.9 82.6 70.3 96.2 91.6 86.4 86.4 80.5 X-VLA [97] 23.489.771.8 61.6 88.296.075.7 62.7 71.4 A V A-VLA [83] 55.5 25.9 74.1 51.8 95.5 88.9 85.6 78.0 70.1 World-Action Models WorldVLA [10] 0.1 27.9 38.0 22.0 43.7 17.1 41.6 10.9 25.0 VLA-JEPA [73] 64.2 67.7 83.9 71.9 91.8 93.488.165.8 79.5 GE-Act [42] 60.7 77.0 80.2 72.6 95.8 86.0 77.490.980.3 HoloBrain-0 [43] 65.5 58.2 79.5 67.7 88.1 90.3 78.7 66.9 74."},{"citing_arxiv_id":"2605.06311","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation","primary_cat":"cs.RO","submitted_at":"2026-05-07T14:13:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06175","ref_index":34,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts","primary_cat":"cs.RO","submitted_at":"2026-05-07T12:56:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Hence, the expected squared Frobenius norms of the localized gradients are equalized across all specialized experts. This proves the theorem. Proposition 2(Isotropic global gradient variance implies projected uniformity).A sufficient condition for the projected-uniformity condition in Eq. (28) is that the global gradient second moments are isotropic, namely GL =E[gg ⊤] =c LI, G R =E[g ⊤g] =c RI,(34) for some constantsc L, cR >0. Under this assumption, for every specialized experti, αL i =c L Tr(Σi), α R i =c R Tr(Σi),(35) where αL i = Tr ΣiU ⊤ i GLUi \u0001 , α R i = Tr ΣiV ⊤ i GRVi \u0001 .(36) Therefore, Eq.(28)holds withκ L =c L andκ R =c R. Proof. Since Ui and Vi are composed of orthogonal singular vectors, they satisfy U ⊤ i Ui =I and V ⊤ i Vi =I."},{"citing_arxiv_id":"2605.04504","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SpecPL: Disentangling Spectral Granularity for Prompt Learning","primary_cat":"cs.CV","submitted_at":"2026-05-06T05:13:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03269","ref_index":123,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RLDX-1 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-05T01:40:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"robotics: a survey. InIEEE symposium series on computational intelligence (SSCI), 2020. [122] Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, Abha Jha, Hanwen Yang, Rong Xue, Sergey Zakharor, Vitor Guizilini, and Yue Wang. Humanoid everyday: A comprehensive robotic dataset for open-world humanoid manipulation. arXiv preprint arXiv:2510.08807, 2025. [123] Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025a. [124] Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and"},{"citing_arxiv_id":"2605.02881","ref_index":45,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MolmoAct2: Action Reasoning Models for Real-world Deployment","primary_cat":"cs.RO","submitted_at":"2026-05-04T17:51:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00078","ref_index":112,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Being-H0.7: A Latent World-Action Model from Egocentric Videos","primary_cat":"cs.RO","submitted_at":"2026-04-30T14:16:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Table 1: Benchmark comparison on multiple embodied manipulation tasks. CALVIN denotes \"ABCD→D\" and CALVIN∗denotes \"ABC→D\", LIBERO-plus∗denotes finetuning with LIBERO-plus dataset Model Size LIBERO LIBERO-plus LIBERO-plus ∗RoboCasa-50 GR1 CALVIN CALVIN∗Robotwin2 # VLA π0 [4] 3B 94.4 53.6 - 42.4 - - 3.92 65.9/58.4 π0-FAST[111] 3B 85.5 61.6 - - - - - - X-VLA [112] 0.9B - - - - - 4.43 - 72.9/72.8 UniVLA [87] 8B 95.5 - - - - 4.63 4.41 - gr00t-N1.6 [5] 3B 93.9 - - 36.0 47.6 4.60 4.24 - π0.5 [40] 3B 96.9 77.4 - 41.4 - 4.06 4.13 82.7/76.8 starVLA [113] 4B 96.5 77.0 - - 48.8 - - 88.2/88.3 MINT-4B [114] 4B 98.7 80.1 84.1 - - 4.57 - - ABot-M0 [115] 4B 98.6 80.5 - - 58.3 - - 86.1/85.1 LingBot-VLA [116]4B - - - - - - - 86."},{"citing_arxiv_id":"2604.27472","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations","primary_cat":"cs.AI","submitted_at":"2026-04-30T06:14:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26848","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-29T16:13:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising real-world success from 42.5% to 70.8%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24622","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies","primary_cat":"cs.CV","submitted_at":"2026-04-27T15:51:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23272","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-25T12:28:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18000","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-20T09:25:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[22] Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165-2183. PMLR, 2023. [23] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences.Minds and machines, 30(4):681-694, 2020. [24] Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025. [25] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative"},{"citing_arxiv_id":"2604.15483","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"${\\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities","primary_cat":"cs.LG","submitted_at":"2026-04-16T19:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. 3 [14] Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π 0.5: a vision-language-action model with open- world generalization. In9th Annual Conference on Robot Learning, 2025. 3, 4, 10 [15] Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Ji- ayin Zou, Yilun Chen, Jia Zeng, et al. X- vla: Soft-prompted transformer as scalable cross- embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025. [16] Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao,"},{"citing_arxiv_id":"2604.13942","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection","primary_cat":"cs.RO","submitted_at":"2026-04-15T14:53:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11751","ref_index":67,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Grounded World Model for Semantically Generalizable Planning","primary_cat":"cs.RO","submitted_at":"2026-04-13T17:25:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11135","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps","primary_cat":"cs.RO","submitted_at":"2026-04-13T07:48:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09330","ref_index":86,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis","primary_cat":"cs.RO","submitted_at":"2026-04-10T13:59:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Wenkang Qin, Guan Huang, and Xingang Wang. Re- condreamer++: Harmonizing generative and reconstructive models for driving scene representation.arXiv preprint arXiv:2503.18438, 2025. 2 [85] Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems (RSS), 2023. 2 [86] Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025. 2 13"},{"citing_arxiv_id":"2604.05672","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2026-04-07T10:18:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11142-11152, 2025. [59] Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, and Tsung-Yi Lin. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models, 2025.https://arxiv.org/abs/2503.22020. [60] JinliangZheng, JianxiongLi, ZhihaoWang, DongxiuLiu, XiruiKang, YuchunFeng, YinanZheng, JiayinZou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025.https://arxiv.org/abs/2510.10274. [61] Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang,"},{"citing_arxiv_id":"2604.04161","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adaptive Action Chunking at Inference-time for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-05T16:03:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[46] Tony Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.Robotics: Science and Systems XIX, 2023. 1, 2 [47] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3D- VLA: A 3D vision-language-action generative world model. InInternational Conference on Machine Learning, 2024. 1 [48] Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-VLA: Soft-prompted Transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025. 2 [49] Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Nam"},{"citing_arxiv_id":"2603.24935","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-03-26T01:56:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.13966","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models","primary_cat":"cs.AI","submitted_at":"2026-03-14T14:38:53+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"vla-eval decouples VLA model inference from benchmark execution via WebSocket and Docker, supporting 14 benchmarks with up to 47x speedup and reproducing published scores across six codebases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.20309","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models","primary_cat":"cs.LG","submitted_at":"2026-02-23T19:55:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.10503","ref_index":79,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning","primary_cat":"cs.RO","submitted_at":"2026-02-11T04:05:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a 22% average success rate gain over supervised fine-tuning on the LIBERO benchmark's","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.21998","ref_index":93,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Causal World Modeling for Robot Control","primary_cat":"cs.CV","submitted_at":"2026-01-29T17:07:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"0 Simulation (Easy vs Hard, 50 tasks). RoboTwin 2.0 is a challenging bimanual manipulation benchmark requiring coordinated dual-arm control. Easy uses fixed initial configurations while Hard involves randomized object poses and scene layouts. ∗ Results for X-VLA are adopted from Motus [5]. Improvements in parentheses indicate gains over the second-best method (underlined). X-VLA∗ [93]π 0 [7]π 0.5 [29] Motus [5]LingBot-V A(Ours) MetricEasy Hard Easy Hard Easy Hard Easy Hard Easy Hard AverageHorizon = 1 81.6 82.5 66.5 61.6 85.1 80.2 91.0 90.6 94.18(+3.2)93.56(+3.0) AverageHorizon = 2 59.3 55.9 66.1 54.7 79.3 73.0 85.2 80.9 90.35(+5.2)86.95(+6.1) AverageHorizon = 3 61.2 66.0 61.6 50.2 78.6 67.4 85.0 84.2 93.22(+8.2)93.28(+9.1) Average50 Tasks 72."},{"citing_arxiv_id":"2601.02078","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Genie Sim 3.0 : A High-Fidelity Comprehensive Simulation Platform for Humanoid Robot","primary_cat":"cs.RO","submitted_at":"2026-01-05T12:59:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Genie Sim 3.0 introduces an LLM-powered scene generator, the first LLM-based automated evaluation benchmark, and a large open synthetic dataset that demonstrates zero-shot sim-to-real transfer for robotic manipulation policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13030","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Motus: A Unified Latent Action World Model","primary_cat":"cs.CV","submitted_at":"2025-12-15T06:58:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"promising general multimodal priors. • A scalable robotic recipe with a three-phase training pipeline and six-layer data pyramid that leverages opti- cal flow-based latent action to learn cross-embodiment 2 transferable motion knowledge. • Extensive experiments show that Motus significantly out- performs state-of-the-art approaches in both simulation (a+15%improvement over X-VLA [ 60] and a+45% improvement over π0.5 [8]) and real-world scenarios (im- proved by+11~48%), demonstrating that large-scale gen- eral and domain-specific priors can be effectively fused to enhance the generalization of policy learning. 2. Related Works 2.1. Unified Multimodal Models Unified multimodal models jointly model various modali- ties and tasks within a single generative framework [29, 40,"},{"citing_arxiv_id":"2511.14148","ref_index":80,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2025-11-18T05:21:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.02776","ref_index":108,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations","primary_cat":"cs.RO","submitted_at":"2025-11-04T17:59:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"XR-1 introduces Unified Vision-Motion Codes learned by dual-branch VQ-VAE and applies them in a three-stage training pipeline to outperform prior VLA models on 120+ real-world manipulation tasks across six robot embodiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}