{"total":20,"items":[{"citing_arxiv_id":"2606.20092","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies","primary_cat":"cs.CV","submitted_at":"2026-06-18T11:11:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EventVLA introduces foundational visual anchors and a Keyframe Evidence Memory module that predicts future keyframe probabilities from VLA embeddings to improve long-horizon task success by an average of 40% on 17 simulation and 4 real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25044","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"X-DiffVLA: X-Embodied Diffusion Action Heads for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-24T12:41:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"X-DiffVLA proposes a diffusion VLA model using Embodiment Forcing and Morphological Tree Diffusion to achieve SOTA cross-embodied performance on simulation benchmarks with 15.3% and 12.5% gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24890","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"QuoVLA: Quotient Space for Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-05-24T06:28:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"QuoVLA introduces a quotient-space framework that compresses VLM latents into action-sufficient representations via quantization and dual-branch design for better VLA generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19282","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-19T03:00:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13382","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning","primary_cat":"cs.RO","submitted_at":"2026-05-13T11:37:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BlockVLA accelerates autoregressive VLA models by 3.3x using block diffusion finetuning, with faster training convergence and better early performance on long-horizon robotic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12369","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization","primary_cat":"cs.RO","submitted_at":"2026-05-12T16:38:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GuidedVLA improves VLA generalization by supervising individual attention heads with manually defined auxiliary signals for three task-relevant factors.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"generation via pose-conditioned anchor attention.arXiv preprint arXiv:2512.03724, 2025. [54] Yixing Liang, Anna Xie, Ziyun Feng, Yuke Zhu, Song- Chun Zhu, and Yunzhu Li. Skilldiffuser: Interpretable skill planning for latent diffusion-based manipulation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 16467-16476, 2024. [55] Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete diffu- sion vla: Bringing discrete diffusion to action decod- ing in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025. [56] Fanqi Lin, Haojie Lu, Haojian Fang, and Ping Luo. Manicm: Real-time 3d diffusion policy via consis-"},{"citing_arxiv_id":"2605.11832","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T09:21:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Diffusion Policy [19] 78.5 87.5 73.5 64.8 76.1 OpenVLA [5] 84.7 88.4 79.2 53.7 76.5 SpatialVLA [78] 88.2 89.9 78.6 55.5 78.1 CoT-VLA [37] 87.5 91.6 87.6 69.0 83.9 π0-Fast [12] 96.4 96.8 88.6 60.2 85.5 GR00T-N1 [20] 94.4 97.6 93.0 90.6 93.9 π0 [6] 98.0 96.8 94.4 88.4 94.4 F1 [79] 98.2 97.8 95.4 91.3 95.7 InternVLA-M1 [80] 98.0 99.0 93.8 92.6 95.9 Dis. Diff. VLA [81] 97.2 98.6 97.4 92.0 96.3 π0.5 [82] 98.8 98.2 98.0 92.4 96.9 GR00T-N1.6 [20] 97.7 98.5 97.5 94.4 97.0 OpenVLA-OFT [11] 97.6 98.4 97.9 94.5 97.1 UniVLA [38] 96.5 96.8 95.6 92.0 95.2 X-VLA [35] 98.2 98.6 97.897.698.1 GeoVLA [64] 98.4 99.0 96.6 96.6 97.7 3D-CAVLA [62] 98.2 99.8 98.2 96.1 98.1 Spatial Forcing [16] 99.4 99.6 98.8 96.0 98.5 Ours98.899.8 99."},{"citing_arxiv_id":"2605.11459","ref_index":50,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models","primary_cat":"cs.RO","submitted_at":"2026-05-12T03:17:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"or predictive heads, all requiring retraining and architecture-specific integration [16]. The second reduces inference latency while retaining the single-frame paradigm: DynamicVLA [ 3] shrinks the backbone to 0.4B, PD-VLA [26] parallelizes autoregressive decoding, FASTer [27] compresses action tokenization, and others accelerate through token caching [48, 49], discrete diffusion [50], or asynchronous inference [51]. Orthogonal efforts repair chunk boundaries at inference time through temporal ensembling [10], guided rejection sampling [29], asynchronous inpainting [28, 52], learned correction heads [22], native continuation [53], or adaptive chunk sizing [30], smoothing inter-chunk seams without addressing intra-chunk drift. 3 Methodology"},{"citing_arxiv_id":"2605.10925","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T17:56:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.","context_count":1,"top_context_role":"method","top_context_polarity":"baseline","context_text":"Average success rates (%) across four task suites. Best results inbold. Methods Spatial Object Goal Long Avg. Success Diffusion Policy [26] 78.3 92.5 68.3 50.5 72.4 π0-FAST [41] 96.4 96.8 88.6 60.2 85.5 DreamVLA [42] 97.5 94.0 89.5 89.5 92.6 GR00T-N1 [35] 94.4 97.6 93.0 90.6 93.9 π0 [4] 96.8 98.8 95.8 85.2 94.1 UniVLA [11] 95.4 98.8 93.6 94.0 95.5 F1 [43] 98.2 97.8 95.4 91.3 95.7 DD-VLA [44] 97.2 98.6 97.4 92.0 96.3 GE-Act [45] 98.2 97.6 95.8 94.4 96.5 MemoryVLA [46] 98.4 98.4 96.4 93.4 96.7 π0.5 [5] 98.8 98.2 98.0 92.4 96.9 OpenVLA-OFT [12] 97.6 98.4 97.9 94.5 97.1 PriorVLA (Ours) 99.4 99.8 99.4 97.6 99.1 Table 4:Real-robot standard-data results.Success rates (%) on eight real-world tasks under ID and OOD evaluation. Average gains overπ 0."},{"citing_arxiv_id":"2605.09302","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Discrete Langevin-Inspired Posterior Sampling","primary_cat":"cs.LG","submitted_at":"2026-05-10T03:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ΔLPS is a gradient-guided discrete posterior sampler for inverse problems that works with masked or uniform discrete diffusion priors and outperforms prior discrete methods on image restoration tasks.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278-2324, 2002. [19] Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal understanding.arXiv preprint arXiv:2505.16839, 2025. [20] Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Xiaokang Yang, Jiangmiao Pang, Yao Mu, and Ping Luo. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025. [21] Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Bac Nguyen, Stefano Ermon,"},{"citing_arxiv_id":"2605.00078","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Being-H0.7: A Latent World-Action Model from Egocentric Videos","primary_cat":"cs.RO","submitted_at":"2026-04-30T14:16:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024. [38] Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024. [39] Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete diffusion vla: Bringing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025. [40] Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan"},{"citing_arxiv_id":"2604.25050","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DiscreteRTC: Discrete Diffusion Policies are Natural Asynchronous Executors","primary_cat":"cs.RO","submitted_at":"2026-04-27T23:04:03+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Finally, the correction guidance term at every denoising step roughly doubles inference cost, ironically increasing the very latency RTC aims to hide. Many ongoing efforts [23, 24, 21, 25] seek to resolve these aforementioned issues within the flow- matching paradigm. Instead, our key insight and observation are that by replacing the action head with a discrete diffusion policy [26], all the aforementioned limitations can be resolved at once. Or, to put it simply:Discrete Diffusion Policies are Natural Asynchronous Executors: (a)Inpainting as Pre-training.Discrete diffusion policies are pre-trained to inpaint upon ran- domly masked sequences. Therefore, scaling pre-training directly improves asynchronous performance, and the native forward pass suits inference-time inpainting;"},{"citing_arxiv_id":"2604.22152","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model","primary_cat":"cs.RO","submitted_at":"2026-04-24T01:50:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"In summary, we make three contributions: •We propose dWorldEval, a discrete-diffusion world model that significantly enhances action controlla- bility, utilizing sparse keyframe memory to ensure spatiotemporal consistency. •We jointly predict visual outcomes and a discrete progress token to enable automatic success detection. •We conduct a systematic evaluation on LIBERO [22], RoboTwin [30], and real-world tasks. Exten- sive experiments confirm that dWorldEval achieves substantially better action controllability measured by our proposed action-sensitive∆-LPIPS metric. Furthermore, its estimated success rates correlate strongly with actual execution performance (Pearsonr≈0.9), enabling accurate ranking of policies across capabilities."},{"citing_arxiv_id":"2604.20472","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-22T11:58:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14732","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems","primary_cat":"cs.RO","submitted_at":"2026-04-16T07:46:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.19710","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Universal Pose Pretraining for Generalizable Vision-Language-Action Policies","primary_cat":"cs.CV","submitted_at":"2026-02-23T11:00:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pose-VLA uses a decoupled two-stage pre-training with discrete pose tokens to extract universal 3D spatial priors from 3D datasets and robotic trajectories, achieving 79.5% success on RoboTwin 2.0 and 96.0% on LIBERO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.12978","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Native Continuation for Action Chunking Flow Policies","primary_cat":"cs.RO","submitted_at":"2026-02-13T14:56:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Legato trains flow-based VLA policies with schedule-shaped action-noise mixtures and randomized conditions to achieve smoother trajectories and ~10% faster task completion than real-time chunking across five real-world manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.11236","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning","primary_cat":"cs.CV","submitted_at":"2026-02-11T16:47:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.21998","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Causal World Modeling for Robot Control","primary_cat":"cs.CV","submitted_at":"2026-01-29T17:07:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"0 81.1 ThinkAct [28] 88.3 91.4 87.1 70.9 84.4 SmolVLA [67] 93.0 94.0 91.0 77.0 88.8 CronusVLA [37] 97.3 99.6 96.9 94.0 97.0 FLOWER [62] 97.1 96.7 95.6 93.5 95.7 GR00T-N1 [6] 94.4 97.6 93.0 90.6 93.9 π0 [7] 96.8 98.8 95.8 85.2 94.1 π0+FAST [57] 96.4 96.8 88.6 60.2 85.5 OpenVLA [34] 84.7 88.4 79.2 53.7 76.5 OpenVLA-OFT [32] 97.6 98.497.994.5 97.1 DD-VLA [44] 97.2 98.6 97.4 92.0 96.3 UniVLA [78] 95.4 98.8 93.6 94.0 95.4 X-VLA [93] 98.2 98.6 97.8 97.6 98.1 LingBot-V A(Ours) 98.5±0.3 99.6±0.397.2±0.298.5±0.5 98.5 two manipulators, making it significantly more difficult for policy learning. We evaluate under bothEasy(fixed initial configurations) andHard(varied object poses and scene layouts) settings. As shown in Tab."},{"citing_arxiv_id":"2511.14148","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2025-11-18T05:21:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}