{"total":55,"items":[{"citing_arxiv_id":"2606.27663","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Direct Action-Head Injection of A Grounded 3D Point Unlocks Spatial and Task Generalization","primary_cat":"cs.RO","submitted_at":"2026-06-26T02:44:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Direct 3D point grounding injected into the action head via a two-layer MLP and adaptive layer norm boosts VLA success rates by 32-46 points on spatial and task perturbations in LIBERO-PRO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23686","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-06-22T17:59:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30484","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation","primary_cat":"cs.RO","submitted_at":"2026-05-28T19:03:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ELAN4D introduces plug-and-play 4D keypoint track supervision from forward kinematics to enhance VLA policy generalization in robotic manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29605","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VLAConf: Calibrated Task-Success Confidence for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-28T08:42:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAConf is a one-class discriminative method that estimates step-wise task-success confidence for VLA models via anomaly scoring on frozen representations plus step-conditioned modeling, shown to be more efficient than ensemble or probability baselines on LIBERO and real robots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29416","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding","primary_cat":"cs.RO","submitted_at":"2026-05-28T06:07:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"3DVLA is a plug-and-play framework that enhances pretrained VLAs with pervasive 3D feature encoding using multi-view consistency and Spatially-Conditioned Geometry Aggregation, an instance estimation module, and a masked self-supervised 3D branch, yielding gains on LIBERO-Plus and RoboTwin 2.0.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25802","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking VLM Representation for VLA Initialization","primary_cat":"cs.CV","submitted_at":"2026-05-25T12:51:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Experiments indicate original VLM representations are crucial for VLA performance, LoRA outperforms full finetuning, and staged robot-data pretraining yields the strongest initialization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22183","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Action with Visual Primitives","primary_cat":"cs.RO","submitted_at":"2026-05-21T08:52:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21414","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction","primary_cat":"cs.RO","submitted_at":"2026-05-20T17:10:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19986","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-19T15:25:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MetaFine reconstructs benchmarks into diagnostic scenarios to evaluate vision-language-action models on fine-grained manipulation, exposing dimension-specific failures and identifying the visual encoder as a key bottleneck.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19678","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-19T11:10:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19282","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR","primary_cat":"cs.LG","submitted_at":"2026-05-19T03:00:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18556","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Key-Gram: Extensible World Knowledge for Embodied Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-18T15:37:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Key-Gram uses a memory module with key-grams and hashed lookup to inject static linguistic priors into vision-language-action backbones, yielding reported gains on manipulation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17601","ref_index":141,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From a Single Demonstration to a General Policy for Contact-Rich Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-17T18:58:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A one-shot LfD framework abstracts a single demonstration into environmental-constraint primitives, then uses self-exploration, human corrections, and compliant recovery to produce a policy that generalizes across poses and geometries, achieving over 90% success on seven real-world multi-stage tasks","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Tedrake, F. Park, and K. Goldberg, \"\"data will solve robotics and automation: True or false?\": A debate,\"Science Robotics, vol. 10, no. 105, p. eaea7897, 2025. [140] C. He, X. Liu, G. S. Camps, G. Sartoretti, and M. Schwager, \"Demystifying diffusion policies: Action memorization and simple lookup table alternatives,\"arXiv preprint arXiv:2505.05787, 2025. [141] S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei,et al., \"Libero-plus: In-depth robustness analysis of vision- language-action models,\"arXiv preprint arXiv:2510.13626, 2025. 20 cm 15 cm external light source visual alignment target Fig. 21: Experimental setup for evaluating the impact of vision-based augmentation under lighting changes."},{"citing_arxiv_id":"2605.15705","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Feedback World Model Enables Precise Guidance of Diffusion Policy","primary_cat":"cs.RO","submitted_at":"2026-05-15T07:52:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Feedback world model closes the prediction-observation loop at inference time to correct errors and improve diffusion policy performance under distribution shift in robotics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13119","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T07:40:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For reinforcement learning, each invocation (z, g) defines a bounded subtask MDP whose reset distribution starts from states where that invocation should begin. Let S0 z,g be the pool of such states, obtained from demonstration boundaries or successful executions of preceding subtasks. We sample ρz,g(s) = 1 |S0z,g | X ¯s∈S0z,g δ(s= ¯s), s 0 ∼ρ z,g .(7) Each invocation has a state-based completion predicate ψz,g(s)∈ {0,1} . For a bounded rollout τ= (s 0, a0, . . . , sH), the invocation-level reward is R(τ;z, g) =1 \u0014 max 0≤t≤H ψz,g(st) = 1 \u0015 .(8) Thus the RL stage optimizes the expected completion of the current invocation, max θ,Φ E(z,g), τ∼π θ,ϕg (·|z),M z,g [R(τ;z, g)],(9) which we instantiate with GRPO in our experiments."},{"citing_arxiv_id":"2605.12369","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization","primary_cat":"cs.RO","submitted_at":"2026-05-12T16:38:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"language instructions to low-level robot actions by combin- ing pretrained vision-language models with large-scale robot demonstrations [2, 10, 104, 23, 31, 45, 6, 101, 90]. One important research direction focuses on scaling embodied data through multi-source datasets [67, 81, 43, 20, 63, 40, 48], standardized multi-task benchmarks [64, 59, 28, 39, 92], and evaluations under distribution shift [28, 65, 24]. Another line of work improves training and inference recipes, including multimodal prompting [42, 25], parameter-efficient adaptation [44, 87, 74, 34], and inference-time acceleration [88, 9, 41, 62, 99]. In parallel, prior work strengthens the action path- way through alternative action parameterizations and learn- ing objectives, including diffusion- or flow-based generation"},{"citing_arxiv_id":"2605.12167","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:15:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12090","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, standard VLA models do not explicitly model world dynamics ithey learn direct observation-to- action mappings without predicting how the environment changes under intervention[ 4]. This absence of predictive physical reasoning limits their generalization, where anticipating future states is essential. Equip- ping embodied policy models with world modeling capabilities thus emerges as a natural direction [ 5]. A growing body of recent work has begun integrating world models into the embodied policy pipeline. These approaches leverage predictive models of environment dynamics to provide agents with physical foresight iwhether through video prediction as visual planning [ 6-10], latent dynamics modeling for policy condi- tioning [11-15], or joint state-action generation within unified architectures [16-22]."},{"citing_arxiv_id":"2605.11832","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T09:21:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Unlike traditional diffusion policies that predict noise or velocity, AML directly predicts clean action chunks on this underlying action manifold. By explicitly mapping policy outputs to action trajectories, our approach eliminates the inefficiency of indirect decoding, ensuring more efficient optimization. We evaluate our method on several benchmarks, includ- ing LIBERO [27], LIBERO-Plus [28], RoboTwin 2.0 [29], and real-world robotic tasks. Experimental results demonstrate that our framework consistently outperforms state-of-the- art methods in both success rate and robustness. Our main contributions are summarized as follows: •We present a VLA framework that enables reliable spatial perception and efficient action learning, al- lowing for robust and precise robotic manipulation."},{"citing_arxiv_id":"2605.11817","ref_index":1,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2026-05-12T09:08:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11809","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models","primary_cat":"cs.AI","submitted_at":"2026-05-12T09:03:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"2 , and use the smoothness loss Lsmooth = 1−cosθ t.The final objective is: L=L act +λ orthoLortho +λ smoothLsmooth. 4 Experiments We evaluate MCF-Proto from four perspectives: benchmark performance, robustness under distri- bution shift, analysis of the learned local action representation, and real-world validation. We first report results on LIBERO [27] and LIBERO-plus [15], then analyze how the learned motion-centric frame reshapes the action space through concentration and geometric compatibility diagnostics, and finally present real-world results and hierarchical ablations. 4.1 Benchmarks and metrics LIBERO.We evaluate on the standard LIBERO benchmark, including the Spatial, Object, Goal, and Long suites. We report task success rate (%) on each suite and the average across all suites."},{"citing_arxiv_id":"2605.11205","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains","primary_cat":"cs.LG","submitted_at":"2026-05-11T20:17:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Simple averaging of evaluation scores degrades in rank correlation with ground truth under data sparsity and difficulty variation, while a two-parameter logistic Item Response Theory model maintains high correlation across conditions.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"In AV safety, they tell regulators which test conditions to mandate. In cybersecurity, they tell CISOs which threat types matter most for product differentiation. 6 Analysis 6.1 Why Simple Averaging Fails The failure of simple averaging under sparsity can be understood through a decomposition. For systemjwith abilityθj, the expected value of the simple average is: E[¯rj] = 1 |Ij| ∑ i∈Ij σ(ai(θj−bi))(7) whereI j ={i:Mji = 1}is the set of items observed for systemj. WhenIj is a biased subset-containing predominantly easy items (lowbi) or predominantly hard items (highbi)- the expected valueE[¯rj]is a biased estimator of system ability. Critically, the bias direction depends on which items are observed, not on the system's true ability. 9 Consider two systems: system A (θA = 2."},{"citing_arxiv_id":"2605.10094","ref_index":7,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs","primary_cat":"cs.RO","submitted_at":"2026-05-11T07:11:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Embracing this \"persistent online\" perspective is vital. While many existing VLAs possess strong foundational capabilities, their closed-loop execution often remains unstable during real-world deployment. Although a robot might occasionally complete a task, it is highly prone to failure in nearly identical states due to perception noise, viewpoint shifts, or accumulated errors [ 7, 21]. Preprint. arXiv:2605.10094v2 [cs.RO] 12 May 2026 This fragility underscores the value of successful experiences. A successful grasp or placement, for instance, implicitly captures the visual geometry, actuation biases, and execution timing specific to that environment. Consequently, these trajectories should not be treated as isolated samples discarded"},{"citing_arxiv_id":"2605.09948","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models","primary_cat":"cs.AI","submitted_at":"2026-05-11T03:51:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"where θ is a predefined threshold determined heuristically based on the wσ rule of a normal distribu- tion (e.g., θ≈0.68,0.95,0.997 for w= 1,2,3 , respectively).The final action is selected from the most probable visited step: n∗ = arg max k≤n p(k), A=A (n∗).(16) 4 Experiments 4.1 Experiment Settings Simulation Benchmark Details.We evaluate on LIBERO [ 35], VLA-Arena [36], and LIBERO- Plus [37]. LIBERO contains multiple task suites covering spatial reasoning, object interaction, and long-horizon manipulation, while VLA-Arena provides a standardized benchmark with diverse tasks and difficulty levels. To assess generalization, we additionally report zero-shot performance on LIBERO-Plus, which introduces unseen task compositions. Following prior work, we report the average success rate over 50 trials per suite on LIBERO and"},{"citing_arxiv_id":"2605.10993","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-09T13:06:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"how does the memory injection strength affect policy stability? 4.1 Experimental Setup Environments and Tasks.We conduct experiments primarily on theLIBERO[ 35] simulation benchmark and further validateECHOon a real-world robotic platform. All models are evaluated on four standard LIBERO suites:LIBERO-Spatial,LIBERO-Object,LIBERO-Goal, andLIBERO-Long (the 10-task sequence). We additionally useLIBERO-Plus[ 36] as a supplementary evaluation for high-complexity scene understanding. Real-world experiments are conducted on a Franka Emika Panda robot to assess the deployability ofECHOin physical manipulation settings. Baselines.We compareECHOwith three groups of baselines. First, Octo and OpenVLA are included as representative generalist VLA policies on the standard LIBERO suites."},{"citing_arxiv_id":"2605.07943","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning","primary_cat":"cs.RO","submitted_at":"2026-05-08T16:11:13+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"TAVIS is a released benchmark showing active vision improves imitation learning in a task-dependent manner, multi-task policies struggle with shifts, and imitation produces human-like anticipatory gaze.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"7]22.1[18.6,26.0]9.0[6.7,11.8]11.2[8.7,14.4]8.3[6.2,11.1] TA VIS-Hands peeking-box 64.6[54.6,73.4]51.0[41.2,60.8]15.6[9.7,24.2]84.4[75.8,90.3]68.8[58.9,77.1]39.6[30.4,49.6] occluded-reach 87.5[79.4,92.7]60.4[50.4,69.6]24.0[16.5,33.4]78.1[68.9,85.2]43.8[34.3,53.7]43.8[34.3,53.7] blocked-clutter-pick-cube 58.3[48.3,67.7]35.4[26.6,45.4]4.2[1.6,10.2]67.7[57.8,76.2]40.6[31.3,50.6]31.2[22.9,41.1] suite mean70.1[64.6,75.1]49.0[43.2,54.7]14.6[11.0,19.1]76.7[71.5,81.2]51.0[45.3,56.8]38.2[32.8,43.9] 17 Table 5:Single-task π0 success rates (%) on TA VIS, with 95% Wilson confidence intervals.Each cell corresponds to an independent π0 checkpoint trained on a single (suite, robot, camera-mode, task) tuple and evaluated for 96 episodes."},{"citing_arxiv_id":"2605.07381","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation","primary_cat":"cs.RO","submitted_at":"2026-05-08T07:35:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"OPTIMALNUMBER OFANCHORSUNDERUNIFORMALLOCATION Assume the anchors are quasi-uniform in the sense that for a geometry-dependent constantc P, hK ≤c P K −1/d.(34) Assume further that a total budget N is split uniformly across anchors, so ni =N/K , and that the expected anchor estimation error satisfies max i E∥ ˆf(z, p i, t)−f ⋆(z, pi, t)∥ ≤Cσ r K N .(35) Combining (23), (34), and (35) gives E(K)≤Cσ r K N +Lc P K −1/d.(36) The first term increases with K because each anchor receives fewer samples; the second term decreases with K because the anchor set coversPmore finely. 16 Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation Ignoring integer constraints, minimizing the right-hand side of (36) yields"},{"citing_arxiv_id":"2605.06481","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-07T16:06:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"build on this by jointly predicting actions and future world states, supplying extra temporal and causal supervision [9, 89, 90]. However, robustness benchmarks show that high standard-benchmark scores do not imply reliable scene understanding: under modest perturbations of object layout, camera viewpoint, robot initial state, background, lighting, or sensor noise, policies often collapse from near-saturated success rates [19, 23, 61], and remain insensitive to paraphrased or even meaningless instruction tokens [19, 99]-both signs that target selection is bound to training layouts and visual context rather than to the language-named objects. Most existing W AMs still represent the predicted world as full-frame observations, image/video token streams, or shared global vision-action latents [10, 95, 101]; such"},{"citing_arxiv_id":"2605.06247","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-07T13:26:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06175","ref_index":9,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts","primary_cat":"cs.RO","submitted_at":"2026-05-07T12:56:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To preserve the pre-trained backbone as much as possible at initialization, we compensate for the non-zero perturbation introduced by the generalized and specialized experts by adjusting the frozen backbone weight from W0 to ˜W0. Let Weq(x) denote the input-dependent equivalent weight of the GSE block. Then, it follows that y=W eq(x)x, W eq(x) = ˜W0 +s gBgAg + EX i=1 wi(x)s i sBi sAi s.(9) In principle, one may desire an exact sample-wise equality: Weq(x) =W 0,∀x . However, this is generally impossible because the selection of specialized experts varies withx through wi(x). Accord- ingly, we aim to maintain consistency with the pretrained weightin expectationunder the routing dis- tribution at initialization. At random initialization, the router is symmetric across specialized experts"},{"citing_arxiv_id":"2605.08215","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Test-Time Training for Visual Foresight Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-05-06T11:21:25+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03269","ref_index":33,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RLDX-1 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-05T01:40:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Scaling rectified flow transformers for high-resolution image synthesis. In International Conference on Machine Learning, 2024. [32] Haoquan Fang, Markus Grotz, Wilbert Pumacay, Yi Ru Wang, Dieter Fox, Ranjay Krishna, and Jiafei Duan. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation.arXiv preprint arXiv:2501.18564, 2025. 30 RLDX-1 Technical Report [33] Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. LIBERO-Plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025. [34] Yao Mu Fourier ActionNet Team. Fourier actionnet dataset.https://action-net.org, 2025. [35] NVIDIA GEAR. GR00T N1."},{"citing_arxiv_id":"2605.00321","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-01T01:00:00+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00078","ref_index":121,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Being-H0.7: A Latent World-Action Model from Egocentric Videos","primary_cat":"cs.RO","submitted_at":"2026-04-30T14:16:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"unseen object instances and novel kitchen styles. • GR1[ 5]: GR1 is a bimanual manipulation benchmark featuring a GR-1 humanoid robot equipped with Fourier dexterous hands. It comprises 24 complex tabletop manipulation tasks that require fine-grained dexterity and coordination. We train our model using 1000 demonstrations per task. Evaluation is performed with 50 trials per task. • LIBERO-plus[ 121]: LIBERO-plus is explicitly designed to systematically assess policy robustness and zero-shot generalization under a diverse set of controlled environmental perturbations. Following standard practice [121], we evaluate our model under two distinct training configurations: a baseline trained exclusively on the standard LIBERO dataset, and a variant fine-tuned on the augmented"},{"citing_arxiv_id":"2604.27472","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations","primary_cat":"cs.AI","submitted_at":"2026-04-30T06:14:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"agnostic to temporal distance to task completion. (ii) Language to State-Action (l→s, a ):In contrast, the temporal signal is introduced in the reverse direction. For a language anchorli with positive setS(i) ={j∈ B:τ j =τ i}, we optimize: Ll→sa =E i∼B  − X j∈S(i) qij log exp(ψ⊤ i ϕj)P k∈B exp(ψ⊤ i ϕk)   ,withq ij = γTj −tj P j′∈S(i) γTj′ −tj′ .(12) Unlike thesa→l direction, li now hasmultiplepositives. The soft targets qij require the predicted probability that the state-action pairj belongs to taski to scale asγTj −tj. This forces the representation inner product to satisfy ψ⊤ i ϕj = (Tj −t j) logγ + C according to Eq.(9), thereby encoding thetemporal distance to task completionwithin the representation space."},{"citing_arxiv_id":"2604.23121","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training","primary_cat":"cs.RO","submitted_at":"2026-04-25T03:18:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist policies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[16] X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024. [17] I. Shenfeld, J. Pari, and P. Agrawal. Rl's razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025. [18] S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qian, L. Ji, X. He, S. Zhang, Z. Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025. [19] X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun. Libero-pro: To- wards robust and fair evaluation of vision-language-action models beyond memorization."},{"citing_arxiv_id":"2604.23001","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines","primary_cat":"cs.RO","submitted_at":"2026-04-24T20:41:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data generation.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Dapeng Zhang, Jing Sun, Chenghui Hu, Xiaoyan Wu, Zhenlong Yuan, Rui Zhou, Fei Shen, and Qingguo Zhou. Purevisionlanguageaction(vla)models: Acomprehensivesurvey.arXiv preprint arXiv:2509.19012, 2025a. Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence, 46(8):5625-5644, 2024a. Shiduo Zhang, Zhe Xu, Peiju Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks, 2024b. URLhttps://arxiv.org/abs/2412. 18194. Yuhong Zhang, Zihan Gao, Shengpeng Li, Ling-Hao Chen, Kaisheng Liu, Runqing Cheng, Xiao Lin, Junjia"},{"citing_arxiv_id":"2604.21241","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors","primary_cat":"cs.RO","submitted_at":"2026-04-23T03:17:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"follows the public implementation provided byStarVLA. Unless stated otherwise, we keep the training protocols and hyperparameters identical to the respective official defaults for both backbones, ensuring a fair and reproducible com- parison. TABLE III SUCCESS RATES(%)ONLIBEROFOR THE4-IN-1MODEL.Corr DENOTES OUR METHOD. Method Long Goal Object Spatial Avg π0(3.3B) 73 95.0 86.0 90.0 86.0 GraspVLA [27] 82.0 91.2 94.1 - 89.1 NORA [28] 74.6 89.4 89.4 92.2 87.9 SmolVLA-Base 72.0 89.0 87.0 98.0 86.5 SmolVLA-Corr85.2 90.8 95.8 9290.95 Our method introduces a sparse set of future spatial anchors derived from the action chunk. Specifically, given the action horizon (chunk size) used by the flow-matching action head, we sampleKsparse anchor steps and predict their"},{"citing_arxiv_id":"2604.20834","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance","primary_cat":"cs.RO","submitted_at":"2026-04-22T17:58:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18107","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-04-20T11:25:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PDF improves VLA success rates on LIBERO and Atari by applying test-time perturbation learning with delayed feedback to correct trajectory overfitting and overconfidence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18000","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-20T09:25:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"corresponding gains in downstream action prediction. As illustrated in Figure 1, this discrepancy gives rise to what we term theillusion of embodied reasoning: policies appear competent in visually stable, in-domain settings, yet rely on shortcut correlations rather than genuine semantic grounding or causal reasoning. Recent robustness-oriented benchmarks, such as LIBERO-Plus [16] and LIBERO-PRO [17], take important steps toward evaluating policies under distribution shifts. However, their reliance on fixed simulation assets limits the ability to systematically vary task semantics, interaction structures, and underlying task logic. This makes it difficult to disentangle reasoning failures from execution-level and perceptual confounders."},{"citing_arxiv_id":"2604.17876","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-20T06:38:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Spatial Object Goal Long Overall Spatial Object Goal Long Overall Diffusion Policy (Chi et al. [13]) 78.5 87.5 73.5 64.8 76.1 - - - - - OpenVLA (Kim et al. [25]) 84.7 88.4 79.2 53.7 76.5 19.4 14.0 15.1 14.3 15.6 WorldVLA (Cen et al. [9]) 87.6 96.2 83.4 60.0 81.8 32.5 28.6 31.8 8.2 25.0 UniVLA (Bu et al. [7]) 95.4 98.8 93.6 94.0 95.5 55.5 36.7 40.7 39.9 42.9 NORA (Hung et al. [22]) 92.2 95.4 89.4 74.6 87.9 47.6 34.4 38.8 36.3 39.0 GR00T-N1 (Bjorck et al. [4]) 94.4 97.6 93.0 90.6 93.9 - - - - - GR00T-N1.5 (Bjorck et al. [4]) 96.5 98.5 91.0 91.5 94.4 77.1 77.1 64.7 59.7 69.5 𝜋0 (Black et al. [5]) 96.8 98.895.885.2 94.2 60.7 61.4 44.9 48.4 53.6 𝜋0-FAST (Pertsch et al. [41]) 96.4 96.8 88.6 60.2 85.5 74.4 72.7 57.5 43.4 61.6 CoT-VLA (Zhao et al."},{"citing_arxiv_id":"2604.17019","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents","primary_cat":"cs.AI","submitted_at":"2026-04-18T14:57:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Mini-BEHAVIOR-Gran benchmark reveals a U-shaped effect of instruction granularity on embodied agent performance, with planning-width correlating best and coarse instructions linked to vision-dominant shallow policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09824","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-10T18:56:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08031","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles","primary_cat":"cs.RO","submitted_at":"2026-04-09T09:32:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM-driven multi-planner scheduling framework turns open-ended passenger instructions into safe, traceable control signals for autonomous vehicles while cutting query costs and matching specialized safety levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05672","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2026-04-07T10:18:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Kembhavi. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models, 2024. https://arxiv.org/abs/2409.17146. [9] Hengyu Fang, Yijiang Liu, Yuan Du, Li Du, and Huanrui Yang. Sqap-vla: A synergistic quantization-aware pruning framework for high-performance vision-language-action models, 2025.https://arxiv.org/abs/2509.09090. [10] Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025. [11] Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox."},{"citing_arxiv_id":"2604.05595","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Uncovering Linguistic Fragility in Vision-Language-Action Models via Diversity-Aware Red Teaming","primary_cat":"cs.RO","submitted_at":"2026-04-07T08:43:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DAERT generates diverse adversarial instructions via a uniform policy in RL to drop VLA task success rates from 93.33% to 5.85% on benchmarks with models like π0 and OpenVLA.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research. PMLR, 2023. URL https://proceedings. mlr.press/v202/driess23a.html. [6] Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019. URL https://arxiv.org/abs/ 1901.10995. [7] Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In- depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025. 16 [8] Divyam Goel, Yufei Wang, Tiancheng Wu, Guixiu Qiao, Pavel Piliptchak, David Held, and Zackory Erickson."},{"citing_arxiv_id":"2604.04834","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes","primary_cat":"cs.CV","submitted_at":"2026-04-06T16:35:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"control, and have shown strong performance on open-ended manipulation tasks in well-lit laboratory settings [15,32,41,66]. However, real-world deployment exposes a critical weakness of current VLA systems: perceptual robustness under sensing-stage degradations. Recent surveys and benchmarks highlight illumination variation and visual domain shift as ma- jor factors limiting stable VLA performance [17,32,41]. In particular, low-light conditions substantially reduce signal quality, while increasing exposure time to recover brightness inevitably introduces motion blur and additional latency dur- ing fast manipulation. In extreme cases, severe under-exposure can lead to black clipping, where frame images become nearly unusable. Importantly, these fail-"},{"citing_arxiv_id":"2604.09651","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-03-30T03:54:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indistinguishable from normal ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02900","ref_index":94,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses","primary_cat":"cs.CR","submitted_at":"2026-03-28T13:21:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1) Attacks (3):[310] [32] [126] Goal Conflicts (§ 4.3.2) Attacks (3):[70] [172] [18] Potential Defenses (§ 4.3.3) Defenses (4):[194] [311] [184] [104] Action (§ 5) Control (§ 5.1) Adversarial Attacks White-box (17):[315] [438] [387] [185] [317] [60] [63] [463] [365] [280] [166] [144] [84] [371] [83] [82] [86] Black-box (8):[107] [439] [470] [250] [170] [20] [94] [372] Adversarial Defenses Robust Training (38):[288] [283] [257] [328] [344] [438] [323] [163] [439] [276] [8] [138] [212] [180] [423] [216] [426] [115] [318] [110] [137] [482] [476] [102] [213] [234] [312] [408] [325] [26] [409] [349] [272] [266] [25] [202] [434] [83] Robust Inference (6):[388] [246] [235] [122] [135] [263] Backdoor Attacks Training Manipulation (4):[360] [59] [114] [480]"}],"limit":50,"offset":0}