{"total":166,"items":[{"citing_arxiv_id":"2606.27872","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"S$^2$-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-26T09:13:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"S²-VLA uses a state-space model to maintain a belief state that produces dynamic gating weights for fusing visual, language, and action features, claiming better long-horizon manipulation than 7B models with only 2B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03847","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies","primary_cat":"cs.RO","submitted_at":"2026-06-02T16:26:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DVAC uses denoising variance as an intrinsic signal to adaptively chunk actions in flow-based robot policies, improving success rates and cutting replans on LIBERO, RoboTwin, CALVIN, and real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00537","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking","primary_cat":"cs.RO","submitted_at":"2026-05-30T05:11:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PACE dynamically selects execution horizons for action chunks in robot policies by detecting low-speed transition points in predicted speed profiles, raising success rates from 57.8% to 64.2% on 50 simulation tasks and from 50.7% to 70.4% in real-robot tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30569","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Any-ttach: Quick End-effector Swapping Enables Manipulation Dexterity with Simplicity","primary_cat":"cs.RO","submitted_at":"2026-05-28T21:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Any-ttach shows that rapid end-effector swapping combined with demonstration collection and task planning enables reliable multi-tool skills in long-horizon tasks such as sandwich making.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30326","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoboWits: Unexpected Challenges for Robotic Creative Problem Solving","primary_cat":"cs.RO","submitted_at":"2026-05-28T17:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29766","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MARS Policy: Multimodality Only When It Matters","primary_cat":"cs.RO","submitted_at":"2026-05-28T11:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MARS policy adaptively activates multimodal generation only when beneficial in robotic tasks, claiming 16.67% higher success and 83.20% lower inference latency than baselines in real-world tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29662","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation","primary_cat":"cs.CV","submitted_at":"2026-05-28T09:23:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SAFE-Pruner forecasts deep-layer token saliency in VLA models via semantic attention consistency and adaptive subtask detection to achieve up to 1.89x speedup with under 1.7% success rate loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23856","ref_index":81,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Point Tracking Improves World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-22T17:08:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23128","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"$\\pi_0$-EqM: Equilibrium Matching for Closed-Loop Vision-Language-Action Control","primary_cat":"cs.RO","submitted_at":"2026-05-22T01:07:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Equilibrium Matching decoder substitution in π₀ improves RoboTwin success from 40.4% to 50.2% across 19 tasks and reaches 87.0% on LIBERO-10.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22493","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Understanding Multimodal Failure in Action-Chunking Behavioral Cloning","primary_cat":"cs.LG","submitted_at":"2026-05-21T13:45:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper identifies distinct failure mechanisms: excessive posterior-prior regularization erases mode information in latent policies, while smooth base-to-action maps limit mode coverage in generative policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21976","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TacO: Benchmarking Tactile Sensors for Object Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-21T04:11:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper provides a task-driven benchmark comparing visual, acoustic, magnetic, and resistive tactile sensors on three manipulation tasks and concludes that sensor utility depends on modality, material friction, and task specifics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21258","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-20T14:48:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A hybrid structural latent points representation is learned by inserting a point-wise latent VAE into a point-cloud autoencoder and regularizing toward a Gaussian prior, paired with a lightweight 3DGS rendering pipeline, yielding gains on RLBench and ManiSkill2 benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20774","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-20T06:15:30+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLA-REPLICA is a low-cost and reproducible real-world benchmark for evaluating VLA models in robotic manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19138","ref_index":6,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones","primary_cat":"cs.RO","submitted_at":"2026-05-18T21:37:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"COBALT enables scalable crowdsourced teleoperation of robots using smartphones, supporting concurrent users with low latency and yielding a 7500+ demonstration dataset validated on imitation learning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18727","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DexHoldem: Playing Texas Hold'em with Dexterous Embodied System","primary_cat":"cs.RO","submitted_at":"2026-05-18T17:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17522","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-17T16:11:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17486","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization","primary_cat":"cs.RO","submitted_at":"2026-05-17T14:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17300","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HCLM: A Hierarchical Framework for Cooperative Loco-Manipulation with Dual Quadrupeds","primary_cat":"cs.RO","submitted_at":"2026-05-17T07:23:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HCLM presents a hierarchical architecture that uses an SE(3)-invariant diffusion policy for coordination and a hybrid whole-body controller with MPC and admittance control for safe closed-chain loco-manipulation on dual quadrupeds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16257","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo","primary_cat":"cs.RO","submitted_at":"2026-05-15T17:59:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DexJoCo is a benchmark and toolkit with 11 functionally grounded tasks, 1.1K trajectories, and empirical benchmarks for task-oriented dexterous manipulation on MuJoCo.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16241","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-15T17:48:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLA-AD distills 7B VLA teachers into 158M students using offline VLM semantic guidance on task phases and directions, matching teacher performance on LIBERO with 44x size reduction and 3.28x speedup.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16043","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Sim-Grounded Policies for Bimanual Rope Manipulation from Human Teleoperation Data","primary_cat":"cs.RO","submitted_at":"2026-05-15T15:15:12+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A simulation-grounded state policy using 3D particle dynamics outperforms an egocentric vision policy by 30.8% in L1 error on unseen rope configurations for bimanual manipulation from limited human data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15492","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FLASH: Efficient Visuomotor Policy via Sparse Sampling","primary_cat":"cs.RO","submitted_at":"2026-05-15T00:15:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FLASH Policy uses sparse Legendre polynomial trajectory fitting and history-anchored flow matching to enable single-step inference for visuomotor control, reporting 31.4 ms per-episode latency and >=92% success on five simulated plus two real manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15352","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Diffusion Policy for Coordinated Control of a Nonholonomic Mobile Base and Dual Arms in Door Opening and Passing","primary_cat":"cs.RO","submitted_at":"2026-05-14T19:23:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A diffusion policy learns coordinated control of a mobile base and dual arms to open and traverse damped pull doors in a single end-to-end visuomotor model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14598","ref_index":65,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DSSP: Diffusion State Space Policy with Full-History Encoding","primary_cat":"cs.RO","submitted_at":"2026-05-14T09:06:01+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14106","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Behavior Cloning for Active Perception with Low-Resolution Egocentric Vision","primary_cat":"cs.RO","submitted_at":"2026-05-13T20:45:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Behavior cloning produces active perception in a plant-centering task where a robot arm uses low-resolution egocentric RGB images to predict joint movements, with relative deltas outperforming absolute positions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13778","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs","primary_cat":"cs.RO","submitted_at":"2026-05-13T16:57:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Their diffusion-based continuous action generation captures multimodal action distributions, making them well suited to complex manipulation tasks. However, deploying dVLAs on real robotic systems remains challenging. In practice, robot control typically runs at a much higher frequency than model inference, as shown in Figure 1, and existing systems therefore rely on open-loop action chunking [33] to bridge this mismatch. For example, π0 [1] predicts a chunk of future actions and replans only after several control steps have already been executed. While this design reduces the frequency of replanning, each replanning round still requires the same expensive full inference pipeline. As a result, end-to-end inference latency remains the main bottleneck, limiting the applicability of dVLAs to reactive, latency-sensitive tasks."},{"citing_arxiv_id":"2605.13548","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AttenA+: Rectifying Action Inequality in Robotic Foundation Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T13:55:37+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"method","top_context_polarity":"background","context_text":"intrinsic physical hierarchy and heterogeneous importance of different motion phases. 2.2 Action Sequence Modeling for Robotics Modeling sequential robotic actions is a core research direction, with early efforts focusing on trajectory optimization and inverse reinforcement learning (IRL). Recent data-driven approaches include Action Chunking with Transformers (ACT) [30], which uses transformers to model temporal dependencies in action sequences, and Diffusion Policy [31], which leverages diffusion models for smooth, feasible trajectory generation-though these prioritize action quality over critical action prioritization based on physical characteristics (e.g., velocity). Prior works have explored importance weighting for imitation learning: some weight entire trajectories by demonstration quality [32, 33],"},{"citing_arxiv_id":"2605.13428","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SID: Sliding into Distribution for Robust Few-Demonstration Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-13T12:22:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SID achieves approximately 90% success on six real-world manipulation tasks with only two demonstrations under out-of-distribution initializations, with less than 10% performance drop under distractors and disturbances.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13067","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Absolute State Fails: Evaluating Proprioceptive Encodings for Robust Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-13T06:41:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Episode-wise relative proprioceptive encoding outperforms absolute state baselines for robust robotic manipulation under varying reference frames.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12369","ref_index":98,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization","primary_cat":"cs.RO","submitted_at":"2026-05-12T16:38:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"multimodal prompting [42, 25], parameter-efficient adaptation [44, 87, 74, 34], and inference-time acceleration [88, 9, 41, 62, 99]. In parallel, prior work strengthens the action path- way through alternative action parameterizations and learn- ing objectives, including diffusion- or flow-based generation [18, 8, 7, 60, 17, 55, 14, 86], action chunking for temporal abstraction [98], and discrete or compressed action tokenizers to better match control bandwidth [69, 85]. Auxiliary Tasks for Robotics ModelsStructured intermediate representations improve policy robustness under distribution shift. Object-centric methods factor manipulation around task- relevant entities such as object poses, keypoints, and relations [36, 32, 58, 33, 89, 53, 50, 49, 68, 13]."},{"citing_arxiv_id":"2605.12228","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Morphologically Equivariant Flow Matching for Bimanual Mobile Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T15:04:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A morphologically equivariant flow matching policy for bimanual robots enforces reflective symmetry to improve sample efficiency and enable zero-shot generalization to mirrored task configurations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In future work, we will integrate visual observations and scale our method to temporal and rotational symmetries in the environment. REFERENCES [1] C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, \"Diffusion policy: Visuomotor policy learning via action diffusion,\" in Proceedings of Robotics: Science and Systems (RSS), 2023. [2] T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, \"Learning fine-grained bimanual manipulation with low-cost hardware,\"arXiv preprint arXiv:2304.13705, 2023. [3] A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., \"Rt-1: Robotics transformer for real-world control at scale,\"arXiv preprint"},{"citing_arxiv_id":"2605.12162","ref_index":75,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:13:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Pt =F p(Condp(ϕp(At−1), F vis t )) (1) whereϕ(·)projects the past trajectory into feature space, and Cond(·)is a feature fusion operator which degrades to outputFvis t directly when there is no past estimate (t= 0). 3.3 Instantiations To demonstrate the versatility of X-Imitator, we instantiate the action branch using three representative visuomotor policies: DP3 [72], ACT [75], and RISE [59]. For the pose branch, we implement it as a lightweight diffusion head [12,27] for simplicity. While these instantiations share the high-level dual-path logic, the key differences lie in how the conditional sequence projectorϕ(·)and feature fusionoperatorCond(·)inEqn.(1)areimplementedtosuitthebasearchitecture. Fig. 2 illustrates the feature fusion in action branch."},{"citing_arxiv_id":"2605.12090","ref_index":262,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11809","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models","primary_cat":"cs.AI","submitted_at":"2026-05-12T09:03:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We further evaluate robustness on LIBERO-plus, which introduces seven perturbation categories: Camera, Robot, Language, Light, Background, Noise, and Layout. We report success rate (%) for each category. 4.2 Baselines We compare against a diverse set of recent VLA and robot policy baselines. On LIBERO, we include π0+FAST [33], OpenVLA-OFT [25], π0 [2], FLOWER [35], GR00T-N1.5 [30], and BEAST [50]. On LIBERO-plus, we compare against OpenVLA [ 24], OpenVLA-OFT, π0, π0-fast, Nora [ 20], WorldVLA [8], UniVLA [47], and RIPT-VLA [45]. 5 Table 1: Experimental Results on the LIBERO Benchmark. Success rate (%) is reported for each task suite.Boldindicates the best performance, underlined indicates the second best. Model Spatial Object Goal Long Avg"},{"citing_arxiv_id":"2605.11665","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Nautilus: From One Prompt to Plug-and-Play Robot Learning","primary_cat":"cs.RO","submitted_at":"2026-05-12T07:26:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"shared robot learning harness: a substrate of typed contracts, chambered execution, and uniform transport, plus a content layer of Guides, Sensors, and State. This changes the unit of work from pairwise integration to reusable onboarding, reducing the burden toΘ(N+M+K). (W AM)), benchmark suitesB (e.g., LIBERO [9], RoboCasa [10], ManiSkill [11], ALOHA [12], etc.), and robot embodiments R (e.g., single-arm, bimanual, dexterous-hand, locomotion, humanoid, etc.), with cardinalities N, M, and K, respectively. Each non-trivial (P, B, R) cross-comparison typically requires a distinct hand-written integration layer, so docker container setup, observation adapters, smoke tests, and trust-validation procedures are repeatedly re-implemented across papers, labs, and"},{"citing_arxiv_id":"2605.11459","ref_index":10,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models","primary_cat":"cs.RO","submitted_at":"2026-05-12T03:17:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the backbone capacity that gives larger VLAs their generalization while still leaving each newly issued chunk blind to motion within the previous one. Indiscriminate re-inference can also break the temporal smoothness across chunks and degrade long-horizon coherence [28]. Other methods include asynchronous inpainting [28], rejection sampling [29], temporal ensembling [10], adaptive chunk sizing [30], and learned correction heads [22], which improve reactivity indirectly through smoother seams or more frequent re-planning. However, the chunks themselves still treat the environment as static, and any learnable corrector still suffers from the dilemma between latency and capacity as well as the ego-motion problem [10, 31]."},{"citing_arxiv_id":"2605.11114","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection","primary_cat":"cs.RO","submitted_at":"2026-05-11T18:23:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SEVO raises ACT and SmolVLA pick-and-place success from 30-35% to 75-85% in novel environments by using active illumination, semantic cues, and diversified teleoperation data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10819","ref_index":61,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T16:37:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10094","ref_index":29,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs","primary_cat":"cs.RO","submitted_at":"2026-05-11T07:11:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Move Near, Open/Close Drawer, and Open Top Drawer and Place Apple, respectively. The largest gain appears on the most challenging long-horizon task, suggesting that our retrieval-guided prior is effective under layout variations and visual perturbations. 5.2 Real-World Experiments 5.2.1 Setup We evaluate our method on two real-world bimanual platforms: an OpenArm-based dual-arm sys- tem [6] and an ALOHA-PiPER system [29, 30]. We collect 100 training trajectories per task and evaluate four tasks: bowl stacking, cube handoff, and sequential test-tube placement on OpenArm, and bimanual T-shirt folding on ALOHA-PiPER. The tasks cover long-horizon manipulation, biman- ual coordination, fine-grained placement, deformable-object manipulation, and appearance shifts. Hardware details, task definitions, and training/testing protocols are provided in Appendix D."},{"citing_arxiv_id":"2605.09860","ref_index":57,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-11T01:43:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Learns state-conditioned commitment depth in a 7B vision-language policy that jointly predicts actions and replan intervals, outperforming fixed-depth baselines and larger models on Sliding Puzzle and Sokoban while providing a theoretical dominance result.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"typical instances (≈4× optimal-length ratio at mean h≈4 ); tight (10,4) applies a ∼33% cut. All main results (§5.3, 3.3, 6, 6.3) report loose; tight in App. F.1 as consistency check. 5.2 Comparison to frontier and open-weight zero-shot baselines We compare our 7 B fine-tuned policy against three frontier closed-source VLMs (GPT-5.5 [ 41], Claude Sonnet [ 4], Gemini 3.1 Pro [ 45]) and seven open-weight VLMs 8-78 B (InternVL3- 8/14/78 B [57], Qwen2.5-VL-7/72 B [7], Qwen3-VL-8/32 B [6]), zero-shot under the same commit- ment interface (App. G). Not apples-to-apples, but tests whether scale alone recovers state-conditioned commitment-depth. It does not (Tab. 1): no frontier wins both (GPT-5.5 plateaus22-35%; Gemini wins Sokoban but 11% Sliding; Claude solves neither); every open-weight VLM scores 0% at every"},{"citing_arxiv_id":"2605.08713","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer","primary_cat":"cs.RO","submitted_at":"2026-05-09T05:50:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"REAP trains an end-to-end SAC policy with behavior cloning and collision penalties inside a 3DGS Real2Sim simulator and transfers it to physical vehicles, succeeding in narrow mechanical parking slots.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"challenges, ranging from multi-objective strategic planning [26] to end-to-end maneuver control in demanding scenarios like consecutive sharp turns [27]. For camera-based autonomous parking, learning policies di- rectly from high-dimensional pixel inputs further increases the difficulty and instability of reinforcement learning. Although vision-based end-to-end methods [28] have been extensively studied in autonomous driving and robotics, most existing algorithms still rely on imitation learning because of the high training cost of reinforcement learning and the transfer challenges caused by the visual gap. A common strategy in this setting is offline reinforcement learning [29], [30], which improves training efficiency and has therefore attracted"},{"citing_arxiv_id":"2605.07931","ref_index":49,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy","primary_cat":"cs.CV","submitted_at":"2026-05-08T16:04:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. [48] Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702-1713, 2025. [49] Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. [50] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model.arXiv preprint arXiv:2403."},{"citing_arxiv_id":"2605.07687","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PhySPRING: Structure-Preserving Reduction of Physics-Informed Twins via GNN","primary_cat":"cs.RO","submitted_at":"2026-05-08T12:55:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhySPRING uses differentiable GNNs to learn hierarchical coarsened spring-mass topologies and parameters from observations, delivering up to 2.3x speedup on PhysTwin benchmarks and comparable robot policy success rates in zero-shot Real2Sim substitution.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"form of the digital twin across all reduction levels. Galerkin projection induces coarse spring-mass representation while a GNN decoder refines the resulting mechanical parameters through differen- tiable rollouts.(iii)We evaluate PhySPRING in a Real2Sim [ 10] policy-evaluation pipeline. The reduced models can be substituted zero-shot into ACT [19] and π0 [20] without retraining, preserving manipulation success rate, while improving action-sampling throughput by up to 1.23 × over the original model. 2 Related works Physics-driven 4D ReconstructionRecent 4D reconstruction methods achieve high-fidelity geom- etry and appearance from visual observations [14, 21, 22, 23, 24, 25]. However, they are primarily"},{"citing_arxiv_id":"2605.07605","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BrickCraft: Visuomotor Skill Composition with Situated Manual Guidance for Long-Horizon Interlocking Brick Assembly","primary_cat":"cs.RO","submitted_at":"2026-05-08T11:30:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BrickCraft composes reusable visuomotor skills via relative anchoring to partial structures and situated visual manuals to achieve long-horizon interlocking brick assembly from limited demonstrations with generalization to unseen designs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In contrast, our work decomposes long-horizon assembly into reusable learned primitives, which enhances data efficiency and facilitates seamless adaptation to novel assembly tasks. B. Visuomotor Imitation Learning Visuomotor imitation learning has emerged as a data-driven paradigm that learns to map raw visual observations to control actions by mimicking expert behaviors. Generative approaches like ACT [ 15] and diffusion policy [ 16], [ 17] formulate policies as conditional action distributions over multimodal observations, showing remarkable efficacy in task-specific, fine-grained manipulation. Building upon these foundations, Vison-Language-Action (VLA) models [ 3], [4], [18] leverage large-scale multimodal pre-training to achieve broader seman- tic scene understanding and cross-task generalization."},{"citing_arxiv_id":"2605.07306","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-08T06:15:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BioProVLA-Agent integrates protocol parsing, visual state verification, and VLA-based execution in a closed-loop multi-agent framework with AugSmolVLA augmentation to improve robustness for biological lab tasks like tube handling and liquid pouring.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"mance of visuomotor policies in multi-task manipulation [ 26]. Subsequently, Zitkovich et al. proposed RT-2, which transfers Internet-scale vision-language pretraining knowledge to robotic control [ 27]. For dual-Arm and complex manipulation, Liu et al. proposed RDT-1B, which adopts a di ﬀusion Transformer to construct a foundation model for dual-Arm manipulation [ 28]. Black et al. proposed π0, a general robotic control model built on a pretrained VLM and ﬂow matching [ 29]. Zheng et al. pro- posed X-VLA, which uses soft prompts to address data hetero- geneity across robotic embodiments [ 7], whereas Shukor et al. proposed SmolVLA, emphasizing VLA deployment under low- cost and low-computation conditions [ 8]."},{"citing_arxiv_id":"2605.06481","ref_index":96,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-07T16:06:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A robustness study. arXiv preprint arXiv:2603.22078, 2026. [95] Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. CoT-VLA: Visual chain-of-thought reasoning for Vision- Language-Action models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. arXiv:2503.22020. [96] Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023. arXiv:2304.13705. [97] Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-VLA: Soft-prompted transformer as scalable cross-embodiment"},{"citing_arxiv_id":"2605.06222","ref_index":31,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When to Trust Imagination: Adaptive Action Execution for World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-07T13:18:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, despite their ability to imagine how the world will evolve, current WAMs typically use their predicted future only to generate an action chunk, while the execution process itself remains largely blind to whether the imagined future is still consistent with the physical rollout. This reveals a fundamental limitation in current WAM execution. At each inference step, a WAM predicts a chunk of future actions [32] and the robot executes a fixed number of them before querying the model again. Such fixed-size execution ignores the fact that the reliability of W AM imagination varies across tasks and across phases within a task. For simple and predictable dynamics, such as approaching or grasping a rigid cup, the W AM prediction may remain accurate over a long horizon;"},{"citing_arxiv_id":"2605.05925","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions","primary_cat":"cs.RO","submitted_at":"2026-05-07T09:31:43+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05812","ref_index":49,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities","primary_cat":"cs.AI","submitted_at":"2026-05-07T07:47:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05544","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-07T00:48:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}