{"total":161,"items":[{"citing_arxiv_id":"2606.27872","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"S$^2$-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-26T09:13:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"S²-VLA uses a state-space model to maintain a belief state that produces dynamic gating weights for fusing visual, language, and action features, claiming better long-horizon manipulation than 7B models with only 2B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09499","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Targeting World Models to Compromise Robot Learning Pipelines","primary_cat":"cs.RO","submitted_at":"2026-06-08T13:50:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"World models introduce a stealthy poisoning vector into robot learning pipelines where malicious prompts or dynamics in teleoperated data activate only during synthetic trajectory generation, enabling backdoors in downstream policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07107","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Coarse-to-Control: Action-Token Planning for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-06-05T10:01:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Coarse-to-Control adds planning via coarse action tokens in the same vocabulary as control actions, improving VLA performance on long-horizon manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00664","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models","primary_cat":"cs.RO","submitted_at":"2026-05-30T10:41:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SKIP achieves 4.16x faster dense video rollouts for robot world models by synthesizing only multimodal-identified keyframes and interpolating the rest, preserving policy training effectiveness with minimal success rate drops.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30660","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies","primary_cat":"cs.LG","submitted_at":"2026-05-28T23:39:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30326","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoboWits: Unexpected Challenges for Robotic Creative Problem Solving","primary_cat":"cs.RO","submitted_at":"2026-05-28T17:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RoboWits benchmark with 238 tasks shows pre-trained VLAs succeed on seed tasks but fail on mutated ones, highlighting brittleness in reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30011","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies","primary_cat":"cs.CV","submitted_at":"2026-05-28T14:36:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISUALTHINK-VLA uses visual evidence tokens and selective routing to reach top success rates on VLA benchmarks while cutting reasoning latency from multi-second to sub-second levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29562","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-28T08:14:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLA-Pro improves cross-task generalization in vision-language-action models by storing task-specific LoRA adapters as procedural memories and retrieving/fusing them at inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29438","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-28T06:33:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ElegantVLA accelerates VLA models up to 3.77x by dynamically scheduling compute across vision, language, and action components without retraining the base model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29360","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models","primary_cat":"cs.AI","submitted_at":"2026-05-28T04:58:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MiraBench defines action-conditioned reliability via three levels (physics adherence, action-following fidelity, optimism bias detection) and applies it to 12 model configurations using a 16,000-judgment human corpus, finding visual fidelity a poor proxy for action fidelity, no reliable scale benefi","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23856","ref_index":69,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Point Tracking Improves World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-22T17:08:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23733","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking","primary_cat":"cs.RO","submitted_at":"2026-05-22T15:10:42+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23128","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"$\\pi_0$-EqM: Equilibrium Matching for Closed-Loop Vision-Language-Action Control","primary_cat":"cs.RO","submitted_at":"2026-05-22T01:07:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Equilibrium Matching decoder substitution in π₀ improves RoboTwin success from 40.4% to 50.2% across 19 tasks and reaches 87.0% on LIBERO-10.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22896","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-21T15:24:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agentic-VLA enables efficient online adaptation of VLA models, delivering +12.3% on long-horizon tasks, +28.5% in 1-shot learning, and 2.4x faster convergence on LIBERO through three new components.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22493","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Understanding Multimodal Failure in Action-Chunking Behavioral Cloning","primary_cat":"cs.LG","submitted_at":"2026-05-21T13:45:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper identifies distinct failure mechanisms: excessive posterior-prior regularization erases mode information in latent policies, while smooth base-to-action maps limit mode coverage in generative policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22183","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Action with Visual Primitives","primary_cat":"cs.RO","submitted_at":"2026-05-21T08:52:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21862","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control","primary_cat":"cs.RO","submitted_at":"2026-05-21T01:19:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20856","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation","primary_cat":"cs.RO","submitted_at":"2026-05-20T07:45:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A hypernetwork generates complete task-specific visuomotor policy parameters from instructions alone to structurally eliminate observation leakage in language-conditioned robotic control.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19986","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-19T15:25:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MetaFine reconstructs benchmarks into diagnostic scenarios to evaluate vision-language-action models on fine-grained manipulation, exposing dimension-specific failures and identifying the visual encoder as a key bottleneck.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19924","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoHIL: Robust Human-in-the-Loop Robotic Reinforcement Learning Against Illumination Variations","primary_cat":"cs.RO","submitted_at":"2026-05-19T14:47:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RoHIL adapts human-in-the-loop RL policies to new illumination conditions offline by combining world-model image relighting, illumination-retention replay, and anchored Bellman regularisation, improving shifted-light performance while preserving source performance on four real-robot tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20299","ref_index":146,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mechanisms of Misgeneralization in Physical Sequence Modeling","primary_cat":"cs.LG","submitted_at":"2026-05-19T12:34:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Generative sequence models for physical tasks exhibit physical misgeneralization where local prediction errors propagate through physical measurements to distort aggregate distributions over quantities like distance or energy; a data deviation kernel explains and predicts the shifts and supports a内核","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19678","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-19T11:10:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19580","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-19T09:22:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PAPO-VLA identifies planning actions via variation and outcome, estimates their causal importance, and folds that importance into GRPO to emphasize key decisions while still using full-trajectory feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19319","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution","primary_cat":"cs.CV","submitted_at":"2026-05-19T03:54:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SWEET is a one-shot sparse visual planning framework that progressively generates manipulation keyframes via image editing conditioned on language and spatial guidance, then converts them to actions with a diffusion predictor, showing better fidelity and lower cost than video models on DROID and Rob","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18287","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StableVLA: Towards Robust Vision-Language-Action Models without Extra Data","primary_cat":"cs.CV","submitted_at":"2026-05-18T12:15:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StableVLA adds an Information Bottleneck Adapter to VLA models that improves robustness to visual corruptions by 30% on average with under 10M extra parameters and no extra data, even when using a much smaller backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17522","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-17T16:11:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17486","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization","primary_cat":"cs.RO","submitted_at":"2026-05-17T14:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16743","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LACE: Latent Visual Representation for Cross-Embodiment Learning","primary_cat":"cs.RO","submitted_at":"2026-05-16T01:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16054","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making","primary_cat":"cs.LG","submitted_at":"2026-05-15T15:21:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"M=⟨S,A,Θ,T,R, γ, P Θ⟩, where S is the state space, A is the action space, and Θ is the space of task-specific latent parameters. For each θ∈Θ , the transition and reward functions are given by Tθ :S × A → P(S) and Rθ :S × A →R , respectively. The parameter θ is sampled from a prior distribution PΘ at the beginning of an episode and remains fixed during the episode. The discount factor is denoted by γ∈[0,1) . This framework defines a family of MDPs indexed by the latent parameter θ, with each θ inducing a different set of dynamics and reward functions. It can be seen as a special case of a contextual MDP where the context is latent and fixed per episode.Xie et al. (2021) 32 Published as a conference paper at ICLR 2026 further generalize this framework by allowing the task parameter θ to evolve dynamically across episodes, rather than being fixed. Bayes-Adaptive MDPs (BAMDPs) are closely related to both HiP-MDPs and contextual MDPs (CMDPs). In BAMDPs, the agent maintains a posterior distribution over MDPs based on its interaction history. Specifically, it maintains a belief bt(R, T) =p(R, T|τ :t), where τ:t = {s0,a 0, r0, . . . ,st} denotes the trajectory observed up to time t. This belief captures the agent's uncertainty about the underlying transition and reward functions. The transition and reward functions can then be defined in expectation over this posterior, effectively conditioning decision-making on the belief bt. When the environment is driven by hidden contextual variables or latent task parameters, such as in CMDPs or HiP-MDPs-this belief can be interpreted as a distribution over these latent variables. In this view, BAMDPs provide a non-parametric framework for reasoning over hidden structure, while approaches like ours explicitly model such latent variables and infer their posterior distributions using amortized inference. Both aim to enable adaptive planning and learning under uncertainty, but differ in how latent structure is represented and "},{"citing_arxiv_id":"2605.15735","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UAM: A Dual-Stream Perspective on Forgetting in VLA Training","primary_cat":"cs.CV","submitted_at":"2026-05-15T08:45:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15536","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SkiP: When to Skip and When to Refine for Efficient Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-15T02:16:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkiP introduces action relabeling and Motion Spectrum Keying to skip redundant steps in robot trajectories, cutting executed steps by 15-40% while maintaining success rates across 72 simulated and 3 real tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15298","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhysBrain 1.0 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-14T18:11:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PhysBrain 1.0 extracts scene elements, spatial dynamics, actions and depth relations from human egocentric video to create QA supervision for VLMs, then transfers the resulting physical priors to VLA policies via capability-preserving adaptation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14598","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DSSP: Diffusion State Space Policy with Full-History Encoding","primary_cat":"cs.RO","submitted_at":"2026-05-14T09:06:01+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13757","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FrameSkip: Learning from Fewer but More Informative Frames in VLA Training","primary_cat":"cs.RO","submitted_at":"2026-05-13T16:38:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13925","ref_index":110,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Robotic Dexterous Hand Intelligence: A Survey","primary_cat":"cs.RO","submitted_at":"2026-05-13T15:23:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"GraspVLA (Deng et al) Paper Picking (Lin et al) D3Grasp (Wang et al) UniGraspTransformer (Wang et al) Fig. 5. Grasp & Pick-and-Place Timeline controllers. OpenVLA [109] establishes the open-source VLA paradigm by fine-tuning vision-language-action models for robotic manipulation, enabling dexterous hands to execute language-conditioned tasks. Octo [110] scales this paradigm to large trajectory datasets and demonstrates cross-embodiment generalization, while subsequent work further expands training data sources by incorporating Internet-scale human interaction data into the VLA framework [111]. Recent studies extend VLA toward stronger geometric grounding and hierarchical control, which are critical for"},{"citing_arxiv_id":"2605.13632","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T14:58:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"306K to learn theGuideandThinkcomponents, using stochastic spatial conditioning so that the model is exposed to both guided and unguided inputs. We then train the Flow-Matching action head to map the latent reasoning statesH reasoning together with control observations to continuous action chunks. In Stage 2, we jointly fine-tune the full policy on domain-specific robot data (e.g., BridgeData V2 [30]) to adapt the reasoning module and action head to the target embodiment and environment. Unless otherwise specified, reasoning generation is optimized with autoregressive token prediction onC, while the action module is optimized with the standard flow-matching objective on action chunks. Additional implementation details are provided in the supplementary material."},{"citing_arxiv_id":"2605.13548","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AttenA+: Rectifying Action Inequality in Robotic Foundation Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T13:55:37+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. [5] Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165-2183. PMLR, 2023. [6] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. [7] Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions."},{"citing_arxiv_id":"2605.13316","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Test-time Sparsity for Extreme Fast Action Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-13T10:28:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13119","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T07:40:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12624","ref_index":74,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving","primary_cat":"cs.RO","submitted_at":"2026-05-12T18:09:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11809","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models","primary_cat":"cs.AI","submitted_at":"2026-05-12T09:03:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We report success rate (%) for each category. 4.2 Baselines We compare against a diverse set of recent VLA and robot policy baselines. On LIBERO, we include π0+FAST [33], OpenVLA-OFT [25], π0 [2], FLOWER [35], GR00T-N1.5 [30], and BEAST [50]. On LIBERO-plus, we compare against OpenVLA [ 24], OpenVLA-OFT, π0, π0-fast, Nora [ 20], WorldVLA [8], UniVLA [47], and RIPT-VLA [45]. 5 Table 1: Experimental Results on the LIBERO Benchmark. Success rate (%) is reported for each task suite.Boldindicates the best performance, underlined indicates the second best. Model Spatial Object Goal Long Avg π0+FAST 96.4 96.8 88.6 60.2 85.5 OpenVLA-OFT 97.6 98.4 97.9 94.5 97.1 π0 96.8 98.8 95.8 85.2 94.1 FLOWER 97.1 96."},{"citing_arxiv_id":"2605.11459","ref_index":6,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models","primary_cat":"cs.RO","submitted_at":"2026-05-12T03:17:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Zsolt Kira, Fei Xia, and Yonatan Bisk. Toward general-purpose robots via foundation models: A survey and meta-analysis.ArXiv, abs/2312.08782, 2023. [5] Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision- language-action models for embodied ai.ArXiv, abs/2405.14093, 2024. [6] Octo Model Team, Dibya Ghosh, Homer Rich Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Pannag R. Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy.ArXiv, abs/2405.12213, 2024. [7] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Krzysztof Choromanski,"},{"citing_arxiv_id":"2605.10942","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[44] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning, pages 785-799. PMLR, 2023. [45] Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. Anypos: Automated task-agnostic actions for bimanual manipulation.arXiv preprint arXiv:2507.12768, 2025. [46] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. 13 [47] Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al."},{"citing_arxiv_id":"2605.10925","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T17:56:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.","context_count":1,"top_context_role":"method","top_context_polarity":"baseline","context_text":"Kevin Black, Cheng Chi, Kyle Beltran Hatch, Shan Lin, Jingpei Lu, Jean Mercat, Abdul Rehman, Pannag R. Sanketi, Archit Sharma, Cody Simpson, Quan Vuong, Homer Rich Walke, Blake Wulfe, Ted Xiao, Jonathan Heewon Yang, Arefeh Yavary, Tony Z. Zhao, et al. DROID: A large-scale in-the-wild robot manipulation dataset. InProceedings of Robotics: Science and Systems, Delft, Netherlands, July 2024. [9] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yun- liang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024."},{"citing_arxiv_id":"2605.10821","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unified Noise Steering for Efficient Human-Guided VLA Adaptation","primary_cat":"cs.RO","submitted_at":"2026-05-11T16:37:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Vision-Language-Action (VLA) models have become a central paradigm for robot learning [1-19]. Early methods formulate action prediction as autoregressive token generation to integrate robot control into language-model-style sequence modeling [1-4, 10, 13, 18]. Other methods attach continuous action heads or diffusion-style decoders to vision-language models for better representation of high-dimensional continuous control [ 5, 17, 45, 46]. More recently, many VLA policies adopt flow-matching action heads, showing strong generative capability and promising performance in real- world robotic manipulation [8, 12, 14-16, 19]. These policies generate action chunks by transporting 2 initial noise variables to continuous actions through a learned state-conditioned velocity field."},{"citing_arxiv_id":"2605.10485","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T12:44:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Vision-Language-Action (VLA) models, which have significantly advanced the field of embodied AI. Early pioneering works, such as RT-1 [5] and RT-2 [47], demonstrated that pre-trained Vision- Language Models (VLMs) could be effectively repurposed to map raw visual observations and natural language instructions directly into low-level robot actions. Following this paradigm, open- source models like OpenVLA [ 21] and Octo [ 37] have become standard baselines, leveraging powerful 2D vision backbones (e.g., SigLIP [45], DINOv2 [29]) and large-scale cross-embodiment datasets [30] to achieve remarkable zero-shot generalization. More recently, architectures such as π0 [4, 3] and GR00T [2] have introduced flow-matching and continuous action generation, further improving dexterous manipulation capabilities."},{"citing_arxiv_id":"2605.10094","ref_index":25,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs","primary_cat":"cs.RO","submitted_at":"2026-05-11T07:11:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A retrieve-then-steer method stores successful robot actions in memory and uses them to steer a frozen VLA's flow-matching sampler for better test-time reliability without parameter updates.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"LIBERO, and adopt CogACT[14] for experiments on SimplerEnv. We also compare with TACO [27], a test-time scaling method, to evaluate our method against existing test-time steering approaches. For a more comprehensive comparison, we further report the success rates of representative VLA policies on selected benchmarks, including OpenVLA[12], π0-FAST [20], RT-1[2], RT-1-X[19], RT-2-X[19], and Octo[25]. 6 Table 2: Success rates (%) on the SIMPLER benchmark. We compare our method on top of CogACT with prior VLA policies. For CogACT and CogACT + Ours, we report mean± std over three random seeds. Method Pick Coke Can Move Near Open/Close Drawer Open Top Drawer and Place Apple Average RT-1 [2] 85.7 44.2 73.0 6.5 52.4 RT-1-X [19] 56.7 31.7 59.7 21.3 42."},{"citing_arxiv_id":"2605.11020","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates","primary_cat":"cs.LG","submitted_at":"2026-05-10T15:32:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TRIRL enables explicit dual-ascent IRL via trust-region local policy updates that guarantee monotonic improvement without full RL solves per iteration, outperforming prior imitation methods by 2.4x aggregate IQM and recovering generalizable rewards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09487","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Kintsugi: Learning Policies by Repairing Executable Knowledge Bases","primary_cat":"cs.LG","submitted_at":"2026-05-10T11:51:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"deterministic inference, but their artifacts are often prompt memories, memory extensions, free-form programs, or one-shot symbolic files. Kintsugi instead uses the LLM only between rollouts to propose structured edits; the typed KB, applier, executor, and verifier decide whether an edit becomes policy. VLA and world-model policies.Vision-Language-Action models such as RT-2 [ 39], Open- VLA [14], Octo [ 28], π0 [2], and π0.7 [12] excel at data-scale perception and open-vocabulary control. World-model methods such as Dreamer [9] and TD-MPC2 [11] learn latent dynamics, while diffusion policies such as DP3 [35] generate actions over 3D observations. Kintsugi is complementary rather than directly comparable: neural perception and motor policies can serve as perception sources"},{"citing_arxiv_id":"2605.10993","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-09T13:06:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Table 1: Success rates (%) on commonly used and challenging manipulation benchmarks with standard deviations reported when available. Octo, OpenVLA, and MAP-VLA results are taken from reported settings, while MemoryVLA is evaluated in our local pipeline. A dash (\"-\") denotes that the baseline was not evaluated in the corresponding environment. Method Standard LIBERO Suites LIBERO-Plus Spatial Object Goal Long-10 Octo [17] 78.9 ±1.0 85.7±0.9 84.6±0.9 51.1±1.3 - OpenVLA [2] 84.7 ±0.9 88.4±0.8 79.2±1.0 53.7±1.3 17.3±3.2 MAP-VLA [6] 96.3 98.4 95.4 83.4 ±0.7 - MemoryVLA [4] 98.0 ±0.6 97.4±0.9 96.4±1.3 92.4±1.1 - Vanillaπ 0 [3] 97.5 ±1.7 97.0±1.2 92.3±2.5 80.7±2.0 54.2±2.9 ECHO(Ours) 98.3±1.0 98.8±0.5 98.6±1.0 93.5±2.6 56.5±2.0 T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 0.0 20.0 40.0"}],"limit":50,"offset":0}