{"total":206,"items":[{"citing_arxiv_id":"2605.23733","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking","primary_cat":"cs.RO","submitted_at":"2026-05-22T15:10:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Any2Any transfers pretrained humanoid whole-body tracking policies to new embodiments with 1% of original training cost via kinematic alignment and parameter-efficient fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22812","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations","primary_cat":"cs.RO","submitted_at":"2026-05-21T17:57:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22671","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model","primary_cat":"cs.CV","submitted_at":"2026-05-21T16:14:19+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22446","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts","primary_cat":"cs.CV","submitted_at":"2026-05-21T13:13:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Pre-VLA is a multimodal runtime verifier that predicts safety confidence and advantage scores for action chunks, raising closed-loop success rates on the LIBERO benchmark from 30.79% to 37.62%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21976","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TacO: Benchmarking Tactile Sensors for Object Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-21T04:11:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper provides a task-driven benchmark comparing visual, acoustic, magnetic, and resistive tactile sensors on three manipulation tasks and concludes that sensor utility depends on modality, material friction, and task specifics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22882","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation","primary_cat":"cs.CV","submitted_at":"2026-05-20T21:36:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21414","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction","primary_cat":"cs.RO","submitted_at":"2026-05-20T17:10:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21372","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training","primary_cat":"cs.CV","submitted_at":"2026-05-20T16:36:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AutoScale is a closed-loop data engine using Graph-RAE for scene representation and Cluster-GA for importance-based retrieval to improve real-synthetic co-training for autonomous driving.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21133","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum","primary_cat":"cs.RO","submitted_at":"2026-05-20T13:05:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A multi-agent LLM framework for humanoid loco-manipulation that separates active spatial perception and task planning from generalizable action generation without task-specific real-robot data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19678","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-19T11:10:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19371","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-Scale Generative Modeling with Heat Dissipation Flow Matching","primary_cat":"cs.CV","submitted_at":"2026-05-19T05:08:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HDFM adds a continuous heat-dissipation (blur) process to flow matching, aligns an interpolated path to fix ill-posed inverse heat dissipation, and uses x-prediction to ease high-dimensional regression, yielding better performance than most baselines on image datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19294","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies","primary_cat":"cs.RO","submitted_at":"2026-05-19T03:14:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DEFLECT is an offline post-training method that improves async VLA policy success rates under high inference delays by using flow-matching likelihood ratios on counterfactual fresh/stale action pairs from a frozen reference policy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18722","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dexora: Open-source VLA for High-DoF Bimanual Dexterity","primary_cat":"cs.RO","submitted_at":"2026-05-18T17:50:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Dexora is the first open-source VLA system for dual-arm dual-hand high-DoF manipulation, trained on 100K simulated and 10K real teleoperated trajectories with a discriminator-weighted diffusion policy, achieving 66.7% dexterous success versus 51.7% for baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18556","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Key-Gram: Extensible World Knowledge for Embodied Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-18T15:37:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Key-Gram uses a memory module with key-grams and hashed lookup to inject static linguistic priors into vision-language-action backbones, yielding reported gains on manipulation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18287","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StableVLA: Towards Robust Vision-Language-Action Models without Extra Data","primary_cat":"cs.CV","submitted_at":"2026-05-18T12:15:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StableVLA adds an Information Bottleneck Adapter to VLA models that improves robustness to visual corruptions by 30% on average with under 10M extra parameters and no extra data, even when using a much smaller backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17601","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From a Single Demonstration to a General Policy for Contact-Rich Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-17T18:58:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A one-shot LfD framework abstracts a single demonstration into environmental-constraint primitives, then uses self-exploration, human corrections, and compliant recovery to produce a policy that generalizes across poses and geometries, achieving over 90% success on seven real-world multi-stage tasks","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"X-Embodiment collaboration,\" inIEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 6892-6903. [29] A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al., \"Droid: A large-scale in-the-wild robot manipulation dataset,\" inRobotics: Science and Systems (RSS), 2024. [30] J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang,et al., \"Gr00t n1: An open foundation model for generalist humanoid robots,\"arXiv preprint arXiv:2503.14734, 2025. [31] G. R. Team, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, T. Armstrong, A. Balakrishna, R. Baruch, M. Bauza, M. Blokzijl,"},{"citing_arxiv_id":"2605.17522","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-17T16:11:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17517","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment","primary_cat":"cs.RO","submitted_at":"2026-05-17T16:02:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AffordVLA improves VLA models for robotic manipulation by implicitly injecting affordance perception through feature alignment with a zero-shot teacher, claiming SOTA results in simulation and real-world tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17077","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning","primary_cat":"cs.RO","submitted_at":"2026-05-16T16:52:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15298","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhysBrain 1.0 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-14T18:11:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PhysBrain 1.0 extracts scene elements, spatial dynamics, actions and depth relations from human egocentric video to create QA supervision for VLMs, then transfers the resulting physical priors to VLA policies via capability-preserving adaptation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15157","ref_index":2,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hand-in-the-Loop: Improving VLA Policies for Dexterous Manipulation via Seamless Hand-Arm Intervention","primary_cat":"cs.RO","submitted_at":"2026-05-14T17:51:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HandITL enables seamless human intervention in VLA policies for bimanual dexterous manipulation, cutting jitter by 99.8% and improving refined policies by 19% over standard teleoperation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13959","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WarmPrior: Straightening Flow-Matching Policies with Temporal Priors","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:00:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Replacing Gaussian noise with a temporally grounded prior from recent actions straightens flow-matching paths and improves success rates in robotic manipulation and prior-space RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13757","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FrameSkip: Learning from Fewer but More Informative Frames in VLA Training","primary_cat":"cs.RO","submitted_at":"2026-05-13T16:38:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16412","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SCAR: Self-Supervised Continuous Action Representation Learning","primary_cat":"cs.RO","submitted_at":"2026-05-13T16:23:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13632","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T14:58:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"reproduced results evaluated with a maximum inference horizon of 120 steps, consistent with the setting used for other models. Method LIBERO SIMPLER-Env (Bridge) Spatial Object Goal Long Avg Spoon Carrot Cube Eggplant Avg OpenVLA [18] 84.7 88.4 79.2 53.7 76.5 4.2 0.0 8.3 45.8 14.6 OpenVLA-OFT [17] 96.2 98.3 96.2 90.7 95.3 - - - - - π 0 [3] 96.8 98.8 95.8 85.2 94.1 50.0 41.7 29.2 70.8 47.9 GR00T-N1 [24] 94.4 97.6 93.0 90.6 93.9 64.5 65.5 5.5 93.0 57.1 π 0.5 [4] 98.8 98.2 98.0 92.4 96.9 - - - - - X-VLA* [37] 98.2 98.6 97.897.6 98.1 95.875.0 62.5 70.8 76.0 ThinkAct [13] 88.3 91.4 87.1 70.9 84.4 37.5 8.7 58.3 70.8 43.8 CoT-VLA [36] 87.5 91.6 87.6 69.0 81.1 - - - - - Uni-VLA [31] 97.099.092.6 90.8 94.8 83.3 66.7 33.395.8 69.8 MolmoAct [20] 87.0 95.4 87."},{"citing_arxiv_id":"2605.13403","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RotVLA: Rotational Latent Action for Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2026-05-13T11:58:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Mainstream approaches [6, 27-31] typically adopt an Inverse Dynamics Model (IDM) 3 to infer latent actions from consecutive video frames, coupled with a Forward Dynamics Model (FDM) that reconstructs future observations conditioned on the inferred latent action. Building on this formulation, [32, 33] incorporate annotated actions to guide latent-action learning, while [22, 34] treat latent actions as surrogate labels for unlabeled data. Other works enhance these models by introducing additional modalities such as depth or optical flow [34-37]. Despite their promise, these pipelines have the risk of degenerating into trivial solutions that simply encode and reconstruct the future frame [ 10]. Existing methods attempt to mitigate this issue by"},{"citing_arxiv_id":"2605.13276","ref_index":6,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models","primary_cat":"cs.AI","submitted_at":"2026-05-13T09:54:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13119","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T07:40:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12386","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T16:49:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SafeManip is a new benchmark that applies LTLf monitors to assess temporal safety properties across eight categories in robotic manipulation, demonstrating that task success frequently fails to ensure safe execution in vision-language-action policies.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"ated with task-specific objects, fixtures, regions, or skills, allowing the same safety specifications to generalize across tasks and environments. We instantiate SAFEMANIPusing 50 RoboCasa365 tasks spanning diverse manipulation and naviga- tion skills, including cleaning, cooking, and other household tasks [14]. Our evaluation covers six VLA policies and training variants, including π0 [2], π0.5 [7], GR00T N1.5 [1], and GR00T variants trained from RoboCasa365 checkpoints. For each policy rollout, SAFEMANIPjointly measures task completion and temporal safety, distinguishing successful rollouts that satisfy safety properties from those that complete the task while violating them. These metrics allow us to characterize safety failures across property categories, manipulation suites, and task horizons."},{"citing_arxiv_id":"2605.12369","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization","primary_cat":"cs.RO","submitted_at":"2026-05-12T16:38:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024. [5] Vineet Bhat, Yu-Hsiang Lan, Prashanth Krishnamurthy, Ramesh Karri, and Farshad Khorrami. 3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks.arXiv preprint arXiv:2505.05800, 2025. [6] Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025. [7] Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert"},{"citing_arxiv_id":"2605.12167","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:15:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"˜ht→t+k =T (m) \u0010 q(m), hrgb t , hrgb t+k \u0011 ,(2) where q(m) denotes the latent action queries and T (m) de- notes the modality-specific spatiotemporal transformer that captures cross-frame correspondences. The transformer outputs are then mapped to discrete latent actions through a modality-specific vector-quantized code- book: z(m) t→t+k = VQ(m) \u0010 ˜ht→t+k \u0011 ,(3) yielding compact action tokens that capture the causal tran- sition between the two frames. To induce modality awareness, each IDM is trained with distinct reconstruction targets. The predicted latent action is combined with the current RGB feature and decoded by a shared ViT-based RGB decoder to reconstruct the future RGB frame: ˆorgb t+k =D rgb"},{"citing_arxiv_id":"2605.12162","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:13:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"performs both vanilla policies and prior methods utilizing explicit pose guidance. The code will be open sourced. Keywords:Robotic Manipulation·Imitation Learning·Object Pose 1 Introduction Robotic manipulation, with applications ranging from industrial assembly to domestic assistance, has been revolutionized by data-driven approaches in recent years [4-6]. Specifically, imitation learning [1,6,7,45,50,59,75] has emerged as a simple but effective paradigm, allowing robots to acquire complex skills from pre-collected expert demonstrations [18,32,64]. Despite recent advancements in vision-based imitation learning, enabling robots to perform complex, dynamic manipulation where the spatial relationship between the robot and the object"},{"citing_arxiv_id":"2605.12160","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:10:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Premover enables VLA policies to act on partial instructions by precomputing focus maps from intermediate backbone layers, reducing wall-clock time 13.6 percent on LIBERO while preserving 95 percent success rate.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[13] Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, and Bo Zhao. VLA-Pruner: Temporal-aware dual-level visual token pruning for efficient vision-language-action inference.arXiv preprint arXiv:2511.16449, 2025. 10 [14] I. Scott MacKenzie. A note on calculating text entry speed, 2002. Research note, York University, https://www.yorku.ca/mack/RN-TextEntrySpeed.html. [15] NVIDIA, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu-Wei Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang Tan, Guanzhi Wang, Zu Wang,"},{"citing_arxiv_id":"2605.11951","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T11:00:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bimanual tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"recovery branch and then rejoins the nominal task graph. Visualizations with more tasks and types of disturbances (e.g., tipped object) are presented in the Appendix E. TABLE III: Success rates of different policies on the simulated single-arm pour water task. Bold values indicate the best result. Method Success Rate RDT [36] 16/50 π 0 [5] 20/50 GR00T N1.5 [4] 15/50 Sim2Real-VLA [52] 26/50 Sim2Real-VLA (rec) 39/50 E. Leveraging Generated Data for Policy Learning Intuitively, the failure-recovery behaviors generated by AgentChord could be valuable for policy learning. In this section, we present a simple yet effective experiment to assess whether incorporating such trajectories can enhance the policy's ability to recover from failures."},{"citing_arxiv_id":"2605.11832","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T09:21:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"coordination [6], [10], [19]-[21], [60]. Diffusion Policy [19] pioneers conditional diffusion for visuomotor control, demonstrating significant improve- ments over deterministic baselines. More recently,π 0 [6] employs a flow-matching action expert based on a DiT architecture for efficient trajectory generation via ODE in- tegration. Similarly, GR00T N1 [20] couples a deliberative VLM backbone with a fast flow-matching generator, while CogACT [21] aligns diffusion-based control with cognitive reasoning. Despite their success, these generative methods often struggle with high-dimensional action spaces due to error accumulation during iterative sampling. Our proposed Action Manifold Learning (AML) module addresses this by"},{"citing_arxiv_id":"2605.11750","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies","primary_cat":"cs.RO","submitted_at":"2026-05-12T08:27:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"5 47.5 55.0 30.0 48.8 DREAMAVOIDVariants ×(Repeated ODE)✓67.5 55.0 70.0 37.5 57.5 ✓(SDE) ×(Random Select) 65.0 35.0 55.0 25.0 45.0 DA-ABL✓ ✓90.0 67.5 80.0 52.5 72.5 6 LIBERO and SimplerEnv Benchmarks We evaluate on two simulation benchmarks: LIBERO [16] and SimplerEnv [17]. The base policy adopted for LIBERO is π0.5, while SimplerEnv uses GR00T-N1.6 [30]. This design aims to verify that DREAMAVOIDcan generalize to other flow policies. Unlike real-world experiments that rely on human-annotated visual frames, we directly leverage the precise underlying physical states of the simulator to locate the critical phase (Appendix E). Experimental results show (Table 6 and Table 7, Appendix E) that the DA-ABL method achieves an average success rate of 97."},{"citing_arxiv_id":"2605.11665","ref_index":56,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Nautilus: From One Prompt to Plug-and-Play Robot Learning","primary_cat":"cs.RO","submitted_at":"2026-05-12T07:26:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Shen, Yaxin Peng, Feifei Feng, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.arXiv preprint arXiv:2409.12514, 2024. [55] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024. [56] Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. [57] Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Ya˘gmurlu, Fabian Otto, and Rudolf Lioutikov."},{"citing_arxiv_id":"2605.11567","ref_index":7,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dynamic Execution Commitment of Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-05-12T05:52:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A3 adaptively selects verifiable action prefixes in VLA models using group-sampled consensus and conditional re-decoding to balance robustness and speed without manual horizon tuning.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"instructions directly to precise motor sequences. To mitigate the high computational overhead of large-scale vision-language backbones [6], modern VLA architectures increasingly adopt dual-system designs [1] that decouple deliberative reasoning from reactive execution. A core efficiency strategy within these frameworks is action chunking, where an action expert (e.g., π-0.5 [1] or GR00T [7]) generates a sequence of future actions in a single forward pass before interacting with the environment again. While substantially reducing inference frequency and improving computational utilization, the determination of the execution horizon,i.e.,the number of actions actually committed before the next re-planning cycle, remains a critical yet underexplored design choice."},{"citing_arxiv_id":"2605.11564","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning","primary_cat":"cs.RO","submitted_at":"2026-05-12T05:49:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. [3] Lucas Beyer, Andreas Steiner, Andr 'e Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024. [4] Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. [5] Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy"},{"citing_arxiv_id":"2605.11479","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T03:54:30+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"2253, 2017. doi: 10.1109/CDC.2017.8263977. URL https://arxiv.org/abs/1709.07523. [3] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289-300, 1995. URL https://www.jstor.org/stable/2346101?seq=1. [4] Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. URL https://arxiv.org/ abs/2503.14734. [5] Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai,"},{"citing_arxiv_id":"2605.11459","ref_index":12,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models","primary_cat":"cs.RO","submitted_at":"2026-05-12T03:17:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"on MOVEBENCHdemonstrations under their official recipes and dynamics-adaptive baselines follow 6 Method Static Uniform Motion Accelerated Motion Irregular Motion Average Moving Accelerating Rand. Walk Stop & Go Teleport Dyn. Only AllEasy Med. Hard Easy Med. Hard Found. Diffusion Policy [11] 75 56 60 21 43 28 17 63 50 5643.8 ±3.2 46.9±3.1GR00T N1.6 [12]8874 64 11 11 6 1 67 35 67 37.3±3.2 42.4±3.1SmolVLA [35] 81 76 57 27 41 33 13 53 40 4442.7 ±3.2 46.5±3.1π0[9] 82 81 63 30 44 30 22 60 43 5147.1 ±3.3 50.6±3.1π0.5[34] 80 85 78 34 58 43 29 54 48 6054.3 ±3.3 56.9±3.1 Comp. ACT [10] 82 79 77 19 69 50 30 53 48 147.3 ±3.3 50.8±3.1BID [29] 79 80 75 29 57 50 33 68 51 4854.6 ±3.3 57.0±3.1DynamicVLA [3] 70 73 57 20 45 42 29 49 40 2442."},{"citing_arxiv_id":"2605.10925","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T17:56:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.","context_count":1,"top_context_role":"method","top_context_polarity":"baseline","context_text":"data regimes is still challenging: full fine-tuning can quickly improve in-distribution performance, but may over-specialize the model and weaken pretrained capabilities. Recent work improves VLA adaptation by refining action representations and fine-tuning objectives [12], reducing trainable parameters with low-rank updates [24], or freezing/protecting the VLM while placing more adaptation capacity in action-side modules [35, 16]. Other approaches improve robustness by constraining or merging parameter updates [18, 19], incorporating temporal context [ 38], or preserving previous capabilities through continual-learning regularization and replay [ 39, 40]. Lightweight interface designs further connect vision-language representations to action policies through query-style tokens"},{"citing_arxiv_id":"2605.10903","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:41:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational overhead during adaptation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This indicates that we can extract the capability vectors by simply conducting parameter arithmetic between two models finetuned with different strategies. Then, to achieve our goal of transferring the properties of 3 θao to θpt, we merge the capability vectorsγao and θpt and get the capability-enhanced meta model with properties: θmeta =θ pt +αγ ao,(4) where α denotes vector weights. This provides a better initialization for further performing finetuning on any new tasks: θ′ ft =θ meta + ∆′ ft.(5) 2.3 During Training: Regularization in Orthogonal Subspaces While we have transferred the properties to the pretrained model, there is an obvious question:how to retain the properties during regular finetuning?"},{"citing_arxiv_id":"2605.11048","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ForceFlow: Learning to Feel and Act via Contact-Driven Flow Matching","primary_cat":"cs.RO","submitted_at":"2026-05-11T13:27:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ForceFlow improves success rates by 37% on six real-world contact-rich tasks over ForceVLA by treating force as a global regulatory signal in a flow-matching policy with hierarchical vision-to-force decomposition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10485","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T12:44:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"natural language instructions directly into low-level robot actions. Following this paradigm, open- source models like OpenVLA [ 21] and Octo [ 37] have become standard baselines, leveraging powerful 2D vision backbones (e.g., SigLIP [45], DINOv2 [29]) and large-scale cross-embodiment datasets [30] to achieve remarkable zero-shot generalization. More recently, architectures such as π0 [4, 3] and GR00T [2] have introduced flow-matching and continuous action generation, further improving dexterous manipulation capabilities. Despite their semantic reasoning capabilities, the visual backbones of these mainstream VLAs are primarily pretrained on massive 2D image-text pairs via contrastive learning. Consequently, they excel at capturing 2D semantic correspondences"},{"citing_arxiv_id":"2605.10408","ref_index":11,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VISOR: A Vision-Language Model-based Test Oracle for Testing Robots","primary_cat":"cs.SE","submitted_at":"2026-05-11T11:46:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv:2410.24164 doi:10.48550/ARXIV.2410.24164 [10] Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14455-14465. doi:10.1109/CVPR52733.2024.01370 [11] Yongchao Chen, Jacob Arkin, Charles Dawson, Yang Zhang, Nicholas Roy, and Chuchu Fan. 2024. AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers. In2024 IEEE International Conference on Ro- botics and Automation (ICRA). 6695-6702. doi:10.1109/ICRA57147.2024.10611163 AIware 2026, July 06-07, 2026, Montreal, Canada Prasun Saurabh, Pablo Valle, Aitor Arrieta, Shaukat Ali, and Paolo Arcaini"},{"citing_arxiv_id":"2605.10044","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adaptive Action Chunking via Multi-Chunk Q Value Estimation","primary_cat":"cs.LG","submitted_at":"2026-05-11T06:14:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[1] Michael Albergo and Eric Vanden-Eijnden. 2023. Building Normalizing Flows with Stochastic Interpolants. InThe Eleventh International Conference on Learning Representations. [2] Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. 2023. Efficient online reinforce- ment learning with offline data. InInternational Conference on Machine Learning. PMLR, 1577-1594. [3] Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. 2025. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734(2025). [4] Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al."},{"citing_arxiv_id":"2605.09994","ref_index":27,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BatchWeave: A Consistent Object-Store-Native Data Plane for Large Foundation Model Training","primary_cat":"cs.DC","submitted_at":"2026-05-11T05:10:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BatchWeave delivers an object-store-native data plane for distributed large foundation model training via transactional global batches and a decentralized adaptive commit algorithm.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and increase by one on each conflict), andAIMD(Additive Increase Multiplicative Decrease, the classic TCP-style congestion control policy [16]: increase the interval by a fixed addend on success, halve it on conflict). Consumer microbenchmarks additionally compare dense-read, which reads the full TGB object span and filters locally. Workloads.We use four workload families: end-to-end GR00T [27] training, end-to-end HoloAssist [ 37] video SFT, end-to-end BEHAVIOR-1K [ 20] VLA training, and controlled data-plane microbenchmarks. All GR00T runs use a trainer-side DataLoader with num_workers=1 and prefetch_factor=4. HoloAssist runs train Qwen3-VL-30B-A3B [7] on the HoloAssist reasoning split, with online video decode, frame sampling at 2 FPS, and 8-16"},{"citing_arxiv_id":"2605.09989","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception","primary_cat":"cs.RO","submitted_at":"2026-05-11T05:06:12+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A. Zhang, H. Zhang, Y . Zhao, R. Zheng, and Y . Zhu. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv: 2503.14734, 2025. [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385. [14] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational Conference on Machine Learning, pages 8748-8763."},{"citing_arxiv_id":"2605.09948","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models","primary_cat":"cs.AI","submitted_at":"2026-05-11T03:51:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"in success rate while significantly reducing parameter count and improving throughput.Bold denotes the best result and underline denotes the second-best. Model Spatial Object Goal Long Average Params↓FLOPs↓Thrpt. (Hz)↑ Diffusion Policy [29] 78.3 92.5 68.3 50.0 72.4 - - - OpenVLA [3]†84.7 88.4 79.2 53.7 76.5 7.2B 6.58T 3.26 π0+FAST [30] 96.4 96.8 88.6 60.2 85.5 3.5B 606.68T 0.05 GR00T-N1.5 [15] 92.0 92.0 86.0 76.0 86.5 2.4B 0.47T 7.63 UniVLA [31] 96.5 96.8 95.692.095.2 7.2B 108.71T 1.41 π0 [13] 96.8 98.8 95.8 85.2 94.1 4.0B 1.79T 3.70 π0.5 [13] 95.4 98.4 97.2 89.6 95.1 4.1B 2.41T 3.11 F1-VLA [32]†98.297.8 95.4 91.3 95.7 4.2B 5.93T 1.68 Qwen3FM 94.0 92.3 91.3 65.7 85.8 2.3B 0.53T 0.97 LoopFMα(3⊗8) 91.0 99.0 95.3 79.0 91.1 1.3B 0.38T 2.04"}],"limit":50,"offset":0}