{"total":13,"items":[{"citing_arxiv_id":"2606.01072","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs","primary_cat":"cs.RO","submitted_at":"2026-05-31T07:34:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dynamic scene graphs serve as explicit memory to improve imitation learning policies for spatial-temporal reasoning under partial observability in mobile and tabletop manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25829","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-25T13:28:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OASIS improves robotic manipulation success and generalization by predicting camera-frame SE(3) end-effector trajectories to condition the action decoder on pose-supervised states.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24642","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-05-23T16:18:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper quantifies the geometric gap in current VLAs via linear probing and compares three architectures for injecting geometry from GFMs while analyzing impacts of data, cameras, and reconstruction quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21862","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control","primary_cat":"cs.RO","submitted_at":"2026-05-21T01:19:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21258","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-20T14:48:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A hybrid structural latent points representation is learned by inserting a point-wise latent VAE into a point-cloud autoencoder and regularizing toward a Gaussian prior, paired with a lightweight 3DGS rendering pipeline, yielding gains on RLBench and ManiSkill2 benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01448","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-02T13:55:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"where (ix, iy, iz) are voxel indices for translation, (ir, ip, iψ) are discrete bins for Euler angles (roll, pitch, yaw), and g is the gripper command. A.3. Encoding: continuous control→discrete LLM tokens Translation discretization.Let the workspace bounds be an axis-aligned box bmin ∈R 3,b max ∈R 3,(8) and letVbe the number of uniform bins per axis (we useV=100). Define per-axis resolution r= (b max −b min)/V.(9) Given a continuous positionp, the voxel index is computed as i= \u0004 (p−b min)⊘r \u0005 ,(10) where⊘denotes elementwise division and the result is clipped to[0, V−1]per axis. Rotation discretization.We convert the quaternion q into Euler angles (degrees) θ= [θ r, θp, θy] in a fixed convention. We then quantize angles with resolution∆(we use∆=5 ◦): k= \u0004 (θ+ 180 ◦)/∆"},{"citing_arxiv_id":"2604.10573","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images","primary_cat":"cs.CV","submitted_at":"2026-04-12T10:36:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UniSplat learns consistent 3D geometry, appearance, and semantics from unposed images using dual masking, progressive Gaussian splatting, and recalibration to align predictions across tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05672","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2026-04-07T10:18:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[10] Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025. [11] Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation, 2023.https://arxiv.org/abs/2306.14896. [12] Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, and Dieter Fox. Rvt-2: Learning precise manipulation from few demonstrations, 2024.https://arxiv.org/abs/2406.08545. [13] Xiaofeng Han, Shunpeng Chen, Zenghuang Fu, Zhe Feng, Lue Fan, Dong An, Changwei Wang, Li Guo, Weiliang Meng, Xiaopeng Zhang, et al. Multimodal fusion and vision-language models: A survey for robot vision."},{"citing_arxiv_id":"2603.05117","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-03-05T12:42:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeedPolicy introduces self-evolving gated attention to extend the temporal horizon of diffusion policies, yielding 36.8% and 169% relative gains over standard DP on clean and randomized RoboTwin 2.0 tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.03233","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data","primary_cat":"cs.RO","submitted_at":"2025-05-06T06:59:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.10631","ref_index":83,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model","primary_cat":"cs.CV","submitted_at":"2025-03-13T17:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"we use the AdamW optimizer with a fixed learning rate of 2e-5 to update both the LLM and the injected MLP parameters. Our models are trained for 300 epochs on 8 NVIDIA A800 GPUs with mixed-precision training. For evaluation, we follow [10, 14] and test all methods using 20 rollouts from the latest epoch checkpoint. Since RLBench employs a sampling-based motion planner [83], we evaluate each model three times per task and report the mean success rate along with its variance. Quantitative Results. As shown in Table 2, HybridVLA(7B) achieves an average success rate of 74% across 10 distinct tasks, outperforming the previous SOTA autoregressive-based VLA (OpenVLA) and diffusion-based VLA (CogACT) by 33% and 14%, respectively."},{"citing_arxiv_id":"2409.01652","ref_index":156,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2024-09-03T06:45:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [155] A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox. Rvt: Robotic view transformer for 3d object manipulation. In Conference on Robot Learning, pages 694-710. PMLR, 2023. [156] A. Goyal, V . Blukis, J. Xu, Y . Guo, Y .-W. Chao, and D. Fox. Rvt-2: Learning precise manip- ulation from few demonstrations. arXiv preprint arXiv:2406.08545, 2024. [157] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Trans-"},{"citing_arxiv_id":"2405.14093","ref_index":95,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Vision-Language-Action Models for Embodied AI","primary_cat":"cs.RO","submitted_at":"2024-05-23T01:43:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RVT [92], RVT-2 [93] CLIP-ResNet50 CLIP-GPT TFM Concat 2D affordance (project to 3D) RLBench data, [SC] Sim: RLBench;Real(Franka): stack, press, place RoboPoint [94] ViT-L/14 Vicuna-V1.5 13B Concat 2D affordance (project to 3D) [SC]Real(Franka): pick, place Gato [18] ViT Sent.Piece TFM Concat BC (cont & disc) [SC]Sim&Real(Sawyer): RGB-stacking (RoboCat) [95] VQ-GAN (p, s) TFM Quant. BC, observation prediction Self- improvement Sim&Real(Sawyer, Franka, KUKA): stacking, building, lifting, insertion, removal VIMA [96] ViT, Mask R-CNN T5 TFM Xattn BC (SE(2)) [SC:VIMA-Data]Sim(Ravens): VIMA-Bench BC-Z [82] ResNet18 (p, s) USE MLP FiLM BC (cont) [SC]Real(EDR): pick-place/wipe/drag, grasp, push RT-1 [97] EfficientNet USE TFM FiLM BC (disc) [SC: Fractal]Real(EDR): pick-place, move, knock"}],"limit":50,"offset":0}