{"total":12,"items":[{"citing_arxiv_id":"2606.00110","ref_index":117,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling","primary_cat":"cs.CV","submitted_at":"2026-05-27T03:38:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GAM framework uses arc-length parameterization for temporal invariance and schema-affine factorization for geometric invariance to build a covariant action manifold integrated into VLA models for improved generalization from sparse data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25829","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-25T13:28:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OASIS improves robotic manipulation success and generalization by predicting camera-frame SE(3) end-effector trajectories to condition the action decoder on pose-supervised states.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25547","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TapSampling: Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-25T08:03:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TapSampling improves generalist robotic manipulation policies at inference time via latent action sampling with an Action-VAE and selection by a task-progress outcome predictor.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22671","ref_index":5,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model","primary_cat":"cs.CV","submitted_at":"2026-05-21T16:14:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"BehaviorVLA learns long-horizon behavioral representations via causal Mamba encoder and phase-conditioned decoder, reporting SOTA results of 58% on RoboTwin 2.0, 98% on LIBERO, 4.36 on CALVIN, and matching OpenVLA-OFT performance with 50% data in sim-to-real transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12167","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:15:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14125","ref_index":8,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System","primary_cat":"cs.CV","submitted_at":"2026-04-15T17:50:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(b) Success Rate in RoboT win Stack Block Stamp Seal Fig.1:(a) Overview of our proposed HiVLA framework. (b) Success rate comparison on RoboTwin benchmark. the development of Vision-Language-Action (VLA) models [6,17,20]. However, current VLA research predominantly adopts end-to-end architectures, utilizing either single-system [7,19,40] or dual-system [5,8,9] approaches that tightly couple visual reasoning with low-level action generation. Although these inte- grated paradigms have shown considerable promise, they face a critical bot- tleneck [13,15] that fine-tuning VLMs on relatively scarce and domain-specific manipulation data inevitably degrades their original reasoning capabilities. This degradation, widely recognized as catastrophic forgetting, ultimately limits the"},{"citing_arxiv_id":"2602.20200","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-02-22T15:39:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.13073","ref_index":126,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey","primary_cat":"cs.RO","submitted_at":"2025-08-18T16:45:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"\" denotes diffusion-based learning, \"FM\" denotes flow-matching, \"MSE\" denotes mean squared error, \"BCE\" denotes binary cross-entropy, and \"AR\" denotes autoregressive learning. Model System 2 Backbone System 1 Learning Contribution Cascade-based DP-VLA [125] OpenVLA Regression Propose a dual-system architecture for robot manipulation with efficiency and performance. RoboDual [126] OpenVLA Diff. Combine a VLA-based generalist for reasoning and a DIT specialist for control. LCB [127] LLaVA Diff. Leverage an added special token to encode VLM reasoning and act as conditions for policy. GR00T N1 [32] Eagle-2 FM Combine a VLM and DiT for humanoid robots manipulation. CogACT [128] OpenVLA Diff. Propose an action ensemble algorithm to integrate the action diffusion process into VLA."},{"citing_arxiv_id":"2507.04447","ref_index":119,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge","primary_cat":"cs.CV","submitted_at":"2025-07-06T16:14:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"DreamVLA shows significant superiority over baselines. The best results are bolded. Method Task completed in a row 1 2 3 4 5 Avg. Len. ↑ Roboflamingo [30] 82.4 61.9 46.6 33.1 23.5 2.47 Susie [118] 87.0 69.0 49.0 38.0 26.0 2.69 GR-1 [14] 85.4 71.2 59.6 49.7 40.1 3.06 3D Diffusor Actor [93] 92.2 78.7 63.9 51.2 41.2 3.27 OpenVLA [1] 91.3 77.8 62.0 52.1 43.5 3.27 RoboDual [119] 94.4 82.7 72.1 62.4 54.4 3.66 UNIVLA [120] 95.5 85.8 75.4 66.9 56.5 3.80 Pi0 [32] 93.8 85.0 76.7 68.1 59.9 3.92 CLOVER [121] 96.0 83.5 70.8 57.5 45.4 3.53 UP-VLA [57] 92.8 86.5 81.5 76.9 69.9 4.08 Robovlm [37] 98.0 93.6 85.4 77.8 70.4 4.25 Seer [56] 96.3 91.6 86.1 80.3 74.0 4.28 VPP [49] 95.7 91.2 86.3 81.0 75.0 4.29 DreamVLA 98.2 94.6 89.5 83.4 78."},{"citing_arxiv_id":"2507.01925","ref_index":289,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Vision-Language-Action Models: An Action Tokenization Perspective","primary_cat":"cs.RO","submitted_at":"2025-07-02T17:34:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"manipulation dataset with Genie-1 1M trajectories, 3K hours, 100 scenes, 5 domains, 200 tasks, 87 skills RGB-D videos, action, skill, proprioception FLaRe [286] WOMD [287, 288] a diverse interactive motion dataset for autonomous driving 103K segments, 20 seconds each, 574 hours ego pose, images, object tracks, 3D bounding box, Lidar data EMMA [29] nuScenes[289] a large dataset for autonomous driving 1K driving scenes, 20 seconds each ego pose, image, Lidar, Radar, object 3D bounding box EMMA [29], VLM-E2E [197] CoVLA [28] a comprehensive Vision-Language-Action dataset for autonomous driving 80 hours, 10K video clips videos, frame-level language captions, future trajectory actions CoVLA [28] 41 A Survey on Vision-Language-Action Models: An Action Tokenization Perspective"},{"citing_arxiv_id":"2505.06111","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniVLA: Learning to Act Anywhere with Task-centric Latent Actions","primary_cat":"cs.RO","submitted_at":"2025-05-09T15:11:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In contrast, some prior works [10, 39, 46] leverage pretrained VLMs to generate robotic actions by tapping into world knowl- edge from large-scale vision-language datasets. For instance, RT-2 [10] and OpenVLA [39] treat actions as tokens within the language model's vocabulary, while RoboFlamingo [46] introduces an additional policy head for action prediction. Building on these generalist policies, RoboDual [13] proposes a synergistic dual-system that combines the strengths of both generalist and specialist policy. Other works incorporate goal image [9] or video [24, 82, 14] prediction tasks to generate valid, executable plans conditioned on language instructions, with these visual cues subsequently guiding the policy in action generation. However, these methods heavily rely on"},{"citing_arxiv_id":"2503.06669","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems","primary_cat":"cs.RO","submitted_at":"2025-03-09T15:40:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgiBot World supplies over 1 million trajectories enabling GO-1 to deliver 30% average gains over Open X-Embodiment and over 60% success on complex dexterous tasks while open-sourcing everything.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"the discretized low-level actions used in OpenVLA [4], this approach also facilitates the efficient adaptation of general- purpose VLMs into robot policies. C. Action Expert To achieve high-frequency and dexterous manipulation, Stage 3 integrates an action expert that utilizes a diffusion objective to model the continuous distribution of low-level actions [34]. Although the action expert shares the same architectural framework as the latent planner, their objectives diverge: the latent planner generates discretized latent action tokens through masked language modeling, while the action expert regresses low-level actions via an iterative denoising process. Both expert modules are conditioned hierarchically"}],"limit":50,"offset":0}