{"total":25,"items":[{"citing_arxiv_id":"2606.27872","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"S$^2$-VLA: State-Space Guided Vision-Language-Action Models for Long-Horizon Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-26T09:13:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"S²-VLA uses a state-space model to maintain a belief state that produces dynamic gating weights for fusing visual, language, and action features, claiming better long-horizon manipulation than 7B models with only 2B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29416","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding","primary_cat":"cs.RO","submitted_at":"2026-05-28T06:07:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"3DVLA is a plug-and-play framework that enhances pretrained VLAs with pervasive 3D feature encoding using multi-view consistency and Spatially-Conditioned Geometry Aggregation, an instance estimation module, and a masked self-supervised 3D branch, yielding gains on LIBERO-Plus and RoboTwin 2.0.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19580","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-19T09:22:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PAPO-VLA identifies planning actions via variation and outcome, estimates their causal importance, and folds that importance into GRPO to emphasize key decisions while still using full-trajectory feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13548","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AttenA+: Rectifying Action Inequality in Robotic Foundation Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T13:55:37+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"action generation, with its variant OpenVLA-OFT [17] further optimizing via orthogonal fine-tuning to push state-of-the-art (SOTA) performance on LIBEROtasks. The π model series, including π0 [2], π0 + FAST [18], and π0.5 [10], advances generative VLA capabilities through flow matching for strong generalization. Other representative VLA models and optimizations include UniVLA [7], VLA-ADP [19], CogACT [20], SmolVLA [21], NORA and NORA-Long [22], WorldVLA and WorldVLA* [8], SP-VLA [23], FlashVLA [24], VLA-Cache [25], FastV and FastV(+OFT) [ 26], SparseVLM [27], and CSP [28]. Parallel efforts emerging as W AMs include Motus [13], LingBot-V A [14], and Fast-W AM [29]. Despite consistent progress across benchmarks, nearly all existing action models share a core limitation: treating all action timesteps equally during training, neglecting the"},{"citing_arxiv_id":"2605.12369","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization","primary_cat":"cs.RO","submitted_at":"2026-05-12T16:38:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"among single-head variants on theObjectandLongsuites, the skill head gives the best single-head result on theGoalsuite, and the depth head performs best on theSpatialsuite. Perturbation Dimensions Task Suites Model Camera Robot Language Light Background Noise Layout Spatial Object Goal Long Total OpenVLA [45] 0.8 3.5 23.0 8.1 34.8 15.2 28.5 19.4 14.0 15.1 14.3 15.6 OpenVLA-OFT [44] 56.4 31.9 79.5 88.7 93.3 75.8 74.2 84.0 66.5 63.0 66.4 69.6 NORA [38] 2.2 37.0 65.1 45.7 58.6 12.8 62.1 47.6 34.4 38.8 36.3 39.0 WorldVLA [12] 0.1 27.9 41.6 43.7 17.1 10.9 38.0 32.5 28.6 31.8 8.2 25.0 UniVLA [11] 1.8 46.2 69.6 69.0 81.0 21.2 31.9 55.5 36.7 40.7 39.9 43.9 π0-Fast [69] 65.1 21.6 61.0 73.2 73.2 74.4 68.8 74.4 72.7 57.5 43.4 61.6 RIPT-VLA [77] 55.2 31.2 77.6 88.4 91.6 73.5 74.2 85.8 64.3 58.0 67.5 68.4"},{"citing_arxiv_id":"2605.12167","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:15:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11832","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Action Manifold with Multi-view Latent Priors for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T09:21:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The method uses multi-view diffusion priors and action manifold learning to resolve depth ambiguity and improve action prediction in VLA robotic manipulation models, reporting higher success rates than baselines on LIBERO, RoboTwin, and real-robot tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"OpenVLA-OFT m shows the performance of OpenVLA-OFT with a mix-sft [28]. Method Camera Robot Language Light Background Noise Layout Total OpenVLA [5] 0.8 3.5 23.0 8.1 34.8 15.2 28.5 15.6 OpenVLA-OFT [11] 56.4 31.9 79.5 88.7 93.3 75.8 74.2 69.6 OpenVLA-OFT w [11] 10.4 38.7 70.5 76.8 93.6 49.9 69.9 55.8 OpenVLA-OFT m [11] 55.6 21.7 81.0 92.7 91.0 78.6 68.7 67.9 NORA [75] 2.2 37.0 65.1 45.7 58.6 12.8 62.1 39.0 WorldVLA [39] 0.1 27.9 41.6 43.7 17.1 10.9 38.0 25.0 UniVLA [38] 1.8 46.2 69.6 69.0 81.0 21.2 31.9 42.9 π 0 [6] 13.8 6.0 58.8 85.0 81.4 79.0 68.9 53.6 π 0 -Fast [12] 65.1 21.6 61.0 73.2 73.2 74.4 68.8 61.6 RIPT-VLA [41] 55.2 31.2 77.6 88.4 91.6 73.5 74.2 68.4 MergeVLA [76] 50.7 30.3 66.0 84.2 85.7 66.0 68.1 62."},{"citing_arxiv_id":"2605.11809","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models","primary_cat":"cs.AI","submitted_at":"2026-05-12T09:03:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MCF-Proto adds a motion-centric local action frame and prototype parameterization to VLA models, inducing emergent geometric structure and improved robustness from standard demonstrations alone.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"directly from policy features, and use it to express the action before mapping it back to the world frame. In this way, the learned frame serves as a compact intermediate representation for action generation rather than an externally specified geometric input. Prototype Actions.Our action representation is also related to prior work on motion primitives, dic- tionary learning, structured latent actions, and compositional policy outputs [21, 32, 28, 16]. Across robotics and machine learning, reusable basis elements or prototype components have been used to capture recurring structure in trajectories, skills, and control policies, often improving parameter efficiency, interpretability, and transfer [21, 32, 36, 34, 18]. Related ideas also appear in mixture-of- experts models, sparse coding, and other compositional prediction frameworks, where outputs are"},{"citing_arxiv_id":"2605.06481","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-07T16:06:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"SlotVLA: Towards modeling of object-relation representations in robotic manipulation.arXiv preprint arXiv:2511.06754, 2025. [27] Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, and Fu-En Yang. ThinkAct: Vision-language-action reasoning via reinforced visual latent planning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2507.16815. [28] Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, and Soujanya Poria. NORA: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854, 2025. [29] Youngjoon Jeong, Junha Chun, Soonwoo Cha, and Taesup Kim. Object-centric world model for language- guided manipulation."},{"citing_arxiv_id":"2605.06175","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts","primary_cat":"cs.RO","submitted_at":"2026-05-07T12:56:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Let g= ∂Lfinal ∂Weq(x) denote the gradient of the loss with respect to the aggregated equivalent weight. By the chain rule, the localized gradients with respect to Ai s and Bi s are gi A = ∂Lfinal ∂Ais = ∂ wi(x)W i s \u0001 ∂Ais ∂Lfinal ∂ wi(x)W is \u0001 =w i(x)s i s(Bi s)⊤g,(13) gi B = ∂Lfinal ∂B is = ∂Lfinal ∂ wi(x)W is \u0001 ∂ wi(x)W i s \u0001 ∂B is =w i(x)s i sg(Ai s)⊤.(14) Using the initialization Bi s = s 1 sis UiΣ1/2 i , A i s = s 1 sis Σ1/2 i V ⊤ i ,(15) we obtain gi A =w i(x)s i s s 1 sis UiΣ1/2 i !⊤ g=w i(x) p sis Σ1/2 i U ⊤ i g,(16) gi B =w i(x)s i sg s 1 sis Σ1/2 i V ⊤ i !⊤ =w i(x) p sis gViΣ1/2 i .(17) 15 We first analyze the squared Frobenius norm ofg i A: ∥gi A∥2 F = Tr (gi A)⊤gi A \u0001 = Tr \u0012\u0010 wi(x) p sis Σ1/2"},{"citing_arxiv_id":"2605.02881","ref_index":16,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MolmoAct2: Action Reasoning Models for Real-world Deployment","primary_cat":"cs.RO","submitted_at":"2026-05-04T17:51:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"contains bidirectional self-attention over the action chunk, cross-attention to the VLM context, and an MLP. A sinusoidal time embedding is passed through a small MLP, then used to produce DiT-style shift, scale, and gate parameters for all three residual branches. Schematically, blockℓcomputes h′ ℓ =h ℓ +g sa ℓ SA(AdaRMSsa ℓ (hℓ, t)),(16) ¯hℓ =h ′ ℓ +g ca ℓ CA(AdaRMSca ℓ (h′ ℓ, t), ˜Kℓ, ˜Vℓ),(17) hℓ+1 = ¯hℓ +g ff ℓ MLP(AdaRMSff ℓ (¯hℓ, t)).(18) Here, AdaRMS denotes RMS normalization followed by the time-dependent affine modulation. The self- attention and cross-attention layers use query-key normalization. The expert uses rotary position embeddings for the action sequence, which gives the denoising transformer an explicit ordering over the predicted trajectory"},{"citing_arxiv_id":"2604.27472","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations","primary_cat":"cs.AI","submitted_at":"2026-04-30T06:14:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"to standard CRL's discounted occupancy estimation. Theorem 1(Temporal Weighting Implements Geometric Sampling).Let π∗ be the deterministic expert policy generating the demonstrations. For any state-action pair(s, a)in the dataset belonging to task with goall, the optimal representations(ϕ ∗, ψ∗)minimizingL l→sa satisfy: ψ∗(l)⊤ϕ∗(s, a) = logQ π∗ l (s, a) +C(l),(13) where Qπ∗ l (s, a) = pπ∗ γ (st+ = sg |s, a )is the discounted state occupancy measure (i.e., the probability of reaching goalsg underπ ∗ starting from(s, a)), andC(l)is a function depending only on the goall. Proof.The proof is deferred to Appendix B. Theorem 1 reveals that while the weightsqij are computed using temporal distances(T−t ), the learned"},{"citing_arxiv_id":"2604.21924","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Long-Horizon Manipulation via Trace-Conditioned VLA Planning","primary_cat":"cs.RO","submitted_at":"2026-04-23T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21241","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors","primary_cat":"cs.RO","submitted_at":"2026-04-23T03:17:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"either term causes a clear drop in performance, while using TABLE IV SUCCESS RATES(%)ONLIBERO-PLUSFOR THE4-IN-1MODEL.Corr DENOTES OUR METHOD. Method Long Goal Object Spatial Avg NORA [28] 36.3 38.8 34.4 47.6 39 UniVLA [19] 39.9 40.7 36.7 55.5 42.9 SmolVLA-Base 46.53 35.89 66.2 32.85 45.37 SmolVLA-Corr49.27 55.27 72.36 54.04 57.74 π0 48.4 44.9 61.4 60.7 53.6 OpenVLA-OFT [29] 66.4 63 66.5 84 69.6 GR00T-Base [30] 62.21 68.54 84.55 85.64 75.23 GR00T-Corr74.55 85.75 88.4 84.1483.21 TABLE V SUCCESS RATE(%)ONLIBERO(4-IN-1)WITH ABLATED CORRIDOR LOSS COMPONENTS. Method Long Goal Object Spatial Avg merge 79.2 90.4 94 92.4 89 +Lbuf 80.6 92.4 92.6 92.4 89.5 +Lcons 82.4 89.2 97.8 92.2 90.4 +Lbuf +L cons −RDP80.2 88.2 95.8 92.2 89."},{"citing_arxiv_id":"2604.20472","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-22T11:58:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17876","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-20T06:38:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"most existing approaches remain fundamentallyreactive, selecting actions solely from the current observation. However, effective robotic manipulation in real-world settings relies on two fundamental and complementary capabilities. The first is accurate object-aware perception, i.e., the ability to localize and represent objects within complex scenes. Recent approaches [27, 48, 49, 58] leverage the Segment Anything Model (SAM) to extract object regions and feed them into downstream policies. While effec- tive, these methods typically rely on human assistance to specify or refine target regions prior to execution, limiting their auton- omy in deployments. The second capability is foresight over scene dynamics. By modeling the underlying temporal evolution of the"},{"citing_arxiv_id":"2604.12447","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-14T08:32:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HazardArena shows VLA models trained on safe data frequently produce unsafe actions in semantically risky but visually similar settings, and a training-free Safety Option Layer reduces those failures with little performance cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.18532","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VLANeXt: Recipes for Building Strong VLA Models","primary_cat":"cs.CV","submitted_at":"2026-02-20T09:26:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLANeXt distills 12 design insights from a unified VLA study into a model that outperforms prior methods on LIBERO benchmarks while releasing code for further exploration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.11236","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning","primary_cat":"cs.CV","submitted_at":"2026-02-11T16:47:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.10503","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning","primary_cat":"cs.RO","submitted_at":"2026-02-11T04:05:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a 22% average success rate gain over supervised fine-tuning on the LIBERO benchmark's","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.18960","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention","primary_cat":"cs.LG","submitted_at":"2025-11-24T10:22:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.15669","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models","primary_cat":"cs.LG","submitted_at":"2025-10-31T05:26:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeepThinkVLA shows CoT improves VLA models only under decoding and causal alignment, delivering 97% success on LIBERO and 21.7-point gains via hybrid attention and SFT-RL training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.09674","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2025-09-11T17:59:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' phenomenon.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.13073","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey","primary_cat":"cs.RO","submitted_at":"2025-08-18T16:45:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"guage instructions, and (2) performs reasoning processes that directly or indirectly serve robotic action generation. We further distinguish two principal categories of large VLM- based VLA models, as shown in Fig. 2 and Fig. 3: • Monolithic Models (Fig. 3, left) comprise single-system and dual-system implementations. (1) Single-system models [26], [27], [44], [45] integrate both environmen- tal comprehension (including visual perception, lin- guistic understanding, and robot state awareness) and action generation within a unified architecture. In con- trast, (2) dual-system models [29]-[32] employ a VLM backbone for scene interpretation and an action expert for action determination, exchanging information via"},{"citing_arxiv_id":"2506.13757","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning","primary_cat":"cs.CV","submitted_at":"2025-06-16T17:58:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoVLA unifies semantic reasoning and trajectory planning in one autoregressive VLA model for end-to-end autonomous driving by tokenizing trajectories into discrete actions and using GRPO reinforcement fine-tuning to adaptively reduce unnecessary reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. [31] Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models.arXiv preprint arXiv:2503.22020, 2025. [32] Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, and Soujanya Poria. Nora: A small open-sourced generalist vision language action model for embodied tasks, 2025. URLhttps://arxiv.org/abs/2504.19854. [33] Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous"}],"limit":50,"offset":0}