{"total":178,"items":[{"citing_arxiv_id":"2607.00673","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Path Planning in Physically Viable World Models","primary_cat":"cs.RO","submitted_at":"2026-07-01T09:19:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A physically viable world model augments 3D Gaussian splats with physics simulation to assess robot route feasibility under simulated terrain changes like flooding, revealing failures not visible in static maps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00361","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision-Language Models","primary_cat":"cs.CR","submitted_at":"2026-07-01T02:59:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReShift is a reasoning-level backdoor framework for VLMs that uses poisoned data construction and joint optimization to shift CoT trajectories on trigger while preserving surface coherence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00316","ref_index":122,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evolving Intelligent Complex Systems via Intellicise Networks: Architecture, Technologies, and Pathways","primary_cat":"eess.SP","submitted_at":"2026-07-01T01:32:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes a cross-layer intellicise network architecture grounded in multiple theories to support intelligent complex systems, with reviews of enabling technologies and a case study.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00310","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RetailSMV: Exocentric vs. Egocentric Adaptation of Foundation Video World Models in Retail","primary_cat":"cs.CV","submitted_at":"2026-07-01T01:23:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Exocentric-only LoRA adaptation of Cosmos3-Nano on a new synchronized retail video dataset matches or exceeds combined ego+exo training on most held-out metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31958","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adapting Generalist Robot Policies with Semantic Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2026-06-30T17:00:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31382","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation","primary_cat":"cs.RO","submitted_at":"2026-06-30T09:10:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30111","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Automating the Design of Embodied AgentArchitectures","primary_cat":"cs.RO","submitted_at":"2026-06-29T10:45:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Automated architecture search for embodied agents produces directional success-rate gains on vision-language and manipulation tasks while exposing limits from simulation noise and incomplete credit assignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29892","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trust Your Instincts: Confidence-Driven Test-Time RL for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-06-29T07:31:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"T^2VLA is a test-time reinforcement learning framework for VLAs that uses internal confidence to define intrinsic rewards via similarity to high-confidence expert demonstrations and a dual-expert bootstrapping mechanism.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29699","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Early Warning Signals for OpenVLA Failure under Visual Distribution Shift","primary_cat":"cs.CV","submitted_at":"2026-06-29T02:07:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"OpenVLA layer-16 activations allow a logistic probe to predict failure within 15 steps under occlusion (AUROC 0.972) better than baselines, with some transfer to camera jitter.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29431","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FADE: Mitigating Hallucinations by Reducing Language-Prior Dominance in Large Vision-Language Models","primary_cat":"cs.AI","submitted_at":"2026-06-28T14:48:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29384","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Event-VLA: Action-Conditioned Event Fusion for Robust Vision-Language-Action Model","primary_cat":"cs.CV","submitted_at":"2026-06-28T13:19:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Event-VLA integrates event streams into VLA models through action-conditioned gated cross-attention to maintain performance in normal light while improving success rates under low-light and near-dark conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29350","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fast Enough to Act: Spatio-Temporal Visual Token Merging for Low-Latency Robotic VLMs and VLAs","primary_cat":"cs.CV","submitted_at":"2026-06-28T11:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ST-Merge is a plug-and-play spatio-temporal token merging method that delivers 2x speedup on VLMs and 8.3x on a VLA at high resolution with minimal accuracy loss via 3D coordinate matching and positional correction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29267","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Enhancing Part-Level Point Grounding for Any Open-Source MLLMs","primary_cat":"cs.CV","submitted_at":"2026-06-28T08:32:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A plug-in Q-Synth Module plus Attention-to-Point Decoder converts text-conditioned attention in frozen MLLMs into point heatmaps, improving part-level grounding accuracy on multiple datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26800","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SSI-Policy: Learning Structured Scene Interfaces for Vision-Language Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-25T09:38:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SSI-Policy uses an RGB-only Structured Scene Interface to improve LIBERO benchmark performance by nearly 15% with only 10 demonstrations per task compared to prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22449","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-Evolving Cognitive Framework via Causal World Modeling for Embodied Scientific Intelligence","primary_cat":"cs.AI","submitted_at":"2026-06-21T11:46:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes a self-evolving cognitive framework integrating causal world modeling, intervention-driven reasoning, and continual refinement for embodied scientific intelligence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20515","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence","primary_cat":"cs.CV","submitted_at":"2026-06-18T17:34:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"S-Agent augments VLMs with spatial tools, scene and agent memory for evidence accumulation on multi-view and video tasks, and produces an 8B model via SFT on its own trajectories that beats same-scale baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07383","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RhinoVLA Technical Report","primary_cat":"cs.RO","submitted_at":"2026-06-05T15:21:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"RhinoVLA cuts VLM tokens with a Qwen3-VL backbone and continuous action expert, adds a unified cross-robot interface, and reaches real-time 11.69 Hz on Huixi R1 while matching π0.5 downstream performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06491","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies","primary_cat":"cs.RO","submitted_at":"2026-06-04T17:59:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TempoVLA learns a single VLA policy with controllable execution speed via variable-speed trajectory augmentation and explicit speed conditioning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06556","ref_index":121,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Robots Need More than VLA and World Models","primary_cat":"cs.RO","submitted_at":"2026-06-04T10:43:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper identifies four missing interfaces (data autolabelling, embodiment retargeting, physics-grounded world models, and video-based reward inference) as the central bottleneck beyond VLA scaling for robot intelligence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05979","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis","primary_cat":"cs.RO","submitted_at":"2026-06-04T10:23:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WLA models use an autoregressive Transformer to jointly predict textual subtasks, subgoal images, and robot actions from instructions, images, and states, reporting SOTA success rates on RoboTwin2.0 and RMBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28345","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Auditing LLM-Governed Social Robots with Culture-Specific Moral Gradients","primary_cat":"cs.RO","submitted_at":"2026-06-02T10:22:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces a gradient-based multilingual audit framework for LLM moral decisions in robot assistance scenarios and reports persistent culturally asymmetric gradient tracking failures not fixed by prompting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03047","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ModuLoop : Low-Level Code Generation using Modular Synthesizer and Closed-Loop Debugger for Robotic Control","primary_cat":"cs.RO","submitted_at":"2026-06-02T02:30:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Closed-Loop Modular Code Synthesizer uses pre-trained LLMs for modular code generation plus iterative execution-based debugging to produce working robotic control programs for camera calibration and pick-and-place tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02277","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models","primary_cat":"cs.RO","submitted_at":"2026-06-01T14:02:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RoboSemanticBench reveals that representative VLA models grasp blocks successfully but select the semantically correct answer at near-random rates, indicating a gap between backbone semantics and action prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00828","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes","primary_cat":"cs.CV","submitted_at":"2026-05-30T17:55:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RoboStressBench decomposes visual stress into four physically grounded dimensions to benchmark VLM robustness in embodied scenes and proposes a stress-aware solver.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00241","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate","primary_cat":"cs.LG","submitted_at":"2026-05-29T18:16:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InfoAtlas is a pretrained neural model for zero-shot mutual information estimation that matches state-of-the-art accuracy with 100x speedup and handles varying dimensions via a single model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00229","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Continuous Reasoning for Vision-Language-Action","primary_cat":"cs.RO","submitted_at":"2026-05-29T18:02:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Continuous Reasoning for VLA introduces a shared Gaussian latent for continuous thoughts, trained with self-verification to improve action prediction on LIBERO-PRO and real robots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30311","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Archon: A Unified Multimodal Model for Holistic Digital Human Generation","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:53:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Archon unifies seven modalities via modality-specific tokenizers and an autoregressive backbone pretrained on 72 tasks, plus a 4x-efficient video reparameterization and stepwise 'Thinking in Modality' procedure, and reports superior or comparable results on digital-human tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30011","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies","primary_cat":"cs.CV","submitted_at":"2026-05-28T14:36:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VISUALTHINK-VLA uses visual evidence tokens and selective routing to reach top success rates on VLA benchmarks while cutting reasoning latency from multi-second to sub-second levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29662","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation","primary_cat":"cs.CV","submitted_at":"2026-05-28T09:23:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SAFE-Pruner forecasts deep-layer token saliency in VLA models via semantic attention consistency and adaptive subtask detection to achieve up to 1.89x speedup with under 1.7% success rate loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27960","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-27T04:54:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mags-RL uses agentic RL and a super-resolution agent for two-round reasoning in MLLMs, claiming gains on VSR, TallyQA, and GQA with a curriculum needing only 40 samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00110","ref_index":204,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling","primary_cat":"cs.CV","submitted_at":"2026-05-27T03:38:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GAM framework uses arc-length parameterization for temporal invariance and schema-affine factorization for geometric invariance to build a covariant action manifold integrated into VLA models for improved generalization from sparse data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27817","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Turning Video Models into Generalist Robot Policies","primary_cat":"cs.RO","submitted_at":"2026-05-27T01:21:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decouples action-free video world models from embodiment-specific IDMs using Jacobian-based translation to achieve zero-shot cross-embodiment robot policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27759","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Colosseum V2: Benchmarking Generalization for Vision Language Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-26T23:17:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27491","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-26T16:23:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GE-Sim 2.0 is a video-based closed-loop simulator for robotic manipulation that adds state expert, world judge, and acceleration modules on top of prior video generation to support policy learning and evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00104","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs","primary_cat":"cs.RO","submitted_at":"2026-05-26T10:03:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"PEACE decouples single-pass LLM planning from PX4 execution via ROS 2 and a constraint layer, with modular 3D perception, and shows feasibility in Gazebo SITL with improved explainability and fewer LLM calls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26256","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions","primary_cat":"cs.AI","submitted_at":"2026-05-25T18:27:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"POLAR organizes prior interactions into a multimodal knowledge graph with semantic and episodic memory to improve personalized embodied task execution across multiple MLLM backbones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25802","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rethinking VLM Representation for VLA Initialization","primary_cat":"cs.CV","submitted_at":"2026-05-25T12:51:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Experiments indicate original VLM representations are crucial for VLA performance, LoRA outperforms full finetuning, and staged robot-data pretraining yields the strongest initialization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24892","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-24T06:37:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"X-Foresight adds a long-horizon chunk-wise auto-regressive world model with temporal importance sampling and curriculum learning to VLA architectures for improved planning and generative fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22570","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis","primary_cat":"cs.CV","submitted_at":"2026-05-21T14:48:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"C) Benchmark statistics.VGenST-Bench comprises 1,200 videos and 33K QA pairs spanning 12 task types and 12 QA types. 1 Introduction Multimodal Large Language Models (MLLMs) have rapidly advanced beyond basic perceptual tasks such as image recognition and captioning, and are now being deployed in physically grounded applications, including robotics [14, 94, 34] and autonomous driving [81, 69]. These deployments position MLLMs as a foundation toward world models that can understand and predict the dynamics of physical environments [28, 30]. However, despite this progress, current MLLMs still exhibit notable challenges in understanding how objects and scenes evolve over time and across viewpoints. In"},{"citing_arxiv_id":"2605.19728","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls","primary_cat":"cs.CV","submitted_at":"2026-05-19T12:02:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19420","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Waypoints: Dual-Heatmap Grounding for Cross-Embodiment Semantic Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-19T06:12:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A vision-language model outputs dual heatmaps for navigation affordance and facing to ground semantic instructions into executable free space, achieving higher affordance rates than waypoint regression across simulated robot embodiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19319","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution","primary_cat":"cs.CV","submitted_at":"2026-05-19T03:54:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SWEET is a one-shot sparse visual planning framework that progressively generates manipulation keyframes via image editing conditioned on language and spatial guidance, then converts them to actions with a diffusion predictor, showing better fidelity and lower cost than video models on DROID and Rob","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18331","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Prune, Update and Trim: Robust Structured Pruning for Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-18T12:48:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Putri is a structured pruning technique for LLMs that compensates for pruning errors via weight updates and sequential processing while pruning at the attention-head level to reach state-of-the-art results at extreme sparsity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17486","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization","primary_cat":"cs.RO","submitted_at":"2026-05-17T14:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(13) Cal-QL calibration regularizer.To enable a smooth transition from offline data to online rollouts, we apply the Cal-QL calibration regularizer to each ensemble member, encouraging high value on policy actions while remaining anchored to dataset actions: LCalReg(θi) =E st∼D h Ea∼π(·|st) \u0002 max Qθi(st,a), V µ(st) \u0001\u0003 −E a∼D(·|st) \u0002 Qθi(st,a) \u0003i . (14) Here at =a t:t+h−1 denotes an action chunk and π(a|s) is the chunk-level policy. We approximateV µ(st) using Monte Carlo returns from the offline dataset. 12 DyGRO-VLA B. Proof Proposition B.1(Variational Information Bottleneck for Action Regression).Let Z be a latent representation produced by an encoder pθ(z|o), and let πθ(a|z) denote a probabilistic action decoder."},{"citing_arxiv_id":"2605.17077","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning","primary_cat":"cs.RO","submitted_at":"2026-05-16T16:52:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16932","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MORN: Metacognitive Object-Goal Regulation for Resource-Rational Long-Horizon Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-16T10:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MORN augments frozen VLM-based object navigation agents with a System 2 meta-controller using Potentiality Index, Persistence Gating, and Evidence Accumulation to improve goal completion rate from 0.23 to 0.30 and reduce wasted steps on the HM3D dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15951","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding","primary_cat":"cs.CV","submitted_at":"2026-05-15T13:41:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14700","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SR-Platform: An Agentic Pipeline for Natural Language-Driven Robot Simulation Environment Synthesis","primary_cat":"cs.RO","submitted_at":"2026-05-14T11:14:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SR-Platform is a production-deployed nine-service Docker system that synthesizes physically valid MuJoCo environments from natural language using LLM orchestration, CadQuery asset forging, constraint-aware layout, and MJCF assembly, with reported median latency of ~50 s for five-object scenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11665","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Nautilus: From One Prompt to Plug-and-Play Robot Learning","primary_cat":"cs.RO","submitted_at":"2026-05-12T07:26:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Emily Perez, Karl Pertsch, Jornell Quiambao, Kanishka Rao, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Kevin Sayed, Jaspiar Singh, Sumedh Sontakke, Austin Stone, Clayton Tan, Huong Tran, Vincent Vanhoucke, Steve Vega, Quan Vuong, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich. Rt-1: Robotics transformer for real-world control at scale. InarXiv preprint arXiv:2212.06817, 2022. [44] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023. [45] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan"},{"citing_arxiv_id":"2605.11534","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments","primary_cat":"cs.RO","submitted_at":"2026-05-12T04:59:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PRISM is a tiered benchmark with 300 human-verified tasks across five photorealistic apartments that diagnoses embodied agent failures in basic ability, reasoning ability, and long-horizon ability using an agent-agnostic API.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"instantiating memory as an independently substitutable module, enabling direct attribution of failures to retention versus reasoning at the apartment level. LLM/VLM-Based Embodied Planning.LLMs have been integrated as planning backbones via affordance grounding [3], closed-loop multimodal feedback [6], and code generation [11]; VLMs enable direct visual plan grounding without symbolic state extraction [5, 1]. A persistent confound in evaluating these systems isaction hallucination: a planner may generate syntactically valid but physically unexecutable actions, and penalizing these silently [18] conflates hallucination failures with genuine reasoning errors.PRISMeliminates this confound structurally: the affordance-grounded atomic action space prevents actions outside the predefined executable action space, ensuring that"}],"limit":50,"offset":0}