{"total":13,"items":[{"citing_arxiv_id":"2607.01784","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpaceEra++: A Unified Framework Towards 3D Spatial Reasoning in Video","primary_cat":"cs.CV","submitted_at":"2026-07-02T06:56:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"SpaceEra++ adds ScenePick frame sampling and SpaceAlign pairwise constraints to the prior SpaceEra system, claiming consistent benchmark gains for 3D video spatial reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11683","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning","primary_cat":"cs.CV","submitted_at":"2026-06-10T05:52:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReRe boosts open-source MLLMs on spatial reasoning benchmarks VSI-Bench and STI-Bench to rival proprietary SOTA by using a two-phase Reason then Re-reason process with Geometry-to-Video novel view synthesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07433","ref_index":219,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Watch, Remember, Reason: Human-View Video Understanding with MLLMs","primary_cat":"cs.CV","submitted_at":"2026-06-05T16:29:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"behaviors. VIDEO-STR [217] models spatial and temporal interactions using object-centric relation graphs, enabling structured reasoning over multi-object layouts across time. SpatialLadder [218] proposes a progressive curriculum that incrementally builds spatial reasoning from basic perceptual grounding to higher-level spatial abstraction. Cambrian- S [219] targets long-horizon spatial cognition by introducing visual spatial recall and continual visual spatial counting tasks, emphasizing sustained spatial memory over extended videos. Overall, these models enhance spatial modeling, enabling video MLLMs to better capture real-world spatial structure. 3.3.2 Thinking with Videos In text-only reasoning, models may ignore important visual"},{"citing_arxiv_id":"2606.01247","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?","primary_cat":"cs.CV","submitted_at":"2026-05-31T14:00:10+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20165","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:50:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20705","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-22T15:46:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"MLLMs still face two key challenges: (i) The rewards are primarily language-centric [27, 44, 64, 66, 79], with dense visual information only being leveraged to extract sparse cues for text-based reasoning. As a result, they exhibit systematic weaknesses in fine-grained visual understand- ing. (ii) The training data typically require manual anno- tations [35, 41, 61, 65], which is very expensive to obtain and difficult to scale given the increasing complexity and quantity of real-world tasks. These shortcomings highlight the need for training methods that reinforce vision-centric grounding and reasoning without human supervision. To this end, we propose SSL-R1, a generic self- supervised post-training framework that derives verifiable"},{"citing_arxiv_id":"2604.12908","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \\rightarrow G$): Vision-Geometry Backbones over Language and Video Models","primary_cat":"cs.RO","submitted_at":"2026-04-14T15:57:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06725","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning","primary_cat":"cs.CV","submitted_at":"2026-04-08T06:47:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tool libraries, and building flexible, scalable toolkits remains a challenge. Post-training approaches enhance MLLMs via SFT and RL [45,47]. SPRITE [19] trains MLLMs on programmatically synthe- sized data to improve performance on spatial benchmarks. However, reliance on high-quality spatial datasets bottlenecks SFT methods due to data scarcity. vsGRPO [32] uses a self-play paradigm to reduce this dependency by automatically generating and solving spatial problems, using training strategies similar to those in R1-Zero [16,21,63]. While these RL methods complement SFT, they re- quire careful reward design and introduce high computational overhead. Unlike these paradigms, our methodrequires no training."},{"citing_arxiv_id":"2604.03318","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs","primary_cat":"cs.CV","submitted_at":"2026-04-01T15:28:13+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tialVLM [4] leverages large-scale scene-centric datasets to enhance spatial awareness, while Video3DLLM [57] ex- tends this idea to multi-frame scenarios. SpaceR [32] in- troduces 2D grids with object-layout intermediate super- vision to guide learning, and ST-Think [46] integrates re- verse reasoning into reinforcement learning to improve spa- tial inference. R1-Zero-VSI [24] constructs a high-quality spatial reasoning dataset and fine-tunes MLLMs using an optimized GRPO algorithm, whereas Spatial-Ladder [21] adopts a three-stage training strategy to progressively en- hance spatial understanding. However, these methods de- pend on additional supervision or large-scale data, resulting in substantial training costs and limited generalization."},{"citing_arxiv_id":"2603.14184","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-03-15T02:21:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00748","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2025-07-01T13:48:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A pipeline of chain-of-thought data synthesis, LoRA-based supervised fine-tuning, rejection sampling, and rule-based reinforcement learning raises multi-image grounding accuracy by 9.04% on MIG-Bench and 4.41% on average across seven other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.21374","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?","primary_cat":"cs.CV","submitted_at":"2025-05-27T16:05:01+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.13377","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding","primary_cat":"cs.CV","submitted_at":"2025-03-17T17:04:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Time-R1 applies RL with verifiable rewards to post-train LVLMs for temporal video grounding, reaching state-of-the-art results on multiple datasets using only 2.5K samples while also improving general video capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}