{"total":13,"items":[{"citing_arxiv_id":"2607.00881","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OmniView-Space: Reinforcing Spatial Reasoning via Multi-Perspective Spatial Mapping","primary_cat":"cs.CV","submitted_at":"2026-07-01T12:45:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OmniView-Space framework with MPSM, tool-guided reasoning, and distillation achieves SOTA on spatial reasoning benchmarks for MLLMs while reducing external geometry dependencies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05677","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video","primary_cat":"cs.CV","submitted_at":"2026-06-04T04:00:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Presents LongSpace-Bench benchmark and LongSpace framework that chunks long videos, adds 3D structural cues, and builds layer-aware memory to improve spatial reasoning in multimodal LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03577","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching","primary_cat":"cs.CV","submitted_at":"2026-06-02T12:46:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Authors create ReasonMatch-Bench and DCRL training to boost MLLM performance on wide-baseline matching, reporting gains over baselines while preserving general capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01247","ref_index":71,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?","primary_cat":"cs.CV","submitted_at":"2026-05-31T14:00:10+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30561","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VLM3: Vision Language Models Are Native 3D Learners","primary_cat":"cs.CV","submitted_at":"2026-05-28T20:48:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Standard VLMs achieve expert-level 3D performance on depth estimation, pose estimation, and object understanding via three simple techniques without architecture changes or regression losses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30231","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GASP injects geometric priors into VLMs via a deep-supervised correspondence head trained on video point correspondences and depth consistency, raising internal matching accuracy and delivering gains on spatial benchmarks without any 3D VQA data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15876","ref_index":63,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unlocking Dense Metric Depth Estimation in VLMs","primary_cat":"cs.CV","submitted_at":"2026-05-15T11:54:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new indoor-outdoor benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02130","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-04T01:19:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025. 6 [68] Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, et al. Lumina-dimoo: An omni diffusion large language model for multi-modal generation and understanding.arXiv preprint arXiv:2510.06308, 2025. 8 [69] Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xi- aodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, and Kevin J Liang. Multi-spatialmllm: Multi-frame spatial un- derstanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025. 8 [70] An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen"},{"citing_arxiv_id":"2604.07296","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence","primary_cat":"cs.CL","submitted_at":"2026-04-08T17:03:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"nificant strides in general image and video understanding, existing LVLMs still struggle with sophisticated spatial reasoning tasks that necessitate the inter- pretation of intricate geometric transformations and spatial configurations. To enhance the spatial intelligence of Large Vision-Language Models (LVLMs), ex- isting research has diverged into architectural augmentation, large-scale dataset curation [17,50,51], and advanced training paradigms [32]. Architecturally, mod- els such as Spatial-MLLM [46], VLM-3R [18], and 3DThinker [11] incorporate geometric priors via external 3D encoders, while SpatialBot [6] and VILASR [47] utilize external tools for depth estimation and ground perception. Simultane- ously, the field has transitioned toward data-driven scaling; SpatialVLM [9] and"},{"citing_arxiv_id":"2604.02870","ref_index":104,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Token Warping Helps MLLMs Look from Nearby Viewpoints","primary_cat":"cs.CV","submitted_at":"2026-04-03T08:37:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"viewpoints while accurately transferring fine-grained details from the observed viewpoint. Data.To construct source-target viewpoint pairs for gener- ating VQAs, we collect image pairs captured from adjacent viewpoints with overlapping fields of view, drawn from real- world scans in ScanNet [19]. The collected pairs are divided intodifficultylevelsbasedontheiroverlapratios[ 104],which reflect the amount of shared content between the two views. Foreachpair,oneviewpointisdesignatedasthesource,with image I𝑆 andpose Π𝑆,andtheotherasthetarget,withimage I𝑇 and poseΠ𝑇. We then generate a question𝑄 answerable only from the target viewpoint, using information available in the source view together with an instruction describing the relative pose change between the two viewpoints."},{"citing_arxiv_id":"2512.23365","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SpatialMosaic: A Multiview VLM Dataset for Partial Visibility","primary_cat":"cs.CV","submitted_at":"2025-12-29T10:48:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpatialMosaic introduces a 2M-pair multi-view QA dataset and 1M-pair benchmark for MLLMs on spatial reasoning under partial visibility, plus a hybrid baseline that integrates 3D reconstruction models as geometry encoders.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.17012","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation","primary_cat":"cs.CV","submitted_at":"2025-12-18T19:13:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.04670","ref_index":142,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cambrian-S: Towards Spatial Supersensing in Video","primary_cat":"cs.CV","submitted_at":"2025-11-06T18:55:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cambrian-S introduces VSI-SUPER benchmarks for long-horizon spatial recall and counting, shows data scaling yields 30% gains on existing tests, and demonstrates a self-supervised next-latent predictor using surprise outperforms baselines on the new spatial supersensing tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"bench: Towards benchmarking continuous improvement of language agents. InNeurIPS, 2024. [140] Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. InNeurIPS, 2024. [141] xAI. Grok-1.5 Vision Preview. https://x.ai/blog/grok-1-5v , April 2024. RealworldQA, Blog post, Announced on April 12, 2024. [142] Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, and Kevin J Liang. Multi-spatialmllm: Multi-frame spatial understanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025. [143] Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al."}],"limit":50,"offset":0}