pith. sign in

arxiv: 2605.22536 · v1 · pith:VJONWMA7new · submitted 2026-05-21 · 💻 cs.CV · cs.CL

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

Pith reviewed 2026-05-22 06:20 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords spatial reasoningvisual degradationmultimodal large language models3D Gaussian SplattingrobustnessVQA benchmarkdegradation synthesisindoor scenes
0
0 comments X

The pith

Visual degradations substantially impair spatial reasoning in multimodal models, but finetuning on a new synthesized dataset restores robustness and can exceed human performance without harming clean-image results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs SpaceDG, a dataset of nearly one million question-answer pairs drawn from indoor scenes rendered with controlled visual degradations. It uses this data to test twenty-five different multimodal models and documents consistent drops in accuracy on spatial tasks whenever inputs contain blur, low light, weather effects, or similar imperfections. The authors then show that further training on the same dataset produces models that handle degraded views better than before while keeping original accuracy on pristine images. A reader would care because real cameras almost always introduce some form of degradation, so any spatial intelligence meant for practical use must remain reliable under those conditions. The central result is therefore evidence that targeted exposure to realistic degradations during training can close the observed robustness gap.

Core claim

By embedding nine types of degradation formation directly into the 3D Gaussian Splatting rendering pipeline, the authors generate a large collection of spatially grounded question-answer pairs that reflect imperfect visual observations; evaluation on the resulting human-verified benchmark demonstrates that every tested model loses substantial spatial reasoning accuracy under these conditions, while subsequent finetuning on the full SpaceDG collection yields models whose degraded-condition performance surpasses human baselines with no measurable regression on clean inputs.

What carries the argument

A physically grounded degradation synthesis engine that integrates degradation formation processes into 3D Gaussian Splatting rendering to produce realistic impaired images paired with spatial questions.

If this is right

  • Spatial reasoning benchmarks that use only pristine images will systematically overestimate model capability for real deployment.
  • Training procedures that include synthesized degradations can produce spatial intelligence that remains effective when visual quality is imperfect.
  • Different degradation types affect different categories of spatial reasoning at different rates, so targeted data collection can address specific weaknesses.
  • Models trained this way can reach or exceed human accuracy on degraded spatial tasks while retaining full performance on clean inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthesis approach could be applied to generate training data for other multimodal reasoning tasks that also rely on visual detail.
  • Future benchmarks in related areas such as navigation or manipulation should adopt similar degradation-aware construction to avoid hidden robustness gaps.
  • If the simulation-to-reality gap proves small, this method offers a scalable way to create large volumes of labeled degraded data without needing expensive real-world captures.

Load-bearing premise

The simulated degradations created inside the 3D rendering process are close enough to real camera and environmental effects that performance measured on them predicts performance on actual deployed hardware.

What would settle it

Run the finetuned models on a fresh collection of real photographs taken under motion blur, low light, or rain and compare their spatial reasoning scores to both the synthetic-degradation results and to human scores on the same real images.

Figures

Figures reproduced from arXiv: 2605.22536 by Hongjie Zhang, Jiarui Li, Le Ma, Muyao Niu, Qiyue Zhao, Xiaolong Zhou, Xue Yang, Yifei Liu, Yuanyuan Gao, Zhihang Zhong, Ziyang Gong.

Figure 1
Figure 1. Figure 1: Overview of SpaceDG. Top-left: Nine physically-realistic degradations across four categories. Top-right: Three spatial task groups covering camera-centric, camera-object, and object￾centric questions. Mid-left: 3DGS-based degradation data engine. Bottom-right: Performance comparison of representative models against human-on-clean-image and non-image baselines, reflect￾ing the upper bound and lower bound re… view at source ↗
Figure 2
Figure 2. Figure 2: Four degradation categories and their specific degradation QA Examples. We show both [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The SpaceDG data engine. Input multi-view RGB images are first reconstructed into [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of SpaceDG and SpaceDG-Bench. Inner-to-outer rings show the proportion of QA pairs by view configuration, spatial task group, and degradation type. Spatial Questions Design To guarantee a comprehensive assessment of spatial intelligence, we systematically design 11 distinct question categories categorized into single-view and multi-view settings. These tasks evaluate three fundamental aspects:… view at source ↗
Figure 5
Figure 5. Figure 5: Per-degradation performance of representative models. Visual degradation consistently impairs spatial reason￾ing. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Degradation-wise correlation analysis. Sensitivity of spatial intelligence is measured by [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The four identified errors caused by visual degradations. We provide the correct answer, [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Two-stage prompt template for degradation-guided spatial reasoning. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Two-stage prompt template for degradation-guided spatial reasoning. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: MLLM filter prompt template. reviewers can navigate through samples, filter cases by task or keyword, and directly edit the question or answer when ambiguity, formatting issues, or incorrect labels are observed. All modifications are saved back to the new benchmark file, enabling an efficient and traceable review process for improving the quality and consistency of the final evaluation set [PITH_FULL_IMA… view at source ↗
Figure 11
Figure 11. Figure 11: Human review interface for SpaceDG-Bench. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Complete QA example of distortion. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Complete QA example of defocus. Motion Blur, Obj. Size Comparison Question: In image A, the object described as “light grey rectangular pillow with subtle horizontal stitching pattern” is the pillow. In image B, the object described as “dark grey fabric bed with brown pillow, positioned next to white wardrobe with wooden trim” is the bed. Which object is taller (consider the top of the objects)? pillow or… view at source ↗
Figure 14
Figure 14. Figure 14: Complete QA example of motion blur. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Complete QA example of water droplets. Haze, Camera Translation Question: Based on the images, decide whether the camera translation from view A to view B exceeds 0.83 meters. Compute the translation distance as the Euclidean norm of the translation vector between the two camera centers (in meters). Return ONLY one token: 'yes' or 'no'. Outputformat: <answer>yes</answer> or <answer>no</answer>. Ground Tru… view at source ↗
Figure 16
Figure 16. Figure 16: Complete QA example of haze. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Complete QA example of low light. Over Exposure, Cam.-Obj. Cardinal Direction Question: In image A, the object described as "cardboard box with red outline, rectangular shape, flat panels, reinforced edges, standing upright, no visible text or logos" is the box. In image B, the object described as "white grid-patterned paper with rectangular cells, mounted on a surface, likely a planning or scheduling boa… view at source ↗
Figure 18
Figure 18. Figure 18: Complete QA example of over exposure. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Complete QA example of low resolution. JPEG, Counting Question: What is the total count of jackets shown? Answer with only a single integer number. Ground Truth: 1 Model clean condition answer : 1 DG Reasoning: The image exhibits significant blocky artifacts and color banding characteristic of low-quality image compression. <degradation> jpeg_compression </degradation>. Scanning the room, there is a black… view at source ↗
Figure 20
Figure 20. Figure 20: Complete QA example of JPEG compression. [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces SpaceDG, a large-scale dataset of approximately 1M QA pairs from nearly 1,000 indoor scenes for evaluating MLLM spatial reasoning under visual degradations. It uses a physically grounded synthesis engine that embeds nine degradation types (motion blur, low light, weather, distortion, compression, etc.) into 3D Gaussian Splatting rendering. SpaceDG-Bench is a human-verified subset with 1,102 questions spanning 11 reasoning categories and 9 degradations, yielding over 10K VQA instances. Evaluation of 25 open- and closed-source MLLMs shows consistent and substantial performance impairment under degradations. Finetuning on SpaceDG improves robustness and can surpass human performance on degraded inputs without regression on clean images.

Significance. If the synthetic degradations are representative of real-world camera and environmental conditions, the work identifies a critical robustness gap in current MLLMs for spatial tasks and demonstrates a practical mitigation via degradation-aware finetuning that preserves clean-image performance. Notable strengths include the dataset scale, human verification on 1,102 questions, and the no-regression finetuning result. These elements could inform future benchmark construction and training strategies for robust multimodal spatial intelligence.

major comments (1)
  1. [Abstract and synthesis engine] Abstract and degradation synthesis description: the central claims of consistent impairment across 25 MLLMs and finetuning gains (including surpassing humans) rest on the assumption that the 3DGS-embedded synthesis of nine degradations produces simulations representative of real deployment conditions. No quantitative validation is reported, such as perceptual metrics (e.g., LPIPS), distribution matching to real camera captures, or downstream task parity checks. Without this, the observed robustness gap and transfer results cannot be reliably interpreted as applying to actual camera/environmental degradations.
minor comments (2)
  1. [Evaluation section] The abstract states consistent impairment and finetuning gains but provides no error bars, confidence intervals, or statistical significance tests for the 25-model evaluation results.
  2. [Methods] Exact synthesis parameters for the nine degradation types (e.g., blur kernel sizes, noise levels, compression ratios) and any controls for rendering artifacts are not detailed, limiting reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying this key assumption underlying our central claims. We address the concern about validation of the synthetic degradations point by point below and outline concrete revisions.

read point-by-point responses
  1. Referee: [Abstract and synthesis engine] Abstract and degradation synthesis description: the central claims of consistent impairment across 25 MLLMs and finetuning gains (including surpassing humans) rest on the assumption that the 3DGS-embedded synthesis of nine degradations produces simulations representative of real deployment conditions. No quantitative validation is reported, such as perceptual metrics (e.g., LPIPS), distribution matching to real camera captures, or downstream task parity checks. Without this, the observed robustness gap and transfer results cannot be reliably interpreted as applying to actual camera/environmental degradations.

    Authors: We agree that explicit quantitative validation would strengthen the link between our synthetic degradations and real-world camera/environmental conditions. Our synthesis engine is constructed by embedding the physical formation processes of each degradation (e.g., integrating motion trajectories for blur, photon-shot noise models for low light, and atmospheric scattering for weather) directly into the differentiable 3DGS rendering pipeline rather than applying post-hoc 2D filters. This design choice was intended to produce degradations that respect scene geometry and lighting. Nevertheless, the submitted manuscript does not report perceptual metrics such as LPIPS or SSIM against matched real captures, nor formal distribution matching or downstream parity checks. In the revised manuscript we will add a dedicated validation subsection (and corresponding appendix) that includes: (1) LPIPS and SSIM comparisons on a held-out set of real degraded captures we collected for four of the nine degradation types; (2) feature-distribution matching using CLIP and DINOv2 embeddings between synthetic and real degraded images; and (3) a small-scale parity experiment showing that MLLM error patterns on our synthetic degradations correlate with those observed on the real captures. We will also explicitly discuss remaining limitations, such as the difficulty of perfectly replicating sensor-specific noise or rare compound degradations. These additions will allow readers to assess the degree of realism more rigorously while preserving the scale and controlled nature of the benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset construction, benchmarking, and finetuning experiments

full rationale

The paper constructs SpaceDG via a 3DGS-embedded degradation synthesis method, creates SpaceDG-Bench with human-verified questions, evaluates 25 MLLMs to show robustness gaps, and reports finetuning results that improve degraded performance without clean-image regression. No equations, predictions, or first-principles derivations are present that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claims rest on experimental outcomes from synthetic data generation and model training, which are externally falsifiable and do not rely on any load-bearing self-referential step. The realism of the degradation engine is a methodological assumption subject to independent validation, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the realism of synthetic degradations and the validity of generated QA pairs for measuring spatial intelligence; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The degradation synthesis engine embedding formation processes into 3D Gaussian Splatting produces realistic simulations matching real-world visual degradations.
    Invoked to justify the dataset as a proxy for deployment conditions.

pith-pipeline@v0.9.0 · 5809 in / 1374 out tokens · 44848 ms · 2026-05-22T06:20:39.425685+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 4 internal anchors

  1. [1]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

  2. [2]

    Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025

  3. [3]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

  4. [4]

    Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

    Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces.arXiv preprint arXiv:2412.14171, 2024

  5. [5]

    Visual embodied brain: Let multimodal large language models see, think, and control in spaces.arXiv preprint arXiv:2506.00123, 2025

    Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces.arXiv preprint arXiv:2506.00123, 2025

  6. [6]

    Robotic visual instruction

    Yanbang Li, Ziyang Gong, Haoyang Li, Xiaoqi Huang, Haolan Kang, Guangping Bai, and Xianzheng Ma. Robotic visual instruction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12155–12165, 2025

  7. [7]

    Mmsi-bench: A benchmark for multi-image spatial intelligence

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelligence. InICLR, 2025

  8. [8]

    Dsi-bench: A benchmark for dynamic spatial intelligence, 2025

    Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, and Zhou Zhao. Dsi-bench: A benchmark for dynamic spatial intelligence, 2025

  9. [9]

    Cambrian-S: Towards Spatial Supersensing in Video

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025

  10. [10]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models, 2026

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models, 2026

  11. [11]

    Mindcube: Spatial mental modeling from limited views, 2026

    Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, and Manling Li. Mindcube: Spatial mental modeling from limited views, 2026

  12. [12]

    Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models, 2025

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models, 2025. 10

  13. [13]

    Stepping vlms onto the court: Benchmarking spatial intelligence in sports.arXiv preprint arXiv:2603.09896, 2026

    Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, et al. Stepping vlms onto the court: Benchmarking spatial intelligence in sports.arXiv preprint arXiv:2603.09896, 2026

  14. [14]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, June 2024

  15. [15]

    Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

    Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, et al. Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

  16. [16]

    Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025

    Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling spa...

  17. [17]

    Deep video deblurring for hand-held cameras

    Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1279–1288, 2017

  18. [18]

    Deep multi-scale convolutional neural network for dynamic scene deblurring

    Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3883–3891, 2017

  19. [19]

    Efficient spatio-temporal recurrent neural network for video deblurring

    Zhihang Zhong, Ye Gao, Yinqiang Zheng, and Bo Zheng. Efficient spatio-temporal recurrent neural network for video deblurring. InEuropean conference on computer vision, pages 191–207. Springer, 2020

  20. [20]

    Real-world video deblurring: A benchmark dataset and an efficient recurrent neural network.International Journal of Computer Vision, 131(1):284–301, 2023

    Zhihang Zhong, Ye Gao, Yinqiang Zheng, Bo Zheng, and Imari Sato. Real-world video deblurring: A benchmark dataset and an efficient recurrent neural network.International Journal of Computer Vision, 131(1):284–301, 2023

  21. [21]

    Blur interpolation transformer for real-world motion from blur

    Zhihang Zhong, Mingdeng Cao, Xiang Ji, Yinqiang Zheng, and Imari Sato. Blur interpolation transformer for real-world motion from blur. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5713–5723, 2023

  22. [22]

    Image super-resolution using deep convolutional networks.IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015

    Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks.IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015

  23. [23]

    Photo-realistic single image super- resolution using a generative adversarial network

    Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super- resolution using a generative adversarial network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017

  24. [24]

    Esrgan: Enhanced super-resolution generative adversarial networks

    Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. InProceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018

  25. [25]

    Transformer for single image super-resolution

    Zhisheng Lu, Juncheng Li, Hong Liu, Chaoyan Huang, Linlin Zhang, and Tieyong Zeng. Transformer for single image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 457–466, 2022

  26. [26]

    Deep shutter unrolling network

    Peidong Liu, Zhaopeng Cui, Viktor Larsson, and Marc Pollefeys. Deep shutter unrolling network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5941–5949, 2020

  27. [27]

    Towards rolling shutter correction and deblurring in dynamic scenes

    Zhihang Zhong, Yinqiang Zheng, and Imari Sato. Towards rolling shutter correction and deblurring in dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9219–9228, 2021

  28. [28]

    Learning adaptive warping for real-world rolling shutter correction

    Mingdeng Cao, Zhihang Zhong, Jiahao Wang, Yinqiang Zheng, and Yujiu Yang. Learning adaptive warping for real-world rolling shutter correction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17785–17793, 2022

  29. [29]

    Learning to see in the dark

    Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. Learning to see in the dark. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3291–3300, 2018. 11

  30. [30]

    Visibility constrained wide-band illumina- tion spectrum design for seeing-in-the-dark

    Muyao Niu, Zhuoxiao Li, Zhihang Zhong, and Yinqiang Zheng. Visibility constrained wide-band illumina- tion spectrum design for seeing-in-the-dark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13976–13985, 2023

  31. [31]

    Nir-assisted video enhancement via unpaired 24-hour data

    Muyao Niu, Zhihang Zhong, and Yinqiang Zheng. Nir-assisted video enhancement via unpaired 24-hour data. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10778–10788, 2023

  32. [32]

    Single image haze removal using dark channel prior.IEEE transactions on pattern analysis and machine intelligence, 33(12):2341–2353, 2010

    Kaiming He, Jian Sun, and Xiaoou Tang. Single image haze removal using dark channel prior.IEEE transactions on pattern analysis and machine intelligence, 33(12):2341–2353, 2010

  33. [33]

    Clearing the skies: A deep network architecture for single-image rain removal.IEEE Transactions on Image Processing, 26(6):2944– 2956, 2017

    Xueyang Fu, Jiabin Huang, Xinghao Ding, Yinghao Liao, and John Paisley. Clearing the skies: A deep network architecture for single-image rain removal.IEEE Transactions on Image Processing, 26(6):2944– 2956, 2017

  34. [34]

    Robust-r1: Degradation-aware reasoning for robust visual understanding

    Jiaqi Tang, Jianmin Chen, Wei Wei, Xiaogang Xu, Runtao Liu, Xiangyu Wu, Qipeng Xie, Jiafei Wu, Lei Zhang, and Qifeng Chen. Robust-r1: Degradation-aware reasoning for robust visual understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026

  35. [35]

    Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models, 2026

    Rohit Saxena, Alessandro Suglia, and Pasquale Minervini. Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models, 2026

  36. [36]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023

  37. [37]

    Holi-spatial: Evolving video streams into holistic 3d spatial intelligence, 2026

    Yuanyuan Gao, Hao Li, Yifei Liu, Xinhao Ji, Yuning Gong, Yuanjun Liao, Fangfu Liu, Manyuan Zhang, Yuchen Yang, Dan Xu, Xue Yang, Huaxi Huang, Hongjie Zhang, Ziwei Liu, Xiao Sun, Dingwen Zhang, and Zhihang Zhong. Holi-spatial: Evolving video streams into holistic 3d spatial intelligence, 2026

  38. [38]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

  39. [39]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

  40. [40]

    Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

  41. [41]

    Kimi-VL Technical Report

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  42. [42]

    Mimo-vl technical report, 2025

    LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025

  43. [43]

    Spatialrgpt: Grounded spatial reasoning in vision language models, 2024

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision language models, 2024

  44. [44]

    Mm-spatial: Exploring 3d spatial understanding in multimodal llms, 2025

    Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, and Peter Grasch. Mm-spatial: Exploring 3d spatial understanding in multimodal llms, 2025

  45. [45]

    Vlm4d: Towards spatiotemporal awareness in vision language models

    Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Eric Xin Wang, and Achuta Kadambi. Vlm4d: Towards spatiotemporal awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025

  46. [46]

    Benchmarking neural network robustness to common corruptions and perturbations, 2019

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations, 2019

  47. [47]

    On the robustness of large multimodal models against image adversarial attacks, 2023

    Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, and Ser-Nam Lim. On the robustness of large multimodal models against image adversarial attacks, 2023

  48. [48]

    Analysing the robustness of vision-language-models to common corruptions, 2025

    Muhammad Usama, Syeda Aishah Asim, Syed Bilal Ali, Syed Talal Wasim, and Umair Bin Mansoor. Analysing the robustness of vision-language-models to common corruptions, 2025

  49. [49]

    Zhiyuan Fan, Yumeng Wang, Sandeep Polisetty, and Yi R. Fung. V2r-bench: Holistically evaluating lvlm robustness to fundamental visual variations, 2025. 12

  50. [50]

    Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

    Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20654–20664, 2024

  51. [51]

    Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–15, 2025

    Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–15, 2025

  52. [52]

    Mip-splatting: Alias-free 3d gaussian splatting.Conference on Computer Vision and Pattern Recognition (CVPR), 2024

    Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splatting.Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  53. [53]

    Proxy-gs: Unified occlusion priors for training and inference in structured 3d gaussian splatting.arXiv preprint arXiv:2509.24421, 2025

    Yuanyuan Gao, Yuning Gong, Yifei Liu, Li Jingfeng, Dingwen Zhang, Yanci Zhang, Dan Xu, Xiao Sun, and Zhihang Zhong. Proxy-gs: Unified occlusion priors for training and inference in structured 3d gaussian splatting.arXiv preprint arXiv:2509.24421, 2025

  54. [54]

    Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction.arXiv preprint arXiv:2406.06521, 2024

    Danpeng Chen, Hai Li, Weicai Ye, Yifan Wang, Weijian Xie, Shangjin Zhai, Nan Wang, Haomin Liu, Hujun Bao, and Guofeng Zhang. Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction.arXiv preprint arXiv:2406.06521, 2024

  55. [55]

    Citygaussian: Real-time high-quality large-scale scene rendering with gaussians

    Yang Liu, Chuanchen Luo, Lue Fan, Naiyan Wang, Junran Peng, and Zhaoxiang Zhang. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. InEuropean Conference on Computer Vision, pages 265–282. Springer, 2025

  56. [56]

    Citygaussianv2: Effi- cient and geometrically accurate reconstruction for large-scale scenes

    Yang Liu, Chuanchen Luo, Zhongkai Mao, Junran Peng, and Zhaoxiang Zhang. Citygaussianv2: Effi- cient and geometrically accurate reconstruction for large-scale scenes. InThe Thirteenth International Conference on Learning Representations, 2025

  57. [57]

    Citygs-x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruction

    Yuanyuan Gao, Hao Li, Jiaqi Chen, Zhengyu Zou, Zhihang Zhong, Dingwen Zhang, Xiao Sun, and Junwei Han. Citygs-x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27187–27196, October 2025

  58. [58]

    Vastgaussian: Vast 3d gaussians for large scene reconstruction

    Jiaqi Lin, Zhihao Li, Xiao Tang, Jianzhuang Liu, Shiyong Liu, Jiayue Liu, Yangdi Lu, Xiaofei Wu, Songcen Xu, Youliang Yan, and Wenming Yang. Vastgaussian: Vast 3d gaussians for large scene reconstruction. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5166–5175, 2024

  59. [59]

    Compact 3d gaussian representation for radiance field

    Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park. Compact 3d gaussian representation for radiance field. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21719–21728, 2024

  60. [60]

    Maskgaussian: Adaptive 3d gaussian representation from probabilistic masks

    Yifei Liu, Zhihang Zhong, Yifan Zhan, Sheng Xu, and Xiao Sun. Maskgaussian: Adaptive 3d gaussian representation from probabilistic masks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 681–690, June 2025

  61. [61]

    Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ FPS

    Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, and Zhangyang Wang. Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ FPS. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  62. [62]

    Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study

    Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1996–2005, 2019

  63. [63]

    Bad-gaussians: Bundle adjusted deblur gaussian splatting

    Lingzhe Zhao, Peng Wang, and Peidong Liu. Bad-gaussians: Bundle adjusted deblur gaussian splatting. In European Conference on Computer Vision (ECCV), 2024

  64. [64]

    Motion-aware animatable gaussian avatars deblurring, 2026

    Muyao Niu, Yifan Zhan, Qingtian Zhu, Zhuoxiao Li, Wei Wang, Zhihang Zhong, Xiao Sun, and Yinqiang Zheng. Motion-aware animatable gaussian avatars deblurring, 2026

  65. [65]

    Dp-nerf: Deblurred neural radiance field with physical scene priors

    Dogyoon Lee, Minhyeok Lee, Chajin Shin, and Sangyoun Lee. Dp-nerf: Deblurred neural radiance field with physical scene priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12386–12396, June 2023

  66. [66]

    Dof-gs: Adjustable depth-of-field 3d gaussian splatting for refocusing, defocus rendering and blur removal.The IEEE / CVF Computer Vision and Pattern Recognition Conference, 2025

    Yujie Wang, Praneeth Chakravarthula, and Baoquan Chen. Dof-gs: Adjustable depth-of-field 3d gaussian splatting for refocusing, defocus rendering and blur removal.The IEEE / CVF Computer Vision and Pattern Recognition Conference, 2025. 13

  67. [67]

    Srinivasan, and Jonathan T

    Ben Mildenhall, Peter Hedman, Ricardo Martin-Brualla, Pratul P. Srinivasan, and Jonathan T. Barron. Nerf in the dark: High dynamic range view synthesis from noisy raw images. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16169–16178, 2022

  68. [68]

    Physics-based noise modeling for extreme low- light photography.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8520–8537, 2021

    Kaixuan Wei, Ying Fu, Yinqiang Zheng, and Jiaolong Yang. Physics-based noise modeling for extreme low- light photography.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8520–8537, 2021

  69. [69]

    Fisheye-gs: Lightweight and extensible gaussian splatting module for fisheye cameras, 2024

    Zimu Liao, Siyan Chen, Rong Fu, Yi Wang, Zhongling Su, Hao Luo, Li Ma, Linning Xu, Bo Dai, Hengjie Li, Zhilin Pei, and Xingcheng Zhang. Fisheye-gs: Lightweight and extensible gaussian splatting module for fisheye cameras, 2024

  70. [70]

    3dgut: Enabling distorted cameras and secondary rays in gaussian splatting.Conference on Computer Vision and Pattern Recognition (CVPR), 2025

    Qi Wu, Janick Martinez Esturo, Ashkan Mirzaei, Nicolas Moenne-Loccoz, and Zan Gojcic. 3dgut: Enabling distorted cameras and secondary rays in gaussian splatting.Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  71. [71]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  72. [72]

    Structure-from-motion revisited

    Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Computer Vision and Pattern Recognition (CVPR), 2016

  73. [73]

    SAM 3: Segment anything with concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris Coll- Vinent, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Z...

  74. [74]

    Internspatial: A comprehensive dataset for spatial reasoning in vision-language models.arXiv preprint arXiv:2506.18385, 2025

    Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Xingguang Wei, Tianshuo Yang, Min Dou, et al. Internspatial: A comprehensive dataset for spatial reasoning in vision-language models.arXiv preprint arXiv:2506.18385, 2025

  75. [75]

    Ma- pAnything: Universal feed-forward metric 3D reconstruction

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Ma- pAnything: Universal feed-forward metric 3D reconstruction. I...

  76. [76]

    Heartfelt – Shadertoy

    Martijn Steinrucken. Heartfelt – Shadertoy. https://www.shadertoy.com/view/ltffzl, 2017. Li- cense: CC BY-NC-SA 3.0

  77. [77]

    Introducing GPT-5.4

    OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026

  78. [78]

    Gemini 3.1 pro: A smarter model for your most complex tasks

    Google. Gemini 3.1 pro: A smarter model for your most complex tasks. https://blog.google/ innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/, Febrary 2026

  79. [79]

    Gemini 3.1 flash-lite: Built for intelligence at scale

    Google. Gemini 3.1 flash-lite: Built for intelligence at scale. https://blog.google/ innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/ , March 2026

  80. [80]

    Introducing claude sonnet 4.6

    Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6 , February 2026

Showing first 80 references.