SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
Pith reviewed 2026-05-22 06:20 UTC · model grok-4.3
The pith
Visual degradations substantially impair spatial reasoning in multimodal models, but finetuning on a new synthesized dataset restores robustness and can exceed human performance without harming clean-image results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By embedding nine types of degradation formation directly into the 3D Gaussian Splatting rendering pipeline, the authors generate a large collection of spatially grounded question-answer pairs that reflect imperfect visual observations; evaluation on the resulting human-verified benchmark demonstrates that every tested model loses substantial spatial reasoning accuracy under these conditions, while subsequent finetuning on the full SpaceDG collection yields models whose degraded-condition performance surpasses human baselines with no measurable regression on clean inputs.
What carries the argument
A physically grounded degradation synthesis engine that integrates degradation formation processes into 3D Gaussian Splatting rendering to produce realistic impaired images paired with spatial questions.
If this is right
- Spatial reasoning benchmarks that use only pristine images will systematically overestimate model capability for real deployment.
- Training procedures that include synthesized degradations can produce spatial intelligence that remains effective when visual quality is imperfect.
- Different degradation types affect different categories of spatial reasoning at different rates, so targeted data collection can address specific weaknesses.
- Models trained this way can reach or exceed human accuracy on degraded spatial tasks while retaining full performance on clean inputs.
Where Pith is reading between the lines
- The same synthesis approach could be applied to generate training data for other multimodal reasoning tasks that also rely on visual detail.
- Future benchmarks in related areas such as navigation or manipulation should adopt similar degradation-aware construction to avoid hidden robustness gaps.
- If the simulation-to-reality gap proves small, this method offers a scalable way to create large volumes of labeled degraded data without needing expensive real-world captures.
Load-bearing premise
The simulated degradations created inside the 3D rendering process are close enough to real camera and environmental effects that performance measured on them predicts performance on actual deployed hardware.
What would settle it
Run the finetuned models on a fresh collection of real photographs taken under motion blur, low light, or rain and compare their spatial reasoning scores to both the synthetic-degradation results and to human scores on the same real images.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SpaceDG, a large-scale dataset of approximately 1M QA pairs from nearly 1,000 indoor scenes for evaluating MLLM spatial reasoning under visual degradations. It uses a physically grounded synthesis engine that embeds nine degradation types (motion blur, low light, weather, distortion, compression, etc.) into 3D Gaussian Splatting rendering. SpaceDG-Bench is a human-verified subset with 1,102 questions spanning 11 reasoning categories and 9 degradations, yielding over 10K VQA instances. Evaluation of 25 open- and closed-source MLLMs shows consistent and substantial performance impairment under degradations. Finetuning on SpaceDG improves robustness and can surpass human performance on degraded inputs without regression on clean images.
Significance. If the synthetic degradations are representative of real-world camera and environmental conditions, the work identifies a critical robustness gap in current MLLMs for spatial tasks and demonstrates a practical mitigation via degradation-aware finetuning that preserves clean-image performance. Notable strengths include the dataset scale, human verification on 1,102 questions, and the no-regression finetuning result. These elements could inform future benchmark construction and training strategies for robust multimodal spatial intelligence.
major comments (1)
- [Abstract and synthesis engine] Abstract and degradation synthesis description: the central claims of consistent impairment across 25 MLLMs and finetuning gains (including surpassing humans) rest on the assumption that the 3DGS-embedded synthesis of nine degradations produces simulations representative of real deployment conditions. No quantitative validation is reported, such as perceptual metrics (e.g., LPIPS), distribution matching to real camera captures, or downstream task parity checks. Without this, the observed robustness gap and transfer results cannot be reliably interpreted as applying to actual camera/environmental degradations.
minor comments (2)
- [Evaluation section] The abstract states consistent impairment and finetuning gains but provides no error bars, confidence intervals, or statistical significance tests for the 25-model evaluation results.
- [Methods] Exact synthesis parameters for the nine degradation types (e.g., blur kernel sizes, noise levels, compression ratios) and any controls for rendering artifacts are not detailed, limiting reproducibility.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying this key assumption underlying our central claims. We address the concern about validation of the synthetic degradations point by point below and outline concrete revisions.
read point-by-point responses
-
Referee: [Abstract and synthesis engine] Abstract and degradation synthesis description: the central claims of consistent impairment across 25 MLLMs and finetuning gains (including surpassing humans) rest on the assumption that the 3DGS-embedded synthesis of nine degradations produces simulations representative of real deployment conditions. No quantitative validation is reported, such as perceptual metrics (e.g., LPIPS), distribution matching to real camera captures, or downstream task parity checks. Without this, the observed robustness gap and transfer results cannot be reliably interpreted as applying to actual camera/environmental degradations.
Authors: We agree that explicit quantitative validation would strengthen the link between our synthetic degradations and real-world camera/environmental conditions. Our synthesis engine is constructed by embedding the physical formation processes of each degradation (e.g., integrating motion trajectories for blur, photon-shot noise models for low light, and atmospheric scattering for weather) directly into the differentiable 3DGS rendering pipeline rather than applying post-hoc 2D filters. This design choice was intended to produce degradations that respect scene geometry and lighting. Nevertheless, the submitted manuscript does not report perceptual metrics such as LPIPS or SSIM against matched real captures, nor formal distribution matching or downstream parity checks. In the revised manuscript we will add a dedicated validation subsection (and corresponding appendix) that includes: (1) LPIPS and SSIM comparisons on a held-out set of real degraded captures we collected for four of the nine degradation types; (2) feature-distribution matching using CLIP and DINOv2 embeddings between synthetic and real degraded images; and (3) a small-scale parity experiment showing that MLLM error patterns on our synthetic degradations correlate with those observed on the real captures. We will also explicitly discuss remaining limitations, such as the difficulty of perfectly replicating sensor-specific noise or rare compound degradations. These additions will allow readers to assess the degree of realism more rigorously while preserving the scale and controlled nature of the benchmark. revision: yes
Circularity Check
No circularity: purely empirical dataset construction, benchmarking, and finetuning experiments
full rationale
The paper constructs SpaceDG via a 3DGS-embedded degradation synthesis method, creates SpaceDG-Bench with human-verified questions, evaluates 25 MLLMs to show robustness gaps, and reports finetuning results that improve degraded performance without clean-image regression. No equations, predictions, or first-principles derivations are present that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claims rest on experimental outcomes from synthetic data generation and model training, which are externally falsifiable and do not rely on any load-bearing self-referential step. The realism of the degradation engine is a methodological assumption subject to independent validation, not a circularity issue.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The degradation synthesis engine embedding formation processes into 3D Gaussian Splatting produces realistic simulations matching real-world visual degradations.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023
work page 2023
-
[2]
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence.arXiv preprint arXiv:2505.23747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...
work page 2026
-
[4]
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Jihan Yang, Shusheng Yang, Anjali Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces.arXiv preprint arXiv:2412.14171, 2024
work page Pith review arXiv 2024
-
[5]
Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces.arXiv preprint arXiv:2506.00123, 2025
-
[6]
Yanbang Li, Ziyang Gong, Haoyang Li, Xiaoqi Huang, Haolan Kang, Guangping Bai, and Xianzheng Ma. Robotic visual instruction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12155–12165, 2025
work page 2025
-
[7]
Mmsi-bench: A benchmark for multi-image spatial intelligence
Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelligence. InICLR, 2025
work page 2025
-
[8]
Dsi-bench: A benchmark for dynamic spatial intelligence, 2025
Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, and Zhou Zhao. Dsi-bench: A benchmark for dynamic spatial intelligence, 2025
work page 2025
-
[9]
Cambrian-S: Towards Spatial Supersensing in Video
Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models, 2026
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models, 2026
work page 2026
-
[11]
Mindcube: Spatial mental modeling from limited views, 2026
Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, and Manling Li. Mindcube: Spatial mental modeling from limited views, 2026
work page 2026
-
[12]
Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models, 2025
Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and Yueting Zhuang. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision-language models, 2025. 10
work page 2025
-
[13]
Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, et al. Stepping vlms onto the court: Benchmarking spatial intelligence in sports.arXiv preprint arXiv:2603.09896, 2026
-
[14]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, June 2024
work page 2024
-
[15]
Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025
Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, et al. Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025
-
[16]
Scaling spatial intelligence with multimodal foundation models.arXiv preprint arXiv:2511.13719, 2025
Zhongang Cai, Ruisi Wang, Chenyang Gu, Fanyi Pu, Junxiang Xu, Yubo Wang, Wanqi Yin, Zhitao Yang, Chen Wei, Qingping Sun, Tongxi Zhou, Jiaqi Li, Hui En Pang, Oscar Qian, Yukun Wei, Zhiqian Lin, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Xiangyu Fan, Hanming Deng, Lewei Lu, Liang Pan, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, and Lei Yang. Scaling spa...
-
[17]
Deep video deblurring for hand-held cameras
Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver Wang. Deep video deblurring for hand-held cameras. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1279–1288, 2017
work page 2017
-
[18]
Deep multi-scale convolutional neural network for dynamic scene deblurring
Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3883–3891, 2017
work page 2017
-
[19]
Efficient spatio-temporal recurrent neural network for video deblurring
Zhihang Zhong, Ye Gao, Yinqiang Zheng, and Bo Zheng. Efficient spatio-temporal recurrent neural network for video deblurring. InEuropean conference on computer vision, pages 191–207. Springer, 2020
work page 2020
-
[20]
Zhihang Zhong, Ye Gao, Yinqiang Zheng, Bo Zheng, and Imari Sato. Real-world video deblurring: A benchmark dataset and an efficient recurrent neural network.International Journal of Computer Vision, 131(1):284–301, 2023
work page 2023
-
[21]
Blur interpolation transformer for real-world motion from blur
Zhihang Zhong, Mingdeng Cao, Xiang Ji, Yinqiang Zheng, and Imari Sato. Blur interpolation transformer for real-world motion from blur. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5713–5723, 2023
work page 2023
-
[22]
Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks.IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015
work page 2015
-
[23]
Photo-realistic single image super- resolution using a generative adversarial network
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super- resolution using a generative adversarial network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017
work page 2017
-
[24]
Esrgan: Enhanced super-resolution generative adversarial networks
Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. InProceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018
work page 2018
-
[25]
Transformer for single image super-resolution
Zhisheng Lu, Juncheng Li, Hong Liu, Chaoyan Huang, Linlin Zhang, and Tieyong Zeng. Transformer for single image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 457–466, 2022
work page 2022
-
[26]
Deep shutter unrolling network
Peidong Liu, Zhaopeng Cui, Viktor Larsson, and Marc Pollefeys. Deep shutter unrolling network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5941–5949, 2020
work page 2020
-
[27]
Towards rolling shutter correction and deblurring in dynamic scenes
Zhihang Zhong, Yinqiang Zheng, and Imari Sato. Towards rolling shutter correction and deblurring in dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9219–9228, 2021
work page 2021
-
[28]
Learning adaptive warping for real-world rolling shutter correction
Mingdeng Cao, Zhihang Zhong, Jiahao Wang, Yinqiang Zheng, and Yujiu Yang. Learning adaptive warping for real-world rolling shutter correction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17785–17793, 2022
work page 2022
-
[29]
Chen Chen, Qifeng Chen, Jia Xu, and Vladlen Koltun. Learning to see in the dark. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3291–3300, 2018. 11
work page 2018
-
[30]
Visibility constrained wide-band illumina- tion spectrum design for seeing-in-the-dark
Muyao Niu, Zhuoxiao Li, Zhihang Zhong, and Yinqiang Zheng. Visibility constrained wide-band illumina- tion spectrum design for seeing-in-the-dark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13976–13985, 2023
work page 2023
-
[31]
Nir-assisted video enhancement via unpaired 24-hour data
Muyao Niu, Zhihang Zhong, and Yinqiang Zheng. Nir-assisted video enhancement via unpaired 24-hour data. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10778–10788, 2023
work page 2023
-
[32]
Kaiming He, Jian Sun, and Xiaoou Tang. Single image haze removal using dark channel prior.IEEE transactions on pattern analysis and machine intelligence, 33(12):2341–2353, 2010
work page 2010
-
[33]
Xueyang Fu, Jiabin Huang, Xinghao Ding, Yinghao Liao, and John Paisley. Clearing the skies: A deep network architecture for single-image rain removal.IEEE Transactions on Image Processing, 26(6):2944– 2956, 2017
work page 2017
-
[34]
Robust-r1: Degradation-aware reasoning for robust visual understanding
Jiaqi Tang, Jianmin Chen, Wei Wei, Xiaogang Xu, Runtao Liu, Xiangyu Wu, Qipeng Xie, Jiafei Wu, Lei Zhang, and Qifeng Chen. Robust-r1: Degradation-aware reasoning for robust visual understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026
work page 2026
-
[35]
Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models, 2026
Rohit Saxena, Alessandro Suglia, and Pasquale Minervini. Vlm-robustbench: A comprehensive benchmark for robustness of vision-language models, 2026
work page 2026
-
[36]
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), July 2023
work page 2023
-
[37]
Holi-spatial: Evolving video streams into holistic 3d spatial intelligence, 2026
Yuanyuan Gao, Hao Li, Yifei Liu, Xinhao Ji, Yuning Gong, Yuanjun Liao, Fangfu Liu, Manyuan Zhang, Yuchen Yang, Dan Xu, Xue Yang, Huaxi Huang, Hongjie Zhang, Ziwei Liu, Xiao Sun, Dingwen Zhang, and Zhihang Zhong. Holi-spatial: Evolving video streams into holistic 3d spatial intelligence, 2026
work page 2026
-
[38]
Scannet++: A high-fidelity dataset of 3d indoor scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023
work page 2023
-
[39]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026
work page 2026
-
[40]
Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025
work page 2025
-
[41]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [42]
-
[43]
Spatialrgpt: Grounded spatial reasoning in vision language models, 2024
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision language models, 2024
work page 2024
-
[44]
Mm-spatial: Exploring 3d spatial understanding in multimodal llms, 2025
Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, and Peter Grasch. Mm-spatial: Exploring 3d spatial understanding in multimodal llms, 2025
work page 2025
-
[45]
Vlm4d: Towards spatiotemporal awareness in vision language models
Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Eric Xin Wang, and Achuta Kadambi. Vlm4d: Towards spatiotemporal awareness in vision language models. InProceedings of the IEEE/CVF international conference on computer vision, pages 8600–8612, 2025
work page 2025
-
[46]
Benchmarking neural network robustness to common corruptions and perturbations, 2019
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations, 2019
work page 2019
-
[47]
On the robustness of large multimodal models against image adversarial attacks, 2023
Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang, and Ser-Nam Lim. On the robustness of large multimodal models against image adversarial attacks, 2023
work page 2023
-
[48]
Analysing the robustness of vision-language-models to common corruptions, 2025
Muhammad Usama, Syeda Aishah Asim, Syed Bilal Ali, Syed Talal Wasim, and Umair Bin Mansoor. Analysing the robustness of vision-language-models to common corruptions, 2025
work page 2025
-
[49]
Zhiyuan Fan, Yumeng Wang, Sandeep Polisetty, and Yi R. Fung. V2r-bench: Holistically evaluating lvlm robustness to fundamental visual variations, 2025. 12
work page 2025
-
[50]
Scaffold-gs: Structured 3d gaussians for view-adaptive rendering
Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20654–20664, 2024
work page 2024
-
[51]
Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–15, 2025
work page 2025
-
[52]
Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splatting.Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[53]
Yuanyuan Gao, Yuning Gong, Yifei Liu, Li Jingfeng, Dingwen Zhang, Yanci Zhang, Dan Xu, Xiao Sun, and Zhihang Zhong. Proxy-gs: Unified occlusion priors for training and inference in structured 3d gaussian splatting.arXiv preprint arXiv:2509.24421, 2025
-
[54]
Danpeng Chen, Hai Li, Weicai Ye, Yifan Wang, Weijian Xie, Shangjin Zhai, Nan Wang, Haomin Liu, Hujun Bao, and Guofeng Zhang. Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction.arXiv preprint arXiv:2406.06521, 2024
-
[55]
Citygaussian: Real-time high-quality large-scale scene rendering with gaussians
Yang Liu, Chuanchen Luo, Lue Fan, Naiyan Wang, Junran Peng, and Zhaoxiang Zhang. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. InEuropean Conference on Computer Vision, pages 265–282. Springer, 2025
work page 2025
-
[56]
Citygaussianv2: Effi- cient and geometrically accurate reconstruction for large-scale scenes
Yang Liu, Chuanchen Luo, Zhongkai Mao, Junran Peng, and Zhaoxiang Zhang. Citygaussianv2: Effi- cient and geometrically accurate reconstruction for large-scale scenes. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[57]
Yuanyuan Gao, Hao Li, Jiaqi Chen, Zhengyu Zou, Zhihang Zhong, Dingwen Zhang, Xiao Sun, and Junwei Han. Citygs-x: A scalable architecture for efficient and geometrically accurate large-scale scene reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 27187–27196, October 2025
work page 2025
-
[58]
Vastgaussian: Vast 3d gaussians for large scene reconstruction
Jiaqi Lin, Zhihao Li, Xiao Tang, Jianzhuang Liu, Shiyong Liu, Jiayue Liu, Yangdi Lu, Xiaofei Wu, Songcen Xu, Youliang Yan, and Wenming Yang. Vastgaussian: Vast 3d gaussians for large scene reconstruction. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5166–5175, 2024
work page 2024
-
[59]
Compact 3d gaussian representation for radiance field
Joo Chan Lee, Daniel Rho, Xiangyu Sun, Jong Hwan Ko, and Eunbyung Park. Compact 3d gaussian representation for radiance field. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21719–21728, 2024
work page 2024
-
[60]
Maskgaussian: Adaptive 3d gaussian representation from probabilistic masks
Yifei Liu, Zhihang Zhong, Yifan Zhan, Sheng Xu, and Xiao Sun. Maskgaussian: Adaptive 3d gaussian representation from probabilistic masks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 681–690, June 2025
work page 2025
-
[61]
Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ FPS
Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, and Zhangyang Wang. Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ FPS. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[62]
Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study
Seungjun Nah, Sungyong Baik, Seokil Hong, Gyeongsik Moon, Sanghyun Son, Radu Timofte, and Kyoung Mu Lee. Ntire 2019 challenge on video deblurring and super-resolution: Dataset and study. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1996–2005, 2019
work page 2019
-
[63]
Bad-gaussians: Bundle adjusted deblur gaussian splatting
Lingzhe Zhao, Peng Wang, and Peidong Liu. Bad-gaussians: Bundle adjusted deblur gaussian splatting. In European Conference on Computer Vision (ECCV), 2024
work page 2024
-
[64]
Motion-aware animatable gaussian avatars deblurring, 2026
Muyao Niu, Yifan Zhan, Qingtian Zhu, Zhuoxiao Li, Wei Wang, Zhihang Zhong, Xiao Sun, and Yinqiang Zheng. Motion-aware animatable gaussian avatars deblurring, 2026
work page 2026
-
[65]
Dp-nerf: Deblurred neural radiance field with physical scene priors
Dogyoon Lee, Minhyeok Lee, Chajin Shin, and Sangyoun Lee. Dp-nerf: Deblurred neural radiance field with physical scene priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12386–12396, June 2023
work page 2023
-
[66]
Yujie Wang, Praneeth Chakravarthula, and Baoquan Chen. Dof-gs: Adjustable depth-of-field 3d gaussian splatting for refocusing, defocus rendering and blur removal.The IEEE / CVF Computer Vision and Pattern Recognition Conference, 2025. 13
work page 2025
-
[67]
Ben Mildenhall, Peter Hedman, Ricardo Martin-Brualla, Pratul P. Srinivasan, and Jonathan T. Barron. Nerf in the dark: High dynamic range view synthesis from noisy raw images. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16169–16178, 2022
work page 2022
-
[68]
Kaixuan Wei, Ying Fu, Yinqiang Zheng, and Jiaolong Yang. Physics-based noise modeling for extreme low- light photography.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8520–8537, 2021
work page 2021
-
[69]
Fisheye-gs: Lightweight and extensible gaussian splatting module for fisheye cameras, 2024
Zimu Liao, Siyan Chen, Rong Fu, Yi Wang, Zhongling Su, Hao Luo, Li Ma, Linning Xu, Bo Dai, Hengjie Li, Zhilin Pei, and Xingcheng Zhang. Fisheye-gs: Lightweight and extensible gaussian splatting module for fisheye cameras, 2024
work page 2024
-
[70]
Qi Wu, Janick Martinez Esturo, Ashkan Mirzaei, Nicolas Moenne-Loccoz, and Zan Gojcic. 3dgut: Enabling distorted cameras and secondary rays in gaussian splatting.Conference on Computer Vision and Pattern Recognition (CVPR), 2025
work page 2025
-
[71]
Depth Anything 3: Recovering the Visual Space from Any Views
Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
Structure-from-motion revisited
Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Computer Vision and Pattern Recognition (CVPR), 2016
work page 2016
-
[73]
SAM 3: Segment anything with concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris Coll- Vinent, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Z...
work page 2026
-
[74]
Nianchen Deng, Lixin Gu, Shenglong Ye, Yinan He, Zhe Chen, Songze Li, Haomin Wang, Xingguang Wei, Tianshuo Yang, Min Dou, et al. Internspatial: A comprehensive dataset for spatial reasoning in vision-language models.arXiv preprint arXiv:2506.18385, 2025
-
[75]
Ma- pAnything: Universal feed-forward metric 3D reconstruction
Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Ma- pAnything: Universal feed-forward metric 3D reconstruction. I...
work page 2026
-
[76]
Martijn Steinrucken. Heartfelt – Shadertoy. https://www.shadertoy.com/view/ltffzl, 2017. Li- cense: CC BY-NC-SA 3.0
work page 2017
-
[77]
OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026
work page 2026
-
[78]
Gemini 3.1 pro: A smarter model for your most complex tasks
Google. Gemini 3.1 pro: A smarter model for your most complex tasks. https://blog.google/ innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/, Febrary 2026
work page 2026
-
[79]
Gemini 3.1 flash-lite: Built for intelligence at scale
Google. Gemini 3.1 flash-lite: Built for intelligence at scale. https://blog.google/ innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/ , March 2026
work page 2026
-
[80]
Anthropic. Introducing claude sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6 , February 2026
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.