Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate
Pith reviewed 2026-05-20 10:38 UTC · model grok-4.3
The pith
COLMAP-based metrics achieve up to 4 times higher correlation with human judgments of multiview 3D consistency than MEt3R.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a controlled robustness benchmark for multiview 3D consistency and a parametric family that decomposes neural metrics into backbone, residual, and aggregation components. This family recovers MEt3R and yields variants up to 3 times more robust. Foundation models such as VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry and cross-view support for unrelated scenes, repeated images, and random noise. We introduce COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals. On real NVS outputs and a structured human study, these metrics achieve up to 4 times higher correlation with human judgments.
What carries the argument
COLMAP-based metrics that treat matches, registration success, dense support, and reconstruction failure as explicit failure-aware consistency signals.
If this is right
- Neural 3D foundation models can produce dense cross-view support even when inputs do not depict a single static scene.
- Varying the backbone, residual, or aggregation stage in neural consistency metrics can increase their robustness by up to a factor of three.
- COLMAP-based signals detect inconsistencies through geometric failure modes that learned priors often miss.
- Evaluation protocols for novel view synthesis and sparse reconstruction should incorporate explicit reconstruction-failure cues to avoid over-optimistic scores.
Where Pith is reading between the lines
- Replacing or supplementing neural consistency checks with COLMAP-based ones could reduce the rate at which inconsistent generated views are accepted in downstream 3D pipelines.
- The parametric decomposition offers a systematic way to diagnose which part of a neural metric is most vulnerable to hallucinations.
- The benchmark could serve as a testbed for future hybrid metrics that combine classical geometry with learned components.
Load-bearing premise
The selected NVS outputs and the human raters in the study are representative of the artifacts and inconsistencies that occur across broader 3D model outputs.
What would settle it
A new collection of multiview images containing hallucinations or repeated views where MEt3R shows higher correlation with fresh human ratings than any of the COLMAP-based metrics.
Figures
read the original abstract
Multiview 3D evaluation assumes that the images being scored are observations of one static 3D scene. This assumption can fail in NVS and sparse-view reconstruction: inputs or generated outputs may contain artifacts, outlier frames, repeated views, or noise, yet still receive high 3D consistency scores. Existing reference-based metrics require ground truth, while ground-truth-free metrics such as MEt3R depend on learned reconstruction backbones whose failure modes are poorly characterized. We study this reliability problem by comparing neural reconstruction priors with classical geometric verification. We introduce \benchmark, a controlled robustness benchmark for multiview 3D consistency, and a parametric family that decomposes neural metrics into backbone, residual, and aggregation components. This family recovers MEt3R and yields variants up to $3\times$ more robust. Our analysis shows that VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry and cross-view support for unrelated scenes, repeated images, and random noise. We introduce COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals. On real NVS outputs and a structured human study, these metrics achieve up to $4\times$ higher correlation with human judgments than MEt3R.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines the reliability of multiview 3D consistency metrics when inputs or outputs from neural view synthesis (NVS) contain artifacts such as hallucinations, repeated views, or noise. It introduces the benchmark, a parametric decomposition of neural metrics into backbone/residual/aggregation components that recovers MEt3R and produces more robust variants, demonstrates that models including VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry for unrelated scenes or noise, and proposes COLMAP-based metrics using matches, registration, and reconstruction failure signals. On real NVS outputs evaluated via a human study, the COLMAP metrics are reported to achieve up to 4× higher correlation with human judgments than MEt3R.
Significance. If the experimental claims hold after additional validation, the work would be significant for the field of 3D reconstruction and novel view synthesis evaluation. It provides concrete evidence of failure modes in current 3D foundation models, introduces a controlled benchmark for robustness testing, and offers a practical alternative using established classical geometry tools that better aligns with human perception of 3D consistency. The parametric decomposition is a useful contribution that could guide future metric design. Credit is due for the human study component and the explicit comparison against neural priors.
major comments (2)
- [Experimental Evaluation / Human Study] The central claim that COLMAP-based metrics achieve up to 4× higher correlation with human judgments than MEt3R (abstract and experimental section) is load-bearing for the paper's recommendation of these metrics as more reliable. However, the manuscript provides insufficient detail on the human study protocol, including the number of scenes, the source models and artifact types in the selected NVS outputs, the number of raters, inter-rater reliability statistics (e.g., Krippendorff’s alpha), and the statistical test used to establish significance of the correlation difference. Without these, it is unclear whether the reported gain generalizes beyond the particular test distribution or is influenced by post-hoc selection favoring classical geometry.
- [§4] §4 (Benchmark and Parametric Family): The claim that the parametric decomposition yields variants up to 3× more robust is central to the technical contribution, yet the manuscript does not specify the exact robustness metric (e.g., correlation under controlled artifact injection) or whether these variants were evaluated on the same human-study NVS outputs used for the 4× COLMAP comparison. This leaves open whether the robustness improvement is independent of the human correlation results.
minor comments (3)
- [Abstract / Introduction] The abstract and introduction use the term 'hallucinate dense geometry' without a precise operational definition tied to the benchmark; adding a short formalization (e.g., cross-view support for unrelated scenes) would improve clarity.
- [Results / Tables] Figure captions and tables reporting correlation values should include confidence intervals or p-values for the differences between metrics to allow readers to assess the strength of the 4× claim directly.
- [Related Work] Ensure the related-work section explicitly contrasts the proposed COLMAP signals against prior geometric verification methods in multi-view stereo literature.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and provide the requested experimental details.
read point-by-point responses
-
Referee: [Experimental Evaluation / Human Study] The central claim that COLMAP-based metrics achieve up to 4× higher correlation with human judgments than MEt3R (abstract and experimental section) is load-bearing for the paper's recommendation of these metrics as more reliable. However, the manuscript provides insufficient detail on the human study protocol, including the number of scenes, the source models and artifact types in the selected NVS outputs, the number of raters, inter-rater reliability statistics (e.g., Krippendorff’s alpha), and the statistical test used to establish significance of the correlation difference. Without these, it is unclear whether the reported gain generalizes beyond the particular test distribution or is influenced by post-hoc selection favoring classical geometry.
Authors: We agree that additional protocol details are necessary to substantiate the claim and address potential concerns about generalization or selection bias. In the revised manuscript we have expanded the experimental section with a dedicated description of the human study. This now specifies the number of scenes, the NVS models and artifact types sampled, the number of raters, the inter-rater reliability (Krippendorff’s alpha), and the statistical test used to compare correlations. Scene selection followed a stratified sampling strategy based on the artifact categories defined in the benchmark, ensuring coverage of common failure modes rather than post-hoc optimization for any particular metric. revision: yes
-
Referee: [§4] §4 (Benchmark and Parametric Family): The claim that the parametric decomposition yields variants up to 3× more robust is central to the technical contribution, yet the manuscript does not specify the exact robustness metric (e.g., correlation under controlled artifact injection) or whether these variants were evaluated on the same human-study NVS outputs used for the 4× COLMAP comparison. This leaves open whether the robustness improvement is independent of the human correlation results.
Authors: The robustness metric is explicitly the change in correlation to ground-truth 3D consistency labels when artifacts are injected under controlled conditions in the benchmark introduced in §4. This evaluation is performed entirely on the synthetic benchmark data and is independent of the separate human study on real NVS outputs. We have revised §4 to state the definition of the robustness metric and to clarify that the reported 3× improvement is measured on the benchmark, while the human-study results serve as an orthogonal validation on real data. revision: yes
Circularity Check
No significant circularity; empirical evaluation stands on external human judgments and classical geometry.
full rationale
The paper's central claim rests on an empirical comparison: COLMAP-based metrics show up to 4× higher correlation with human judgments than MEt3R on selected NVS outputs. This is not derived from any internal equations or self-referential fitting; the parametric family is introduced as an analysis tool that recovers MEt3R rather than predicting it, and COLMAP metrics are constructed from standard geometric primitives (matches, registration, reconstruction failure) independent of the neural backbones under test. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the described chain. The result is self-contained because validation comes from an external human study and classical SfM software rather than reducing to the paper's own fitted parameters or prior outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption COLMAP registration and match statistics provide reliable, failure-aware signals for multiview consistency even when neural backbones hallucinate.
invented entities (1)
-
Benchmark (named via LaTeX macro in abstract)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce SysCON3D, a controlled robustness benchmark... COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a parametric family that decomposes neural metrics into backbone, residual, and aggregation components
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Met3r: Measuring multi-view consistency in generated images
Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images. InComputer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 5
work page 2025
-
[2]
Mip-nerf 360: Unbounded anti-aliased neural radiance fields
Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022. 6, 9
work page 2022
-
[3]
Mvgenmaster: Scaling multi-view generation from any image via 3d priors enhanced diffusion model
Chenjie Cao, Chaohui Yu, Shang Liu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Mvgenmaster: Scaling multi-view generation from any image via 3d priors enhanced diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6045–6056, 2025. 3, 9
work page 2025
-
[4]
Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, and Jianfei Cai. Mvsplat360: Feed-forward 360 scene synthesis from sparse views.Advances in Neural Information Processing Systems, 37:107064–107086, 2024. 3, 9, 33
work page 2024
-
[5]
Shivam Duggal, Yushi Hu, Oscar Michel, Aniruddha Kembhavi, William T. Freeman, Noah A. Smith, Ranjay Krishna, Antonio Torralba, Ali Farhadi, and Wei-Chiu Ma. Eval3d: Interpretable and fine-grained evaluation for 3d generation.CVPR, 2025. 3
work page 2025
-
[6]
Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commun. ACM, 24(6):381–395, 1981. 4
work page 1981
-
[7]
Brandt, Axel Feldmann, Zhoutong Zhang, and William T
Stephanie Fu, Mark Hamilton, Laura E. Brandt, Axel Feldmann, Zhoutong Zhang, and William T. Freeman. Featup: A model-agnostic framework for features at any resolution. InThe Twelfth International Conference on Learning Representations, 2024. 2, 3, 4
work page 2024
-
[8]
Rating the chess rating system
Mark E Glickman and Albyn C Jones. Rating the chess rating system. 9, 22
-
[9]
Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.J. Mach. Learn. Res., 13(null):723–773, 2012. 3
work page 2012
-
[10]
Frank R. Hampel. The influence curve and its role in robust estimation.Journal of the American Statistical Association, 69(346):383–393, 1974. 3
work page 1974
-
[11]
Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, and Chen Feng. Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025. 2, 8, 33
-
[12]
Benchmarking neural network robustness to common corruptions and perturbations
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations, 2019. 3
work page 2019
-
[13]
Peter J. Huber. Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35(1): 73–101, 1964. 3
work page 1964
-
[14]
Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017. 3
work page 2017
-
[15]
Explaining human preferences via metrics for structured 3d reconstruction
Jack Langerman, Denys Rozumnyi, Yuzhong Huang, and Dmytro Mishkin. Explaining human preferences via metrics for structured 3d reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26944–26953, 2025. 22
work page 2025
-
[16]
Grounding image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European conference on computer vision, pages 71–91. Springer, 2024. 2, 3, 4, 9
work page 2024
-
[17]
Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 9 12
work page 2024
-
[18]
Zero- 1-to-3: Zero-shot one image to 3d object
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero- 1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 5
work page 2023
-
[19]
Gen3deval: Using vllms for automatic evaluation of generated 3d objects
Shalini Maiti, Lourdes Agapito, and Filippos Kokkinos. Gen3deval: Using vllms for automatic evaluation of generated 3d objects. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18552–18562, 2025. 3
work page 2025
-
[20]
Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Laba...
work page 2023
-
[21]
Structure-from-Motion Revisited
Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-Motion Revisited. InConference on Computer Vision and Pattern Recognition (CVPR), 2016. 2, 4, 5, 37
work page 2016
-
[22]
Daniel J Simons and Christopher F Chabris. Gorillas in our midst: Sustained inattentional blindness for dynamic events.perception, 28(9):1059–1074, 1999. 22
work page 1999
-
[23]
Saar Stern, Ido Sobol, and Or Litany. Appreciate the view: A task-aware evaluation framework for novel view synthesis.arXiv preprint arXiv:2511.12675, 2025. 3, 5
-
[24]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2, 3, 4, 9
work page 2025
-
[25]
Dust3r: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024. 2, 3, 9
work page 2024
-
[26]
Difix3d+: Improving 3d reconstructions with single-step diffusion models
Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26024–26035, 2025. 2, 3, 9
work page 2025
-
[27]
Genfusion: Closing the loop between reconstruction and generation via videos
Sibo Wu, Congrong Xu, Binbin Huang, Geiger Andreas, and Anpei Chen. Genfusion: Closing the loop between reconstruction and generation via videos. InConference on Computer Vision and Pattern Recognition (CVPR), 2025. 3
work page 2025
-
[28]
Doppelgangers++: Improved visual disambiguation with geometric 3d features
Yuanbo Xiangli, Ruojin Cai, Hanyu Chen, Jeffrey Byrne, and Noah Snavely. Doppelgangers++: Improved visual disambiguation with geometric 3d features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3
work page 2025
-
[29]
Depthsplat: Connecting gaussian splatting and depth
Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16453–16463, 2025. 2, 3, 9
work page 2025
-
[30]
Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli
Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 9
work page 2025
-
[31]
Mvsnet: Depth inference for unstructured multi-view stereo
Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings of the European conference on computer vision (ECCV), pages 767–783,
-
[32]
Meng You, Zhiyu Zhu, Hui Liu, and Junhui Hou. Nvs-solver: Video diffusion model as zero-shot novel view synthesizer.arXiv preprint arXiv:2405.15364, 2024. 3, 9
-
[33]
Long-term photometric consistent novel view synthesis with diffusion models
Jason J Yu, Fereshteh Forghani, Konstantinos G Derpanis, and Marcus A Brubaker. Long-term photometric consistent novel view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7094–7104, 2023. 3, 5
work page 2023
-
[34]
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2, 3, 9 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Jensen (Jinghao) Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025. 3, 9
-
[36]
Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats
Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4349–4359, 2025. 3, 9 14 Appendix Contents A Notation and Common Terms 17 B FAQs 18 C Detailed Cont...
work page 2025
-
[37]
Why not apply RANSAC-style geometric checks to matches predicted by VGGT, DUSt3R, or MASt3R? 19 Classical matching separates evidence from verification: SIFT proposes discrete matches, and RANSAC tests whether they support a geometry. Learned 3D backbones predict geometry, cameras, and correspondences together, so their matches are already tied to the mod...
-
[38]
We introduce aunified parametric frameworkfor ground-truth-free 3D consistency met- rics, showing that existing and new neural metrics can be decomposed into three components: areconstruction backbone, aresidual function, and anaggregation function. This formu- lation clarifies the design space of multi-view consistency metrics and enables principled vari...
-
[39]
We constructSysCON3D, a robustness benchmark for 3D consistency evaluation, which systematically injects controlled cross-scene outliers and synthetic corruptions into multi- view image sets. This benchmark enables direct testing of whether a metric can distinguish geometrically consistent scenes from increasingly inconsistent ones
-
[40]
Through SysCON3D, we uncover asystematic and previously underappreciated failure modeof modern data-driven 3D reconstruction backbones: rather than rejecting impossible inputs, they hallucinate non-trivial 3D structure and spurious cross-view consistency for cross-scene mixtures, repeated images, and random noise. Consequently, evaluation metrics built on...
-
[41]
To address this limitation, we developrobust COLMAP-based 3D consistency metrics that avoid learned priors and instead rely on classical geometric verification. These metrics provide a more interpretable, failure-aware, and robust measure of scene-level multi-view consistency
-
[42]
We design astructured human preference studyfor evaluating 3D consistency, with explicit protocols that distinguish 3D consistency from visual realism and plausibility. This yields a more targeted human reference for assessing how well automatic metrics align with human judgments of scene-level consistency
-
[43]
We perform acomprehensive empirical evaluationon SysCON3D, Mip-NeRF360, and DL3DV , together with our human study, and show that the proposed metrics substantially improve robustness over prior work while also aligning more closely with human judgments. In particular, neural distributional metrics improve over MEt3R, and COLMAP-based metrics achieve the s...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.