pith. sign in

arxiv: 2605.16873 · v1 · pith:7MNTUTZTnew · submitted 2026-05-16 · 💻 cs.CV

HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction

Pith reviewed 2026-05-19 20:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D reconstructiondiffusion priorshallucination detectionnovel view synthesissparse view reconstructionmulti-view consistencyartifact reduction
0
0 comments X

The pith

HAD estimates pixel-wise hallucination scores from a pre-trained novel view synthesis network to mask unreliable pixels in diffusion-augmented images during sparse-view 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the introduction of invented content by diffusion models when they generate extra views to help build 3D models from limited input images. It does so by running those generated images through a separate network that can compare information across multiple viewpoints and assign a score to each pixel indicating how inconsistent it is with the real inputs. Selective masking of high-score pixels during the reconstruction process then keeps only the reliable additions, while fusing several differently conditioned versions of each new view brings in more surrounding context from the original set.

Core claim

HAD estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, multiple versions of augmented images at each novel view are created by conditioning the diffusion prior on different input views and then fused into a final image that leverages the broader context across all input views.

What carries the argument

Pixel-wise hallucination score maps produced by a pre-trained feedforward novel view synthesis network, which identify inconsistent pixels for selective masking in diffusion-augmented training views during progressive 3D reconstruction.

If this is right

  • Selective masking prevents non-existent artifacts from being baked into the final 3D model.
  • Fusing multiple conditioned augmentations at each novel viewpoint incorporates broader context from all input views.
  • The overall procedure substantially reduces hallucination artifacts compared to standard diffusion-assisted reconstruction.
  • The approach reaches state-of-the-art results on multiple benchmarks for novel view synthesis from sparse inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same scoring mechanism could be tested on other generative priors used in 3D tasks to check whether multi-view consistency filtering generalizes beyond diffusion models.
  • If the pre-trained network's reasoning is the key enabler, similar networks might serve as lightweight consistency checkers in related pipelines such as dynamic scene reconstruction.
  • An extension worth checking is whether the fusion step remains effective when the number of original input views drops below the levels tested in the benchmarks.

Load-bearing premise

The pre-trained novel view synthesis network can reliably produce hallucination scores that accurately identify pixels inconsistent with the original input views, and that masking these pixels improves rather than harms the final 3D model quality.

What would settle it

A direct comparison showing that reconstructions using the hallucination-masked augmented views produce no measurable improvement or even lower quality on standard novel view synthesis metrics than reconstructions that use the unmasked diffusion outputs.

Figures

Figures reproduced from arXiv: 2605.16873 by Chris Broaddus, Laurent Guigues, Siyu Huang, Weiwei Sun, Xi Liu, Zhou Ren.

Figure 1
Figure 1. Figure 1: While diffusion priors [41] enhance the quality of 3D reconstruction, they introduce detrimental aliens – the hallucinated elements that do not exist in the observed regions, as highlighted in boxes. This work addresses this issue through hallucination score modeling, achieving high-quality 3D reconstruction with improved fidelity. Abstract Diffusion priors have recently demonstrated strong capabil￾ity in … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of framework – We train 3DGS with input images and HAD-augmented novel views. HAD combines a pretrained diffusion prior (which generates images from 3DGS-rendered views conditioned on reference input images) with our hallucination score network (which predicts pixel-wise reliability maps). Our multi-sampling strategy fuses multiple generated versions into refined augmented views. Hallucination sco… view at source ↗
Figure 3
Figure 3. Figure 3: Examples on DL3DV [25] – We show novel-view rendering obtained by ours, Gspat-mcmc [22], LVSM [20] and Difix3D [41]. Our approach achieves the sharper rendering as well as the better fidelity to the ground-truth. the input views, then optimize the model using both input and augmented views. Under this setting, we replace Di￾fix3D’s diffusion prior with our HAD. We demonstrate that even under this suboptima… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative demonstration of hallucination pattens in diffusion-assisted 3DGS pipelines. Both Difix3D (image diffusion) and [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Hallucination analysis for multi-view diffusion (SVC). Hallucination maps are computed against ground-truth images to highlight [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of hallucination scoring network – The network predicts a pixel-wise hallucination score map s for a hal￾lucinated novel view ˜iG. It consists of a multi-view feature en￾coder V (the frozen feature backbone of a pre-trained LVSM) and a three-layer U-Net score branch S, which estimates hallucina￾tion scores using both multi-view features and the novel view im￾age. The model is trained on curated mu… view at source ↗
Figure 7
Figure 7. Figure 7: Hallucination Scoring for GenFusion. Our hallucination scoring network can also mitigate hallucinations in video diffusion without fine-tuning [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Generalization of our hallucination scoring network to video diffusion (GenFusion). Our model is [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Generalization of our hallucination scoring network to multi-view diffusion (SVC). Our model is [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: More Qualitative Results on DL3DV [25] [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: More Qualitative Results on DL3DV [25] [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: More Qualitative Results on MipNeRF-360 [3] [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
read the original abstract

Diffusion priors have recently demonstrated strong capability in enhancing the quality of sparse-view 3D reconstruction by augmenting training views at novel viewpoints, but they inevitably introduce hallucinated content -- artifacts inconsistent with the input views -- into the final 3D model. To address this challenge, we propose Hallucination-Aware Diffusion prior (HAD), which estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis (NVS) network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, we create multiple versions of augmented images at each novel view by conditioning the diffusion prior on different input views, which are then fused into a final image that leverages the broader context across all input views. We show that our method substantially reduces hallucination artifacts in diffusion-assisted 3D reconstruction, thereby achieving state-of-the-art performance across multiple benchmarks on novel view synthesis. Our project are publicly available at \href{https://xiliu8006.github.io/HAD-Project-website/}{project website}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Hallucination-Aware Diffusion prior (HAD) for sparse-view 3D reconstruction. It estimates pixel-wise hallucination score maps on diffusion-augmented novel-view images by exploiting multi-view reasoning from a pre-trained feedforward novel view synthesis (NVS) network. These scores enable selective masking of unreliable pixels during progressive reconstruction; multiple conditioned augmentations are fused to leverage broader context. The method is claimed to substantially reduce hallucination artifacts and deliver state-of-the-art novel-view synthesis results on multiple benchmarks.

Significance. If the quantitative claims hold, the work would be significant because it supplies a practical, training-free mechanism to detect and suppress diffusion-induced inconsistencies using an existing large-scale NVS prior. The combination of per-pixel masking and multi-view fusion directly targets a known failure mode of diffusion-assisted reconstruction pipelines and could improve reliability in downstream applications that require geometrically consistent 3D models from limited input views.

major comments (2)
  1. [Method / Experiments] The central claim that masking pixels flagged by the pre-trained NVS network improves final 3D reconstruction quality (rather than discarding useful signal) is load-bearing yet rests on an unverified assumption. The manuscript should provide a direct ablation (e.g., reconstruction metrics with vs. without masking) together with qualitative examples showing that masked regions correspond to genuine hallucinations rather than view-consistent content.
  2. [Abstract / Experiments] The abstract asserts SOTA performance across benchmarks, but the provided description contains no quantitative tables, error bars, or statistical significance tests. Without these data the magnitude of improvement attributable to HAD versus prior diffusion-augmented baselines cannot be assessed.
minor comments (2)
  1. [Method] Clarify the exact architecture and training data of the feedforward NVS network used for scoring; a brief citation or diagram would help readers reproduce the pipeline.
  2. [Method] The fusion step for multiple conditioned augmentations is described at a high level; a short algorithmic outline or pseudocode would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the empirical validation of our approach. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method / Experiments] The central claim that masking pixels flagged by the pre-trained NVS network improves final 3D reconstruction quality (rather than discarding useful signal) is load-bearing yet rests on an unverified assumption. The manuscript should provide a direct ablation (e.g., reconstruction metrics with vs. without masking) together with qualitative examples showing that masked regions correspond to genuine hallucinations rather than view-consistent content.

    Authors: We agree that a direct ablation study is necessary to substantiate the benefit of the hallucination-aware masking. In the revised manuscript, we will add a dedicated ablation in the Experiments section that reports reconstruction metrics (PSNR, SSIM, LPIPS) on the same benchmarks with and without the masking step. We will also include qualitative examples that visualize the hallucination score maps overlaid on the augmented views, highlighting regions that are masked and demonstrating their inconsistency with the input views (e.g., via multi-view consistency checks) rather than view-consistent geometry. This addition directly addresses the concern and will be supported by the existing multi-view reasoning mechanism described in Section 3. revision: yes

  2. Referee: [Abstract / Experiments] The abstract asserts SOTA performance across benchmarks, but the provided description contains no quantitative tables, error bars, or statistical significance tests. Without these data the magnitude of improvement attributable to HAD versus prior diffusion-augmented baselines cannot be assessed.

    Authors: The full manuscript already contains quantitative tables in the Experiments section that compare HAD against prior diffusion-augmented baselines on multiple benchmarks, reporting standard novel-view synthesis metrics. To improve clarity and address the referee's point, we will revise the abstract to include a concise statement of the key quantitative gains and will augment the tables with error bars (computed over multiple runs or scenes) as well as statistical significance tests (e.g., paired t-tests) where appropriate. These changes will make the magnitude of improvement more transparent without altering the core claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a pipeline that estimates pixel-wise hallucination scores using a pre-trained external feedforward NVS network on large-scale 3D data, then applies selective masking during progressive reconstruction and fuses multiple conditioned augmentations. No equations, fitted parameters, or self-citations are shown reducing the hallucination scores or performance gains to quantities defined by the method's own inputs or outputs. The approach is benchmarked on external NVS tasks with claimed SOTA results, confirming the derivation remains independent of self-referential definitions or forced predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the pre-trained NVS network provides accurate hallucination detection without introducing new biases, and that selective masking preserves geometric consistency. No explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption A pre-trained feedforward NVS network can produce reliable pixel-wise hallucination scores that reflect inconsistency with input views.
    Invoked when using the NVS network to estimate scores for masking in the reconstruction procedure.

pith-pipeline@v0.9.0 · 5761 in / 1344 out tokens · 28461 ms · 2026-05-19T20:21:11.246063+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 4 internal anchors

  1. [1]

    Instant uncertainty calibration of nerfs us- ing a meta-calibrator

    Niki Amini-Naieni, Tomas Jakab, Andrea Vedaldi, and Ronald Clark. Instant uncertainty calibration of nerfs us- ing a meta-calibrator. InEuropean Conference on Computer Vision, pages 309–324. Springer, 2024. 3

  2. [2]

    Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields

    Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 5855–5864,

  3. [3]

    Mip-nerf 360: Unbounded anti-aliased neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022. 2, 6, 7, 9

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

  5. [5]

    Tensorf: Tensorial radiance fields

    Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean con- ference on computer vision, pages 333–350. Springer, 2022. 2

  6. [6]

    Pgsr: Planar-based gaussian splatting for ef- ficient and high-fidelity surface reconstruction.IEEE Trans- actions on Visualization and Computer Graphics, 2024

    Danpeng Chen, Hai Li, Weicai Ye, Yifan Wang, Weijian Xie, Shangjin Zhai, Nan Wang, Haomin Liu, Hujun Bao, and Guofeng Zhang. Pgsr: Planar-based gaussian splatting for ef- ficient and high-fidelity surface reconstruction.IEEE Trans- actions on Visualization and Computer Graphics, 2024. 2

  7. [7]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean Conference on Computer Vision, pages 370–386. Springer, 2024. 3

  8. [8]

    Mvs- plat360: Feed-forward 360 scene synthesis from sparse views.Advances in Neural Information Processing Systems, 37:107064–107086, 2024

    Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, and Jianfei Cai. Mvs- plat360: Feed-forward 360 scene synthesis from sparse views.Advances in Neural Information Processing Systems, 37:107064–107086, 2024. 3

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3

  10. [10]

    Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ fps

    Zhiwen Fan, Kevin Wang, Kairun Wen, Zehao Zhu, Dejia Xu, Zhangyang Wang, et al. Lightgaussian: Unbounded 3d gaussian compression with 15x reduction and 200+ fps. Advances in neural information processing systems, 37: 140138–140158, 2024. 2

  11. [11]

    Flowr: Flowing from sparse to dense 3d reconstructions

    Tobias Fischer, Samuel Rota Bul `o, Yung-Hsu Yang, Nikhil Keetha, Lorenzo Porzi, Norman M ¨uller, Katja Schwarz, Jonathon Luiten, Marc Pollefeys, and Peter Kontschieder. Flowr: Flowing from sparse to dense 3d reconstructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 27702–27712, 2025. 1, 2, 3

  12. [12]

    Plenoxels: Radiance fields without neural networks

    Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2022. 2

  13. [13]

    CAT3D: Create Anything in 3D with Multi-View Diffusion Models

    Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models.arXiv preprint arXiv:2405.10314, 2024. 3

  14. [14]

    Bayes’ rays: Uncertainty quan- tification for neural radiance fields

    Lily Goli, Cody Reading, Silvia Sell ´an, Alec Jacobson, and Andrea Tagliasacchi. Bayes’ rays: Uncertainty quan- tification for neural radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20061–20070, 2024. 3

  15. [15]

    Uc-nerf: Uncertainty-aware conditional neural radiance fields from endoscopic sparse views.IEEE Transactions on Medical Imaging, 44(3):1284–1296, 2024

    Jiaxin Guo, Jiangliu Wang, Ruofeng Wei, Di Kang, Qi Dou, and Yun-Hui Liu. Uc-nerf: Uncertainty-aware conditional neural radiance fields from endoscopic sparse views.IEEE Transactions on Medical Imaging, 44(3):1284–1296, 2024. 3

  16. [16]

    2d gaussian splatting for geometrically accu- rate radiance fields

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accu- rate radiance fields. InSIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024. 2

  17. [17]

    Putting nerf on a diet: Semantically consistent few-shot view synthesis

    Ajay Jain, Matthew Tancik, and Pieter Abbeel. Putting nerf on a diet: Semantically consistent few-shot view synthesis. InICCV, pages 5885–5894, 2021. 2

  18. [18]

    Rayzer: A self-supervised large view synthe- sis model

    Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, et al. Rayzer: A self-supervised large view synthe- sis model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4918–4929, 2025. 3, 8

  19. [19]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

    Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025. 3

  20. [20]

    Lvsm: A large view synthesis model with minimal 3d inductive bias

    Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias. InThe Thirteenth International Conference on Learning Representations, 2025. 2, 3, 5, 6, 7

  21. [21]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 1, 2, 3, 4, 6, 7

  22. [22]

    3d gaussian splat- ting as markov chain monte carlo.Advances in Neural In- formation Processing Systems, 37:80965–80986, 2024

    Shakiba Kheradmand, Daniel Rebain, Gopal Sharma, Wei- wei Sun, Yang-Che Tseng, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. 3d gaussian splat- ting as markov chain monte carlo.Advances in Neural In- formation Processing Systems, 37:80965–80986, 2024. 1, 6, 7

  23. [23]

    Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normaliza- tion

    Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu. Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normaliza- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 20775–20785,

  24. [24]

    Variational multi-scale rep- resentation for estimating uncertainty in 3d gaussian splat- ting

    Ruiqi Li and Yiu-ming Cheung. Variational multi-scale rep- resentation for estimating uncertainty in 3d gaussian splat- ting. InAdvances in Neural Information Processing Systems, pages 87934–87958, 2024. 3

  25. [25]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 2, 6, 7, 8

  26. [26]

    Re- conx: Reconstruct any scene from sparse views with video diffusion model.IEEE Transactions on Image Processing,

    Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, and Yueqi Duan. Re- conx: Reconstruct any scene from sparse views with video diffusion model.IEEE Transactions on Image Processing,

  27. [27]

    Zero-1-to- 3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 3

  28. [28]

    3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors.Advances in Neural Informa- tion Processing Systems, 37:133305–133327, 2024

    Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors.Advances in Neural Informa- tion Processing Systems, 37:133305–133327, 2024. 1, 2, 3

  29. [29]

    Sparseneus: Fast generalizable neural sur- face reconstruction from sparse views

    Xiaoxiao Long, Cheng Lin, Peng Wang, Taku Komura, and Wenping Wang. Sparseneus: Fast generalizable neural sur- face reconstruction from sparse views. InEuropean Confer- ence on Computer Vision, pages 210–227. Springer, 2022. 2

  30. [30]

    Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation.International Conference on Machine Learning, 2024

    Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni, and Filippos Kokki- nos. Im-3d: Iterative multiview diffusion and reconstruction for high-quality 3d generation.International Conference on Machine Learning, 2024. 3

  31. [31]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InECCV, 2020. 1, 2

  32. [32]

    Hardware acceleration of neu- ral graphics

    Muhammad Husnain Mubarik, Ramakrishna Kanungo, To- bias Zirr, and Rakesh Kumar. Hardware acceleration of neu- ral graphics. InProceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–12, 2023. 2

  33. [33]

    Reg- nerf: Regularizing neural radiance fields for view synthesis from sparse inputs

    Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. Reg- nerf: Regularizing neural radiance fields for view synthesis from sparse inputs. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 5480–5490, 2022. 2

  34. [34]

    Barron, and Ben Milden- hall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. InThe Eleventh International Conference on Learning Representa- tions, 2023. 3

  35. [35]

    Estimating 3d uncertainty field: Quantify- ing uncertainty for neural radiance fields

    Jianxiong Shen, Ruijie Ren, Adria Ruiz, and Francesc Moreno-Noguer. Estimating 3d uncertainty field: Quantify- ing uncertainty for neural radiance fields. In2024 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 2375–2381. IEEE, 2024. 3

  36. [36]

    MVDream: Multi-view Diffusion for 3D Generation

    Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv:2308.16512, 2023. 3

  37. [37]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3

  38. [38]

    Sparsenerf: Distilling depth ranking for few-shot novel view synthesis

    Guangcong Wang, Zhaoxi Chen, Chen Change Loy, and Zi- wei Liu. Sparsenerf: Distilling depth ranking for few-shot novel view synthesis. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 9065–9076,

  39. [39]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

  40. [40]

    Sparse2dgs: Geometry-prioritized gaussian splatting for surface reconstruction from sparse views

    Jiang Wu, Rui Li, Yu Zhu, Rong Guo, Jinqiu Sun, and Yan- ning Zhang. Sparse2dgs: Geometry-prioritized gaussian splatting for surface reconstruction from sparse views. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 11307–11316, 2025. 2

  41. [41]

    Difix3d+: Improving 3d reconstruc- tions with single-step diffusion models

    Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Goj- cic, and Huan Ling. Difix3d+: Improving 3d reconstruc- tions with single-step diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26024–26035, 2025. 1, 3, 4, 5, 6, 7, 8

  42. [42]

    Reconfusion: 3d reconstruction with diffusion priors

    Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Reconfusion: 3d reconstruction with diffusion priors. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21551–21561, 2024. 3, 6

  43. [43]

    Genfusion: Closing the loop between recon- struction and generation via videos

    Sibo Wu, Congrong Xu, Binbin Huang, Andreas Geiger, and Anpei Chen. Genfusion: Closing the loop between recon- struction and generation via videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6078–6088, 2025. 2, 6, 7, 1

  44. [44]

    Depthsplat: Connecting gaussian splatting and depth

    Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 16453–16463, 2025. 3, 6

  45. [45]

    Freenerf: Im- proving few-shot neural rendering with free frequency reg- ularization

    Jiawei Yang, Marco Pavone, and Yue Wang. Freenerf: Im- proving few-shot neural rendering with free frequency reg- ularization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8254–8263,

  46. [46]

    Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models

    Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models. InCVPR,

  47. [47]

    Gsfixer: Improving 3d gaussian splatting with reference-guided video diffusion priors.arXiv preprint arXiv:2508.09667, 2025

    Xingyilang Yin, Qi Zhang, Jiahao Chang, Ying Feng, Qing- nan Fan, Xi Yang, Chi-Man Pun, Huaqi Zhang, and Xi- aodong Cun. Gsfixer: Improving 3d gaussian splatting with reference-guided video diffusion priors.arXiv preprint arXiv:2508.09667, 2025. 2

  48. [48]

    Plenoctrees for real-time rendering of neural radiance fields

    Alex Yu, Ruilong Li, Matthew Tancik, Hao Li, Ren Ng, and Angjoo Kanazawa. Plenoctrees for real-time rendering of neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5752– 5761, 2021. 2

  49. [49]

    Mip-splatting: Alias-free 3d gaussian splat- ting

    Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 19447–19456,

  50. [50]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

  51. [51]

    Stable virtual camera: Generative view synthesis with diffusion models

    Jensen Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12405–12414, 2025. 2, 7, 1, 3

  52. [52]

    Fsgs: Real-time few-shot view synthesis using gaussian splatting

    Zehao Zhu, Zhiwen Fan, Yifan Jiang, and Zhangyang Wang. Fsgs: Real-time few-shot view synthesis using gaussian splatting. InEuropean conference on computer vision, pages 145–163. Springer, 2024. 7

  53. [53]

    Surface splatting

    Matthias Zwicker, Hanspeter Pfister, Jeroen van Baar, and Markus Gross. Surface splatting. InProceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, page 371–378, New York, NY , USA, 2001. As- sociation for Computing Machinery. 2 HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction Supplementary Material Overv...

  54. [54]

    Hallucination in NVS via diffusion We provide additional analysis to deepen understanding the hallucination issue introduced by diffusion models in NVS task. Specifically, we evaluate recent state-of-the-art meth- ods from two representative paradigms:Diffusion-assisted NVS with explicit 3DGS model(e.g., Difix3D [41], GenFu- sion [43] and 3DGS-enhancer [2...

  55. [55]

    Details of Hallucination Scoring Network Overview of hallucination score network.We provide a detailed model architecture Fig. 6. Training dataset curation.We provide additional details on the constructing training dataset for the hallucination score network. For all training scenes, we follow the Di- fix3D [41] pipeline under the 9-view setting to first ...

  56. [56]

    Improving GenFusion We integrate our hallucination scoring network into Gen- Fusion [43] – the state-of-the-art video-diffusion-assisted 3DGS training pipeline

    Generalizing to other diffusion models 9.1. Improving GenFusion We integrate our hallucination scoring network into Gen- Fusion [43] – the state-of-the-art video-diffusion-assisted 3DGS training pipeline. Importantly, we apply the same HAD model as in the main paper without any additional fine-tuning on video diffusion data. To ensure a fair com- parison,...

  57. [57]

    Additional Qualitative Comparisons We provide additional qualitative results on both the DL3DV – as shown in Fig

    More Results 10.1. Additional Qualitative Comparisons We provide additional qualitative results on both the DL3DV – as shown in Fig. 11 and Fig. 10, and MipN- eRF360 datasets – see Fig. 12. Note we also include the corresponding rendered videos in project website, provid- ing a clearer comparison across viewpoints. Both the qual- itative results and video...