arxiv: 2604.11797 · v2 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization

Deming Li , Abhay Yadav , Cheng Peng , Rama Chellappa , Anand Bhattad

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D reconstructiondiffusion modelsmulti-view consistencylatent bridge matchingscene refinementcross-view synchronizationdenoising trajectory

0 comments

The pith

SyncFix refines 3D scene reconstructions by synchronizing multiple views through joint latent bridge matching in a diffusion process.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SyncFix to enforce cross-view consistency when refining reconstructed scenes with diffusion models. It casts the task as a joint latent bridge matching problem that aligns distorted and clean representations across views, learning a shared conditional that holds throughout denoising. Training occurs only on image pairs yet the method extends directly to any number of views at inference time, with quality rising as more views are supplied. This matters because existing reconstruction pipelines often leave semantic or geometric mismatches that degrade final outputs, and SyncFix demonstrates consistent gains over baselines even when no clean reference images are present.

Core claim

SyncFix formulates refinement of reconstructed scenes as a joint latent bridge matching problem that synchronizes distorted and clean representations across multiple views to fix semantic and geometric inconsistencies. It learns a joint conditional over multiple views to enforce consistency throughout the denoising trajectory. Training is performed solely on image pairs, yet the approach generalizes naturally to an arbitrary number of views at inference, and reconstruction quality continues to improve with additional views although gains diminish at higher counts. Qualitative and quantitative results show that SyncFix produces higher-fidelity outputs than current baselines, even without any,

What carries the argument

Joint latent bridge matching that synchronizes distorted and clean representations across views during diffusion denoising.

If this is right

Reconstruction quality rises as more views are supplied, with diminishing returns after a moderate count.
The method outperforms existing baselines even when no clean reference images are supplied.
Higher fidelity is obtained when a small number of clean references can be included.
The pair-trained model applies without retraining to scenes containing any number of input views.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pairwise training may suffice for multi-view consistency tasks in other diffusion pipelines such as video or novel-view synthesis.
The synchronization mechanism could be tested on reconstruction pipelines that use different base models or different noise schedules.
If view count continues to help, the approach might reduce the need for dense capture setups in practical 3D scanning.

Load-bearing premise

Training only on image pairs will generalize to any number of views at inference and the joint matching process will remove inconsistencies without creating new artifacts or losing detail.

What would settle it

Running SyncFix on a multi-view dataset with progressively added views and observing that perceptual or geometric error metrics stop improving or begin to worsen beyond a small number of views.

Figures

Figures reproduced from arXiv: 2604.11797 by Abhay Yadav, Anand Bhattad, Cheng Peng, Deming Li, Rama Chellappa.

**Figure 1.** Figure 1: Independent diffusion refinement (DifiX3D+) processes each view separately, leading to inconsistent geometry across views (see close-view of table in the bottom panel and highlighted regions above). SyncFix (Ours) instead refines all views jointly, enforcing cross-view agreement during denoising and producing stable 3D structure. Abstract. We present SyncFix, a framework that enforces cross-view consistenc… view at source ↗

**Figure 2.** Figure 2: SyncFix overview. Distorted renderings from multiple viewpoints xD are encoded into latent representations and transported toward clean targets xGT using latent bridge matching. SyncFix learns a joint latent bridge over multiple views, coupling latent trajectories through cross-view attention to enforce multi-view consistency during refinement. The model is trained using view pairs (N = 2) but generalizes… view at source ↗

**Figure 3.** Figure 3: Multi-view consistency under extreme sparse-view degradation [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Performance at different number of 3DGS training views on the [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison on the NeRFBusters evaluation dataset. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of multi-view refinement with feedforward renderings [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation on multi-view synchronization. We compare SyncFix with and without joint multi-view refinement using the same latent-bridge matching training protocol and without any reference images. Processing each view independently (Ours, single-view) improves individual renderings but produces inconsistent scene structure across viewpoints. In particular, the green carpet on the floor and the drawing board o… view at source ↗

**Figure 8.** Figure 8: Cross-view geometric consistency. Keypoint correspondences between two refined views. Difix3D+ produces inconsistent refinements, resulting in irregular correspondences across views. SyncFix preserves the underlying scene geometry, producing coherent cross-view matches. For visualization clarity, we display 20 sampled correspondences. Matches computed using RaCo + LightGlue. 5 Discussion Our experiments … view at source ↗

**Figure 9.** Figure 9: Limitation. SyncFix may synthesize content that mismatches the groundtruth information. The resulting pixel-wise metrics are worse for this scene despite being multi-view consistent. Multi-view consistency can break down when views are refined in isolation. As seen in our SyncFix results, the windowsills in the bottom views become inconsistent because the independent network passes fail to synchronize geo… view at source ↗

**Figure 10.** Figure 10: CVSC scores as the number of degraded views and reference views [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: FID as the number of degraded views and reference views changes [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: PSNR as the number of degraded views and reference views [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Visual comparison on scene f102f93e..., Pair 1. The baseline 3DGS renderings exhibit strong blur and missing structure in the chair and surrounding objects. Fixer improves stability but suppresses fine details, while Difix3D+ produces sharper results with inconsistent textures across views. Our method reconstructs clearer chair geometry and preserves the object boundaries more faithfully, while maintainin… view at source ↗

**Figure 14.** Figure 14: Visual comparison on scene f102f93e..., Pair 2. The cropped regions emphasize the chair structure, where baseline methods either blur the geometry or introduce inconsistent textures. Our method restores sharper edges and produces a more coherent object appearance across the two views. Note that the corruption rates are different from [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Visual comparison on scene e0a5470d..., Pair 1. In the doorway region, the baseline renderings show severe blur and inconsistent geometry. Fixer produces smoother results but removes structural details, while Difix3D+ partially restores textures with view-dependent artifacts. Our method recovers cleaner structural edges and maintains a consistent appearance across both viewpoints [PITH_FULL_IMAGE:figure… view at source ↗

**Figure 16.** Figure 16: Visual comparison on scene e0a5470d..., Pair 2. The boxed crops highlight brick and railing structures that are difficult to reconstruct from sparse observations. Our method restores sharper geometric detail and more consistent textures compared with Fixer and Difix3D+. Note that the level of corruption is different in the 3DGS views compared to [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Visual comparison on scene eba3bdcb..., Pair 1. The baseline 3DGS renderings contain strong blur and missing geometry around the wall fixtures. Fixer produces smoother results but loses structural detail, whereas Difix3D+ introduces artifacts around object boundaries. Our method reconstructs clearer object shapes and preserves the layout of switches and outlets more faithfully [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 18.** Figure 18: Visual comparison on scene eba3bdcb..., Pair 2. The selected crops illustrate improvements in structural clarity and artifact removal. Our method produces sharper boundaries and a more accurate layout of the wall fixtures while maintaining a consistent appearance across views. Note that the level of corruption is different in the 3DGS views compared to [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

read the original abstract

We present SyncFix, a framework that enforces cross-view consistency during the diffusion-based refinement of reconstructed scenes. SyncFix formulates refinement as a joint latent bridge matching problem, synchronizing distorted and clean representations across multiple views to fix the semantic and geometric inconsistencies. This means SyncFix learns a joint conditional over multiple views to enforce consistency throughout the denoising trajectory. Our training is done only on image pairs, but it generalizes naturally to an arbitrary number of views during inference. Moreover, reconstruction quality improves with additional views, with diminishing returns at higher view counts. Qualitative and quantitative results demonstrate that SyncFix consistently generates high-quality reconstructions and surpasses current state-of-the-art baselines, even in the absence of clean reference images. SyncFix achieves even higher fidelity when sparse references are available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SyncFix proposes pair-trained joint latent bridge matching to enforce multi-view consistency in diffusion 3D refinement, but the scaling mechanism for n>2 views lacks enough formulation detail to fully support the generalization and improvement claims.

read the letter

The core idea is using a joint conditional over views inside the diffusion denoising process to sync distorted and clean latents, fixing semantic and geometric issues in reconstructed scenes. Training happens only on pairs yet the method is said to extend to any number of views at inference, with quality rising as views are added (diminishing returns after a point). They also report better results when sparse clean references are present and solid gains over baselines even without them.

Referee Report

2 major / 2 minor

Summary. The paper presents SyncFix, a framework for enforcing cross-view consistency in diffusion-based refinement of 3D scene reconstructions. It formulates the refinement as a joint latent bridge matching problem that synchronizes distorted and clean representations across views to correct semantic and geometric inconsistencies. The method is trained exclusively on image pairs but is claimed to generalize naturally to an arbitrary number of views at inference, with reconstruction quality improving as more views are added (with diminishing returns). It asserts consistent outperformance over state-of-the-art baselines even without clean reference images and further gains when sparse references are available.

Significance. If the pair-to-multi-view generalization holds without introducing artifacts and the claimed improvements are empirically validated, SyncFix could meaningfully advance diffusion-based 3D reconstruction by offering a scalable consistency mechanism that does not require multi-view training data or clean references. The observation of quality gains with additional views would be a useful practical property for real-world capture scenarios.

major comments (2)

[Abstract] Abstract: the headline claim that 'training is done only on image pairs, but it generalizes naturally to an arbitrary number of views during inference' via 'joint latent bridge matching' is load-bearing for all performance assertions, yet no formulation is given for how the joint conditional is constructed when n>2 (single multi-view latent vs. repeated pairwise bridges), whether the denoising trajectory stays consistent, or how new artifacts are avoided.
[Abstract] Abstract: the assertions of 'consistent outperformance of SOTA baselines' and 'reconstruction quality improves with additional views' are presented without any quantitative metrics, datasets, baselines, error analysis, or implementation details, rendering the central empirical claims unverifiable.

minor comments (2)

The term 'joint latent bridge matching' is introduced without definition, prior reference, or relation to existing bridge-matching or latent diffusion literature.
The abstract states that 'qualitative and quantitative results demonstrate...' but supplies no information on the evaluation protocol, datasets, or metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the manuscript. We address each major comment point by point below, indicating where revisions will be incorporated.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that 'training is done only on image pairs, but it generalizes naturally to an arbitrary number of views during inference' via 'joint latent bridge matching' is load-bearing for all performance assertions, yet no formulation is given for how the joint conditional is constructed when n>2 (single multi-view latent vs. repeated pairwise bridges), whether the denoising trajectory stays consistent, or how new artifacts are avoided.

Authors: We agree that the abstract does not provide the requested formulation details for the n>2 case. In the revised version we will expand the abstract to briefly describe the construction of the joint conditional via a single synchronized multi-view latent representation (rather than repeated pairwise bridges), the maintenance of a consistent denoising trajectory through joint conditioning, and the avoidance of new artifacts via cross-view synchronization at each step. We will also ensure the method section supplies the corresponding mathematical details. revision: yes
Referee: [Abstract] Abstract: the assertions of 'consistent outperformance of SOTA baselines' and 'reconstruction quality improves with additional views' are presented without any quantitative metrics, datasets, baselines, error analysis, or implementation details, rendering the central empirical claims unverifiable.

Authors: We concur that the abstract would be more informative with specific supporting details. In the revision we will add concise quantitative highlights (e.g., key performance gains and the observed trend with increasing view count), along with references to the datasets, baselines, and evaluation metrics used in the experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical results rather than self-referential derivation

full rationale

The provided abstract and text assert that pair-trained joint latent bridge matching generalizes to arbitrary views at inference and improves with more views, but no equations, fitted parameters, or self-citations are quoted that reduce any prediction or uniqueness claim to the inputs by construction. The central results (surpassing baselines, higher fidelity with references) are presented as externally validated outcomes, keeping the derivation self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available, so specific free parameters, axioms, and invented entities cannot be extracted or verified. The approach appears to rest on standard assumptions from diffusion models and latent space methods in computer vision.

invented entities (1)

joint latent bridge matching no independent evidence
purpose: To synchronize distorted and clean representations across multiple views for consistency enforcement
Introduced in the abstract as the core formulation of the refinement problem.

pith-pipeline@v0.9.0 · 5436 in / 1525 out tokens · 108293 ms · 2026-05-10T15:15:47.082547+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generalizable Sparse-View 3D Reconstruction from Unconstrained Images
cs.CV 2026-04 unverdicted novelty 6.0

GenWildSplat is a feed-forward model that reconstructs 3D Gaussians from sparse unposed unconstrained images by predicting depth and poses with learned priors, an appearance adapter, and semantic segmentation for transients.

Reference graph

Works this paper leans on

46 extracted references · 9 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

World Simulation with Video Foundation Models for Physical AI

Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025)

work page internal anchor Pith review arXiv 2025
[2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Asim, M., Wewer, C., Wimmer, T., Schiele, B., Lenssen, J.E.: Met3r: Measuring multi-view consistency in generated images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6034–6044 (2025)

2025
[3]

In: Proceedings of the IEEE/CVF international conference on computer vision

Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srini- vasan, P.P.: Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5855–5864 (2021)

2021
[4]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

Chadebec, C., Tasar, O., Sreetharan, S., Aubin, B.: Lbm: Latent bridge matching for fast image-to-image translation. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 29086–29098 (2025)

2025
[5]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Deng,K.,Liu,A.,Zhu,J.Y.,Ramanan,D.:Depth-supervisednerf:Fewerviewsand faster training for free. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12882–12891 (2022)

2022
[6]

Dream- sim: Learning new dimensions of human visual similar- ity using synthetic data.arXiv preprint arXiv:2306.09344,

Fu,S.,Tamir,N.,Sundaram,S.,Chai,L.,Zhang,R.,Dekel,T.,Isola,P.:Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344 (2023) 18 Deming Li et al

work page arXiv 2023
[7]

Advances in Neural Information Processing Systems (2024)

Gao*, R., Holynski*, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P.P., Barron, J.T., Poole*, B.: Cat3d: Create anything in 3d with multi-view dif- fusion models. Advances in Neural Information Processing Systems (2024)

2024
[8]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (2023)

Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A.: Instruct-nerf2nerf: Editing 3d scenes with instructions. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (2023)

2023
[9]

ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

Jiang, L., Mao, Y., Xu, L., Lu, T., Ren, K., Jin, Y., Xu, X., Yu, M., Pang, J., Zhao, F., et al.: Anysplat: Feed-forward 3d gaussian splatting from unconstrained views. ACM Transactions on Graphics (TOG)44(6), 1–16 (2025)

2025
[10]

ACM Transactions on Graphics42(4) (July 2023),https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4) (July 2023),https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

2023
[11]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.Y., Lin, T.Y.: Magic3d: High-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 300–309 (2023)

2023
[12]

In: Proceedings of the IEEE/CVF international conference on com- puter vision

Lindenberger, P., Sarlin, P.E., Pollefeys, M.: Lightglue: Local feature matching at light speed. In: Proceedings of the IEEE/CVF international conference on com- puter vision. pp. 17627–17638 (2023)

2023
[13]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22160–22169 (2024)

2024
[14]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t

2023
[15]

arXiv preprint arXiv:2408.16767 (2024)

Liu, F., Sun, W., Wang, H., Wang, Y., Sun, H., Ye, J., Zhang, J., Duan, Y.: Reconx: Reconstruct any scene from sparse views with video diffusion model. arXiv preprint arXiv:2408.16767 (2024)

work page arXiv 2024
[16]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https: //openreview.net/forum?id=P4s6FUpCbG

Liu, X., Zhou, C., Huang, S.: 3DGS-enhancer: Enhancing unbounded 3d gaus- sian splatting with view-consistent 2d diffusion priors. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024),https: //openreview.net/forum?id=P4s6FUpCbG

2024
[17]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z

Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and trans- fer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z

2023
[18]

In: The Twelfth International Conference on Learning Representations (2024),https: //openreview.net/forum?id=MN3yH2ovHb

Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., Wang, W.: Sync- dreamer: Generating multiview-consistent images from a single-view image. In: The Twelfth International Conference on Learning Representations (2024),https: //openreview.net/forum?id=MN3yH2ovHb

2024
[19]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Luo, Y., Zhou, S., Lan, Y., Pan, X., Loy, C.C.: 3denhancer: Consistent multi-view diffusion for 3d enhancement. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16430–16440 (2025)

2025
[20]

In: International Conferenceon LearningRepresentations(2022),https://openreview.net/forum? id=aBsCjcPu_tE

Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: SDEdit: Guided image synthesis and editing with stochastic differential equations. In: International Conferenceon LearningRepresentations(2022),https://openreview.net/forum? id=aBsCjcPu_tE

2022
[21]

In: ECCV (2020) SyncFix 19

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: ECCV (2020) SyncFix 19

2020
[22]

Niemeyer, M., Barron, J.T., Mildenhall, B., Sajjadi, M.S.M., Geiger, A., Radwan, N.: Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs.In:Proc.IEEEConf.onComputerVisionandPatternRecognition(CVPR) (2022)

2022
[23]

com/nv-tlabs/Fixer(2025), gitHub repository

NVIDIA: Fixer: Official repository for the nvidia fixer model.https://github. com/nv-tlabs/Fixer(2025), gitHub repository. Accessed: 2026-03-12

2025
[24]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=FjNys5c7VyY

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=FjNys5c7VyY

2023
[26]

arXiv preprint arXiv:2601.16981 (2026)

Serrano-Lozano, D., Bhattad, A., Herranz, L., Lalonde, J.F., Vazquez-Corral, J.: Synclight: Controllable and consistent multi-view relighting. arXiv preprint arXiv:2601.16981 (2026)

work page internal anchor Pith review arXiv 2026
[27]

arXiv preprint arXiv:2602.15755 (2026)

Shenoi, A., Lindenberger, P., Sarlin, P.E., Pollefeys, M.: Raco: Ranking and co- variance for practical learned keypoints. arXiv preprint arXiv:2602.15755 (2026)

work page arXiv 2026
[28]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=FUgrjq2pbB

Shi, Y., Wang, P., Ye, J., Mai, L., Li, K., Yang, X.: MVDream: Multi-view dif- fusion for 3d generation. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=FUgrjq2pbB

2024
[29]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

In: IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

Wang, G., Chen, Z., Loy, C.C., Liu, Z.: Sparsenerf: Distilling depth ranking for few- shot novel view synthesis. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

2023
[31]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wang, H., Du, X., Li, J., Yeh, R.A., Shakhnarovich, G.: Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12619– 12629 (2023)

2023
[32]

In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

2025
[33]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

2023
[34]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Warburg, F., Weber, E., Tancik, M., Holynski, A., Kanazawa, A.: Nerfbusters: Removing ghostly artifacts from casually captured nerfs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18120–18130 (2023)

2023
[35]

Warburg*, F., Weber*, E., Tancik, M., Hołyński, A., Kanazawa, A.: Nerfbusters: Removing ghostly artifacts from casually captured nerfs (2023)

2023
[36]

arXiv preprint arXiv:2506.12563 (2025)

Wickrema, C., Leary, S., Sarkar, S., Giglio, M., Bianchi, E., Mace, E., Twardowski, M.: Benchmarking image similarity metrics for novel view synthesis applications. arXiv preprint arXiv:2506.12563 (2025)

work page arXiv 2025
[37]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wu, J.Z., Zhang, Y., Turki, H., Ren, X., Gao, J., Shou, M.Z., Fidler, S., Gojcic, Z., Ling, H.: Difix3d+: Improving 3d reconstructions with single-step diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26024–26035 (2025)

2025
[38]

with diffusion priors

Wu, R., Mildenhall, B., Henzler, P., Park, K., Gao, R., Watson, D., Srinivasan, P.P., Verbin, D., Barron, J.T., Poole, B., et al.: Reconfusion: 3d reconstruction 20 Deming Li et al. with diffusion priors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21551–21561 (2024)

2024
[39]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Xu, H., Peng, S., Wang, F., Blum, H., Barath, D., Geiger, A., Pollefeys, M.: Depth- splat: Connecting gaussian splatting and depth. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16453–16463 (2025)

2025
[40]

Yang, J., Pavone, M., Wang, Y.:
[41]

In: The Fourteenth International Conferenceon LearningRepresentations(2026),https://openreview.net/forum? id=ImRhA9xmay

Ye, B., Chen, B., Xu, H., Barath, D., Pollefeys, M.: Yonosplat: You only need one model for feedforward 3d gaussian splatting. In: The Fourteenth International Conferenceon LearningRepresentations(2026),https://openreview.net/forum? id=ImRhA9xmay

2026
[42]

IEEE transactions on pattern analysis and machine intelligence (2024)

Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. IEEE transactions on pattern analysis and machine intelligence (2024)

2024
[43]

Zhou, H., Shao, Z., Miao, S., Wang, P., Bai, D., Liu, B., Liao, Y.: Freefix: Boost- ing 3d gaussian splatting via fine-tuning-free diffusion models. arXiv preprint arXiv:2601.20857 (2026) SyncFix 21 A Technical Appendices and Supplementary Material A.1 sectionNumber of Views Analysis SyncFix is trained on view pairs, but our permutation-invariant latent c...

work page arXiv 2026
[44]

10 reveals a clear trend that the CVSC score improves with more de- graded views, regardless of the number of reference views

Fig. 10 reveals a clear trend that the CVSC score improves with more de- graded views, regardless of the number of reference views. The biggest gain occurs when moving from single-view to two-view inference. This finding supports our claim that jointly refining more views yields better cross-view consistency
[45]

11 shows that FID tends to increase as either the number of degraded views or the number of reference views grows

Fig. 11 shows that FID tends to increase as either the number of degraded views or the number of reference views grows. We attribute this behavior 22 Deming Li et al. to the attention being distributed across a larger set of inputs during joint conditioning, which can introduce a moderate averaging/smoothing effect
[46]

12 indicates that increasing the number of input views, particularly the reference images, leads to a higher PSNR and plateaus with more views

Fig. 12 indicates that increasing the number of input views, particularly the reference images, leads to a higher PSNR and plateaus with more views. We observe that the local details are better recovered in the regions that are visible in the reference views. We report the results with 5 degraded views and 5 reference views in our main text. 1 2 3 4 5 Num...