pith. machine review for the scientific record. sign in

arxiv: 2605.12169 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelsview synthesisnovel view synthesisreference-guidedcoarse-to-finezero-shotimage restorationstereo conversion
0
0 comments X

The pith

UniFixer repairs diffusion degradations in view synthesis with a reference-guided coarse-to-fine refiner.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that diffusion-based view synthesis produces spatial, temporal, and backbone-related degradations such as blur and geometric distortion. It introduces UniFixer as a plug-and-play module that takes one reference view and applies pre-alignment, structure anchoring, and detail injection to restore quality. A sympathetic reader would care because this allows any existing diffusion model to produce cleaner novel views without retraining or changing the backbone. The approach claims to generalize zero-shot across different degradation types and tasks including novel view synthesis and stereo conversion.

Core claim

UniFixer is a universal reference-guided framework that fixes diverse diffusion degradations via a coarse-to-fine strategy. A reference pre-alignment module first performs coarse alignment between the reference view and the degraded novel view. A global structure anchoring mechanism then rectifies geometric distortions, followed by a local detail injection module that recovers fine-grained texture details. This enables plug-and-play zero-shot fixing across diffusion backbones and achieves state-of-the-art performance on novel view synthesis and stereo conversion.

What carries the argument

The coarse-to-fine refiner consisting of reference pre-alignment, global structure anchoring, and local detail injection modules that use a single reference view to correct degradations.

If this is right

  • Achieves state-of-the-art results on novel view synthesis benchmarks without task-specific retraining.
  • Extends directly to stereo conversion with the same reference-guided modules.
  • Operates zero-shot across different diffusion model architectures and scenes.
  • Reduces blurred details and geometric distortions while preserving structural fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modules could be tested on diffusion-based video generation to enforce temporal consistency across frames.
  • Performance may degrade when the reference view comes from an extreme viewpoint angle not covered in training.
  • Chaining multiple reference views could further reduce residual artifacts in complex scenes.

Load-bearing premise

A single reference view always supplies enough undistorted information to correct all three degradation types without introducing new artifacts, and the modules generalize zero-shot to unseen diffusion backbones and scenes.

What would settle it

Applying UniFixer to a new diffusion backbone on a scene where the reference view lacks key details and checking whether output artifacts persist or new distortions appear.

Figures

Figures reproduced from arXiv: 2605.12169 by Christopher Schroers, Sihan Chen, Tunc Aydin, Xiang Zhang, Yang Zhang.

Figure 1
Figure 1. Figure 1: Existing diffusion-based view synthesis, including explicit/implicit novel view synthesis and stereo conversion approaches, often suffer from diffusion degradations (e.g., inconsistent textures and distorted structures) due to pixel-to-latent compression and diffusion hallucination. Moreover, diffusion degradation varies with different spatial resolutions, temporal dynamics, and diffusion backbones, posing… view at source ↗
Figure 2
Figure 2. Figure 2: Degradation analysis with t-SNE feature visualization of spatial (×2/ × 3/ × 4/ × 6), temporal (frame strides ∈ {1, 3, 6, 9}), and backbone-related (UNet [54] and DiT [19]) degradations on a DL3DV [28] scene. The shared setting (spatial ×4, temporal stride 1, and DiT-based backbone) is plotted once as backbone_DiT. Color families denote degradation types, and the red star marks the ground-truth. Crosses/ci… view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline of UniFixer. Given a degraded novel view and a high-quality refer￾ence view, we first perform coarse alignment by the reference pre-alignment module. Leveraging the warped reference, we apply the global structure anchoring module to aggregate shared structures via reference-mixed attention. The local detail injection module adaptively fuses multi-scale features for fine-grained texture enhancement… view at source ↗
Figure 4
Figure 4. Figure 4: Reference pre-alignment. For ex￾plicit view synthesis methods, i.e., DWI￾based methods, we estimate depth infor￾mation and perform geometry-based warp￾ing using camera poses. For implicit ap￾proaches, we perform flow-based warping with the optical flow estimated from the ref￾erence and the synthesized novel view. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual results of applying novel view fixers (including DIFIX3D+ [48], MaRINeR [7], and ours) to improve diffusion-based view synthesis methods. 5.4 Ablation Study Component design. Tab. 5 demonstrates the contributions of each compo￾nent in our UniFixer pipeline. Compared with the variant that directly utilizes features from the reference view (#1), RPA alleviates the difficulty of corre￾spondence search … view at source ↗
Figure 6
Figure 6. Figure 6: Visual results of ablation study on component design and referencing mecha￾nism, respectively. The ID corresponds to the ID in Tab. 4a and 4b of the main paper. (a) Input image ×4 ×8 ×16 Feature map Attention map (b) DINOv3 feature maps at different scales [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: DINOv3 feature map visualization. 8 Degradation analysis details 8.1 DINOv3 map visualization To further justify using DINOv3 patch tokens to extract degradation features for the t-SNE analysis, we visualize the token representations of a single de￾graded image. Specifically, we feed the same image at three input resolutions (i.e., 480×832 / 960×1664 / 1920×3328, denoted by ×4/×8/×16 in [PITH_FULL_IMAGE:f… view at source ↗
Figure 8
Figure 8. Figure 8: t-SNE visualization under different degradation dimensions. Lighter colors in￾dicate more severe degradation. semantically related regions, further supporting the semantic consistency and representational validity of the patch tokens. 8.2 Degradation fixing results We further split the all-in-one t-SNE figure ( [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional degradation and fixing visual results. The visual quality consistently improves as the degradation severity decreases (across rows) and as more effective fixers are applied (across columns). These results are a subset of the samples used to produce the t-SNE visualization [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Flow-based warping failure case. when the viewpoint change between target and source view is large, SEA-RAFT can struggle to estimate accurate correspondences, which further leads to erro￾neous warping ( [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Warping and view-dependent effects discussion. For explicit DWI methods, we observe two main issues: (1) view-dependent effects can be baked into the estimated geometry and thus propagated by warp￾ing; (2) a single surface point is typically constrained to have a single depth, which cannot represent multi-depth structures (e.g., semi-transparent or lay￾ered regions), leading to warping errors. These error… view at source ↗
Figure 12
Figure 12. Figure 12: Visual results of applying plug-and-play fixers (including DIFIX3D+ [48], MaRINeR [7], and ours) to improve diffusion-based view synthesis methods [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visual results of applying plug-and-play fixers (continued) [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visual results of in-the-wild test [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
read the original abstract

With the recent surge of generative models, diffusion-based approaches have become mainstream for view synthesis tasks, either in an explicit depth-warp-inpaint or in an implicit end-to-end manner. Despite their success, both paradigms often suffer from noticeable quality degradation, e.g., blurred details and distorted structures, caused by pixel-to-latent compression and diffusion hallucination. In this paper, we investigate diffusion degradation from three key dimensions (i.e., spatial, temporal, and backbone-related) and propose UniFixer, a universal reference-guided framework that fixes diverse degradation artifacts via a coarse-to-fine strategy. Specifically, a reference pre-alignment module is first designed to perform coarse alignment between the reference view and the degraded novel view. A global structure anchoring mechanism then rectifies geometric distortions to ensure structural fidelity, followed by a local detail injection module that recovers fine-grained texture details for high-quality view synthesis. Our UniFixer serves as a plug-and-play refiner that achieves zero-shot fixing across different types of diffusion degradation, and extensive experiments verify our state-of-the-art performance on novel view synthesis and stereo conversion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes UniFixer, a universal reference-guided framework for correcting degradation artifacts in diffusion-based view synthesis. It decomposes degradation into spatial, temporal, and backbone-related dimensions and employs a coarse-to-fine pipeline consisting of a reference pre-alignment module, a global structure anchoring mechanism, and a local detail injection module. The central claim is that this architecture acts as a plug-and-play, zero-shot refiner that achieves state-of-the-art performance on novel view synthesis and stereo conversion tasks.

Significance. If the zero-shot and SOTA claims are substantiated, the work would offer a practical, model-agnostic post-processing tool that mitigates common diffusion artifacts (blurring, geometric distortion) without retraining the underlying generative backbone. This could be broadly useful given the prevalence of diffusion models in view synthesis pipelines.

major comments (2)
  1. [Abstract] Abstract: the assertion of 'state-of-the-art performance' and 'extensive experiments' is unsupported by any quantitative metrics, tables, error bars, ablation studies, or dataset descriptions, rendering the central empirical claim unverifiable from the provided text.
  2. [Abstract] Abstract: the zero-shot generalization claim for the coarse-to-fine modules across unseen diffusion backbones rests on an untested assumption that degradation statistics are sufficiently similar; no evidence is supplied that the reference pre-alignment, global structure anchoring, or local detail injection steps avoid introducing new geometric or texture artifacts when latent spaces or sampling schedules differ.
minor comments (1)
  1. [Abstract] Abstract: the three degradation dimensions (spatial, temporal, backbone-related) are introduced without a supporting citation or prior reference to establish the taxonomy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concerns about the abstract's empirical claims by clarifying the supporting evidence in the full manuscript and committing to revisions for better verifiability. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of 'state-of-the-art performance' and 'extensive experiments' is unsupported by any quantitative metrics, tables, error bars, ablation studies, or dataset descriptions, rendering the central empirical claim unverifiable from the provided text.

    Authors: We acknowledge that the abstract, as a concise summary, omits the detailed metrics. The full manuscript contains quantitative tables with metrics such as PSNR/SSIM/LPIPS, error bars, ablation studies, and dataset descriptions for novel view synthesis and stereo conversion in the Experiments section. To improve verifiability, we will revise the abstract to include key performance highlights substantiating the SOTA claim. revision: yes

  2. Referee: [Abstract] Abstract: the zero-shot generalization claim for the coarse-to-fine modules across unseen diffusion backbones rests on an untested assumption that degradation statistics are sufficiently similar; no evidence is supplied that the reference pre-alignment, global structure anchoring, or local detail injection steps avoid introducing new geometric or texture artifacts when latent spaces or sampling schedules differ.

    Authors: Our experiments evaluate UniFixer on multiple diffusion-based view synthesis pipelines with different backbones and sampling schedules, showing consistent zero-shot improvements without new artifacts via quantitative and qualitative results. The reference-guided design aims to be backbone-agnostic. We agree that explicit tests on additional unseen backbones would strengthen the claim, and we will add further analysis or experiments in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: independent architectural modules with no derivation reducing to inputs

full rationale

The paper introduces UniFixer as a plug-and-play refiner consisting of three explicitly designed modules (reference pre-alignment, global structure anchoring, local detail injection) that operate via a coarse-to-fine strategy on diffusion degradations. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described method. The central claim rests on the novelty of the reference-guided framework and its zero-shot applicability, verified by experiments rather than any self-referential reduction. No self-citation load-bearing, ansatz smuggling, or renaming of known results is present; the contribution is an engineering architecture independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

Abstract introduces three new modules (reference pre-alignment, global structure anchoring, local detail injection) as core components; no free parameters, mathematical axioms, or external benchmarks are mentioned.

invented entities (3)
  • reference pre-alignment module no independent evidence
    purpose: perform coarse alignment between reference view and degraded novel view
    New module introduced to handle initial alignment step.
  • global structure anchoring mechanism no independent evidence
    purpose: rectify geometric distortions to ensure structural fidelity
    New mechanism proposed for correcting structure.
  • local detail injection module no independent evidence
    purpose: recover fine-grained texture details
    New module for injecting local textures.

pith-pipeline@v0.9.0 · 5505 in / 1211 out tokens · 100110 ms · 2026-05-13T07:03:37.613540+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 10 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    An, J., Huang, S., Song, Y., Dou, D., Liu, W., Luo, J.: Artflow: Unbiased im- age style transfer via reversible neural flows. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 862–871 (2021)

  2. [2]

    arXiv preprint arXiv:2509.19296 (2025)

    Bahmani, S., Shen, T., Ren, J., Huang, J., Jiang, Y., Turki, H., Tagliasacchi, A., Lindell, D.B., Gojcic, Z., Fidler, S., et al.: Lyra: Generative 3d scene reconstruction via video diffusion model self-distillation. arXiv preprint arXiv:2509.19296 (2025)

  3. [3]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)

  4. [4]

    StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

    Behrens, T., Obukhov, A., Ke, B., Tosi, F., Poggi, M., Schindler, K.: Stereospace: Depth-free synthesis of stereo geometry via end-to-end diffusion in a canonical space. arXiv preprint arXiv:2512.10959 (2025)

  5. [5]

    Bernasconi, M., Djelouah, A., Zhang, Y., Gross, M., Schroers, C.: Rebair: Reference-basedimagerestoration.In:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision. pp. 5489–5498 (2025)

  6. [6]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  7. [7]

    In: European Conference on Computer Vision

    Bösiger,L.,Dusmanu,M.,Pollefeys,M.,Bauer,Z.:Mariner:Enhancingnovelviews by matching rendered images with nearby references. In: European Conference on Computer Vision. pp. 76–94. Springer (2024)

  8. [8]

    In: European conference on computer vision

    Cao, J., Liang, J., Zhang, K., Li, Y., Zhang, Y., Wang, W., Gool, L.V.: Reference- based image super-resolution with deformable attention transformer. In: European conference on computer vision. pp. 325–342. Springer (2022)

  9. [9]

    arXiv preprint arXiv:2512.08765 (2025)

    Chu, R., He, Y., Chen, Z., Zhang, S., Xu, X., Xia, B., Wang, D., Yi, H., Liu, X., Zhao, H., et al.: Wan-move: Motion-controllable video generation via latent trajectory guidance. arXiv preprint arXiv:2512.08765 (2025)

  10. [10]

    Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convo- lutional networks (2017),https://arxiv.org/abs/1703.06211

  11. [11]

    IEEE Transactions on Pattern Analysis and Machine Intelligence44(5), 2567–2581 (2022)

    Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: Unify- ing structure and texture similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence44(5), 2567–2581 (2022)

  12. [12]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  13. [13]

    Fan, X., Girish, S., Ramanujan, V., Wang, C., Mirzaei, A., Sushko, P., Siarohin, A., Tulyakov, S., Krishna, R.: Omniview: An all-seeing diffusion model for 3d and 4d view synthesis (2025),https://arxiv.org/abs/2512.10940

  14. [14]

    In: NeurIPS (2024)

    Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3d: Create anything in 3d with multi-view diffusion models. In: NeurIPS (2024)

  15. [15]

    Geyer, M., Tov, O., Jin, L., Tucker, R., Mosseri, I., Dekel, T., Snavely, N.: Eye2eye:Asimpleapproachformonocular-to-stereovideosynthesis.arXivpreprint arXiv:2505.00135 (2025) 26 S. Chen, X. Zhang, Y. Zhang, T. Aydin, C. Schroers

  16. [16]

    He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024)

  17. [17]

    In: Advances in neural information processing systems

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in neural information processing systems. pp. 6626–6637 (2017)

  18. [18]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Izadimehr, M., Ghanbari, M., Chen, G., Zhou, W., Hao, X., Dasari, M., Timmerer, C., Amirpour, H.: Svd: Spatial video dataset. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 12988–12994 (2025)

  19. [19]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)

  20. [20]

    In: European Conference on Computer Vision

    Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., Xu, Q.: Brushnet: A plug-and- play image inpainting model with decomposed dual-branch diffusion. In: European Conference on Computer Vision. pp. 150–168. Springer (2024)

  21. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Re- purposing diffusion-based image generators for monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 9492–9502 (2024)

  22. [22]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5148–5157 (2021)

  23. [23]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  24. [24]

    Auto-Encoding Variational Bayes

    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  25. [25]

    arXiv preprint arXiv:2203.13215 (2022)

    Kolkin, N., Kucera, M., Paris, S., Sykora, D., Shechtman, E., Shakhnarovich, G.: Neural neighbor style transfer. arXiv preprint arXiv:2203.13215 (2022)

  26. [26]

    arXiv preprint arXiv:2412.12091 (2024)

    Liang, H., Cao, J., Goel, V., Qian, G., Korolev, S., Terzopoulos, D., Plataniotis, K.N., Tulyakov, S., Ren, J.: Wonderland: Navigating 3d scenes from a single image. arXiv preprint arXiv:2412.12091 (2024)

  27. [27]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22160–22169 (2024)

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Lu, L., Li, W., Tao, X., Lu, J., Jia, J.: Masa-sr: Matching acceleration and spa- tial adaptation for reference-based image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6368– 6377 (2021)

  30. [30]

    Journal of machine learning research9(11) (2008)

    Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research9(11) (2008)

  31. [31]

    In: WACV

    Mehl, L., Bruhn, A., Gross, M., Schroers, C.: Stereo conversion with disparity- aware warping, compositing and inpainting. In: WACV. pp. 4260–4269 (2024)

  32. [32]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Mehl, L., Schmalfuss, J., Jahedi, A., Nalivayko, Y., Bruhn, A.: Spring: A high- resolutionhigh-detaildatasetandbenchmarkforsceneflow,opticalflowandstereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4981–4991 (2023) UniFixer: A Universal Reference-Guided Fixer 27

  33. [33]

    arXiv preprint arXiv:2512.14236 (2025)

    Metzger, N., Truong, P., Bhat, G., Schindler, K., Tombari, F.: Elastic3d: Con- trollable stereo video conversion with guided latent decoding. arXiv preprint arXiv:2512.14236 (2025)

  34. [34]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Niklaus, S., Liu, F.: Softmax splatting for video frame interpolation. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5437–5446 (2020)

  35. [35]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  36. [36]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ren,X.,Shen,T.,Huang,J.,Ling,H.,Lu,Y.,Nimier-David,M.,Müller,T.,Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6121–6132 (2025)

  37. [37]

    In: International Conference on Medical image computing and computer-assisted intervention

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)

  38. [38]

    In: European Conference on Computer Vision

    Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. In: European Conference on Computer Vision. pp. 87–103. Springer (2024)

  39. [39]

    arXiv preprint arXiv:2512.16915 (2025)

    Shen, G., Du, Y., Ge, W., He, J., Chang, C., Zhou, D., Yang, Z., Wang, L., Tao, X., Chen, Y.C.: Stereopilot: Learning unified and efficient stereo conversion via generative priors. arXiv preprint arXiv:2512.16915 (2025)

  40. [40]

    arXiv preprint arXiv:2505.16565 (2025)

    Shvetsova, N., Bhat, G., Truong, P., Kuehne, H., Tombari, F.: M2svid: End-to-end inpainting and refinement for monocular-to-stereo video conversion. arXiv preprint arXiv:2505.16565 (2025)

  41. [41]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

  42. [42]

    Advances in Neural Information Processing Systems34, 19313–19325 (2021)

    Sitzmann, V., Rezchikov, S., Freeman, B., Tenenbaum, J., Durand, F.: Light field networks: Neural scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems34, 19313–19325 (2021)

  43. [43]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  44. [44]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Wang, J., Chan, K.C., Loy, C.C.: Exploring clip for assessing the look and feel of images. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 2555–2563 (2023)

  45. [45]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, L., Frisvad, J.R., Jensen, M.B., Bigdeli, S.A.: Stereodiffusion: Training- free stereo image generation using latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7416– 7425 (2024)

  46. [46]

    In: European Conference on Computer Vision

    Wang, Y., Lipson, L., Deng, J.: Sea-raft: Simple, efficient, accurate raft for optical flow. In: European Conference on Computer Vision. pp. 36–54. Springer (2024)

  47. [47]

    IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

  48. [48]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wu, J.Z., Zhang, Y., Turki, H., Ren, X., Gao, J., Shou, M.Z., Fidler, S., Gojcic, Z., Ling, H.: Difix3d+: Improving 3d reconstructions with single-step diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26024–26035 (2025)

  49. [49]

    arXiv preprint arXiv:2512.09363 (2025) 28 S

    Xing,K.,Jin,X.,Li,L.,Yin,Y.,Liang,H.,Luo,G.,Fang,C.,Wang,J.,Plataniotis, K.N., Zhao, Y., et al.: Stereoworld: Geometry-aware monocular-to-stereo video generation. arXiv preprint arXiv:2512.09363 (2025) 28 S. Chen, X. Zhang, Y. Zhang, T. Aydin, C. Schroers

  50. [50]

    Advances in Neural Information Processing Systems37, 21875–21911 (2024)

    Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024)

  51. [51]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition workshops

    Yang, S., Wu, T., Shi, S., Lao, S.s., Gong, Y., Cao, M., Wang, J., Yang, Y.: Maniqa: Multi-dimension attention network for no-reference image quality assessment. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition workshops. pp. 1190–1199 (2022)

  52. [52]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Yoo, J., Uh, Y., Chun, S., Kang, B., Ha, J.W.: Photorealistic style transfer via wavelet transforms. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9036–9045 (2019)

  53. [53]

    In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

    Yu, S., Chen, Y., Qi, Z., Xie, Z., Wang, Y., Wang, L., Shan, Y., Lu, H.: Mono2stereo: A benchmark and empirical study for stereo conversion. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 21847–21856 (2025)

  54. [54]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)

  55. [55]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

  56. [56]

    Advances in Neural Information Processing Sys- tems37, 108674–108709 (2024)

    Zhang, X., Ke, B., Riemenschneider, H., Metzger, N., Obukhov, A., Gross, M., Schindler, K., Schroers, C.: Betterdepth: Plug-and-play diffusion refiner for zero- shot monocular depth estimation. Advances in Neural Information Processing Sys- tems37, 108674–108709 (2024)

  57. [57]

    In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

    Zhang, X., Zhang, Y., Mehl, L., Gross, M., Schroers, C.: High-fidelity novel view synthesis via splatting-guided diffusion. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–11 (2025)

  58. [58]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference (2026)

    Zhang, X., Zhang, Y., Mehl, L., Gross, M., Schroers, C.: Guardians of the hair: Rescuing soft boundaries in depth, stereo, and novel views. In: Proceedings of the Computer Vision and Pattern Recognition Conference (2026)

  59. [59]

    arXiv preprint arXiv:2409.07447 (2024)

    Zhao, S., Hu, W., Cun, X., Zhang, Y., Li, X., Kong, Z., Gao, X., Niu, M., Shan, Y.: Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos. arXiv preprint arXiv:2409.07447 (2024)

  60. [60]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhou,J.,Gao,H.,Voleti,V.,Vasishta,A.,Yao,C.H.,Boss,M.,Torr,P.,Rupprecht, C., Jampani, V.: Stable virtual camera: Generative view synthesis with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12405–12414 (2025)