arxiv: 2605.12169 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis

Sihan Chen , Xiang Zhang , Yang Zhang , Tunc Aydin , Christopher Schroers

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsview synthesisnovel view synthesisreference-guidedcoarse-to-finezero-shotimage restorationstereo conversion

0 comments

The pith

UniFixer repairs diffusion degradations in view synthesis with a reference-guided coarse-to-fine refiner.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that diffusion-based view synthesis produces spatial, temporal, and backbone-related degradations such as blur and geometric distortion. It introduces UniFixer as a plug-and-play module that takes one reference view and applies pre-alignment, structure anchoring, and detail injection to restore quality. A sympathetic reader would care because this allows any existing diffusion model to produce cleaner novel views without retraining or changing the backbone. The approach claims to generalize zero-shot across different degradation types and tasks including novel view synthesis and stereo conversion.

Core claim

UniFixer is a universal reference-guided framework that fixes diverse diffusion degradations via a coarse-to-fine strategy. A reference pre-alignment module first performs coarse alignment between the reference view and the degraded novel view. A global structure anchoring mechanism then rectifies geometric distortions, followed by a local detail injection module that recovers fine-grained texture details. This enables plug-and-play zero-shot fixing across diffusion backbones and achieves state-of-the-art performance on novel view synthesis and stereo conversion.

What carries the argument

The coarse-to-fine refiner consisting of reference pre-alignment, global structure anchoring, and local detail injection modules that use a single reference view to correct degradations.

If this is right

Achieves state-of-the-art results on novel view synthesis benchmarks without task-specific retraining.
Extends directly to stereo conversion with the same reference-guided modules.
Operates zero-shot across different diffusion model architectures and scenes.
Reduces blurred details and geometric distortions while preserving structural fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modules could be tested on diffusion-based video generation to enforce temporal consistency across frames.
Performance may degrade when the reference view comes from an extreme viewpoint angle not covered in training.
Chaining multiple reference views could further reduce residual artifacts in complex scenes.

Load-bearing premise

A single reference view always supplies enough undistorted information to correct all three degradation types without introducing new artifacts, and the modules generalize zero-shot to unseen diffusion backbones and scenes.

What would settle it

Applying UniFixer to a new diffusion backbone on a scene where the reference view lacks key details and checking whether output artifacts persist or new distortions appear.

Figures

Figures reproduced from arXiv: 2605.12169 by Christopher Schroers, Sihan Chen, Tunc Aydin, Xiang Zhang, Yang Zhang.

**Figure 1.** Figure 1: Existing diffusion-based view synthesis, including explicit/implicit novel view synthesis and stereo conversion approaches, often suffer from diffusion degradations (e.g., inconsistent textures and distorted structures) due to pixel-to-latent compression and diffusion hallucination. Moreover, diffusion degradation varies with different spatial resolutions, temporal dynamics, and diffusion backbones, posing… view at source ↗

**Figure 2.** Figure 2: Degradation analysis with t-SNE feature visualization of spatial (×2/ × 3/ × 4/ × 6), temporal (frame strides ∈ {1, 3, 6, 9}), and backbone-related (UNet [54] and DiT [19]) degradations on a DL3DV [28] scene. The shared setting (spatial ×4, temporal stride 1, and DiT-based backbone) is plotted once as backbone_DiT. Color families denote degradation types, and the red star marks the ground-truth. Crosses/ci… view at source ↗

**Figure 3.** Figure 3: Pipeline of UniFixer. Given a degraded novel view and a high-quality reference view, we first perform coarse alignment by the reference pre-alignment module. Leveraging the warped reference, we apply the global structure anchoring module to aggregate shared structures via reference-mixed attention. The local detail injection module adaptively fuses multi-scale features for fine-grained texture enhancement… view at source ↗

**Figure 4.** Figure 4: Reference pre-alignment. For explicit view synthesis methods, i.e., DWIbased methods, we estimate depth information and perform geometry-based warping using camera poses. For implicit approaches, we perform flow-based warping with the optical flow estimated from the reference and the synthesized novel view. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Visual results of applying novel view fixers (including DIFIX3D+ [48], MaRINeR [7], and ours) to improve diffusion-based view synthesis methods. 5.4 Ablation Study Component design. Tab. 5 demonstrates the contributions of each component in our UniFixer pipeline. Compared with the variant that directly utilizes features from the reference view (#1), RPA alleviates the difficulty of correspondence search … view at source ↗

**Figure 6.** Figure 6: Visual results of ablation study on component design and referencing mechanism, respectively. The ID corresponds to the ID in Tab. 4a and 4b of the main paper. (a) Input image ×4 ×8 ×16 Feature map Attention map (b) DINOv3 feature maps at different scales [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: DINOv3 feature map visualization. 8 Degradation analysis details 8.1 DINOv3 map visualization To further justify using DINOv3 patch tokens to extract degradation features for the t-SNE analysis, we visualize the token representations of a single degraded image. Specifically, we feed the same image at three input resolutions (i.e., 480×832 / 960×1664 / 1920×3328, denoted by ×4/×8/×16 in [PITH_FULL_IMAGE:f… view at source ↗

**Figure 8.** Figure 8: t-SNE visualization under different degradation dimensions. Lighter colors indicate more severe degradation. semantically related regions, further supporting the semantic consistency and representational validity of the patch tokens. 8.2 Degradation fixing results We further split the all-in-one t-SNE figure ( [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Additional degradation and fixing visual results. The visual quality consistently improves as the degradation severity decreases (across rows) and as more effective fixers are applied (across columns). These results are a subset of the samples used to produce the t-SNE visualization [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Flow-based warping failure case. when the viewpoint change between target and source view is large, SEA-RAFT can struggle to estimate accurate correspondences, which further leads to erroneous warping ( [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Warping and view-dependent effects discussion. For explicit DWI methods, we observe two main issues: (1) view-dependent effects can be baked into the estimated geometry and thus propagated by warping; (2) a single surface point is typically constrained to have a single depth, which cannot represent multi-depth structures (e.g., semi-transparent or layered regions), leading to warping errors. These error… view at source ↗

**Figure 12.** Figure 12: Visual results of applying plug-and-play fixers (including DIFIX3D+ [48], MaRINeR [7], and ours) to improve diffusion-based view synthesis methods [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 12.** Figure 12: Visual results of applying plug-and-play fixers (continued) [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Visual results of in-the-wild test [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

read the original abstract

With the recent surge of generative models, diffusion-based approaches have become mainstream for view synthesis tasks, either in an explicit depth-warp-inpaint or in an implicit end-to-end manner. Despite their success, both paradigms often suffer from noticeable quality degradation, e.g., blurred details and distorted structures, caused by pixel-to-latent compression and diffusion hallucination. In this paper, we investigate diffusion degradation from three key dimensions (i.e., spatial, temporal, and backbone-related) and propose UniFixer, a universal reference-guided framework that fixes diverse degradation artifacts via a coarse-to-fine strategy. Specifically, a reference pre-alignment module is first designed to perform coarse alignment between the reference view and the degraded novel view. A global structure anchoring mechanism then rectifies geometric distortions to ensure structural fidelity, followed by a local detail injection module that recovers fine-grained texture details for high-quality view synthesis. Our UniFixer serves as a plug-and-play refiner that achieves zero-shot fixing across different types of diffusion degradation, and extensive experiments verify our state-of-the-art performance on novel view synthesis and stereo conversion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UniFixer gives a modular reference-guided refiner for common diffusion degradations in view synthesis, but the zero-shot universality across backbones is the part that needs the experiments to back it up.

read the letter

The paper's main contribution is a three-module pipeline that takes a degraded novel view plus one reference and tries to fix it in stages: first coarse alignment of the reference, then global structure correction, then local texture recovery. This targets the spatial, temporal, and backbone-related artifacts that show up in both warp-inpaint and end-to-end diffusion view synthesis. The framing as a plug-and-play add-on is straightforward and addresses a real pain point for people who want to use off-the-shelf diffusion models without retraining them from scratch.

Referee Report

2 major / 1 minor

Summary. The paper proposes UniFixer, a universal reference-guided framework for correcting degradation artifacts in diffusion-based view synthesis. It decomposes degradation into spatial, temporal, and backbone-related dimensions and employs a coarse-to-fine pipeline consisting of a reference pre-alignment module, a global structure anchoring mechanism, and a local detail injection module. The central claim is that this architecture acts as a plug-and-play, zero-shot refiner that achieves state-of-the-art performance on novel view synthesis and stereo conversion tasks.

Significance. If the zero-shot and SOTA claims are substantiated, the work would offer a practical, model-agnostic post-processing tool that mitigates common diffusion artifacts (blurring, geometric distortion) without retraining the underlying generative backbone. This could be broadly useful given the prevalence of diffusion models in view synthesis pipelines.

major comments (2)

[Abstract] Abstract: the assertion of 'state-of-the-art performance' and 'extensive experiments' is unsupported by any quantitative metrics, tables, error bars, ablation studies, or dataset descriptions, rendering the central empirical claim unverifiable from the provided text.
[Abstract] Abstract: the zero-shot generalization claim for the coarse-to-fine modules across unseen diffusion backbones rests on an untested assumption that degradation statistics are sufficiently similar; no evidence is supplied that the reference pre-alignment, global structure anchoring, or local detail injection steps avoid introducing new geometric or texture artifacts when latent spaces or sampling schedules differ.

minor comments (1)

[Abstract] Abstract: the three degradation dimensions (spatial, temporal, backbone-related) are introduced without a supporting citation or prior reference to establish the taxonomy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concerns about the abstract's empirical claims by clarifying the supporting evidence in the full manuscript and committing to revisions for better verifiability. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of 'state-of-the-art performance' and 'extensive experiments' is unsupported by any quantitative metrics, tables, error bars, ablation studies, or dataset descriptions, rendering the central empirical claim unverifiable from the provided text.

Authors: We acknowledge that the abstract, as a concise summary, omits the detailed metrics. The full manuscript contains quantitative tables with metrics such as PSNR/SSIM/LPIPS, error bars, ablation studies, and dataset descriptions for novel view synthesis and stereo conversion in the Experiments section. To improve verifiability, we will revise the abstract to include key performance highlights substantiating the SOTA claim. revision: yes
Referee: [Abstract] Abstract: the zero-shot generalization claim for the coarse-to-fine modules across unseen diffusion backbones rests on an untested assumption that degradation statistics are sufficiently similar; no evidence is supplied that the reference pre-alignment, global structure anchoring, or local detail injection steps avoid introducing new geometric or texture artifacts when latent spaces or sampling schedules differ.

Authors: Our experiments evaluate UniFixer on multiple diffusion-based view synthesis pipelines with different backbones and sampling schedules, showing consistent zero-shot improvements without new artifacts via quantitative and qualitative results. The reference-guided design aims to be backbone-agnostic. We agree that explicit tests on additional unseen backbones would strengthen the claim, and we will add further analysis or experiments in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: independent architectural modules with no derivation reducing to inputs

full rationale

The paper introduces UniFixer as a plug-and-play refiner consisting of three explicitly designed modules (reference pre-alignment, global structure anchoring, local detail injection) that operate via a coarse-to-fine strategy on diffusion degradations. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described method. The central claim rests on the novelty of the reference-guided framework and its zero-shot applicability, verified by experiments rather than any self-referential reduction. No self-citation load-bearing, ansatz smuggling, or renaming of known results is present; the contribution is an engineering architecture independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

Abstract introduces three new modules (reference pre-alignment, global structure anchoring, local detail injection) as core components; no free parameters, mathematical axioms, or external benchmarks are mentioned.

invented entities (3)

reference pre-alignment module no independent evidence
purpose: perform coarse alignment between reference view and degraded novel view
New module introduced to handle initial alignment step.
global structure anchoring mechanism no independent evidence
purpose: rectify geometric distortions to ensure structural fidelity
New mechanism proposed for correcting structure.
local detail injection module no independent evidence
purpose: recover fine-grained texture details
New module for injecting local textures.

pith-pipeline@v0.9.0 · 5505 in / 1211 out tokens · 100110 ms · 2026-05-13T07:03:37.613540+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 10 internal anchors

[1]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

An, J., Huang, S., Song, Y., Dou, D., Liu, W., Luo, J.: Artflow: Unbiased im- age style transfer via reversible neural flows. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 862–871 (2021)

work page 2021
[2]

arXiv preprint arXiv:2509.19296 (2025)

Bahmani, S., Shen, T., Ren, J., Huang, J., Jiang, Y., Turki, H., Tagliasacchi, A., Lindell, D.B., Gojcic, Z., Fidler, S., et al.: Lyra: Generative 3d scene reconstruction via video diffusion model self-distillation. arXiv preprint arXiv:2509.19296 (2025)

work page arXiv 2025
[3]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H.,Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14834–14844 (2025)

work page 2025
[4]

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Behrens, T., Obukhov, A., Ke, B., Tosi, F., Poggi, M., Schindler, K.: Stereospace: Depth-free synthesis of stereo geometry via end-to-end diffusion in a canonical space. arXiv preprint arXiv:2512.10959 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Bernasconi, M., Djelouah, A., Zhang, Y., Gross, M., Schroers, C.: Rebair: Reference-basedimagerestoration.In:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision. pp. 5489–5498 (2025)

work page 2025
[6]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

In: European Conference on Computer Vision

Bösiger,L.,Dusmanu,M.,Pollefeys,M.,Bauer,Z.:Mariner:Enhancingnovelviews by matching rendered images with nearby references. In: European Conference on Computer Vision. pp. 76–94. Springer (2024)

work page 2024
[8]

In: European conference on computer vision

Cao, J., Liang, J., Zhang, K., Li, Y., Zhang, Y., Wang, W., Gool, L.V.: Reference- based image super-resolution with deformable attention transformer. In: European conference on computer vision. pp. 325–342. Springer (2022)

work page 2022
[9]

arXiv preprint arXiv:2512.08765 (2025)

Chu, R., He, Y., Chen, Z., Zhang, S., Xu, X., Xia, B., Wang, D., Yi, H., Liu, X., Zhao, H., et al.: Wan-move: Motion-controllable video generation via latent trajectory guidance. arXiv preprint arXiv:2512.08765 (2025)

work page arXiv 2025
[10]

Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convo- lutional networks (2017),https://arxiv.org/abs/1703.06211

work page arXiv 2017
[11]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(5), 2567–2581 (2022)

Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: Unify- ing structure and texture similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence44(5), 2567–2581 (2022)

work page 2022
[12]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[13]

Fan, X., Girish, S., Ramanujan, V., Wang, C., Mirzaei, A., Sushko, P., Siarohin, A., Tulyakov, S., Krishna, R.: Omniview: An all-seeing diffusion model for 3d and 4d view synthesis (2025),https://arxiv.org/abs/2512.10940

work page arXiv 2025
[14]

In: NeurIPS (2024)

Gao, R., Holynski, A., Henzler, P., Brussee, A., Martin-Brualla, R., Srinivasan, P., Barron, J.T., Poole, B.: Cat3d: Create anything in 3d with multi-view diffusion models. In: NeurIPS (2024)

work page 2024
[15]

Geyer, M., Tov, O., Jin, L., Tucker, R., Mosseri, I., Dekel, T., Snavely, N.: Eye2eye:Asimpleapproachformonocular-to-stereovideosynthesis.arXivpreprint arXiv:2505.00135 (2025) 26 S. Chen, X. Zhang, Y. Zhang, T. Aydin, C. Schroers

work page arXiv 2025
[16]

He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

In: Advances in neural information processing systems

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in neural information processing systems. pp. 6626–6637 (2017)

work page 2017
[18]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Izadimehr, M., Ghanbari, M., Chen, G., Zhou, W., Hao, X., Dasari, M., Timmerer, C., Amirpour, H.: Svd: Spatial video dataset. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 12988–12994 (2025)

work page 2025
[19]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)

work page 2025
[20]

In: European Conference on Computer Vision

Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., Xu, Q.: Brushnet: A plug-and- play image inpainting model with decomposed dual-branch diffusion. In: European Conference on Computer Vision. pp. 150–168. Springer (2024)

work page 2024
[21]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Re- purposing diffusion-based image generators for monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 9492–9502 (2024)

work page 2024
[22]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5148–5157 (2021)

work page 2021
[23]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[24]

Auto-Encoding Variational Bayes

Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[25]

arXiv preprint arXiv:2203.13215 (2022)

Kolkin, N., Kucera, M., Paris, S., Sykora, D., Shechtman, E., Shakhnarovich, G.: Neural neighbor style transfer. arXiv preprint arXiv:2203.13215 (2022)

work page arXiv 2022
[26]

arXiv preprint arXiv:2412.12091 (2024)

Liang, H., Cao, J., Goel, V., Qian, G., Korolev, S., Terzopoulos, D., Plataniotis, K.N., Tulyakov, S., Ren, J.: Wonderland: Navigating 3d scenes from a single image. arXiv preprint arXiv:2412.12091 (2024)

work page arXiv 2024
[27]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ling, L., Sheng, Y., Tu, Z., Zhao, W., Xin, C., Wan, K., Yu, L., Guo, Q., Yu, Z., Lu, Y., et al.: Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22160–22169 (2024)

work page 2024
[29]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lu, L., Li, W., Tao, X., Lu, J., Jia, J.: Masa-sr: Matching acceleration and spa- tial adaptation for reference-based image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6368– 6377 (2021)

work page 2021
[30]

Journal of machine learning research9(11) (2008)

Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research9(11) (2008)

work page 2008
[31]

In: WACV

Mehl, L., Bruhn, A., Gross, M., Schroers, C.: Stereo conversion with disparity- aware warping, compositing and inpainting. In: WACV. pp. 4260–4269 (2024)

work page 2024
[32]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Mehl, L., Schmalfuss, J., Jahedi, A., Nalivayko, Y., Bruhn, A.: Spring: A high- resolutionhigh-detaildatasetandbenchmarkforsceneflow,opticalflowandstereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4981–4991 (2023) UniFixer: A Universal Reference-Guided Fixer 27

work page 2023
[33]

arXiv preprint arXiv:2512.14236 (2025)

Metzger, N., Truong, P., Bhat, G., Schindler, K., Tombari, F.: Elastic3d: Con- trollable stereo video conversion with guided latent decoding. arXiv preprint arXiv:2512.14236 (2025)

work page arXiv 2025
[34]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Niklaus, S., Liu, F.: Softmax splatting for video frame interpolation. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5437–5446 (2020)

work page 2020
[35]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

work page 2023
[36]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ren,X.,Shen,T.,Huang,J.,Ling,H.,Lu,Y.,Nimier-David,M.,Müller,T.,Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6121–6132 (2025)

work page 2025
[37]

In: International Conference on Medical image computing and computer-assisted intervention

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)

work page 2015
[38]

In: European Conference on Computer Vision

Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. In: European Conference on Computer Vision. pp. 87–103. Springer (2024)

work page 2024
[39]

arXiv preprint arXiv:2512.16915 (2025)

Shen, G., Du, Y., Ge, W., He, J., Chang, C., Zhou, D., Yang, Z., Wang, L., Tao, X., Chen, Y.C.: Stereopilot: Learning unified and efficient stereo conversion via generative priors. arXiv preprint arXiv:2512.16915 (2025)

work page arXiv 2025
[40]

arXiv preprint arXiv:2505.16565 (2025)

Shvetsova, N., Bhat, G., Truong, P., Kuehne, H., Tombari, F.: M2svid: End-to-end inpainting and refinement for monocular-to-stereo video conversion. arXiv preprint arXiv:2505.16565 (2025)

work page arXiv 2025
[41]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Advances in Neural Information Processing Systems34, 19313–19325 (2021)

Sitzmann, V., Rezchikov, S., Freeman, B., Tenenbaum, J., Durand, F.: Light field networks: Neural scene representations with single-evaluation rendering. Advances in Neural Information Processing Systems34, 19313–19325 (2021)

work page 2021
[43]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wang, J., Chan, K.C., Loy, C.C.: Exploring clip for assessing the look and feel of images. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 2555–2563 (2023)

work page 2023
[45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, L., Frisvad, J.R., Jensen, M.B., Bigdeli, S.A.: Stereodiffusion: Training- free stereo image generation using latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7416– 7425 (2024)

work page 2024
[46]

In: European Conference on Computer Vision

Wang, Y., Lipson, L., Deng, J.: Sea-raft: Simple, efficient, accurate raft for optical flow. In: European Conference on Computer Vision. pp. 36–54. Springer (2024)

work page 2024
[47]

IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Process- ing13(4), 600–612 (2004)

work page 2004
[48]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wu, J.Z., Zhang, Y., Turki, H., Ren, X., Gao, J., Shou, M.Z., Fidler, S., Gojcic, Z., Ling, H.: Difix3d+: Improving 3d reconstructions with single-step diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26024–26035 (2025)

work page 2025
[49]

arXiv preprint arXiv:2512.09363 (2025) 28 S

Xing,K.,Jin,X.,Li,L.,Yin,Y.,Liang,H.,Luo,G.,Fang,C.,Wang,J.,Plataniotis, K.N., Zhao, Y., et al.: Stereoworld: Geometry-aware monocular-to-stereo video generation. arXiv preprint arXiv:2512.09363 (2025) 28 S. Chen, X. Zhang, Y. Zhang, T. Aydin, C. Schroers

work page arXiv 2025
[50]

Advances in Neural Information Processing Systems37, 21875–21911 (2024)

Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024)

work page 2024
[51]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition workshops

Yang, S., Wu, T., Shi, S., Lao, S.s., Gong, Y., Cao, M., Wang, J., Yang, Y.: Maniqa: Multi-dimension attention network for no-reference image quality assessment. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition workshops. pp. 1190–1199 (2022)

work page 2022
[52]

In: Proceedings of the IEEE/CVF international conference on computer vision

Yoo, J., Uh, Y., Chun, S., Kang, B., Ha, J.W.: Photorealistic style transfer via wavelet transforms. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9036–9045 (2019)

work page 2019
[53]

In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

Yu, S., Chen, Y., Qi, Z., Xie, Z., Wang, Y., Wang, L., Shan, Y., Lu, H.: Mono2stereo: A benchmark and empirical study for stereo conversion. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 21847–21856 (2025)

work page 2025
[54]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)

work page internal anchor Pith review arXiv 2024
[55]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

work page 2018
[56]

Advances in Neural Information Processing Sys- tems37, 108674–108709 (2024)

Zhang, X., Ke, B., Riemenschneider, H., Metzger, N., Obukhov, A., Gross, M., Schindler, K., Schroers, C.: Betterdepth: Plug-and-play diffusion refiner for zero- shot monocular depth estimation. Advances in Neural Information Processing Sys- tems37, 108674–108709 (2024)

work page 2024
[57]

In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

Zhang, X., Zhang, Y., Mehl, L., Gross, M., Schroers, C.: High-fidelity novel view synthesis via splatting-guided diffusion. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. pp. 1–11 (2025)

work page 2025
[58]

In: Proceedings of the Computer Vision and Pattern Recognition Conference (2026)

Zhang, X., Zhang, Y., Mehl, L., Gross, M., Schroers, C.: Guardians of the hair: Rescuing soft boundaries in depth, stereo, and novel views. In: Proceedings of the Computer Vision and Pattern Recognition Conference (2026)

work page 2026
[59]

arXiv preprint arXiv:2409.07447 (2024)

Zhao, S., Hu, W., Cun, X., Zhang, Y., Li, X., Kong, Z., Gao, X., Niu, M., Shan, Y.: Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos. arXiv preprint arXiv:2409.07447 (2024)

work page arXiv 2024
[60]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhou,J.,Gao,H.,Voleti,V.,Vasishta,A.,Yao,C.H.,Boss,M.,Torr,P.,Rupprecht, C., Jampani, V.: Stable virtual camera: Generative view synthesis with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12405–12414 (2025)

work page 2025