pith. sign in

arxiv: 2411.14295 · v3 · submitted 2024-11-21 · 💻 cs.CV

DissolveStereo: Coarse Depth Injection for Zero-Shot Stereo Video Generation

Pith reviewed 2026-05-23 17:14 UTC · model grok-4.3

classification 💻 cs.CV
keywords stereo video generationzero-shot synthesisdiffusion modelsdepth injectionview consistencytemporal coherenceepipolar geometry
0
0 comments X

The pith

DissolveStereo generates consistent stereo videos from monocular diffusion models by injecting coarse depth maps without paired training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DissolveStereo to create stereo videos using video diffusion models trained only on regular video, without any stereo pairs. It initializes left and right views with a noisy restart to set up compatible starting points in latent space, then runs iterative refinement while feeding in simplified depth information. The simplification step, dissolving the depth maps, drops high-frequency details so the two views can align more easily during generation. This produces videos with stronger matching between left and right perspectives and less flickering across frames.

Core claim

DissolveStereo shows that a noisy restart strategy combined with iterative refinement and dissolved depth maps allows zero-shot stereo video generation from video diffusion priors, producing higher epipolar consistency and temporal smoothness than prior approaches.

What carries the argument

dissolved depth maps that reduce high-frequency depth information to simplify and align latent representations across views during refinement.

Load-bearing premise

That noisy restarts plus dissolved depth maps will reliably align left and right latent spaces without introducing new artifacts that later steps cannot fix.

What would settle it

A generated stereo video that exhibits visible left-right mismatches or increased flickering when the dissolved depth maps are replaced by full-resolution depth maps or when the noisy restart is removed.

Figures

Figures reproduced from arXiv: 2411.14295 by Jian Shi, Peter Wonka, Qian Wang, Ramzi Idoughi, Wenqing Cui, Zhenyu Li.

Figure 1
Figure 1. Figure 1: With just a single image and an associated text prompt (left), our method generates compelling stereo video sequences. The [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the StereoCrafter-Zero pipeline. Top: Our method is based on two main components: (1) Noisy Restart for a robust initial latent estimation (Sec. 3.1) and (2) Iterative Refinement for the latent refinement (Sec. 3.2) during the sampling step. Bottom: The proposed pipeline takes a conditioning image and text prompt as input, generating both left and right views that produce a strong stereoscop… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the noisy start strategy. At selected steps, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the iterative refinement strategy. This pro [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Dissolved depth maps obtained with DepthCrafter (50-step schedule). The top row shows the gradual removal of high-frequency details. The bottom row demonstrates ghosting effects reduction, but highlights a potential trade-off: increased distortion at larger t. oDiffusion (stereo image generation). Note that Immerse￾Pro is a depth-free method that may create an arbitrary degree of stereoscopic effect withou… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison with different stereo conversion methods including [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Impact of the noisy restart on stereo effects. This anaglyph visualization vividly demonstrates the improvement. Increasing noisy [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Impact of the iterative refinement. Without it, warping [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Demonstration of generating stereo videos with interpolated inputs. We input the starting and ending frames as our input and [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Demonstration of generating looped stereo videos. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Demonstration of the failure cases. Our method can fail [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 16
Figure 16. Figure 16: Demo of repetitive latent warping for LVDMs. Left: [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗
Figure 15
Figure 15. Figure 15: Warp LVDM latent space with direct translation with [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: A cross comparison of depth maps from the disparity propagated image-based models and the video depth models. The visual [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: A cross comparison of different dissolving levels. Top: a simpler case with fewer intricate structures. Bottom: a complex case [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Demonstration of the differences between using DepthCrafter and Disparity Propagation method for obtaining disparity values. [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: We visualize the tensor of the first timeframe in the first [PITH_FULL_IMAGE:figures/full_fig_p016_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: A screenshot captured on Meta Quest 3, showcasing a [PITH_FULL_IMAGE:figures/full_fig_p017_21.png] view at source ↗
read the original abstract

Generating high-quality stereo videos requires consistent depth perception and temporal coherence across frames. Despite advances in image and video synthesis using diffusion models, producing high-quality stereo videos remains a challenging task due to the difficulty of maintaining consistent temporal and spatial coherence between left and right views. We introduce DissolveStereo, a novel framework for zero-shot stereo video generation that leverages video diffusion priors without requiring paired training data. Our key innovations include a noisy restart strategy to initialize stereo-aware latent representations and an iterative refinement process that progressively harmonizes the latent space, addressing issues like temporal flickering and view inconsistencies. Importantly, we propose the use of dissolved depth maps to streamline latent space operations by reducing high-frequency depth information. Our comprehensive evaluations, including quantitative metrics and user studies, demonstrate that DissolveStereo produces high-quality stereo videos with enhanced depth consistency and temporal smoothness. In terms of epipolar consistency, our method achieves an 11.7% improvement in MEt3R score over the current state-of-the-art. Furthermore, user studies indicate strong perceptual gains over the previous arts, with an 8.0% higher perceived frame quality and 10.9% higher perceived temporal coherence. Our code is in https://github.com/shijianjian/DissolveStereo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DissolveStereo, a zero-shot stereo video generation framework that builds on video diffusion priors without paired training data. It proposes a noisy restart strategy to initialize stereo-aware latent representations, an iterative refinement process to harmonize left/right views, and dissolved depth maps to reduce high-frequency depth information during latent operations. The central empirical claims are an 11.7% MEt3R improvement over prior state-of-the-art and 8.0–10.9% gains in user-study scores for frame quality and temporal coherence.

Significance. If the reported gains are reproducible and the mechanism is shown to be robust, the work would constitute a practical engineering contribution to consistent stereo video synthesis. The dissolved-depth injection and noisy-restart approach are concrete, testable ideas that could be adopted or extended by others working on view-consistent diffusion models.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (framework): the claim that dissolved depth maps plus noisy restart reliably harmonize latents without introducing uncorrectable artifacts rests on the reported MEt3R and user-study deltas, yet no ablation isolating the dissolved-depth component, no failure-case analysis, and no error bars are supplied; this makes the 11.7% figure difficult to interpret as load-bearing evidence.
  2. [§4] §4 (experiments): the MEt3R improvement and user-study percentages are presented without baseline implementation details, exact hyper-parameter settings for the iterative refinement, or cross-validation across multiple seeds; these omissions directly affect the defensibility of the quantitative claims.
minor comments (2)
  1. The GitHub link is given but the manuscript does not state whether the released code includes the exact evaluation scripts used for MEt3R and the user study.
  2. [§3] Notation for “dissolved depth map” is introduced without an explicit equation or pseudocode showing how high-frequency components are removed before injection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thoughtful review and for recognizing the practical contributions of DissolveStereo. We address each major comment below and commit to revisions that strengthen the empirical support and reproducibility of the work.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (framework): the claim that dissolved depth maps plus noisy restart reliably harmonize latents without introducing uncorrectable artifacts rests on the reported MEt3R and user-study deltas, yet no ablation isolating the dissolved-depth component, no failure-case analysis, and no error bars are supplied; this makes the 11.7% figure difficult to interpret as load-bearing evidence.

    Authors: We acknowledge that an ablation isolating the dissolved-depth component would strengthen the evidence for its specific role. In the revised manuscript we will add a targeted ablation comparing the full model to a variant without dissolved depth maps. We will also add a dedicated limitations subsection discussing observed failure cases and potential artifacts. Regarding error bars, the primary experiments used fixed seeds for direct comparability; we will run additional trials with varied seeds and report standard deviations where feasible. revision: yes

  2. Referee: [§4] §4 (experiments): the MEt3R improvement and user-study percentages are presented without baseline implementation details, exact hyper-parameter settings for the iterative refinement, or cross-validation across multiple seeds; these omissions directly affect the defensibility of the quantitative claims.

    Authors: We agree that greater implementation transparency is needed. The revised §4 will include full baseline implementation details (including any code adaptations), the precise hyper-parameter values used for iterative refinement, and results across multiple random seeds to supply error bars and assess variability of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical engineering framework for zero-shot stereo video generation via diffusion models, relying on a noisy restart strategy and dissolved depth maps. All reported gains (MEt3R, user scores) are presented as experimental outcomes from evaluations rather than predictions derived from equations or first-principles results. No load-bearing self-citations, uniqueness theorems, ansatzes, or fitted-input predictions appear in the abstract or framework description; the central claims rest on external benchmarks and perceptual studies that remain independent of internal parameter definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the unstated premise that video diffusion priors already encode sufficient stereo geometry when guided by coarse depth; no new physical entities are postulated.

axioms (1)
  • domain assumption Video diffusion models trained on monocular video contain latent representations that can be steered toward consistent stereo pairs by depth injection.
    Invoked when the method claims zero-shot operation without paired stereo data.
invented entities (1)
  • dissolved depth map no independent evidence
    purpose: Reduce high-frequency depth information to simplify latent-space operations.
    Introduced as a key innovation to address flickering and view inconsistency.

pith-pipeline@v0.9.0 · 5765 in / 1219 out tokens · 22441 ms · 2026-05-23T17:14:31.876339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

    cs.CV 2025-12 unverdicted novelty 7.0

    A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Met3r: Measuring multi-view consistency in generated images

    Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images. arXiv preprint arXiv:2501.06336, 2025. 6

  2. [2]

    Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B. Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

  3. [3]

    Structure-from- motion with oriented points

    Jonathan T Barron and Jovan Popovi ´c. Structure-from- motion with oriented points. In IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2015. 2

  4. [4]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073, 2024. 8

  5. [5]

    Diffdreamer: Towards consistent unsupervised single-view scene extrapolation with conditional diffusion models

    Shengqu Cai, Eric Ryan Chan, Songyou Peng, Mohamad Shahbazi, Anton Obukhov, Luc Van Gool, and Gordon Wetzstein. Diffdreamer: Towards consistent unsupervised single-view scene extrapolation with conditional diffusion models. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 2139–2150, 2023. 6

  6. [6]

    Generative rendering: Controllable 4d-guided video generation with 2d diffusion models

    Shengqu Cai, Duygu Ceylan, Matheus Gadelha, Chun- Hao Paul Huang, Tuanfeng Yang Wang, and Gordon Wet- zstein. Generative rendering: Controllable 4d-guided video generation with 2d diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7611–7620, 2024. 6

  7. [7]

    Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 22560–22570, 2023. 1

  8. [8]

    Fec: Three finetuning- free methods to enhance consistency for real image edit- ing

    Songyan Chen and Jiancheng Huang. Fec: Three finetuning- free methods to enhance consistency for real image edit- ing. In 2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML) , pages 76–87. IEEE, 2023. 1

  9. [9]

    Video depth anything: Consistent depth estimation for super-long videos

    Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zi- long Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. arXiv preprint arXiv:2501.12375, 2025. 2, 8

  10. [10]

    Anydoor: Zero-shot object-level im- age customization

    Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level im- age customization. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6593–6602, 2024. 1

  11. [11]

    Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes

    Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384, 2023. 2

  12. [12]

    Svg: 3d stereoscopic video generation via denoising frame matrix

    Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean Fanello, Xiaojuan Qi, and Yinda Zhang. Svg: 3d stereoscopic video generation via denoising frame matrix. arXiv preprint arXiv:2407.00367, 2024. 3

  13. [13]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. In Advances in Neural Informa- tion Processing Systems, 2020. 2

  14. [14]

    Depthcrafter: Generating consistent long depth sequences for open-world videos

    Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. arXiv preprint arXiv:2409.02095, 2024. 2, 5

  15. [15]

    Kv inversion: Kv embeddings learning for text-conditioned real image action editing

    Jiancheng Huang, Yifan Liu, Jin Qin, and Shifeng Chen. Kv inversion: Kv embeddings learning for text-conditioned real image action editing. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV) , pages 172–184. Springer, 2023. 1

  16. [16]

    Infusion: Inject and attention fusion for multi concept zero-shot text-based video editing

    Anant Khandelwal. Infusion: Inject and attention fusion for multi concept zero-shot text-based video editing. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 3017–3026, 2023. 1

  17. [17]

    Ro- bust consistent video depth estimation

    Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Ro- bust consistent video depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1611–1621, 2021. 2

  18. [18]

    Wetzstein

    Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hong- sheng Li, Leonidas Guibas, and Gordon. Wetzstein. Collab- orative video diffusion: Consistent multi-video generation with camera control. In arXiv, 2024. 2, 6

  19. [19]

    Dynibar: Neural dynamic image-based rendering

    Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, and Noah Snavely. Dynibar: Neural dynamic image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4273–4284, 2023. 2

  20. [20]

    Robust dynamic radiance fields

    Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Jo- hannes Kopf, and Jia-Bin Huang. Robust dynamic radiance fields. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023. 2

  21. [21]

    Consistent video depth estimation.ACM Transactions on Graphics (ToG), 39(4):71–1, 2020

    Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation.ACM Transactions on Graphics (ToG), 39(4):71–1, 2020. 2

  22. [22]

    Stereo conversion with disparity-aware warp- ing, compositing and inpainting

    Lukas Mehl, Andr ´es Bruhn, Markus Gross, and Christo- pher Schroers. Stereo conversion with disparity-aware warp- ing, compositing and inpainting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4260–4269, 2024. 1

  23. [23]

    Dragondiffusion: Enabling drag-style manipula- tion on diffusion models

    Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. Dragondiffusion: Enabling drag-style manipula- tion on diffusion models. arXiv preprint arXiv:2307.02421,

  24. [24]

    Dreammatcher: Appearance matching self-attention for semantically-consistent text-to- image personalization

    Jisu Nam, Heesu Kim, DongJae Lee, Siyoon Jin, Seungry- ong Kim, and Seunggyu Chang. Dreammatcher: Appearance matching self-attention for semantically-consistent text-to- image personalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8100–8110, 2024. 1

  25. [25]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. 9 Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 6

  26. [26]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 2

  27. [27]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

    Rene Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3):1623–1637, 2022. 1

  28. [28]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In arXiv preprint arXiv:2112.10752, 2022. 2

  29. [29]

    Immersepro: End- to-end stereo video synthesis via implicit disparity learning

    Jian Shi, Zhenyu Li, and Peter Wonka. Immersepro: End- to-end stereo video synthesis via implicit disparity learning. arXiv preprint arXiv:2410.00262, 2024. 2

  30. [30]

    Dissolving is amplifying: Towards fine-grained anomaly detection

    Jian Shi, Pengyi Zhang, Ni Zhang, Hakim Ghazzai, and Pe- ter Wonka. Dissolving is amplifying: Towards fine-grained anomaly detection. In European Conference on Computer Vision, pages 377–394. Springer, 2024. 5

  31. [31]

    3d photography using context-aware layered depth inpainting

    Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin Huang. 3d photography using context-aware layered depth inpainting. In IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2020. 1

  32. [32]

    Realmdreamer: Text-driven 3d scene gener- ation with inpainting and depth diffusion

    Jaidev Shriram, Alex Trevithick, Lingjie Liu, and Ravi Ra- mamoorthi. Realmdreamer: Text-driven 3d scene gener- ation with inpainting and depth diffusion. arXiv preprint arXiv:2404.07199, 2024. 2

  33. [33]

    Dreamcraft3d: Hierarchi- cal 3d generation with bootstrapped diffusion prior

    Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Liu. Dreamcraft3d: Hierarchi- cal 3d generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818, 2023. 2

  34. [34]

    Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior, 2023

    Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior, 2023. 2

  35. [35]

    Web stereo video supervision for depth prediction from dynamic scenes

    Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes. In 2019 International Conference on 3D Vision (3DV), pages 348–357. IEEE, 2019. 1, 2

  36. [36]

    Stereodiffusion: Training-free stereo image generation using latent diffusion models

    Lezhong Wang, Jeppe Revall Frisvad, Mark Bo Jensen, and Siavash Arjomand Bigdeli. Stereodiffusion: Training-free stereo image generation using latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 7416–7425, 2024. 2, 1

  37. [37]

    Brostow, and Michael Firman

    Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambetov, Gabriel J. Brostow, and Michael Firman. Learning stereo from single images. In European Conference on Computer Vision (ECCV), 2020. 1

  38. [38]

    Learning stereo from single images

    Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambe- tov, Gabriel J Brostow, and Michael Firman. Learning stereo from single images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 722–740. Springer, 2020. 1

  39. [39]

    Deep3d: Fully automatic 2d-to-3d video conversion with deep convo- lutional neural networks

    Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep convo- lutional neural networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , pages 842–

  40. [40]

    Springer, 2016. 1, 2

  41. [41]

    Dynamicrafter: Animating open-domain images with video diffusion priors

    Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Xintao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023. 5

  42. [42]

    Depth anything: Unleash- ing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Ji- ashi Feng, and Hengshuang Zhao. Depth anything: Unleash- ing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891, 2024. 8

  43. [43]

    4real: Towards photorealistic 4d scene generation via video diffusion models, 2024

    Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Mena- pace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, and Hsin-Ying Lee. 4real: Towards photorealistic 4d scene generation via video diffusion models, 2024. 2

  44. [44]

    Hifi-123: Towards high-fidelity one image to 3d content gen- eration

    Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, Wenbo Hu, Long Quan, Ying Shan, and Yonghong Tian. Hifi-123: Towards high-fidelity one image to 3d content gen- eration. arXiv preprint arXiv:2310.06744, 2023. 2

  45. [45]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2

  46. [46]

    4diffusion: Multi-view video diffusion model for 4d generation

    Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. 4diffusion: Multi-view video diffusion model for 4d generation. arXiv preprint arXiv:2405.20674, 2024. 2

  47. [47]

    Text2nerf: Text-driven 3d scene generation with neu- ral radiance fields

    Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, and Jing Liao. Text2nerf: Text-driven 3d scene generation with neu- ral radiance fields. IEEE Transactions on Visualization and Computer Graphics, 2024. 2

  48. [48]

    Temporal3d: 2d-to- 3d video conversion network with multi-frame fusion

    Zheyu Zhang and Ronggang Wang. Temporal3d: 2d-to- 3d video conversion network with multi-frame fusion. In 2022 4th International Conference on Advances in Com- puter Technology, Information Science and Communications (CTISC), pages 1–5. IEEE, 2022. 2

  49. [49]

    Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos

    Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xi- aoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, and Ying Shan. Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos. arXiv preprint arXiv:2409.07447, 2024. 2, 1

  50. [50]

    clip-score: CLIP Score for Py- Torch

    SUN Zhengwentai. clip-score: CLIP Score for Py- Torch. https : / / github . com / taited / clip - score, 2023. Version 0.1.1. 6 10 StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart Supplementary Material A. Preliminaries Image Warping for Stereo View Synthesis Generating a stereo pair of a single input image includes warping the im-...

  51. [51]

    Frame Quality (5 is best, 1 is poor)

  52. [52]

    Temporal Coherence (5 is best, 1 is poor)

  53. [53]

    Stereoscopic Effects (5 is best, 1 is poor)

  54. [54]

    For the overall confor- mity, while some stereo videos may exhibit stronger stereo effects, they can also induce discomfort or motion sickness in viewers

    Overall Conformity (5 is best, 1 is poor) Specifically, we define frame quality as the quality of the generated 2D images, temporal coherence as the consis- tency between frames, and stereoscopic effects as a subjec- tive experience of the stereo videos. For the overall confor- mity, while some stereo videos may exhibit stronger stereo effects, they can a...