DissolveStereo: Coarse Depth Injection for Zero-Shot Stereo Video Generation

Jian Shi; Peter Wonka; Qian Wang; Ramzi Idoughi; Wenqing Cui; Zhenyu Li

arxiv: 2411.14295 · v3 · submitted 2024-11-21 · 💻 cs.CV

DissolveStereo: Coarse Depth Injection for Zero-Shot Stereo Video Generation

Jian Shi , Qian Wang , Zhenyu Li , Wenqing Cui , Ramzi Idoughi , Peter Wonka This is my paper

Pith reviewed 2026-05-23 17:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords stereo video generationzero-shot synthesisdiffusion modelsdepth injectionview consistencytemporal coherenceepipolar geometry

0 comments

The pith

DissolveStereo generates consistent stereo videos from monocular diffusion models by injecting coarse depth maps without paired training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DissolveStereo to create stereo videos using video diffusion models trained only on regular video, without any stereo pairs. It initializes left and right views with a noisy restart to set up compatible starting points in latent space, then runs iterative refinement while feeding in simplified depth information. The simplification step, dissolving the depth maps, drops high-frequency details so the two views can align more easily during generation. This produces videos with stronger matching between left and right perspectives and less flickering across frames.

Core claim

DissolveStereo shows that a noisy restart strategy combined with iterative refinement and dissolved depth maps allows zero-shot stereo video generation from video diffusion priors, producing higher epipolar consistency and temporal smoothness than prior approaches.

What carries the argument

dissolved depth maps that reduce high-frequency depth information to simplify and align latent representations across views during refinement.

Load-bearing premise

That noisy restarts plus dissolved depth maps will reliably align left and right latent spaces without introducing new artifacts that later steps cannot fix.

What would settle it

A generated stereo video that exhibits visible left-right mismatches or increased flickering when the dissolved depth maps are replaced by full-resolution depth maps or when the noisy restart is removed.

Figures

Figures reproduced from arXiv: 2411.14295 by Jian Shi, Peter Wonka, Qian Wang, Ramzi Idoughi, Wenqing Cui, Zhenyu Li.

**Figure 1.** Figure 1: With just a single image and an associated text prompt (left), our method generates compelling stereo video sequences. The [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: An overview of the StereoCrafter-Zero pipeline. Top: Our method is based on two main components: (1) Noisy Restart for a robust initial latent estimation (Sec. 3.1) and (2) Iterative Refinement for the latent refinement (Sec. 3.2) during the sampling step. Bottom: The proposed pipeline takes a conditioning image and text prompt as input, generating both left and right views that produce a strong stereoscop… view at source ↗

**Figure 3.** Figure 3: Illustration of the noisy start strategy. At selected steps, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Illustration of the iterative refinement strategy. This pro [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Dissolved depth maps obtained with DepthCrafter (50-step schedule). The top row shows the gradual removal of high-frequency details. The bottom row demonstrates ghosting effects reduction, but highlights a potential trade-off: increased distortion at larger t. oDiffusion (stereo image generation). Note that ImmersePro is a depth-free method that may create an arbitrary degree of stereoscopic effect withou… view at source ↗

**Figure 7.** Figure 7: Comparison with different stereo conversion methods including [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Impact of the noisy restart on stereo effects. This anaglyph visualization vividly demonstrates the improvement. Increasing noisy [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Impact of the iterative refinement. Without it, warping [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 11.** Figure 11: Demonstration of generating stereo videos with interpolated inputs. We input the starting and ending frames as our input and [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Demonstration of generating looped stereo videos. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Demonstration of the failure cases. Our method can fail [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 16.** Figure 16: Demo of repetitive latent warping for LVDMs. Left: [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗

**Figure 15.** Figure 15: Warp LVDM latent space with direct translation with [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗

**Figure 17.** Figure 17: A cross comparison of depth maps from the disparity propagated image-based models and the video depth models. The visual [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗

**Figure 18.** Figure 18: A cross comparison of different dissolving levels. Top: a simpler case with fewer intricate structures. Bottom: a complex case [PITH_FULL_IMAGE:figures/full_fig_p015_18.png] view at source ↗

**Figure 19.** Figure 19: Demonstration of the differences between using DepthCrafter and Disparity Propagation method for obtaining disparity values. [PITH_FULL_IMAGE:figures/full_fig_p016_19.png] view at source ↗

**Figure 20.** Figure 20: We visualize the tensor of the first timeframe in the first [PITH_FULL_IMAGE:figures/full_fig_p016_20.png] view at source ↗

**Figure 21.** Figure 21: A screenshot captured on Meta Quest 3, showcasing a [PITH_FULL_IMAGE:figures/full_fig_p017_21.png] view at source ↗

read the original abstract

Generating high-quality stereo videos requires consistent depth perception and temporal coherence across frames. Despite advances in image and video synthesis using diffusion models, producing high-quality stereo videos remains a challenging task due to the difficulty of maintaining consistent temporal and spatial coherence between left and right views. We introduce DissolveStereo, a novel framework for zero-shot stereo video generation that leverages video diffusion priors without requiring paired training data. Our key innovations include a noisy restart strategy to initialize stereo-aware latent representations and an iterative refinement process that progressively harmonizes the latent space, addressing issues like temporal flickering and view inconsistencies. Importantly, we propose the use of dissolved depth maps to streamline latent space operations by reducing high-frequency depth information. Our comprehensive evaluations, including quantitative metrics and user studies, demonstrate that DissolveStereo produces high-quality stereo videos with enhanced depth consistency and temporal smoothness. In terms of epipolar consistency, our method achieves an 11.7% improvement in MEt3R score over the current state-of-the-art. Furthermore, user studies indicate strong perceptual gains over the previous arts, with an 8.0% higher perceived frame quality and 10.9% higher perceived temporal coherence. Our code is in https://github.com/shijianjian/DissolveStereo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DissolveStereo is a practical engineering recipe for zero-shot stereo video via diffusion with three specific tricks and reported metric gains, but thin on component breakdowns.

read the letter

The main thing to know is that this paper delivers a usable zero-shot pipeline for stereo video generation by combining noisy restart for initial latents, iterative refinement to reduce flickering and view mismatch, and dissolved depth maps that drop high-frequency depth signals to simplify latent operations. The abstract frames it as an empirical contribution on top of existing video diffusion priors without needing paired stereo training data.

Referee Report

2 major / 2 minor

Summary. The paper introduces DissolveStereo, a zero-shot stereo video generation framework that builds on video diffusion priors without paired training data. It proposes a noisy restart strategy to initialize stereo-aware latent representations, an iterative refinement process to harmonize left/right views, and dissolved depth maps to reduce high-frequency depth information during latent operations. The central empirical claims are an 11.7% MEt3R improvement over prior state-of-the-art and 8.0–10.9% gains in user-study scores for frame quality and temporal coherence.

Significance. If the reported gains are reproducible and the mechanism is shown to be robust, the work would constitute a practical engineering contribution to consistent stereo video synthesis. The dissolved-depth injection and noisy-restart approach are concrete, testable ideas that could be adopted or extended by others working on view-consistent diffusion models.

major comments (2)

[Abstract and §3] Abstract and §3 (framework): the claim that dissolved depth maps plus noisy restart reliably harmonize latents without introducing uncorrectable artifacts rests on the reported MEt3R and user-study deltas, yet no ablation isolating the dissolved-depth component, no failure-case analysis, and no error bars are supplied; this makes the 11.7% figure difficult to interpret as load-bearing evidence.
[§4] §4 (experiments): the MEt3R improvement and user-study percentages are presented without baseline implementation details, exact hyper-parameter settings for the iterative refinement, or cross-validation across multiple seeds; these omissions directly affect the defensibility of the quantitative claims.

minor comments (2)

The GitHub link is given but the manuscript does not state whether the released code includes the exact evaluation scripts used for MEt3R and the user study.
[§3] Notation for “dissolved depth map” is introduced without an explicit equation or pseudocode showing how high-frequency components are removed before injection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thoughtful review and for recognizing the practical contributions of DissolveStereo. We address each major comment below and commit to revisions that strengthen the empirical support and reproducibility of the work.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (framework): the claim that dissolved depth maps plus noisy restart reliably harmonize latents without introducing uncorrectable artifacts rests on the reported MEt3R and user-study deltas, yet no ablation isolating the dissolved-depth component, no failure-case analysis, and no error bars are supplied; this makes the 11.7% figure difficult to interpret as load-bearing evidence.

Authors: We acknowledge that an ablation isolating the dissolved-depth component would strengthen the evidence for its specific role. In the revised manuscript we will add a targeted ablation comparing the full model to a variant without dissolved depth maps. We will also add a dedicated limitations subsection discussing observed failure cases and potential artifacts. Regarding error bars, the primary experiments used fixed seeds for direct comparability; we will run additional trials with varied seeds and report standard deviations where feasible. revision: yes
Referee: [§4] §4 (experiments): the MEt3R improvement and user-study percentages are presented without baseline implementation details, exact hyper-parameter settings for the iterative refinement, or cross-validation across multiple seeds; these omissions directly affect the defensibility of the quantitative claims.

Authors: We agree that greater implementation transparency is needed. The revised §4 will include full baseline implementation details (including any code adaptations), the precise hyper-parameter values used for iterative refinement, and results across multiple random seeds to supply error bars and assess variability of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical engineering framework for zero-shot stereo video generation via diffusion models, relying on a noisy restart strategy and dissolved depth maps. All reported gains (MEt3R, user scores) are presented as experimental outcomes from evaluations rather than predictions derived from equations or first-principles results. No load-bearing self-citations, uniqueness theorems, ansatzes, or fitted-input predictions appear in the abstract or framework description; the central claims rest on external benchmarks and perceptual studies that remain independent of internal parameter definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the unstated premise that video diffusion priors already encode sufficient stereo geometry when guided by coarse depth; no new physical entities are postulated.

axioms (1)

domain assumption Video diffusion models trained on monocular video contain latent representations that can be steered toward consistent stereo pairs by depth injection.
Invoked when the method claims zero-shot operation without paired stereo data.

invented entities (1)

dissolved depth map no independent evidence
purpose: Reduce high-frequency depth information to simplify latent-space operations.
Introduced as a key innovation to address flickering and view inconsistency.

pith-pipeline@v0.9.0 · 5765 in / 1219 out tokens · 22441 ms · 2026-05-23T17:14:31.876339+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
cs.CV 2025-12 unverdicted novelty 7.0

A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

Met3r: Measuring multi-view consistency in generated images

Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images. arXiv preprint arXiv:2501.06336, 2025. 6

work page arXiv 2025
[2]

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B. Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024
[3]

Structure-from- motion with oriented points

Jonathan T Barron and Jovan Popovi ´c. Structure-from- motion with oriented points. In IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2015. 2

work page 2015
[4]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Diffdreamer: Towards consistent unsupervised single-view scene extrapolation with conditional diffusion models

Shengqu Cai, Eric Ryan Chan, Songyou Peng, Mohamad Shahbazi, Anton Obukhov, Luc Van Gool, and Gordon Wetzstein. Diffdreamer: Towards consistent unsupervised single-view scene extrapolation with conditional diffusion models. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 2139–2150, 2023. 6

work page 2023
[6]

Generative rendering: Controllable 4d-guided video generation with 2d diffusion models

Shengqu Cai, Duygu Ceylan, Matheus Gadelha, Chun- Hao Paul Huang, Tuanfeng Yang Wang, and Gordon Wet- zstein. Generative rendering: Controllable 4d-guided video generation with 2d diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7611–7620, 2024. 6

work page 2024
[7]

Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 22560–22570, 2023. 1

work page 2023
[8]

Fec: Three finetuning- free methods to enhance consistency for real image edit- ing

Songyan Chen and Jiancheng Huang. Fec: Three finetuning- free methods to enhance consistency for real image edit- ing. In 2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML) , pages 76–87. IEEE, 2023. 1

work page 2023
[9]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zi- long Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. arXiv preprint arXiv:2501.12375, 2025. 2, 8

work page arXiv 2025
[10]

Anydoor: Zero-shot object-level im- age customization

Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level im- age customization. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6593–6602, 2024. 1

work page 2024
[11]

Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384, 2023. 2

work page arXiv 2023
[12]

Svg: 3d stereoscopic video generation via denoising frame matrix

Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean Fanello, Xiaojuan Qi, and Yinda Zhang. Svg: 3d stereoscopic video generation via denoising frame matrix. arXiv preprint arXiv:2407.00367, 2024. 3

work page arXiv 2024
[13]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. In Advances in Neural Informa- tion Processing Systems, 2020. 2

work page 2020
[14]

Depthcrafter: Generating consistent long depth sequences for open-world videos

Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. arXiv preprint arXiv:2409.02095, 2024. 2, 5

work page arXiv 2024
[15]

Kv inversion: Kv embeddings learning for text-conditioned real image action editing

Jiancheng Huang, Yifan Liu, Jin Qin, and Shifeng Chen. Kv inversion: Kv embeddings learning for text-conditioned real image action editing. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV) , pages 172–184. Springer, 2023. 1

work page 2023
[16]

Infusion: Inject and attention fusion for multi concept zero-shot text-based video editing

Anant Khandelwal. Infusion: Inject and attention fusion for multi concept zero-shot text-based video editing. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 3017–3026, 2023. 1

work page 2023
[17]

Ro- bust consistent video depth estimation

Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Ro- bust consistent video depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1611–1621, 2021. 2

work page 2021
[18]

Wetzstein

Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hong- sheng Li, Leonidas Guibas, and Gordon. Wetzstein. Collab- orative video diffusion: Consistent multi-video generation with camera control. In arXiv, 2024. 2, 6

work page 2024
[19]

Dynibar: Neural dynamic image-based rendering

Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, and Noah Snavely. Dynibar: Neural dynamic image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4273–4284, 2023. 2

work page 2023
[20]

Robust dynamic radiance fields

Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Jo- hannes Kopf, and Jia-Bin Huang. Robust dynamic radiance fields. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023. 2

work page 2023
[21]

Consistent video depth estimation.ACM Transactions on Graphics (ToG), 39(4):71–1, 2020

Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation.ACM Transactions on Graphics (ToG), 39(4):71–1, 2020. 2

work page 2020
[22]

Stereo conversion with disparity-aware warp- ing, compositing and inpainting

Lukas Mehl, Andr ´es Bruhn, Markus Gross, and Christo- pher Schroers. Stereo conversion with disparity-aware warp- ing, compositing and inpainting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4260–4269, 2024. 1

work page 2024
[23]

Dragondiffusion: Enabling drag-style manipula- tion on diffusion models

Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. Dragondiffusion: Enabling drag-style manipula- tion on diffusion models. arXiv preprint arXiv:2307.02421,

work page arXiv
[24]

Dreammatcher: Appearance matching self-attention for semantically-consistent text-to- image personalization

Jisu Nam, Heesu Kim, DongJae Lee, Siyoon Jin, Seungry- ong Kim, and Seunggyu Chang. Dreammatcher: Appearance matching self-attention for semantically-consistent text-to- image personalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8100–8110, 2024. 1

work page 2024
[25]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. 9 Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

Rene Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3):1623–1637, 2022. 1

work page 2022
[28]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In arXiv preprint arXiv:2112.10752, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Immersepro: End- to-end stereo video synthesis via implicit disparity learning

Jian Shi, Zhenyu Li, and Peter Wonka. Immersepro: End- to-end stereo video synthesis via implicit disparity learning. arXiv preprint arXiv:2410.00262, 2024. 2

work page arXiv 2024
[30]

Dissolving is amplifying: Towards fine-grained anomaly detection

Jian Shi, Pengyi Zhang, Ni Zhang, Hakim Ghazzai, and Pe- ter Wonka. Dissolving is amplifying: Towards fine-grained anomaly detection. In European Conference on Computer Vision, pages 377–394. Springer, 2024. 5

work page 2024
[31]

3d photography using context-aware layered depth inpainting

Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin Huang. 3d photography using context-aware layered depth inpainting. In IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2020. 1

work page 2020
[32]

Realmdreamer: Text-driven 3d scene gener- ation with inpainting and depth diffusion

Jaidev Shriram, Alex Trevithick, Lingjie Liu, and Ravi Ra- mamoorthi. Realmdreamer: Text-driven 3d scene gener- ation with inpainting and depth diffusion. arXiv preprint arXiv:2404.07199, 2024. 2

work page arXiv 2024
[33]

Dreamcraft3d: Hierarchi- cal 3d generation with bootstrapped diffusion prior

Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Liu. Dreamcraft3d: Hierarchi- cal 3d generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818, 2023. 2

work page arXiv 2023
[34]

Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior, 2023

Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior, 2023. 2

work page 2023
[35]

Web stereo video supervision for depth prediction from dynamic scenes

Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes. In 2019 International Conference on 3D Vision (3DV), pages 348–357. IEEE, 2019. 1, 2

work page 2019
[36]

Stereodiffusion: Training-free stereo image generation using latent diffusion models

Lezhong Wang, Jeppe Revall Frisvad, Mark Bo Jensen, and Siavash Arjomand Bigdeli. Stereodiffusion: Training-free stereo image generation using latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 7416–7425, 2024. 2, 1

work page 2024
[37]

Brostow, and Michael Firman

Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambetov, Gabriel J. Brostow, and Michael Firman. Learning stereo from single images. In European Conference on Computer Vision (ECCV), 2020. 1

work page 2020
[38]

Learning stereo from single images

Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambe- tov, Gabriel J Brostow, and Michael Firman. Learning stereo from single images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 722–740. Springer, 2020. 1

work page 2020
[39]

Deep3d: Fully automatic 2d-to-3d video conversion with deep convo- lutional neural networks

Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep convo- lutional neural networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , pages 842–

work page 2016
[40]

Springer, 2016. 1, 2

work page 2016
[41]

Dynamicrafter: Animating open-domain images with video diffusion priors

Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Xintao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023. 5

work page arXiv 2023
[42]

Depth anything: Unleash- ing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Ji- ashi Feng, and Hengshuang Zhao. Depth anything: Unleash- ing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891, 2024. 8

work page arXiv 2024
[43]

4real: Towards photorealistic 4d scene generation via video diffusion models, 2024

Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Mena- pace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, and Hsin-Ying Lee. 4real: Towards photorealistic 4d scene generation via video diffusion models, 2024. 2

work page 2024
[44]

Hifi-123: Towards high-fidelity one image to 3d content gen- eration

Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, Wenbo Hu, Long Quan, Ying Shan, and Yonghong Tian. Hifi-123: Towards high-fidelity one image to 3d content gen- eration. arXiv preprint arXiv:2310.06744, 2023. 2

work page arXiv 2023
[45]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

4diffusion: Multi-view video diffusion model for 4d generation

Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. 4diffusion: Multi-view video diffusion model for 4d generation. arXiv preprint arXiv:2405.20674, 2024. 2

work page arXiv 2024
[47]

Text2nerf: Text-driven 3d scene generation with neu- ral radiance fields

Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, and Jing Liao. Text2nerf: Text-driven 3d scene generation with neu- ral radiance fields. IEEE Transactions on Visualization and Computer Graphics, 2024. 2

work page 2024
[48]

Temporal3d: 2d-to- 3d video conversion network with multi-frame fusion

Zheyu Zhang and Ronggang Wang. Temporal3d: 2d-to- 3d video conversion network with multi-frame fusion. In 2022 4th International Conference on Advances in Com- puter Technology, Information Science and Communications (CTISC), pages 1–5. IEEE, 2022. 2

work page 2022
[49]

Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos

Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xi- aoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, and Ying Shan. Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos. arXiv preprint arXiv:2409.07447, 2024. 2, 1

work page arXiv 2024
[50]

clip-score: CLIP Score for Py- Torch

SUN Zhengwentai. clip-score: CLIP Score for Py- Torch. https : / / github . com / taited / clip - score, 2023. Version 0.1.1. 6 10 StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart Supplementary Material A. Preliminaries Image Warping for Stereo View Synthesis Generating a stereo pair of a single input image includes warping the im-...

work page 2023
[51]

Frame Quality (5 is best, 1 is poor)

work page
[52]

Temporal Coherence (5 is best, 1 is poor)

work page
[53]

Stereoscopic Effects (5 is best, 1 is poor)

work page
[54]

For the overall confor- mity, while some stereo videos may exhibit stronger stereo effects, they can also induce discomfort or motion sickness in viewers

Overall Conformity (5 is best, 1 is poor) Specifically, we define frame quality as the quality of the generated 2D images, temporal coherence as the consis- tency between frames, and stereoscopic effects as a subjec- tive experience of the stereo videos. For the overall confor- mity, while some stereo videos may exhibit stronger stereo effects, they can a...

work page

[1] [1]

Met3r: Measuring multi-view consistency in generated images

Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images. arXiv preprint arXiv:2501.06336, 2025. 6

work page arXiv 2025

[2] [2]

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B. Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024

[3] [3]

Structure-from- motion with oriented points

Jonathan T Barron and Jovan Popovi ´c. Structure-from- motion with oriented points. In IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 2015. 2

work page 2015

[4] [4]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Aleksei Bochkovskii, Ama ¨el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073, 2024. 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Diffdreamer: Towards consistent unsupervised single-view scene extrapolation with conditional diffusion models

Shengqu Cai, Eric Ryan Chan, Songyou Peng, Mohamad Shahbazi, Anton Obukhov, Luc Van Gool, and Gordon Wetzstein. Diffdreamer: Towards consistent unsupervised single-view scene extrapolation with conditional diffusion models. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 2139–2150, 2023. 6

work page 2023

[6] [6]

Generative rendering: Controllable 4d-guided video generation with 2d diffusion models

Shengqu Cai, Duygu Ceylan, Matheus Gadelha, Chun- Hao Paul Huang, Tuanfeng Yang Wang, and Gordon Wet- zstein. Generative rendering: Controllable 4d-guided video generation with 2d diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7611–7620, 2024. 6

work page 2024

[7] [7]

Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 22560–22570, 2023. 1

work page 2023

[8] [8]

Fec: Three finetuning- free methods to enhance consistency for real image edit- ing

Songyan Chen and Jiancheng Huang. Fec: Three finetuning- free methods to enhance consistency for real image edit- ing. In 2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML) , pages 76–87. IEEE, 2023. 1

work page 2023

[9] [9]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zi- long Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. arXiv preprint arXiv:2501.12375, 2025. 2, 8

work page arXiv 2025

[10] [10]

Anydoor: Zero-shot object-level im- age customization

Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level im- age customization. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6593–6602, 2024. 1

work page 2024

[11] [11]

Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes

Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes. arXiv preprint arXiv:2311.13384, 2023. 2

work page arXiv 2023

[12] [12]

Svg: 3d stereoscopic video generation via denoising frame matrix

Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean Fanello, Xiaojuan Qi, and Yinda Zhang. Svg: 3d stereoscopic video generation via denoising frame matrix. arXiv preprint arXiv:2407.00367, 2024. 3

work page arXiv 2024

[13] [13]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. In Advances in Neural Informa- tion Processing Systems, 2020. 2

work page 2020

[14] [14]

Depthcrafter: Generating consistent long depth sequences for open-world videos

Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. arXiv preprint arXiv:2409.02095, 2024. 2, 5

work page arXiv 2024

[15] [15]

Kv inversion: Kv embeddings learning for text-conditioned real image action editing

Jiancheng Huang, Yifan Liu, Jin Qin, and Shifeng Chen. Kv inversion: Kv embeddings learning for text-conditioned real image action editing. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV) , pages 172–184. Springer, 2023. 1

work page 2023

[16] [16]

Infusion: Inject and attention fusion for multi concept zero-shot text-based video editing

Anant Khandelwal. Infusion: Inject and attention fusion for multi concept zero-shot text-based video editing. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 3017–3026, 2023. 1

work page 2023

[17] [17]

Ro- bust consistent video depth estimation

Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Ro- bust consistent video depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1611–1621, 2021. 2

work page 2021

[18] [18]

Wetzstein

Zhengfei Kuang, Shengqu Cai, Hao He, Yinghao Xu, Hong- sheng Li, Leonidas Guibas, and Gordon. Wetzstein. Collab- orative video diffusion: Consistent multi-video generation with camera control. In arXiv, 2024. 2, 6

work page 2024

[19] [19]

Dynibar: Neural dynamic image-based rendering

Zhengqi Li, Qianqian Wang, Forrester Cole, Richard Tucker, and Noah Snavely. Dynibar: Neural dynamic image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4273–4284, 2023. 2

work page 2023

[20] [20]

Robust dynamic radiance fields

Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Jo- hannes Kopf, and Jia-Bin Huang. Robust dynamic radiance fields. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, 2023. 2

work page 2023

[21] [21]

Consistent video depth estimation.ACM Transactions on Graphics (ToG), 39(4):71–1, 2020

Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation.ACM Transactions on Graphics (ToG), 39(4):71–1, 2020. 2

work page 2020

[22] [22]

Stereo conversion with disparity-aware warp- ing, compositing and inpainting

Lukas Mehl, Andr ´es Bruhn, Markus Gross, and Christo- pher Schroers. Stereo conversion with disparity-aware warp- ing, compositing and inpainting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4260–4269, 2024. 1

work page 2024

[23] [23]

Dragondiffusion: Enabling drag-style manipula- tion on diffusion models

Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. Dragondiffusion: Enabling drag-style manipula- tion on diffusion models. arXiv preprint arXiv:2307.02421,

work page arXiv

[24] [24]

Dreammatcher: Appearance matching self-attention for semantically-consistent text-to- image personalization

Jisu Nam, Heesu Kim, DongJae Lee, Siyoon Jin, Seungry- ong Kim, and Seunggyu Chang. Dreammatcher: Appearance matching self-attention for semantically-consistent text-to- image personalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8100–8110, 2024. 1

work page 2024

[25] [25]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. 9 Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

Rene Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 44(3):1623–1637, 2022. 1

work page 2022

[28] [28]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In arXiv preprint arXiv:2112.10752, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Immersepro: End- to-end stereo video synthesis via implicit disparity learning

Jian Shi, Zhenyu Li, and Peter Wonka. Immersepro: End- to-end stereo video synthesis via implicit disparity learning. arXiv preprint arXiv:2410.00262, 2024. 2

work page arXiv 2024

[30] [30]

Dissolving is amplifying: Towards fine-grained anomaly detection

Jian Shi, Pengyi Zhang, Ni Zhang, Hakim Ghazzai, and Pe- ter Wonka. Dissolving is amplifying: Towards fine-grained anomaly detection. In European Conference on Computer Vision, pages 377–394. Springer, 2024. 5

work page 2024

[31] [31]

3d photography using context-aware layered depth inpainting

Meng-Li Shih, Shih-Yang Su, Johannes Kopf, and Jia-Bin Huang. 3d photography using context-aware layered depth inpainting. In IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), 2020. 1

work page 2020

[32] [32]

Realmdreamer: Text-driven 3d scene gener- ation with inpainting and depth diffusion

Jaidev Shriram, Alex Trevithick, Lingjie Liu, and Ravi Ra- mamoorthi. Realmdreamer: Text-driven 3d scene gener- ation with inpainting and depth diffusion. arXiv preprint arXiv:2404.07199, 2024. 2

work page arXiv 2024

[33] [33]

Dreamcraft3d: Hierarchi- cal 3d generation with bootstrapped diffusion prior

Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Liu. Dreamcraft3d: Hierarchi- cal 3d generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818, 2023. 2

work page arXiv 2023

[34] [34]

Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior, 2023

Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior, 2023. 2

work page 2023

[35] [35]

Web stereo video supervision for depth prediction from dynamic scenes

Chaoyang Wang, Simon Lucey, Federico Perazzi, and Oliver Wang. Web stereo video supervision for depth prediction from dynamic scenes. In 2019 International Conference on 3D Vision (3DV), pages 348–357. IEEE, 2019. 1, 2

work page 2019

[36] [36]

Stereodiffusion: Training-free stereo image generation using latent diffusion models

Lezhong Wang, Jeppe Revall Frisvad, Mark Bo Jensen, and Siavash Arjomand Bigdeli. Stereodiffusion: Training-free stereo image generation using latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 7416–7425, 2024. 2, 1

work page 2024

[37] [37]

Brostow, and Michael Firman

Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambetov, Gabriel J. Brostow, and Michael Firman. Learning stereo from single images. In European Conference on Computer Vision (ECCV), 2020. 1

work page 2020

[38] [38]

Learning stereo from single images

Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambe- tov, Gabriel J Brostow, and Michael Firman. Learning stereo from single images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 722–740. Springer, 2020. 1

work page 2020

[39] [39]

Deep3d: Fully automatic 2d-to-3d video conversion with deep convo- lutional neural networks

Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep convo- lutional neural networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , pages 842–

work page 2016

[40] [40]

Springer, 2016. 1, 2

work page 2016

[41] [41]

Dynamicrafter: Animating open-domain images with video diffusion priors

Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Xintao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023. 5

work page arXiv 2023

[42] [42]

Depth anything: Unleash- ing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Ji- ashi Feng, and Hengshuang Zhao. Depth anything: Unleash- ing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891, 2024. 8

work page arXiv 2024

[43] [43]

4real: Towards photorealistic 4d scene generation via video diffusion models, 2024

Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Mena- pace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, and Hsin-Ying Lee. 4real: Towards photorealistic 4d scene generation via video diffusion models, 2024. 2

work page 2024

[44] [44]

Hifi-123: Towards high-fidelity one image to 3d content gen- eration

Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, Wenbo Hu, Long Quan, Ying Shan, and Yonghong Tian. Hifi-123: Towards high-fidelity one image to 3d content gen- eration. arXiv preprint arXiv:2310.06744, 2023. 2

work page arXiv 2023

[45] [45]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

4diffusion: Multi-view video diffusion model for 4d generation

Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. 4diffusion: Multi-view video diffusion model for 4d generation. arXiv preprint arXiv:2405.20674, 2024. 2

work page arXiv 2024

[47] [47]

Text2nerf: Text-driven 3d scene generation with neu- ral radiance fields

Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, and Jing Liao. Text2nerf: Text-driven 3d scene generation with neu- ral radiance fields. IEEE Transactions on Visualization and Computer Graphics, 2024. 2

work page 2024

[48] [48]

Temporal3d: 2d-to- 3d video conversion network with multi-frame fusion

Zheyu Zhang and Ronggang Wang. Temporal3d: 2d-to- 3d video conversion network with multi-frame fusion. In 2022 4th International Conference on Advances in Com- puter Technology, Information Science and Communications (CTISC), pages 1–5. IEEE, 2022. 2

work page 2022

[49] [49]

Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos

Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xi- aoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, and Ying Shan. Stereocrafter: Diffusion-based generation of long and high-fidelity stereoscopic 3d from monocular videos. arXiv preprint arXiv:2409.07447, 2024. 2, 1

work page arXiv 2024

[50] [50]

clip-score: CLIP Score for Py- Torch

SUN Zhengwentai. clip-score: CLIP Score for Py- Torch. https : / / github . com / taited / clip - score, 2023. Version 0.1.1. 6 10 StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart Supplementary Material A. Preliminaries Image Warping for Stereo View Synthesis Generating a stereo pair of a single input image includes warping the im-...

work page 2023

[51] [51]

Frame Quality (5 is best, 1 is poor)

work page

[52] [52]

Temporal Coherence (5 is best, 1 is poor)

work page

[53] [53]

Stereoscopic Effects (5 is best, 1 is poor)

work page

[54] [54]

For the overall confor- mity, while some stereo videos may exhibit stronger stereo effects, they can also induce discomfort or motion sickness in viewers

Overall Conformity (5 is best, 1 is poor) Specifically, we define frame quality as the quality of the generated 2D images, temporal coherence as the consis- tency between frames, and stereoscopic effects as a subjec- tive experience of the stereo videos. For the overall confor- mity, while some stereo videos may exhibit stronger stereo effects, they can a...

work page