pith. machine review for the scientific record. sign in

arxiv: 2309.03453 · v2 · submitted 2023-09-07 · 💻 cs.CV · cs.AI· cs.GR

Recognition: 2 theorem links

· Lean Theorem

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Authors on Pith no claims yet

Pith reviewed 2026-05-16 10:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GR
keywords multiview diffusionconsistent image generationnovel view synthesis3D-aware attentionimage-to-3Dsynchronized reverse process
0
0 comments X

The pith

SyncDreamer generates multiple consistent views of an object from one input image by synchronizing their diffusion process.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SyncDreamer is a diffusion model that produces several images of the same object from different viewpoints, all derived from a single starting image. It works by modeling the joint probability of these views and keeping their generation states aligned at every reverse diffusion step. Alignment happens through a 3D-aware attention step that links matching features between the views. Earlier single-view generators like Zero123 often produced mismatched geometry and colors across outputs, which broke downstream 3D work. When the synchronization holds, the resulting set of images can be fed directly into novel-view synthesis, text-to-3D, or image-to-3D pipelines without extra consistency fixes.

Core claim

The paper claims that a synchronized multiview diffusion model, which synchronizes the intermediate states of all generated images at every step of the reverse process through a 3D-aware feature attention mechanism, generates multiview-consistent images from a single-view image.

What carries the argument

The 3D-aware feature attention mechanism that correlates corresponding features across different views while the joint reverse diffusion process runs.

If this is right

  • All generated views share consistent geometry and appearance because their diffusion trajectories remain coupled.
  • A single reverse process produces an entire set of usable views instead of requiring separate runs that later need alignment.
  • The outputs can be used directly as input for 3D reconstruction or generation pipelines that assume multi-view consistency.
  • Training on the joint distribution removes the need for post-hoc consistency losses or multi-stage refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synchronization idea could be applied to video frames to enforce temporal coherence without explicit motion models.
  • Pairing the method with text conditioning might allow generation of consistent multi-view scenes from descriptive prompts alone.
  • If the attention mechanism generalizes, the approach could reduce the number of real camera views needed for high-quality 3D capture.

Load-bearing premise

The 3D-aware attention step correctly links matching features across views without adding new geometric or color mismatches during joint generation.

What would settle it

Generate views of an object whose true 3D geometry is known, project that geometry into each target viewpoint, and measure whether pixel-level or depth-level discrepancy between generated images and projections is smaller than in independent baselines.

read the original abstract

In this paper, we present a novel diffusion model called that generates multiview-consistent images from a single-view image. Using pretrained large-scale 2D diffusion models, recent work Zero123 demonstrates the ability to generate plausible novel views from a single-view image of an object. However, maintaining consistency in geometry and colors for the generated images remains a challenge. To address this issue, we propose a synchronized multiview diffusion model that models the joint probability distribution of multiview images, enabling the generation of multiview-consistent images in a single reverse process. SyncDreamer synchronizes the intermediate states of all the generated images at every step of the reverse process through a 3D-aware feature attention mechanism that correlates the corresponding features across different views. Experiments show that SyncDreamer generates images with high consistency across different views, thus making it well-suited for various 3D generation tasks such as novel-view-synthesis, text-to-3D, and image-to-3D.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SyncDreamer, a synchronized multiview diffusion model that generates consistent images across multiple views from a single input image. Building on pretrained 2D diffusion models such as Zero123, it models the joint distribution of multiview images by synchronizing intermediate diffusion states at each reverse step via a 3D-aware feature attention mechanism that correlates corresponding features across views, with the goal of improving geometric and color consistency for downstream tasks including novel-view synthesis, text-to-3D, and image-to-3D.

Significance. If the synchronization mechanism demonstrably produces samples from the true joint multiview distribution, the work would offer a practical advance over single-view-conditioned methods by reducing view inconsistencies without requiring explicit 3D supervision or multi-stage pipelines. This could streamline image-to-3D pipelines and improve reliability in applications that rely on consistent multiview inputs.

major comments (3)
  1. [§3.2] §3.2 (3D-aware Feature Attention): the mechanism is described as correlating 'corresponding features across different views' but the text provides no explicit construction for establishing 3D correspondences (camera intrinsics/extrinsics, epipolar geometry, or learned depth). Without this, the attention reduces to 2D feature mixing and cannot be guaranteed to resolve depth ambiguities or occlusions during the joint reverse process.
  2. [§4.1] §4.1 and Table 2: the quantitative consistency metrics (e.g., cross-view PSNR or LPIPS) are reported only against Zero123; no ablation isolating the synchronization module versus simple multi-view concatenation is shown, so it is impossible to attribute the claimed consistency gains specifically to the 3D-aware attention.
  3. [§4.3] §4.3 (qualitative results): several generated examples exhibit residual color shifts and silhouette mismatches between views; these contradict the central claim that the joint reverse process enforces high consistency and require either quantitative error analysis or a revised statement of the method's limitations.
minor comments (2)
  1. [§3.1] Notation for the synchronized noise schedule is introduced without a clear equation reference; add an explicit definition (e.g., Eq. (7)) linking the per-view noise levels to the shared synchronization step.
  2. [§4] The abstract states 'Experiments show...' but the experimental section lacks a dedicated limitations paragraph; add one summarizing failure cases (e.g., thin structures or reflective surfaces).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript to improve clarity, add requested ablations, and provide a more balanced discussion of limitations.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (3D-aware Feature Attention): the mechanism is described as correlating 'corresponding features across different views' but the text provides no explicit construction for establishing 3D correspondences (camera intrinsics/extrinsics, epipolar geometry, or learned depth). Without this, the attention reduces to 2D feature mixing and cannot be guaranteed to resolve depth ambiguities or occlusions during the joint reverse process.

    Authors: We agree that §3.2 would benefit from greater detail on how 3D correspondences are constructed. The 3D-aware attention operates on features lifted into a shared 3D coordinate frame using the known camera intrinsics and extrinsics provided for each target view; correspondences are then sampled along epipolar lines and aggregated via cross-view attention. We will expand the section with an explicit algorithmic description, including the projection and sampling steps, to make this construction unambiguous. revision: yes

  2. Referee: [§4.1] §4.1 and Table 2: the quantitative consistency metrics (e.g., cross-view PSNR or LPIPS) are reported only against Zero123; no ablation isolating the synchronization module versus simple multi-view concatenation is shown, so it is impossible to attribute the claimed consistency gains specifically to the 3D-aware attention.

    Authors: We concur that an ablation isolating the synchronization module is necessary. In the revision we will add a controlled experiment in §4.1 that compares the full SyncDreamer model against a baseline using identical multi-view concatenation but without the 3D-aware attention (i.e., independent per-view diffusion with shared noise schedule only). The new results will be reported alongside the existing metrics to directly attribute gains to the proposed mechanism. revision: yes

  3. Referee: [§4.3] §4.3 (qualitative results): several generated examples exhibit residual color shifts and silhouette mismatches between views; these contradict the central claim that the joint reverse process enforces high consistency and require either quantitative error analysis or a revised statement of the method's limitations.

    Authors: We acknowledge the residual inconsistencies visible in some qualitative examples. While the joint reverse process substantially improves consistency relative to independent sampling, it does not eliminate all color shifts or silhouette mismatches, particularly under large viewpoint changes or complex geometry. In the revision we will (i) add quantitative measures of cross-view color variance and silhouette overlap, and (ii) revise the claims in §4.3 and the conclusion to state the achieved consistency level more precisely and explicitly list remaining limitations. revision: partial

Circularity Check

0 steps flagged

No circularity: new 3D-aware attention mechanism added to pretrained diffusion without self-referential reduction

full rationale

The derivation introduces a synchronized multiview diffusion process via 3D-aware feature attention to correlate cross-view features during joint reverse diffusion. This is presented as an architectural addition on top of existing 2D diffusion models (citing Zero123 externally), with no equations shown to be tautological, no fitted parameters renamed as predictions, and no load-bearing self-citations or imported uniqueness theorems. The joint distribution modeling is achieved through the explicit attention design rather than by construction from the target consistency property.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on pretrained large-scale 2D diffusion models as a black-box starting point and assumes the 3D-aware attention can enforce geometric consistency without additional 3D supervision.

axioms (1)
  • domain assumption Pretrained 2D diffusion models provide a sufficiently rich prior for novel view synthesis when extended with cross-view synchronization.
    Invoked in the abstract when stating that SyncDreamer builds on Zero123 and uses pretrained models.

pith-pipeline@v0.9.0 · 5488 in / 1170 out tokens · 15548 ms · 2026-05-16T10:31:44.910240+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 7.0

    R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.

  2. Img2CADSeq: Image-to-CAD Generation via Sequence-Based Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    Img2CADSeq generates standard CAD sequences from images via a multi-stage pipeline with three-level hierarchical codebook encoding, importance-guided compression, and contrastive point-cloud conditioning of a VQ-Diffu...

  3. ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes

    cs.CV 2026-05 unverdicted novelty 7.0

    ConFixGS repairs feedforward 3D Gaussian Splatting with confidence-aware diffusion priors, delivering up to 3.68 dB PSNR gains and halved FID scores on Waymo, nuScenes, and KITTI novel view synthesis tasks.

  4. Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

    cs.CV 2026-04 unverdicted novelty 7.0

    A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...

  5. SafeMind: A Risk-Aware Differentiable Control Framework for Adaptive and Safe Quadruped Locomotion

    cs.RO 2026-04 unverdicted novelty 7.0

    SafeMind is a differentiable framework that combines probabilistic control barrier functions, semantic context encoding, and meta-adaptive risk calibration to deliver safer, lower-energy quadruped locomotion under unc...

  6. Novel View Synthesis as Video Completion

    cs.CV 2026-04 unverdicted novelty 7.0

    Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.

  7. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    cs.CV 2023-09 unverdicted novelty 7.0

    DreamGaussian creates high-quality textured 3D meshes from single-view images in 2 minutes via generative Gaussian Splatting with mesh extraction and UV refinement.

  8. REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement

    cs.CV 2026-04 unverdicted novelty 6.0

    REVIVE 3D generates voluminous 3D assets from flat 2D images via an inflated prior construction followed by latent-space refinement, plus new metrics for volume and flatness validated by user study.

  9. FurnSet: Exploiting Repeats for 3D Scene Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    FurnSet improves single-view 3D scene reconstruction by using per-object CLS tokens and set-aware self-attention to group and jointly reconstruct repeated object instances, with added scene-object conditioning and lay...

  10. Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

    cs.CV 2026-04 unverdicted novelty 6.0

    Viewpoint tokens learned on a mixed 3D-rendered and photorealistic dataset enable precise camera control in text-to-image generation while factorizing geometry from appearance and transferring to unseen object categories.

  11. Repurposing 3D Generative Model for Autoregressive Layout Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    LaviGen turns 3D generative models into an autoregressive layout generator that models geometric and physical constraints, delivering 19% higher physical plausibility and 65% faster inference on the LayoutVLM benchmark.

  12. ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment

    cs.CV 2026-04 unverdicted novelty 6.0

    ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.

  13. Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

    cs.CV 2026-04 unverdicted novelty 6.0

    A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.

  14. SegviGen: Repurposing 3D Generative Model for Part Segmentation

    cs.CV 2026-03 unverdicted novelty 6.0

    SegviGen shows pretrained 3D generative models can be repurposed for part segmentation via voxel colorization, beating prior methods by 40% interactively and 15% on full segmentation using only 0.32% of labeled data.

  15. InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    cs.CV 2024-04 unverdicted novelty 6.0

    InstantMesh produces diverse, high-quality 3D meshes from single images in seconds by combining a multi-view diffusion model with a sparse-view large reconstruction model and optimizing directly on meshes.

  16. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    cs.CV 2023-11 conditional novelty 6.0

    Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...

  17. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 5.0

    R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.

  18. Pose-Aware Diffusion for 3D Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.

  19. AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation

    cs.CV 2026-04 unverdicted novelty 4.0

    AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantica...

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 18 Pith papers · 5 internal anchors

  1. [1]

    Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond

    Mohammadreza Armandpour, Huangjie Zheng, Ali Sadeghian, Amir Sadeghian, and Mingyuan Zhou. Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968,

  2. [2]

    Multidiffusion: Fusing diffusion paths for controlled image generation

    Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113,

  3. [3]

    ShapeNet: An Information-Rich 3D Model Repository

    Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012,

  4. [4]

    Single- stage diffusion nerf: A unified approach to 3d generation and reconstruction

    Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single- stage diffusion nerf: A unified approach to 3d generation and reconstruction. In ICCV, 2023a. Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:23...

  5. [5]

    Objaverse-xl: A universe of 10m+ 3d objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023a. Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Eh...

  6. [6]

    Hyperdiffusion: Generat- ing implicit neural fields with weight-space diffusion

    Ziya Erkoc ¸, Fangchang Ma, Qi Shan, Matthias Nießner, and Angela Dai. Hyperdiffusion: Generat- ing implicit neural fields with weight-space diffusion. arXiv preprint arXiv:2303.17015,

  7. [7]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618,

  8. [8]

    Learning controllable 3d diffusion models from single-view images

    Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, Lingjie Liu, and Josh Susskind. Learning controllable 3d diffusion models from single-view images. arXiv preprint arXiv:2304.06700 , 2023a. Jiatao Gu, Alex Trevithick, Kai-En Lin, Joshua M Susskind, Christian Theobalt, Lingjie Liu, and Ravi Ramamoorthi. Nerfdiff: Single-image view synthesis with nerf-g...

  9. [9]

    3dgen: Triplane latent diffusion for textured mesh generation

    Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas O ˘guz. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371,

  10. [10]

    Dream- time: An improved optimization strategy for text-to-3d content creation

    Yukun Huang, Jianan Wang, Yukai Shi, Xianbiao Qi, Zheng-Jun Zha, and Lei Zhang. Dream- time: An improved optimization strategy for text-to-3d content creation. arXiv preprint arXiv:2306.12422,

  11. [11]

    Shap-e: Generating conditional 3d implicit functions

    Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463,

  12. [12]

    Neuralfield-ldm: Scene generation with hierarchi- cal latent diffusion models

    11 Published as a conference paper at ICLR 2024 Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin Rombach, Antonio Torralba, and Sanja Fidler. Neuralfield-ldm: Scene generation with hierarchi- cal latent diffusion models. In CVPR,

  13. [13]

    Syncdiffusion: Coherent montage via synchronized joint diffusions

    Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions. arXiv preprint arXiv:2306.05178,

  14. [14]

    One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization

    Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. arXiv preprint arXiv:2306.16928, 2023a. Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023b. Xi...

  15. [15]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751,

  16. [16]

    Autodecoding latent 3d diffusion models

    Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc Van Gool, and Sergey Tulyakov. Autodecoding latent 3d diffusion models. arXiv preprint arXiv:2307.05445,

  17. [17]

    Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors

    Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin- Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey Tulyakov, et al. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843,

  18. [18]

    Dreambooth3d: Subject-driven text-to-3d generation

    12 Published as a conference paper at ICLR 2024 Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aberman, Michael Rubinstein, Jonathan Barron, et al. Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508,

  19. [19]

    Vq3d: Learning a 3d-aware generative model on imagenet

    Kyle Sargent, Jing Yu Koh, Han Zhang, Huiwen Chang, Charles Herrmann, Pratul Srinivasan, Jiajun Wu, and Deqing Sun. Vq3d: Learning a 3d-aware generative model on imagenet. arXiv preprint arXiv:2302.06833,

  20. [20]

    Ditto-nerf: Diffusion-based itera- tive text to omni-directional 3d model

    Hoigi Seo, Hayeon Kim, Gwanghyun Kim, and Se Young Chun. Ditto-nerf: Diffusion-based itera- tive text to omni-directional 3d model. arXiv preprint arXiv:2304.02827, 2023a. Junyoung Seo, Wooseok Jang, Min-Seop Kwak, Jaehoon Ko, Hyeonsu Kim, Junho Kim, Jin-Hwa Kim, Jiyoung Lee, and Seungryong Kim. Let 2d diffusion model know 3d-consistency for robust text-t...

  21. [21]

    MVDream: Multi-view Diffusion for 3D Generation

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512,

  22. [22]

    3d generation on imagenet

    Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian Ren, Hsin-Ying Lee, Peter Wonka, and Sergey Tulyakov. 3d generation on imagenet. arXiv preprint arXiv:2303.01416,

  23. [23]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502,

  24. [24]

    Viewset diffusion:(0-) image- conditioned 3d generative models from 2d data

    Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset diffusion:(0-) image- conditioned 3d generative models from 2d data. arXiv preprint arXiv:2306.07881,

  25. [25]

    Make- it-3d: High-fidelity 3d creation from a single image with diffusion prior

    Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make- it-3d: High-fidelity 3d creation from a single image with diffusion prior. In ICCV, 2023a. Shitao Tang, Fuyang Zhang, Jiacheng Chen, Peng Wang, and Yasutaka Furukawa. Mvdiffusion: En- abling holistic multi-view image generation with correspondence-aware diffusion. a...

  26. [26]

    State of the art on neural rendering

    13 Published as a conference paper at ICLR 2024 Ayush Tewari, Ohad Fried, Justus Thies, Vincent Sitzmann, Stephen Lombardi, Kalyan Sunkavalli, Ricardo Martin-Brualla, Tomas Simon, Jason Saragih, Matthias Nießner, et al. State of the art on neural rendering. In Computer Graphics Forum,

  27. [27]

    Diffusion with forward models: Solving stochastic inverse problems without direct supervision

    Ayush Tewari, Tianwei Yin, George Cazenavette, Semon Rezchikov, Joshua B Tenenbaum, Fr ´edo Durand, William T Freeman, and Vincent Sitzmann. Diffusion with forward models: Solving stochastic inverse problems without direct supervision. arXiv preprint arXiv:2306.11719,

  28. [28]

    Textmesh: Generation of realistic 3d meshes from text prompts.arXiv preprint arXiv:2304.12439,

    Christina Tsalicoglou, Fabian Manhardt, Alessio Tonioni, Michael Niemeyer, and Federico Tombari. Textmesh: Generation of realistic 3d meshes from text prompts.arXiv preprint arXiv:2304.12439,

  29. [29]

    Rodin: A generative model for sculpting 3d digital avatars using diffusion

    Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In CVPR, 2023b. Yiming Wang, Qin Han, Marc Habermann, Kostas Daniilidis, Christian Theobalt, and Lingjie Liu. Neus2: Fast learning of neural imp...

  30. [30]

    Novel view synthesis with diffusion models.arXiv preprint arXiv:2210.04628,

    Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, and Mo- hammad Norouzi. Novel view synthesis with diffusion models.arXiv preprint arXiv:2210.04628,

  31. [31]

    Hd-fusion: Detailed text-to-3d generation leveraging multiple noise estimation.arXiv preprint arXiv:2307.16183,

    Jinbo Wu, Xiaobo Gao, Xing Liu, Zhengyang Shen, Chen Zhao, Haocheng Feng, Jingtuo Liu, and Errui Ding. Hd-fusion: Detailed text-to-3d generation leveraging multiple noise estimation.arXiv preprint arXiv:2307.16183,

  32. [32]

    V oxurf: V oxel-based efficient and accurate neural surface reconstruction

    Tong Wu, Jiaqi Wang, Xingang Pan, Xudong Xu, Christian Theobalt, Ziwei Liu, and Dahua Lin. V oxurf: V oxel-based efficient and accurate neural surface reconstruction. arXiv preprint arXiv:2208.12697,

  33. [33]

    3d-aware image generation using 2d diffusion models

    Jianfeng Xiang, Jiaolong Yang, Binbin Huang, and Xin Tong. 3d-aware image generation using 2d diffusion models. arXiv preprint arXiv:2303.17905,

  34. [34]

    Points-to-3d: Bridging the gap between sparse points and shape-controllable text-to-3d generation

    14 Published as a conference paper at ICLR 2024 Chaohui Yu, Qiang Zhou, Jingliang Li, Zhe Zhang, Zhibin Wang, and Fan Wang. Points-to-3d: Bridging the gap between sparse points and shape-controllable text-to-3d generation. arXiv preprint arXiv:2307.13908, 2023a. Jason J. Yu, Fereshteh Forghani, Konstantinos G. Derpanis, and Marcus A. Brubaker. Long-term p...

  35. [35]

    3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models

    Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neural fields and generative diffusion models. In SIGGRAPH, 2023a. Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang, and Jing Liao. Text2nerf: Text-driven 3d scene generation with neural radiance fields. arXiv preprint arXiv:2305.11588, 2023b. Richard Zh...

  36. [36]

    Hifa: High-fidelity text-to-3d with advanced diffusion guidance

    Joseph Zhu and Peiye Zhuang. Hifa: High-fidelity text-to-3d with advanced diffusion guidance. arXiv preprint arXiv:2305.18766,

  37. [37]

    We chose these sizes because the latent feature map size of an image of256 × 256 in the Stable Diffusion Rombach et al

    We sample 48 depth planes for the view-frustum volume because the view may look into the volume from the diagonal direction. We chose these sizes because the latent feature map size of an image of256 × 256 in the Stable Diffusion Rombach et al. (2022)32×32. The elevation of the target views is set to 30◦ and the azimuth evenly distributes in [0◦, 360◦]. B...

  38. [38]

    The viewpoint difference is computed from the difference between the target view and the input view on their elevations and azimuths

    The learning rate is annealed from 5e-4 to 1e-5. The viewpoint difference is computed from the difference between the target view and the input view on their elevations and azimuths. Since we need an elevation of the input view to compute the viewpoint difference ∆v(n), we use the rendering elevation in training while we roughly estimate an elevation angl...

  39. [39]

    On each step, we sample 4096 rays and sample 128 points on each ray for training

    for 2k steps to reconstruct the shape, which costs about 10 mins. On each step, we sample 4096 rays and sample 128 points on each ray for training. Both the mask loss and the rendering loss are applied in training NeuS. The reconstruction process can be further sped up by faster reconstruction methods (Wang et al., 2023c; Guo, 2022; Wu et al.,

  40. [40]

    A.2 T EXT-TO-IMAGE -TO-3D By incorporating text2image models like Stable Diffusion (Rombach et al.,

    or generalizable SDF predictors (Long et al., 2022; Liu et al., 2023a) with priors. A.2 T EXT-TO-IMAGE -TO-3D By incorporating text2image models like Stable Diffusion (Rombach et al.,

  41. [41]

    Examples are shown in Fig

    or Imagen (Saharia et al., 2022), SyncDreamer enables generating 3D models from text. Examples are shown in Fig

  42. [42]

    Input view Good instance Failure instance Figure 9: Limitation on the generation quality

    Compared with existing text-to-3D distillation, our method gives more flexibility because users 1https://github.com/OPHoperHPO/image-background-remove-tool 15 Published as a conference paper at ICLR 2024 Input text Text to image Generated images Mesh Figure 8: Examples of using SyncDreamer to generate 3D models from texts. Input view Good instance Failure...

  43. [43]

    Especially, we notice that the generation quality is sensitive to the foreground object size in the image. The reason is that changing the foreground object size corresponds to adjusting the perspective patterns of the input camera and affects how the model perceives the geometry of the ob- ject. The training images of SyncDreamer have a predefined intrin...

  44. [44]

    In the figure, row 1 shows the generated images of SyncDreamer and row 2 shows the re-generated images of SyncDreamer using one of first-row images as its input image. Though the regenerated images are still plausible, they reasonably differ from the original input view A.7 N OVEL -VIEW RENDERINGS OF NEUS Though SyncDreamer can only generate images on fix...

  45. [45]

    MLP” NeuS and “Hash-grid

    19 Published as a conference paper at ICLR 2024 4 views 8 views 16 views Figure 14: Results of using fewer generated views of SyncDreamer for NeuS reconstruction. Odd columns show the renderings of NeuS whileeven columns show the reconstructed surfaces of NeuS. Input MLP Hash-grid Input MLP Hash-grid Figure 15: Surface reconstruction results using “MLP” N...

  46. [46]

    However, applying such an attention layer to all the feature maps of 16 images in our setting costs unaffordable GPU memory in training

    applies attention layers on all feature maps from multiview images, which also achieves promising results. However, applying such an attention layer to all the feature maps of 16 images in our setting costs unaffordable GPU memory in training. Finding a suitable network design for multiview-consistent image generation would still be an interesting and cha...