How to Spin an Object: First, Get the Shape Right
Pith reviewed 2026-05-23 06:50 UTC · model grok-4.3
The pith
Camera-relative object coordinates outperform depth maps and pointmaps as the intermediate geometry for two-stage image-to-3D generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By factorizing image-to-3D generation into a multiview-geometry prior and an appearance decoder, the unPIC framework identifies Camera-Relative Object Coordinates (CROCS) as the optimal intermediate representation because they are simpler to predict from images and provide stronger conditioning for consistent texture decoding across views, while also enabling direct 3D point cloud output.
What carries the argument
unPIC, a modular framework that separates the image-to-3D process into a multiview-geometry prior followed by an appearance decoder to enable controlled comparison of intermediate geometry representations.
If this is right
- CROCS enables fully feedforward 3D point cloud generation without requiring a separate post-hoc reconstruction step.
- CROCS serves as an effective conditioning signal that improves 360-degree multiview consistency during appearance decoding.
- The unPIC formulation with CROCS outperforms leading baselines including InstantMesh, Direct3D, CAT3D, Free3D, and EscherNet on real-world captures such as Google Scanned Objects.
- CROCS is easier for the first-stage geometry prior to predict than depth maps or other pointmap-based alternatives.
Where Pith is reading between the lines
- If camera-relative coordinates prove consistently superior, future two-stage models may shift away from depth or feature-based intermediates toward explicit relative 3D encodings.
- The factorization approach could be reused to benchmark geometry representations in adjacent tasks such as 4D reconstruction or video-to-3D lifting.
- Direct point cloud output from CROCS suggests that explicit coordinate prediction may reduce the need for implicit surface representations in feedforward 3D pipelines.
Load-bearing premise
The modular split between the multiview-geometry prior and appearance decoder in unPIC isolates the effect of the intermediate representation without interference from implementation differences in the prior or decoder.
What would settle it
A controlled test in which the geometry prior and appearance decoder are jointly optimized or replaced with alternative architectures, showing that CROCS no longer produces higher novel-view PSNR or lower geometric error than depth maps on the same real-world datasets.
Figures
read the original abstract
Image-to-3D models increasingly rely on hierarchical generation to disentangle geometry and texture. However, the design choices underlying these two-stage models--particularly the optimal choice of intermediate geometric representations--remain largely understudied. To investigate this, we introduce unPIC (undo-a-Picture), a modular framework for empirical analysis of image-to-3D pipelines. By factorizing the generation process into a multiview-geometry prior followed by an appearance decoder, unPIC enables a rigorous comparison of intermediate geometry representations. Through this framework, we identify that a specific representation, Camera-Relative Object Coordinates (CROCS), significantly outperforms alternatives such as depth maps, pretrained visual features, and other pointmap-based representations. We demonstrate that CROCS is not only easier for the first-stage geometry prior to predict, but also serves as an effective conditioning signal for ensuring 360-degree consistency during appearance decoding. Another advantage is that CROCS enables fully feedforward, direct 3D point cloud generation without requiring a separate post-hoc reconstruction step. Our unPIC formulation utilizing CROCS achieves superior novel-view quality, geometric accuracy, and multiview consistency; it outperforms leading baselines, including InstantMesh, Direct3D, CAT3D, Free3D, and EscherNet, on datasets of real-world 3D captures like Google Scanned Objects and the Digital Twin Catalog.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the unPIC framework, which factorizes image-to-3D generation into a multiview-geometry prior followed by an appearance decoder. This modular setup is used to empirically compare intermediate geometry representations, with the central finding that Camera-Relative Object Coordinates (CROCS) outperforms depth maps, pretrained visual features, and other pointmap representations. CROCS is claimed to be easier to predict, provide better conditioning for 360-degree consistency, enable direct point-cloud output, and yield superior novel-view quality, geometric accuracy, and multiview consistency. The CROCS-based unPIC model is reported to outperform baselines including InstantMesh, Direct3D, CAT3D, Free3D, and EscherNet on real-world datasets such as Google Scanned Objects and the Digital Twin Catalog.
Significance. If the comparisons are controlled such that performance differences can be causally attributed to the geometry representation rather than implementation variations, the work would offer a useful empirical guide for choosing intermediate representations in hierarchical 3D generation models.
major comments (2)
- [Experimental setup and results sections (around the description of unPIC and the comparison experiments)] The central claim that CROCS superiority is due to its intrinsic properties (easier prediction, better conditioning, direct output) within the unPIC setup requires that the multiview-geometry prior and appearance decoder use identical architectures, losses, training schedules, and hyperparameters across all tested representations, differing only in input/output tensor format and coordinate semantics. The manuscript must provide explicit confirmation and controls demonstrating this (e.g., in the experimental setup or ablations section); without it, performance gaps cannot be isolated from potential per-representation adaptations.
- [Results and evaluation sections (tables reporting metrics on GSO and DTC)] Quantitative support for the outperformance claims (novel-view quality, geometric accuracy, multiview consistency) is needed with full details including error bars, dataset splits, implementation specifics, and tables comparing all representations under the same unPIC configuration; the absence of these in the provided abstract raises the need for clear presentation in the main results.
minor comments (1)
- [Introduction or method section introducing CROCS] Clarify the exact definition and coordinate semantics of CROCS early in the paper to aid readers in understanding its distinction from other pointmap representations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the concerns about experimental controls and quantitative reporting below. The revisions will make the controlled nature of the comparisons and the supporting details explicit.
read point-by-point responses
-
Referee: [Experimental setup and results sections (around the description of unPIC and the comparison experiments)] The central claim that CROCS superiority is due to its intrinsic properties (easier prediction, better conditioning, direct output) within the unPIC setup requires that the multiview-geometry prior and appearance decoder use identical architectures, losses, training schedules, and hyperparameters across all tested representations, differing only in input/output tensor format and coordinate semantics. The manuscript must provide explicit confirmation and controls demonstrating this (e.g., in the experimental setup or ablations section); without it, performance gaps cannot be isolated from potential per-representation adaptations.
Authors: The unPIC framework was explicitly designed so that the multiview-geometry prior and appearance decoder share identical architectures, losses, training schedules, and hyperparameters for every representation tested; the only differences are the tensor shapes and the semantic meaning of the coordinate channels. This design isolates the effect of the representation itself. To satisfy the request for explicit confirmation, we will insert a dedicated paragraph in the Experimental Setup section that states these controls verbatim and notes that no per-representation hyperparameter search or architectural modifications were performed. revision: yes
-
Referee: [Results and evaluation sections (tables reporting metrics on GSO and DTC)] Quantitative support for the outperformance claims (novel-view quality, geometric accuracy, multiview consistency) is needed with full details including error bars, dataset splits, implementation specifics, and tables comparing all representations under the same unPIC configuration; the absence of these in the provided abstract raises the need for clear presentation in the main results.
Authors: The full manuscript already contains tables reporting the relevant metrics on GSO and DTC, but we agree that additional transparency is warranted. In the revision we will augment the results section with (i) error bars obtained from three independent training runs, (ii) explicit statements of the train/validation/test splits, (iii) further implementation details (model parameter counts, optimizer settings, training wall-clock time), and (iv) a single consolidated table that directly compares depth maps, visual features, alternative pointmaps, and CROCS under the identical unPIC configuration. revision: yes
Circularity Check
No circularity in empirical comparison of geometry representations
full rationale
The paper introduces unPIC as a modular empirical framework to compare intermediate geometry representations (depth maps, visual features, pointmaps, CROCS) via a two-stage pipeline of multiview-geometry prior followed by appearance decoder. All reported claims of CROCS superiority rest on experimental metrics (novel-view quality, geometric accuracy, multiview consistency) across datasets, with no equations, derivations, or parameter-fitting steps that reduce outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to justify core results. The factorization is presented as enabling controlled comparison rather than as a derived necessity, and performance differences are attributed to observable properties of the representations themselves.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard supervised learning assumptions (i.i.d. train/test splits, fixed random seeds for comparison) hold for the reported experiments.
invented entities (1)
-
CROCS
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Objectron: A large scale dataset of object-centric videos in the wild with pose an- notations
Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jian- ing Wei, and Matthias Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose an- notations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7822–7831,
-
[2]
Mohammadreza Armandpour, Huangjie Zheng, Ali Sadeghian, Amir Sadeghian, and Mingyuan Zhou. Re- imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968, 2023. 2
-
[3]
The ycb object and model set: Towards common benchmarks for manipu- lation research
Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srini- vasa, Pieter Abbeel, and Aaron M Dollar. The ycb object and model set: Towards common benchmarks for manipu- lation research. In 2015 International Conference on Ad- vanced Robotics (ICAR), pages 510–517. IEEE, 2015. 4, 15
work page 2015
-
[4]
Abo: Dataset and benchmarks for real-world 3d object understand- ing
Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, and Jitendra Malik. Abo: Dataset and benchmarks for real-world 3d object understand- ing. CVPR, 2022. 7
work page 2022
-
[5]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 4
work page 2023
-
[6]
Objaverse-xl: A universe of 10m+ 3d objects
Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Informa- tion Processing Systems, 36, 2024. 4, 17
work page 2024
-
[7]
Strobl, Matthias Humt, and Rudolph Triebel
Maximilian Denninger, Dominik Winkelbauer, Martin Sun- dermeyer, Wout Boerdijk, Markus Knauer, Klaus H. Strobl, Matthias Humt, and Rudolph Triebel. Blenderproc2: A 10 procedural pipeline for photorealistic rendering. Journal of Open Source Software, 8(82):4901, 2023. 18
work page 2023
-
[8]
Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B. McHugh, and Vincent Vanhoucke. Google scanned objects: A high- quality dataset of 3d scanned household items, 2022. 7
work page 2022
-
[9]
Niladri Shekhar Dutt, Sanjeev Muralikrishnan, and Niloy J. Mitra. Diffusion 3d features (diff3f): Decorating untextured shapes with distilled semantic features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4494–4504, 2024. 3
work page 2024
-
[10]
Prob- ing the 3d awareness of visual foundation models
Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab- hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob- ing the 3d awareness of visual foundation models. In CVPR, pages 21795–21806, 2024. 3
work page 2024
-
[11]
Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image
Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image. In ECCV, 2024. 3
work page 2024
-
[12]
Ruiqi Gao*, Aleksander Holynski*, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul P. Srinivasan, Jonathan T. Barron, and Ben Poole*. Cat3d: Create any- thing in 3d with multi-view diffusion models. arXiv, 2024. 1, 2, 4, 15
work page 2024
-
[13]
Multiple view ge- ometry in computer vision
Richard Hartley and Andrew Zisserman. Multiple view ge- ometry in computer vision . Cambridge university press,
-
[14]
Lotus: Diffusion-based visual foundation model for high-quality dense prediction
Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Liu, Bingbing Liu, and Ying- Cong Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction. arXiv preprint arXiv:2409.18124, 2024. 3
-
[15]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 7
work page 2017
-
[16]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. NeurIPS, 33:6840–6851, 2020. 1, 4
work page 2020
-
[18]
Lrm: Large reconstruction model for single image to 3d
Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. ICLR, 2024. 2, 3, 6, 9, 15
work page 2024
-
[19]
sim- ple diffusion: End-to-end diffusion for high resolution im- ages
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. In International Conference on Machine Learning , pages 13213–13232. PMLR, 2023. 5
work page 2023
-
[20]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 10
work page 2019
-
[21]
SODA: Bottleneck diffusion models for representation learning
Drew A Hudson, Daniel Zoran, Mateusz Malinowski, An- drew K Lampinen, Andrew Jaegle, James L McClelland, Loic Matthey, Felix Hill, and Alexander Lerchner. SODA: Bottleneck diffusion models for representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 23115–23127, 2024. 3
work page 2024
-
[22]
Allan Jabri, Sjoerd van Steenkiste, Emiel Hoogeboom, Mehdi S. M. Sajjadi, and Thomas Kipf. Dorsal: Diffusion for object-centric representations of scenes et al. In ICLR,
-
[23]
Shap-e: Generating condi- tional 3d implicit functions, 2023
Heewoo Jun and Alex Nichol. Shap-e: Generating condi- tional 3d implicit functions, 2023. 3
work page 2023
-
[24]
Leveraging VLM-based pipelines to annotate 3d objects
Rishabh Kabra, Loic Matthey, Alexander Lerchner, and Niloy Mitra. Leveraging VLM-based pipelines to annotate 3d objects. In Forty-first International Conference on Ma- chine Learning, 2024. 17
work page 2024
-
[25]
Repurpos- ing diffusion-based image generators for monocular depth estimation
Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3
work page 2024
-
[26]
3d gaussian splatting for real-time radiance field rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,
-
[27]
Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiao- juan Qi, and Andrew J. Davison. Eschernet: A generative model for scalable view synthesis. In CVPR, pages 9503– 9513, 2024. 1, 2
work page 2024
-
[28]
Omninocs: A unified nocs dataset and model for 3d lifting of 2d objects
Akshay Krishnan, Abhijit Kundu, Kevis-Kokitsi Maninis, James Hays, and Matthew Brown. Omninocs: A unified nocs dataset and model for 3d lifting of 2d objects. In European Conference on Computer Vision , pages 127–145. Springer,
-
[29]
Advances in 3d generation: A survey
Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yan-Pei Cao, and Ying Shan. Advances in 3d generation: A survey. arXiv preprint arXiv:2401.17807, 2024. 1
-
[30]
One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion
Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion. Advances in Neural Information Processing Systems , 36, 2024. 15
work page 2024
-
[31]
Zero-1-to-3: Zero-shot one image to 3d object
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, pages 9298– 9309, 2023. 1, 2, 6
work page 2023
-
[32]
Nerf: Representing scenes as neural radiance fields for view syn- thesis
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM , 65(1):99–106, 2021. 2
work page 2021
-
[33]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR,
-
[34]
Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, and Yinqiang Zheng. Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model. ECCV, 2024. 3 11
work page 2024
-
[35]
Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...
work page 2024
-
[36]
Aria digital twin: A new benchmark dataset for egocentric 3d machine percep- tion, 2023
Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Pe- ters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Carl Yuheng Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine percep- tion, 2023. 7
work page 2023
-
[37]
Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. ICLR,
-
[38]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. pages 8748–
-
[39]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[40]
High-resolution image syn- thesis with latent diffusion models, 2021
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2021. 6
work page 2021
-
[41]
Zeronvs: Zero- shot 360-degree view synthesis from a single image
Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry La- gun, Li Fei-Fei, Deqing Sun, and Jiajun Wu. Zeronvs: Zero- shot 360-degree view synthesis from a single image. In CVPR, pages 9420–9429, 2024. 2
work page 2024
-
[42]
Mental rotation of three-dimensional objects
Roger N Shepard and Jacqueline Metzler. Mental rotation of three-dimensional objects. Science, 171(3972):701–703,
-
[43]
Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling
Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. In ACM SIGGRAPH 2024 Conference Papers , pages 1–11, 2024. 3
work page 2024
-
[44]
Mental rotations, a group test of three-dimensional spatial visualization
Steven G Vandenberg and Allan R Kuse. Mental rotations, a group test of three-dimensional spatial visualization. Per- ceptual and motor skills, 47(2):599–604, 1978. 15
work page 1978
-
[45]
Normalized object coordinate space for category-level 6d object pose and size estimation
He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. In CVPR, pages 2642–2651, 2019. 1, 4
work page 2019
-
[46]
Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, pages 12619–12629, 2023. 2
work page 2023
-
[47]
Dust3r: Geometric 3d vi- sion made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. In CVPR, pages 20697–20709, 2024. 3, 13
work page 2024
-
[48]
Image quality assessment: from error visibility to structural similarity
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 7
work page 2004
-
[49]
Novel view synthesis with diffusion models
Daniel Watson, William Chan, Ricardo Martin Bru- alla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. In The Eleventh International Conference on Learning Repre- sentations, 2023. 3
work page 2023
-
[50]
Srinivasan, Dor Verbin, Jonathan T
Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Ho?y?ski. Reconfusion: 3d reconstruction with diffusion priors. In CVPR, pages 21551–21561, 2024. 2
work page 2024
-
[51]
Neural assets: 3d-aware multi-object scene synthesis with image diffusion models
Ziyi Wu, Yulia Rubanova, Rishabh Kabra, Drew A Hud- son, Igor Gilitschenski, Yusuf Aytar, Sjoerd van Steenkiste, Kelsey R Allen, and Thomas Kipf. Neural assets: 3d-aware multi-object scene synthesis with image diffusion models. arXiv preprint arXiv:2406.09292, 2024. 3
-
[52]
Sparp: Fast 3d object reconstruction and pose estimation from sparse views
Chao Xu, Ang Li, Linghao Chen, Yulin Liu, Ruoxi Shi, Hao Su, and Minghua Liu. Sparp: Fast 3d object reconstruction and pose estimation from sparse views. ECCV, 2024. 3, 4
work page 2024
-
[53]
Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model
Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Ji- ahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, et al. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. ICLR, 2024. 2
work page 2024
-
[54]
pixelNeRF: Neural radiance fields from one or few images
Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural radiance fields from one or few images. In CVPR, 2021. 2, 3
work page 2021
-
[55]
Adding conditional control to text-to-image diffusion models, 2023
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3
work page 2023
-
[56]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. 7
work page 2018
-
[57]
Chuanxia Zheng and Andrea Vedaldi. Free3d: Consistent novel view synthesis without 3d representation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9720–9731, 2024. 2 12 A. Layout The appendix is organized as follows: we describe CROCS in detail in Appendix B, discussing in particular the geomet- rica...
work page 2024
-
[58]
The error of the prior alone. This describes the geomet- rical inaccuracy (when the intermediate representation is CROCS) in predicting the 3D shape and pose of a given object. 15 (a) Ground truth (b) unPIC via CROCS (Ours) (c) One-2-3-45 (XL) (d) CAT3D Figure 10. Additional qualitative comparison on Objaverse-XL holdouts
-
[59]
The error of the decoder when it is fed ground-truth CROCS rather than the output of the prior. This de- scribes the difficulty of rendering the object (e.g., pre- dicting the object’s texture) at novel views when extrap- olating from a single source image. We report these components along with total hierarchi- cal error in Table 5. We find that the prior...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.