pith. sign in

arxiv: 2412.10273 · v3 · submitted 2024-12-13 · 💻 cs.CV · cs.LG

How to Spin an Object: First, Get the Shape Right

Pith reviewed 2026-05-23 06:50 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords image-to-3Dgeometry representationnovel view synthesismultiview consistencypoint cloudsCROCSunPICtwo-stage generation
0
0 comments X

The pith

Camera-relative object coordinates outperform depth maps and pointmaps as the intermediate geometry for two-stage image-to-3D generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines design choices in hierarchical image-to-3D models that first predict geometry then decode appearance. It introduces the unPIC framework, which factors the pipeline into a fixed multiview-geometry prior followed by an appearance decoder, to isolate and compare different intermediate geometric representations. Experiments demonstrate that Camera-Relative Object Coordinates, or CROCS, are both easier for the geometry prior to predict and more effective at enforcing 360-degree consistency in the decoder. This representation also supports direct feedforward generation of 3D point clouds without extra reconstruction steps, leading to higher novel-view quality, geometric accuracy, and multiview consistency than depth maps, pretrained features, or other pointmap variants.

Core claim

By factorizing image-to-3D generation into a multiview-geometry prior and an appearance decoder, the unPIC framework identifies Camera-Relative Object Coordinates (CROCS) as the optimal intermediate representation because they are simpler to predict from images and provide stronger conditioning for consistent texture decoding across views, while also enabling direct 3D point cloud output.

What carries the argument

unPIC, a modular framework that separates the image-to-3D process into a multiview-geometry prior followed by an appearance decoder to enable controlled comparison of intermediate geometry representations.

If this is right

  • CROCS enables fully feedforward 3D point cloud generation without requiring a separate post-hoc reconstruction step.
  • CROCS serves as an effective conditioning signal that improves 360-degree multiview consistency during appearance decoding.
  • The unPIC formulation with CROCS outperforms leading baselines including InstantMesh, Direct3D, CAT3D, Free3D, and EscherNet on real-world captures such as Google Scanned Objects.
  • CROCS is easier for the first-stage geometry prior to predict than depth maps or other pointmap-based alternatives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If camera-relative coordinates prove consistently superior, future two-stage models may shift away from depth or feature-based intermediates toward explicit relative 3D encodings.
  • The factorization approach could be reused to benchmark geometry representations in adjacent tasks such as 4D reconstruction or video-to-3D lifting.
  • Direct point cloud output from CROCS suggests that explicit coordinate prediction may reduce the need for implicit surface representations in feedforward 3D pipelines.

Load-bearing premise

The modular split between the multiview-geometry prior and appearance decoder in unPIC isolates the effect of the intermediate representation without interference from implementation differences in the prior or decoder.

What would settle it

A controlled test in which the geometry prior and appearance decoder are jointly optimized or replaced with alternative architectures, showing that CROCS no longer produces higher novel-view PSNR or lower geometric error than depth maps on the same real-world datasets.

Figures

Figures reproduced from arXiv: 2412.10273 by Drew A. Hudson, Joao Carreira, Niloy J. Mitra, Rishabh Kabra, Sjoerd van Steenkiste.

Figure 1
Figure 1. Figure 1: Top: A hierarchical approach to novel-view synthesis. A prior models multiview geometric features from a single image and are jointly decoded to the target novel-view images. Our intermediate features, CROCS, establish point-to-point correspondence across views. Bottom: Samples from the prior and decoder. Our model exhibits transferrable shape understanding having never seen a real-world pixel. shape under… view at source ↗
Figure 2
Figure 2. Figure 2: Schrodinger’s cup: two sets of valid novel views, fol- ¨ lowing different trajectories in representation space. The ob￾served view does not reveal whether the cup has a handle or not. 3.2. Target-Camera Poses for Arbitrary Source Poses Our aim is to generate novel views predicting an object’s appearance as it spins around its vertical axis. We want to do this for arbitrary source images. In keeping with pr… view at source ↗
Figure 3
Figure 3. Figure 3: Camera-Relative Object Coordinate Spaces. We show two data-points (Left and Right columns) obtained from one object. Top-left: The wireframe shows the RGB reference cube used to paint the object surface. The large camera denotes the source view, whereas the smaller cameras denote (3 of 7) novel views. Top-right: Say all camera locations are rotated by θ = 120 degrees around the vertical axis (the object st… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison. The left column shows the source image, while the remaining images in each row are predicted novel views. One-2-3-45 produces multiview inconsistencies. CAT3D can squash the shapes in unseen views. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Diversity of unPIC outputs from an ambiguous image (top-left). All samples show a 90-degree rotation from the source view, and are shown here without cherry-picking. prior—this would be much likelier if the modules were co￾trained, as the prior would be spurred to output only the optimal intermediate representation for the final RGB loss. See [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: In-the-wild inference on natural images from two datasets: GQA [20] and Objectron [1]. As in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: CROCS rescaling. We examine the effect of rotating the RGB reference cube used to paint the object’s surface, following the source camera as it moves around a fixed object. θ (the azimuthal angle) denotes the default camera position, while θ ′ denotes a new position. We consider the largest possible objects—a square in 2D or cube in 3D. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: CROCS when varying the camera elevation angle ϕ. As in [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative comparison on Objaverse-XL holdouts. 2. The error of the decoder when it is fed ground-truth CROCS rather than the output of the prior. This de￾scribes the difficulty of rendering the object (e.g., pre￾dicting the object’s texture) at novel views when extrap￾olating from a single source image. We report these components along with total hierarchi￾cal error in [PITH_FULL_IMAGE:figur… view at source ↗
Figure 11
Figure 11. Figure 11: Additional qualitative comparison on Objaverse-XL holdouts. D. More Training and Evaluation Details Dataset. We use the following sets of assets from Obja￾verse: 1) the LVIS subset expanded to 83k examples from the same categories. We use object type labels predicted with high confidence from [24] to expand the LVIS sub￾set. 2) The KIUI subset 2 comprising 101k additional as￾2https://github.com/ashawkey/o… view at source ↗
Figure 12
Figure 12. Figure 12: Additional qualitative comparison on Objaverse-XL holdouts. To render NOCS images, we adapted a script from BlenderProc3 [7] which creates a special NOCS material for the surface of a given object. We also use their Blender set￾tings (the CYCLES engine with 1 diffuse bounce, 0 glossy bounces, and 0 ambient occlusion bounces) for rendering NOCS. We export them in the EXR format to ensure linear￾3https://gi… view at source ↗
Figure 13
Figure 13. Figure 13: Additional qualitative comparison on Objaverse-XL holdouts. the timestep and the source image. We apply CFG to the source image. The decoder model is additionally condi￾tioned on the prior’s output CROCS. We only apply CFG to the source image, as the decoder tracks CROCS pointmaps closely even without CFG. The architecture for the prior and decoder has just under 150M parameters. It takes about 6 minutes … view at source ↗
read the original abstract

Image-to-3D models increasingly rely on hierarchical generation to disentangle geometry and texture. However, the design choices underlying these two-stage models--particularly the optimal choice of intermediate geometric representations--remain largely understudied. To investigate this, we introduce unPIC (undo-a-Picture), a modular framework for empirical analysis of image-to-3D pipelines. By factorizing the generation process into a multiview-geometry prior followed by an appearance decoder, unPIC enables a rigorous comparison of intermediate geometry representations. Through this framework, we identify that a specific representation, Camera-Relative Object Coordinates (CROCS), significantly outperforms alternatives such as depth maps, pretrained visual features, and other pointmap-based representations. We demonstrate that CROCS is not only easier for the first-stage geometry prior to predict, but also serves as an effective conditioning signal for ensuring 360-degree consistency during appearance decoding. Another advantage is that CROCS enables fully feedforward, direct 3D point cloud generation without requiring a separate post-hoc reconstruction step. Our unPIC formulation utilizing CROCS achieves superior novel-view quality, geometric accuracy, and multiview consistency; it outperforms leading baselines, including InstantMesh, Direct3D, CAT3D, Free3D, and EscherNet, on datasets of real-world 3D captures like Google Scanned Objects and the Digital Twin Catalog.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the unPIC framework, which factorizes image-to-3D generation into a multiview-geometry prior followed by an appearance decoder. This modular setup is used to empirically compare intermediate geometry representations, with the central finding that Camera-Relative Object Coordinates (CROCS) outperforms depth maps, pretrained visual features, and other pointmap representations. CROCS is claimed to be easier to predict, provide better conditioning for 360-degree consistency, enable direct point-cloud output, and yield superior novel-view quality, geometric accuracy, and multiview consistency. The CROCS-based unPIC model is reported to outperform baselines including InstantMesh, Direct3D, CAT3D, Free3D, and EscherNet on real-world datasets such as Google Scanned Objects and the Digital Twin Catalog.

Significance. If the comparisons are controlled such that performance differences can be causally attributed to the geometry representation rather than implementation variations, the work would offer a useful empirical guide for choosing intermediate representations in hierarchical 3D generation models.

major comments (2)
  1. [Experimental setup and results sections (around the description of unPIC and the comparison experiments)] The central claim that CROCS superiority is due to its intrinsic properties (easier prediction, better conditioning, direct output) within the unPIC setup requires that the multiview-geometry prior and appearance decoder use identical architectures, losses, training schedules, and hyperparameters across all tested representations, differing only in input/output tensor format and coordinate semantics. The manuscript must provide explicit confirmation and controls demonstrating this (e.g., in the experimental setup or ablations section); without it, performance gaps cannot be isolated from potential per-representation adaptations.
  2. [Results and evaluation sections (tables reporting metrics on GSO and DTC)] Quantitative support for the outperformance claims (novel-view quality, geometric accuracy, multiview consistency) is needed with full details including error bars, dataset splits, implementation specifics, and tables comparing all representations under the same unPIC configuration; the absence of these in the provided abstract raises the need for clear presentation in the main results.
minor comments (1)
  1. [Introduction or method section introducing CROCS] Clarify the exact definition and coordinate semantics of CROCS early in the paper to aid readers in understanding its distinction from other pointmap representations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concerns about experimental controls and quantitative reporting below. The revisions will make the controlled nature of the comparisons and the supporting details explicit.

read point-by-point responses
  1. Referee: [Experimental setup and results sections (around the description of unPIC and the comparison experiments)] The central claim that CROCS superiority is due to its intrinsic properties (easier prediction, better conditioning, direct output) within the unPIC setup requires that the multiview-geometry prior and appearance decoder use identical architectures, losses, training schedules, and hyperparameters across all tested representations, differing only in input/output tensor format and coordinate semantics. The manuscript must provide explicit confirmation and controls demonstrating this (e.g., in the experimental setup or ablations section); without it, performance gaps cannot be isolated from potential per-representation adaptations.

    Authors: The unPIC framework was explicitly designed so that the multiview-geometry prior and appearance decoder share identical architectures, losses, training schedules, and hyperparameters for every representation tested; the only differences are the tensor shapes and the semantic meaning of the coordinate channels. This design isolates the effect of the representation itself. To satisfy the request for explicit confirmation, we will insert a dedicated paragraph in the Experimental Setup section that states these controls verbatim and notes that no per-representation hyperparameter search or architectural modifications were performed. revision: yes

  2. Referee: [Results and evaluation sections (tables reporting metrics on GSO and DTC)] Quantitative support for the outperformance claims (novel-view quality, geometric accuracy, multiview consistency) is needed with full details including error bars, dataset splits, implementation specifics, and tables comparing all representations under the same unPIC configuration; the absence of these in the provided abstract raises the need for clear presentation in the main results.

    Authors: The full manuscript already contains tables reporting the relevant metrics on GSO and DTC, but we agree that additional transparency is warranted. In the revision we will augment the results section with (i) error bars obtained from three independent training runs, (ii) explicit statements of the train/validation/test splits, (iii) further implementation details (model parameter counts, optimizer settings, training wall-clock time), and (iv) a single consolidated table that directly compares depth maps, visual features, alternative pointmaps, and CROCS under the identical unPIC configuration. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical comparison of geometry representations

full rationale

The paper introduces unPIC as a modular empirical framework to compare intermediate geometry representations (depth maps, visual features, pointmaps, CROCS) via a two-stage pipeline of multiview-geometry prior followed by appearance decoder. All reported claims of CROCS superiority rest on experimental metrics (novel-view quality, geometric accuracy, multiview consistency) across datasets, with no equations, derivations, or parameter-fitting steps that reduce outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to justify core results. The factorization is presented as enabling controlled comparison rather than as a derived necessity, and performance differences are attributed to observable properties of the representations themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the two-stage factorization cleanly separates geometry prediction from appearance; the abstract supplies no further free parameters, axioms, or invented entities beyond the newly proposed CROCS representation itself.

axioms (1)
  • standard math Standard supervised learning assumptions (i.i.d. train/test splits, fixed random seeds for comparison) hold for the reported experiments.
    Implicit background assumption in any empirical ML comparison.
invented entities (1)
  • CROCS no independent evidence
    purpose: Camera-Relative Object Coordinates used as intermediate geometric representation
    Newly introduced representation whose utility is demonstrated empirically in the paper.

pith-pipeline@v0.9.0 · 5794 in / 1369 out tokens · 44792 ms · 2026-05-23T06:50:03.024481+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 2 internal anchors

  1. [1]

    Objectron: A large scale dataset of object-centric videos in the wild with pose an- notations

    Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jian- ing Wei, and Matthias Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose an- notations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7822–7831,

  2. [2]

    Re- imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond

    Mohammadreza Armandpour, Huangjie Zheng, Ali Sadeghian, Amir Sadeghian, and Mingyuan Zhou. Re- imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968, 2023. 2

  3. [3]

    The ycb object and model set: Towards common benchmarks for manipu- lation research

    Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srini- vasa, Pieter Abbeel, and Aaron M Dollar. The ycb object and model set: Towards common benchmarks for manipu- lation research. In 2015 International Conference on Ad- vanced Robotics (ICAR), pages 510–517. IEEE, 2015. 4, 15

  4. [4]

    Abo: Dataset and benchmarks for real-world 3d object understand- ing

    Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, and Jitendra Malik. Abo: Dataset and benchmarks for real-world 3d object understand- ing. CVPR, 2022. 7

  5. [5]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 4

  6. [6]

    Objaverse-xl: A universe of 10m+ 3d objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Informa- tion Processing Systems, 36, 2024. 4, 17

  7. [7]

    Strobl, Matthias Humt, and Rudolph Triebel

    Maximilian Denninger, Dominik Winkelbauer, Martin Sun- dermeyer, Wout Boerdijk, Markus Knauer, Klaus H. Strobl, Matthias Humt, and Rudolph Triebel. Blenderproc2: A 10 procedural pipeline for photorealistic rendering. Journal of Open Source Software, 8(82):4901, 2023. 18

  8. [8]

    McHugh, and Vincent Vanhoucke

    Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B. McHugh, and Vincent Vanhoucke. Google scanned objects: A high- quality dataset of 3d scanned household items, 2022. 7

  9. [9]

    Niladri Shekhar Dutt, Sanjeev Muralikrishnan, and Niloy J. Mitra. Diffusion 3d features (diff3f): Decorating untextured shapes with distilled semantic features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4494–4504, 2024. 3

  10. [10]

    Prob- ing the 3d awareness of visual foundation models

    Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab- hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob- ing the 3d awareness of visual foundation models. In CVPR, pages 21795–21806, 2024. 3

  11. [11]

    Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image

    Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image. In ECCV, 2024. 3

  12. [12]

    Srinivasan, Jonathan T

    Ruiqi Gao*, Aleksander Holynski*, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul P. Srinivasan, Jonathan T. Barron, and Ben Poole*. Cat3d: Create any- thing in 3d with multi-view diffusion models. arXiv, 2024. 1, 2, 4, 15

  13. [13]

    Multiple view ge- ometry in computer vision

    Richard Hartley and Andrew Zisserman. Multiple view ge- ometry in computer vision . Cambridge university press,

  14. [14]

    Lotus: Diffusion-based visual foundation model for high-quality dense prediction

    Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Liu, Bingbing Liu, and Ying- Cong Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction. arXiv preprint arXiv:2409.18124, 2024. 3

  15. [15]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 7

  16. [16]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 3

  17. [17]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. NeurIPS, 33:6840–6851, 2020. 1, 4

  18. [18]

    Lrm: Large reconstruction model for single image to 3d

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. ICLR, 2024. 2, 3, 6, 9, 15

  19. [19]

    sim- ple diffusion: End-to-end diffusion for high resolution im- ages

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. In International Conference on Machine Learning , pages 13213–13232. PMLR, 2023. 5

  20. [20]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 10

  21. [21]

    SODA: Bottleneck diffusion models for representation learning

    Drew A Hudson, Daniel Zoran, Mateusz Malinowski, An- drew K Lampinen, Andrew Jaegle, James L McClelland, Loic Matthey, Felix Hill, and Alexander Lerchner. SODA: Bottleneck diffusion models for representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 23115–23127, 2024. 3

  22. [22]

    Allan Jabri, Sjoerd van Steenkiste, Emiel Hoogeboom, Mehdi S. M. Sajjadi, and Thomas Kipf. Dorsal: Diffusion for object-centric representations of scenes et al. In ICLR,

  23. [23]

    Shap-e: Generating condi- tional 3d implicit functions, 2023

    Heewoo Jun and Alex Nichol. Shap-e: Generating condi- tional 3d implicit functions, 2023. 3

  24. [24]

    Leveraging VLM-based pipelines to annotate 3d objects

    Rishabh Kabra, Loic Matthey, Alexander Lerchner, and Niloy Mitra. Leveraging VLM-based pipelines to annotate 3d objects. In Forty-first International Conference on Ma- chine Learning, 2024. 17

  25. [25]

    Repurpos- ing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

  26. [26]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,

  27. [27]

    Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiao- juan Qi, and Andrew J. Davison. Eschernet: A generative model for scalable view synthesis. In CVPR, pages 9503– 9513, 2024. 1, 2

  28. [28]

    Omninocs: A unified nocs dataset and model for 3d lifting of 2d objects

    Akshay Krishnan, Abhijit Kundu, Kevis-Kokitsi Maninis, James Hays, and Matthew Brown. Omninocs: A unified nocs dataset and model for 3d lifting of 2d objects. In European Conference on Computer Vision , pages 127–145. Springer,

  29. [29]

    Advances in 3d generation: A survey

    Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yan-Pei Cao, and Ying Shan. Advances in 3d generation: A survey. arXiv preprint arXiv:2401.17807, 2024. 1

  30. [30]

    One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion

    Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion. Advances in Neural Information Processing Systems , 36, 2024. 15

  31. [31]

    Zero-1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, pages 9298– 9309, 2023. 1, 2, 6

  32. [32]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM , 65(1):99–106, 2021. 2

  33. [33]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR,

  34. [34]

    Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model

    Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, and Yinqiang Zheng. Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model. ECCV, 2024. 3 11

  35. [35]

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

  36. [36]

    Aria digital twin: A new benchmark dataset for egocentric 3d machine percep- tion, 2023

    Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Pe- ters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Carl Yuheng Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine percep- tion, 2023. 7

  37. [37]

    Barron, and Ben Milden- hall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. ICLR,

  38. [38]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. pages 8748–

  39. [39]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 1

  40. [40]

    High-resolution image syn- thesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2021. 6

  41. [41]

    Zeronvs: Zero- shot 360-degree view synthesis from a single image

    Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry La- gun, Li Fei-Fei, Deqing Sun, and Jiajun Wu. Zeronvs: Zero- shot 360-degree view synthesis from a single image. In CVPR, pages 9420–9429, 2024. 2

  42. [42]

    Mental rotation of three-dimensional objects

    Roger N Shepard and Jacqueline Metzler. Mental rotation of three-dimensional objects. Science, 171(3972):701–703,

  43. [43]

    Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling

    Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. In ACM SIGGRAPH 2024 Conference Papers , pages 1–11, 2024. 3

  44. [44]

    Mental rotations, a group test of three-dimensional spatial visualization

    Steven G Vandenberg and Allan R Kuse. Mental rotations, a group test of three-dimensional spatial visualization. Per- ceptual and motor skills, 47(2):599–604, 1978. 15

  45. [45]

    Normalized object coordinate space for category-level 6d object pose and size estimation

    He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. In CVPR, pages 2642–2651, 2019. 1, 4

  46. [46]

    Yeh, and Greg Shakhnarovich

    Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, pages 12619–12629, 2023. 2

  47. [47]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. In CVPR, pages 20697–20709, 2024. 3, 13

  48. [48]

    Image quality assessment: from error visibility to structural similarity

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 7

  49. [49]

    Novel view synthesis with diffusion models

    Daniel Watson, William Chan, Ricardo Martin Bru- alla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. In The Eleventh International Conference on Learning Repre- sentations, 2023. 3

  50. [50]

    Srinivasan, Dor Verbin, Jonathan T

    Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Ho?y?ski. Reconfusion: 3d reconstruction with diffusion priors. In CVPR, pages 21551–21561, 2024. 2

  51. [51]

    Neural assets: 3d-aware multi-object scene synthesis with image diffusion models

    Ziyi Wu, Yulia Rubanova, Rishabh Kabra, Drew A Hud- son, Igor Gilitschenski, Yusuf Aytar, Sjoerd van Steenkiste, Kelsey R Allen, and Thomas Kipf. Neural assets: 3d-aware multi-object scene synthesis with image diffusion models. arXiv preprint arXiv:2406.09292, 2024. 3

  52. [52]

    Sparp: Fast 3d object reconstruction and pose estimation from sparse views

    Chao Xu, Ang Li, Linghao Chen, Yulin Liu, Ruoxi Shi, Hao Su, and Minghua Liu. Sparp: Fast 3d object reconstruction and pose estimation from sparse views. ECCV, 2024. 3, 4

  53. [53]

    Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model

    Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Ji- ahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, et al. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. ICLR, 2024. 2

  54. [54]

    pixelNeRF: Neural radiance fields from one or few images

    Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural radiance fields from one or few images. In CVPR, 2021. 2, 3

  55. [55]

    Adding conditional control to text-to-image diffusion models, 2023

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3

  56. [56]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. 7

  57. [57]

    be- hind

    Chuanxia Zheng and Andrea Vedaldi. Free3d: Consistent novel view synthesis without 3d representation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9720–9731, 2024. 2 12 A. Layout The appendix is organized as follows: we describe CROCS in detail in Appendix B, discussing in particular the geomet- rica...

  58. [58]

    This describes the geomet- rical inaccuracy (when the intermediate representation is CROCS) in predicting the 3D shape and pose of a given object

    The error of the prior alone. This describes the geomet- rical inaccuracy (when the intermediate representation is CROCS) in predicting the 3D shape and pose of a given object. 15 (a) Ground truth (b) unPIC via CROCS (Ours) (c) One-2-3-45 (XL) (d) CAT3D Figure 10. Additional qualitative comparison on Objaverse-XL holdouts

  59. [59]

    This de- scribes the difficulty of rendering the object (e.g., pre- dicting the object’s texture) at novel views when extrap- olating from a single source image

    The error of the decoder when it is fed ground-truth CROCS rather than the output of the prior. This de- scribes the difficulty of rendering the object (e.g., pre- dicting the object’s texture) at novel views when extrap- olating from a single source image. We report these components along with total hierarchi- cal error in Table 5. We find that the prior...