How to Spin an Object: First, Get the Shape Right

Drew A. Hudson; Joao Carreira; Niloy J. Mitra; Rishabh Kabra; Sjoerd van Steenkiste

arxiv: 2412.10273 · v3 · submitted 2024-12-13 · 💻 cs.CV · cs.LG

How to Spin an Object: First, Get the Shape Right

Rishabh Kabra , Drew A. Hudson , Sjoerd van Steenkiste , Joao Carreira , Niloy J. Mitra This is my paper

Pith reviewed 2026-05-23 06:50 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords image-to-3Dgeometry representationnovel view synthesismultiview consistencypoint cloudsCROCSunPICtwo-stage generation

0 comments

The pith

Camera-relative object coordinates outperform depth maps and pointmaps as the intermediate geometry for two-stage image-to-3D generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines design choices in hierarchical image-to-3D models that first predict geometry then decode appearance. It introduces the unPIC framework, which factors the pipeline into a fixed multiview-geometry prior followed by an appearance decoder, to isolate and compare different intermediate geometric representations. Experiments demonstrate that Camera-Relative Object Coordinates, or CROCS, are both easier for the geometry prior to predict and more effective at enforcing 360-degree consistency in the decoder. This representation also supports direct feedforward generation of 3D point clouds without extra reconstruction steps, leading to higher novel-view quality, geometric accuracy, and multiview consistency than depth maps, pretrained features, or other pointmap variants.

Core claim

By factorizing image-to-3D generation into a multiview-geometry prior and an appearance decoder, the unPIC framework identifies Camera-Relative Object Coordinates (CROCS) as the optimal intermediate representation because they are simpler to predict from images and provide stronger conditioning for consistent texture decoding across views, while also enabling direct 3D point cloud output.

What carries the argument

unPIC, a modular framework that separates the image-to-3D process into a multiview-geometry prior followed by an appearance decoder to enable controlled comparison of intermediate geometry representations.

If this is right

CROCS enables fully feedforward 3D point cloud generation without requiring a separate post-hoc reconstruction step.
CROCS serves as an effective conditioning signal that improves 360-degree multiview consistency during appearance decoding.
The unPIC formulation with CROCS outperforms leading baselines including InstantMesh, Direct3D, CAT3D, Free3D, and EscherNet on real-world captures such as Google Scanned Objects.
CROCS is easier for the first-stage geometry prior to predict than depth maps or other pointmap-based alternatives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If camera-relative coordinates prove consistently superior, future two-stage models may shift away from depth or feature-based intermediates toward explicit relative 3D encodings.
The factorization approach could be reused to benchmark geometry representations in adjacent tasks such as 4D reconstruction or video-to-3D lifting.
Direct point cloud output from CROCS suggests that explicit coordinate prediction may reduce the need for implicit surface representations in feedforward 3D pipelines.

Load-bearing premise

The modular split between the multiview-geometry prior and appearance decoder in unPIC isolates the effect of the intermediate representation without interference from implementation differences in the prior or decoder.

What would settle it

A controlled test in which the geometry prior and appearance decoder are jointly optimized or replaced with alternative architectures, showing that CROCS no longer produces higher novel-view PSNR or lower geometric error than depth maps on the same real-world datasets.

Figures

Figures reproduced from arXiv: 2412.10273 by Drew A. Hudson, Joao Carreira, Niloy J. Mitra, Rishabh Kabra, Sjoerd van Steenkiste.

**Figure 1.** Figure 1: Top: A hierarchical approach to novel-view synthesis. A prior models multiview geometric features from a single image and are jointly decoded to the target novel-view images. Our intermediate features, CROCS, establish point-to-point correspondence across views. Bottom: Samples from the prior and decoder. Our model exhibits transferrable shape understanding having never seen a real-world pixel. shape under… view at source ↗

**Figure 2.** Figure 2: Schrodinger’s cup: two sets of valid novel views, fol- ¨ lowing different trajectories in representation space. The observed view does not reveal whether the cup has a handle or not. 3.2. Target-Camera Poses for Arbitrary Source Poses Our aim is to generate novel views predicting an object’s appearance as it spins around its vertical axis. We want to do this for arbitrary source images. In keeping with pr… view at source ↗

**Figure 3.** Figure 3: Camera-Relative Object Coordinate Spaces. We show two data-points (Left and Right columns) obtained from one object. Top-left: The wireframe shows the RGB reference cube used to paint the object surface. The large camera denotes the source view, whereas the smaller cameras denote (3 of 7) novel views. Top-right: Say all camera locations are rotated by θ = 120 degrees around the vertical axis (the object st… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison. The left column shows the source image, while the remaining images in each row are predicted novel views. One-2-3-45 produces multiview inconsistencies. CAT3D can squash the shapes in unseen views. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Diversity of unPIC outputs from an ambiguous image (top-left). All samples show a 90-degree rotation from the source view, and are shown here without cherry-picking. prior—this would be much likelier if the modules were cotrained, as the prior would be spurred to output only the optimal intermediate representation for the final RGB loss. See [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: In-the-wild inference on natural images from two datasets: GQA [20] and Objectron [1]. As in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: CROCS rescaling. We examine the effect of rotating the RGB reference cube used to paint the object’s surface, following the source camera as it moves around a fixed object. θ (the azimuthal angle) denotes the default camera position, while θ ′ denotes a new position. We consider the largest possible objects—a square in 2D or cube in 3D. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: CROCS when varying the camera elevation angle ϕ. As in [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative comparison on Objaverse-XL holdouts. 2. The error of the decoder when it is fed ground-truth CROCS rather than the output of the prior. This describes the difficulty of rendering the object (e.g., predicting the object’s texture) at novel views when extrapolating from a single source image. We report these components along with total hierarchical error in [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 11.** Figure 11: Additional qualitative comparison on Objaverse-XL holdouts. D. More Training and Evaluation Details Dataset. We use the following sets of assets from Objaverse: 1) the LVIS subset expanded to 83k examples from the same categories. We use object type labels predicted with high confidence from [24] to expand the LVIS subset. 2) The KIUI subset 2 comprising 101k additional as2https://github.com/ashawkey/o… view at source ↗

**Figure 12.** Figure 12: Additional qualitative comparison on Objaverse-XL holdouts. To render NOCS images, we adapted a script from BlenderProc3 [7] which creates a special NOCS material for the surface of a given object. We also use their Blender settings (the CYCLES engine with 1 diffuse bounce, 0 glossy bounces, and 0 ambient occlusion bounces) for rendering NOCS. We export them in the EXR format to ensure linear3https://gi… view at source ↗

**Figure 13.** Figure 13: Additional qualitative comparison on Objaverse-XL holdouts. the timestep and the source image. We apply CFG to the source image. The decoder model is additionally conditioned on the prior’s output CROCS. We only apply CFG to the source image, as the decoder tracks CROCS pointmaps closely even without CFG. The architecture for the prior and decoder has just under 150M parameters. It takes about 6 minutes … view at source ↗

read the original abstract

Image-to-3D models increasingly rely on hierarchical generation to disentangle geometry and texture. However, the design choices underlying these two-stage models--particularly the optimal choice of intermediate geometric representations--remain largely understudied. To investigate this, we introduce unPIC (undo-a-Picture), a modular framework for empirical analysis of image-to-3D pipelines. By factorizing the generation process into a multiview-geometry prior followed by an appearance decoder, unPIC enables a rigorous comparison of intermediate geometry representations. Through this framework, we identify that a specific representation, Camera-Relative Object Coordinates (CROCS), significantly outperforms alternatives such as depth maps, pretrained visual features, and other pointmap-based representations. We demonstrate that CROCS is not only easier for the first-stage geometry prior to predict, but also serves as an effective conditioning signal for ensuring 360-degree consistency during appearance decoding. Another advantage is that CROCS enables fully feedforward, direct 3D point cloud generation without requiring a separate post-hoc reconstruction step. Our unPIC formulation utilizing CROCS achieves superior novel-view quality, geometric accuracy, and multiview consistency; it outperforms leading baselines, including InstantMesh, Direct3D, CAT3D, Free3D, and EscherNet, on datasets of real-world 3D captures like Google Scanned Objects and the Digital Twin Catalog.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

unPIC is a useful modular framework for testing geometry representations, and CROCS appears to outperform the alternatives, though the experimental controls deserve close checking.

read the letter

The paper's key point is that their unPIC framework makes it possible to compare intermediate geometry representations in image-to-3D generation, and they conclude that camera-relative object coordinates (CROCS) work better than depth maps or other options. They set up a two-stage model with a multiview geometry prior and an appearance decoder. This separation lets them swap representations and see the impact on prediction difficulty, consistency, and final quality. CROCS has the advantage of being directly usable for point cloud output and providing good conditioning for the decoder. They show gains over baselines like InstantMesh on Google Scanned Objects and similar data. The framework itself is the clearest new piece. It addresses a gap where people pick representations without much comparison. The practical benefit of avoiding extra reconstruction steps is also noted. The potential issue is in how controlled the experiments are. The claim of rigorous comparison rests on keeping the prior and decoder the same across tests. If small changes were made to fit each representation, the differences could come from those tweaks rather than the representation. The abstract gives no quantitative results or setup details, which makes it difficult to assess how strong the evidence is. This paper is aimed at researchers in computer vision working on 3D reconstruction from images. Anyone tuning a hierarchical image-to-3D system might find the comparison useful. It deserves peer review. The modular analysis idea is worth discussing even if the CROCS result needs more backing on the controls.

Referee Report

2 major / 1 minor

Summary. The paper introduces the unPIC framework, which factorizes image-to-3D generation into a multiview-geometry prior followed by an appearance decoder. This modular setup is used to empirically compare intermediate geometry representations, with the central finding that Camera-Relative Object Coordinates (CROCS) outperforms depth maps, pretrained visual features, and other pointmap representations. CROCS is claimed to be easier to predict, provide better conditioning for 360-degree consistency, enable direct point-cloud output, and yield superior novel-view quality, geometric accuracy, and multiview consistency. The CROCS-based unPIC model is reported to outperform baselines including InstantMesh, Direct3D, CAT3D, Free3D, and EscherNet on real-world datasets such as Google Scanned Objects and the Digital Twin Catalog.

Significance. If the comparisons are controlled such that performance differences can be causally attributed to the geometry representation rather than implementation variations, the work would offer a useful empirical guide for choosing intermediate representations in hierarchical 3D generation models.

major comments (2)

[Experimental setup and results sections (around the description of unPIC and the comparison experiments)] The central claim that CROCS superiority is due to its intrinsic properties (easier prediction, better conditioning, direct output) within the unPIC setup requires that the multiview-geometry prior and appearance decoder use identical architectures, losses, training schedules, and hyperparameters across all tested representations, differing only in input/output tensor format and coordinate semantics. The manuscript must provide explicit confirmation and controls demonstrating this (e.g., in the experimental setup or ablations section); without it, performance gaps cannot be isolated from potential per-representation adaptations.
[Results and evaluation sections (tables reporting metrics on GSO and DTC)] Quantitative support for the outperformance claims (novel-view quality, geometric accuracy, multiview consistency) is needed with full details including error bars, dataset splits, implementation specifics, and tables comparing all representations under the same unPIC configuration; the absence of these in the provided abstract raises the need for clear presentation in the main results.

minor comments (1)

[Introduction or method section introducing CROCS] Clarify the exact definition and coordinate semantics of CROCS early in the paper to aid readers in understanding its distinction from other pointmap representations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concerns about experimental controls and quantitative reporting below. The revisions will make the controlled nature of the comparisons and the supporting details explicit.

read point-by-point responses

Referee: [Experimental setup and results sections (around the description of unPIC and the comparison experiments)] The central claim that CROCS superiority is due to its intrinsic properties (easier prediction, better conditioning, direct output) within the unPIC setup requires that the multiview-geometry prior and appearance decoder use identical architectures, losses, training schedules, and hyperparameters across all tested representations, differing only in input/output tensor format and coordinate semantics. The manuscript must provide explicit confirmation and controls demonstrating this (e.g., in the experimental setup or ablations section); without it, performance gaps cannot be isolated from potential per-representation adaptations.

Authors: The unPIC framework was explicitly designed so that the multiview-geometry prior and appearance decoder share identical architectures, losses, training schedules, and hyperparameters for every representation tested; the only differences are the tensor shapes and the semantic meaning of the coordinate channels. This design isolates the effect of the representation itself. To satisfy the request for explicit confirmation, we will insert a dedicated paragraph in the Experimental Setup section that states these controls verbatim and notes that no per-representation hyperparameter search or architectural modifications were performed. revision: yes
Referee: [Results and evaluation sections (tables reporting metrics on GSO and DTC)] Quantitative support for the outperformance claims (novel-view quality, geometric accuracy, multiview consistency) is needed with full details including error bars, dataset splits, implementation specifics, and tables comparing all representations under the same unPIC configuration; the absence of these in the provided abstract raises the need for clear presentation in the main results.

Authors: The full manuscript already contains tables reporting the relevant metrics on GSO and DTC, but we agree that additional transparency is warranted. In the revision we will augment the results section with (i) error bars obtained from three independent training runs, (ii) explicit statements of the train/validation/test splits, (iii) further implementation details (model parameter counts, optimizer settings, training wall-clock time), and (iv) a single consolidated table that directly compares depth maps, visual features, alternative pointmaps, and CROCS under the identical unPIC configuration. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical comparison of geometry representations

full rationale

The paper introduces unPIC as a modular empirical framework to compare intermediate geometry representations (depth maps, visual features, pointmaps, CROCS) via a two-stage pipeline of multiview-geometry prior followed by appearance decoder. All reported claims of CROCS superiority rest on experimental metrics (novel-view quality, geometric accuracy, multiview consistency) across datasets, with no equations, derivations, or parameter-fitting steps that reduce outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to justify core results. The factorization is presented as enabling controlled comparison rather than as a derived necessity, and performance differences are attributed to observable properties of the representations themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that the two-stage factorization cleanly separates geometry prediction from appearance; the abstract supplies no further free parameters, axioms, or invented entities beyond the newly proposed CROCS representation itself.

axioms (1)

standard math Standard supervised learning assumptions (i.i.d. train/test splits, fixed random seeds for comparison) hold for the reported experiments.
Implicit background assumption in any empirical ML comparison.

invented entities (1)

CROCS no independent evidence
purpose: Camera-Relative Object Coordinates used as intermediate geometric representation
Newly introduced representation whose utility is demonstrated empirically in the paper.

pith-pipeline@v0.9.0 · 5794 in / 1369 out tokens · 44792 ms · 2026-05-23T06:50:03.024481+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 2 internal anchors

[1]

Objectron: A large scale dataset of object-centric videos in the wild with pose an- notations

Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jian- ing Wei, and Matthias Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose an- notations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7822–7831,

work page
[2]

Re- imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond

Mohammadreza Armandpour, Huangjie Zheng, Ali Sadeghian, Amir Sadeghian, and Mingyuan Zhou. Re- imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968, 2023. 2

work page arXiv 2023
[3]

The ycb object and model set: Towards common benchmarks for manipu- lation research

Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srini- vasa, Pieter Abbeel, and Aaron M Dollar. The ycb object and model set: Towards common benchmarks for manipu- lation research. In 2015 International Conference on Ad- vanced Robotics (ICAR), pages 510–517. IEEE, 2015. 4, 15

work page 2015
[4]

Abo: Dataset and benchmarks for real-world 3d object understand- ing

Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, and Jitendra Malik. Abo: Dataset and benchmarks for real-world 3d object understand- ing. CVPR, 2022. 7

work page 2022
[5]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 4

work page 2023
[6]

Objaverse-xl: A universe of 10m+ 3d objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Informa- tion Processing Systems, 36, 2024. 4, 17

work page 2024
[7]

Strobl, Matthias Humt, and Rudolph Triebel

Maximilian Denninger, Dominik Winkelbauer, Martin Sun- dermeyer, Wout Boerdijk, Markus Knauer, Klaus H. Strobl, Matthias Humt, and Rudolph Triebel. Blenderproc2: A 10 procedural pipeline for photorealistic rendering. Journal of Open Source Software, 8(82):4901, 2023. 18

work page 2023
[8]

McHugh, and Vincent Vanhoucke

Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B. McHugh, and Vincent Vanhoucke. Google scanned objects: A high- quality dataset of 3d scanned household items, 2022. 7

work page 2022
[9]

Niladri Shekhar Dutt, Sanjeev Muralikrishnan, and Niloy J. Mitra. Diffusion 3d features (diff3f): Decorating untextured shapes with distilled semantic features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4494–4504, 2024. 3

work page 2024
[10]

Prob- ing the 3d awareness of visual foundation models

Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab- hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob- ing the 3d awareness of visual foundation models. In CVPR, pages 21795–21806, 2024. 3

work page 2024
[11]

Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image

Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image. In ECCV, 2024. 3

work page 2024
[12]

Srinivasan, Jonathan T

Ruiqi Gao*, Aleksander Holynski*, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul P. Srinivasan, Jonathan T. Barron, and Ben Poole*. Cat3d: Create any- thing in 3d with multi-view diffusion models. arXiv, 2024. 1, 2, 4, 15

work page 2024
[13]

Multiple view ge- ometry in computer vision

Richard Hartley and Andrew Zisserman. Multiple view ge- ometry in computer vision . Cambridge university press,

work page
[14]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction

Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Liu, Bingbing Liu, and Ying- Cong Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction. arXiv preprint arXiv:2409.18124, 2024. 3

work page arXiv 2024
[15]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 7

work page 2017
[16]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. NeurIPS, 33:6840–6851, 2020. 1, 4

work page 2020
[18]

Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. ICLR, 2024. 2, 3, 6, 9, 15

work page 2024
[19]

sim- ple diffusion: End-to-end diffusion for high resolution im- ages

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. In International Conference on Machine Learning , pages 13213–13232. PMLR, 2023. 5

work page 2023
[20]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 10

work page 2019
[21]

SODA: Bottleneck diffusion models for representation learning

Drew A Hudson, Daniel Zoran, Mateusz Malinowski, An- drew K Lampinen, Andrew Jaegle, James L McClelland, Loic Matthey, Felix Hill, and Alexander Lerchner. SODA: Bottleneck diffusion models for representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 23115–23127, 2024. 3

work page 2024
[22]

Allan Jabri, Sjoerd van Steenkiste, Emiel Hoogeboom, Mehdi S. M. Sajjadi, and Thomas Kipf. Dorsal: Diffusion for object-centric representations of scenes et al. In ICLR,

work page
[23]

Shap-e: Generating condi- tional 3d implicit functions, 2023

Heewoo Jun and Alex Nichol. Shap-e: Generating condi- tional 3d implicit functions, 2023. 3

work page 2023
[24]

Leveraging VLM-based pipelines to annotate 3d objects

Rishabh Kabra, Loic Matthey, Alexander Lerchner, and Niloy Mitra. Leveraging VLM-based pipelines to annotate 3d objects. In Forty-first International Conference on Ma- chine Learning, 2024. 17

work page 2024
[25]

Repurpos- ing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024
[26]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,

work page
[27]

Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiao- juan Qi, and Andrew J. Davison. Eschernet: A generative model for scalable view synthesis. In CVPR, pages 9503– 9513, 2024. 1, 2

work page 2024
[28]

Omninocs: A unified nocs dataset and model for 3d lifting of 2d objects

Akshay Krishnan, Abhijit Kundu, Kevis-Kokitsi Maninis, James Hays, and Matthew Brown. Omninocs: A unified nocs dataset and model for 3d lifting of 2d objects. In European Conference on Computer Vision , pages 127–145. Springer,

work page
[29]

Advances in 3d generation: A survey

Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yan-Pei Cao, and Ying Shan. Advances in 3d generation: A survey. arXiv preprint arXiv:2401.17807, 2024. 1

work page arXiv 2024
[30]

One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion. Advances in Neural Information Processing Systems , 36, 2024. 15

work page 2024
[31]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, pages 9298– 9309, 2023. 1, 2, 6

work page 2023
[32]

Nerf: Representing scenes as neural radiance fields for view syn- thesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM , 65(1):99–106, 2021. 2

work page 2021
[33]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR,

work page
[34]

Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model

Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, and Yinqiang Zheng. Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model. ECCV, 2024. 3 11

work page 2024
[35]

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

work page 2024
[36]

Aria digital twin: A new benchmark dataset for egocentric 3d machine percep- tion, 2023

Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Pe- ters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Carl Yuheng Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine percep- tion, 2023. 7

work page 2023
[37]

Barron, and Ben Milden- hall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. ICLR,

work page
[38]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. pages 8748–

work page
[39]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

High-resolution image syn- thesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2021. 6

work page 2021
[41]

Zeronvs: Zero- shot 360-degree view synthesis from a single image

Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry La- gun, Li Fei-Fei, Deqing Sun, and Jiajun Wu. Zeronvs: Zero- shot 360-degree view synthesis from a single image. In CVPR, pages 9420–9429, 2024. 2

work page 2024
[42]

Mental rotation of three-dimensional objects

Roger N Shepard and Jacqueline Metzler. Mental rotation of three-dimensional objects. Science, 171(3972):701–703,

work page
[43]

Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling

Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. In ACM SIGGRAPH 2024 Conference Papers , pages 1–11, 2024. 3

work page 2024
[44]

Mental rotations, a group test of three-dimensional spatial visualization

Steven G Vandenberg and Allan R Kuse. Mental rotations, a group test of three-dimensional spatial visualization. Per- ceptual and motor skills, 47(2):599–604, 1978. 15

work page 1978
[45]

Normalized object coordinate space for category-level 6d object pose and size estimation

He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. In CVPR, pages 2642–2651, 2019. 1, 4

work page 2019
[46]

Yeh, and Greg Shakhnarovich

Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, pages 12619–12629, 2023. 2

work page 2023
[47]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. In CVPR, pages 20697–20709, 2024. 3, 13

work page 2024
[48]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 7

work page 2004
[49]

Novel view synthesis with diffusion models

Daniel Watson, William Chan, Ricardo Martin Bru- alla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. In The Eleventh International Conference on Learning Repre- sentations, 2023. 3

work page 2023
[50]

Srinivasan, Dor Verbin, Jonathan T

Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Ho?y?ski. Reconfusion: 3d reconstruction with diffusion priors. In CVPR, pages 21551–21561, 2024. 2

work page 2024
[51]

Neural assets: 3d-aware multi-object scene synthesis with image diffusion models

Ziyi Wu, Yulia Rubanova, Rishabh Kabra, Drew A Hud- son, Igor Gilitschenski, Yusuf Aytar, Sjoerd van Steenkiste, Kelsey R Allen, and Thomas Kipf. Neural assets: 3d-aware multi-object scene synthesis with image diffusion models. arXiv preprint arXiv:2406.09292, 2024. 3

work page arXiv 2024
[52]

Sparp: Fast 3d object reconstruction and pose estimation from sparse views

Chao Xu, Ang Li, Linghao Chen, Yulin Liu, Ruoxi Shi, Hao Su, and Minghua Liu. Sparp: Fast 3d object reconstruction and pose estimation from sparse views. ECCV, 2024. 3, 4

work page 2024
[53]

Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model

Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Ji- ahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, et al. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. ICLR, 2024. 2

work page 2024
[54]

pixelNeRF: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural radiance fields from one or few images. In CVPR, 2021. 2, 3

work page 2021
[55]

Adding conditional control to text-to-image diffusion models, 2023

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3

work page 2023
[56]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. 7

work page 2018
[57]

be- hind

Chuanxia Zheng and Andrea Vedaldi. Free3d: Consistent novel view synthesis without 3d representation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9720–9731, 2024. 2 12 A. Layout The appendix is organized as follows: we describe CROCS in detail in Appendix B, discussing in particular the geomet- rica...

work page 2024
[58]

This describes the geomet- rical inaccuracy (when the intermediate representation is CROCS) in predicting the 3D shape and pose of a given object

The error of the prior alone. This describes the geomet- rical inaccuracy (when the intermediate representation is CROCS) in predicting the 3D shape and pose of a given object. 15 (a) Ground truth (b) unPIC via CROCS (Ours) (c) One-2-3-45 (XL) (d) CAT3D Figure 10. Additional qualitative comparison on Objaverse-XL holdouts

work page
[59]

This de- scribes the difficulty of rendering the object (e.g., pre- dicting the object’s texture) at novel views when extrap- olating from a single source image

The error of the decoder when it is fed ground-truth CROCS rather than the output of the prior. This de- scribes the difficulty of rendering the object (e.g., pre- dicting the object’s texture) at novel views when extrap- olating from a single source image. We report these components along with total hierarchi- cal error in Table 5. We find that the prior...

work page 2000

[1] [1]

Objectron: A large scale dataset of object-centric videos in the wild with pose an- notations

Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jian- ing Wei, and Matthias Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose an- notations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7822–7831,

work page

[2] [2]

Re- imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond

Mohammadreza Armandpour, Huangjie Zheng, Ali Sadeghian, Amir Sadeghian, and Mingyuan Zhou. Re- imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. arXiv preprint arXiv:2304.04968, 2023. 2

work page arXiv 2023

[3] [3]

The ycb object and model set: Towards common benchmarks for manipu- lation research

Berk Calli, Arjun Singh, Aaron Walsman, Siddhartha Srini- vasa, Pieter Abbeel, and Aaron M Dollar. The ycb object and model set: Towards common benchmarks for manipu- lation research. In 2015 International Conference on Ad- vanced Robotics (ICAR), pages 510–517. IEEE, 2015. 4, 15

work page 2015

[4] [4]

Abo: Dataset and benchmarks for real-world 3d object understand- ing

Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, and Jitendra Malik. Abo: Dataset and benchmarks for real-world 3d object understand- ing. CVPR, 2022. 7

work page 2022

[5] [5]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 4

work page 2023

[6] [6]

Objaverse-xl: A universe of 10m+ 3d objects

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Informa- tion Processing Systems, 36, 2024. 4, 17

work page 2024

[7] [7]

Strobl, Matthias Humt, and Rudolph Triebel

Maximilian Denninger, Dominik Winkelbauer, Martin Sun- dermeyer, Wout Boerdijk, Markus Knauer, Klaus H. Strobl, Matthias Humt, and Rudolph Triebel. Blenderproc2: A 10 procedural pipeline for photorealistic rendering. Journal of Open Source Software, 8(82):4901, 2023. 18

work page 2023

[8] [8]

McHugh, and Vincent Vanhoucke

Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B. McHugh, and Vincent Vanhoucke. Google scanned objects: A high- quality dataset of 3d scanned household items, 2022. 7

work page 2022

[9] [9]

Niladri Shekhar Dutt, Sanjeev Muralikrishnan, and Niloy J. Mitra. Diffusion 3d features (diff3f): Decorating untextured shapes with distilled semantic features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4494–4504, 2024. 3

work page 2024

[10] [10]

Prob- ing the 3d awareness of visual foundation models

Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab- hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob- ing the 3d awareness of visual foundation models. In CVPR, pages 21795–21806, 2024. 3

work page 2024

[11] [11]

Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image

Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowiz- ard: Unleashing the diffusion priors for 3d geometry estima- tion from a single image. In ECCV, 2024. 3

work page 2024

[12] [12]

Srinivasan, Jonathan T

Ruiqi Gao*, Aleksander Holynski*, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul P. Srinivasan, Jonathan T. Barron, and Ben Poole*. Cat3d: Create any- thing in 3d with multi-view diffusion models. arXiv, 2024. 1, 2, 4, 15

work page 2024

[13] [13]

Multiple view ge- ometry in computer vision

Richard Hartley and Andrew Zisserman. Multiple view ge- ometry in computer vision . Cambridge university press,

work page

[14] [14]

Lotus: Diffusion-based visual foundation model for high-quality dense prediction

Jing He, Haodong Li, Wei Yin, Yixun Liang, Leheng Li, Kaiqiang Zhou, Hongbo Liu, Bingbing Liu, and Ying- Cong Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction. arXiv preprint arXiv:2409.18124, 2024. 3

work page arXiv 2024

[15] [15]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 7

work page 2017

[16] [16]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. NeurIPS, 33:6840–6851, 2020. 1, 4

work page 2020

[18] [18]

Lrm: Large reconstruction model for single image to 3d

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. ICLR, 2024. 2, 3, 6, 9, 15

work page 2024

[19] [19]

sim- ple diffusion: End-to-end diffusion for high resolution im- ages

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. sim- ple diffusion: End-to-end diffusion for high resolution im- ages. In International Conference on Machine Learning , pages 13213–13232. PMLR, 2023. 5

work page 2023

[20] [20]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 6700–6709, 2019. 10

work page 2019

[21] [21]

SODA: Bottleneck diffusion models for representation learning

Drew A Hudson, Daniel Zoran, Mateusz Malinowski, An- drew K Lampinen, Andrew Jaegle, James L McClelland, Loic Matthey, Felix Hill, and Alexander Lerchner. SODA: Bottleneck diffusion models for representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 23115–23127, 2024. 3

work page 2024

[22] [22]

Allan Jabri, Sjoerd van Steenkiste, Emiel Hoogeboom, Mehdi S. M. Sajjadi, and Thomas Kipf. Dorsal: Diffusion for object-centric representations of scenes et al. In ICLR,

work page

[23] [23]

Shap-e: Generating condi- tional 3d implicit functions, 2023

Heewoo Jun and Alex Nichol. Shap-e: Generating condi- tional 3d implicit functions, 2023. 3

work page 2023

[24] [24]

Leveraging VLM-based pipelines to annotate 3d objects

Rishabh Kabra, Loic Matthey, Alexander Lerchner, and Niloy Mitra. Leveraging VLM-based pipelines to annotate 3d objects. In Forty-first International Conference on Ma- chine Learning, 2024. 17

work page 2024

[25] [25]

Repurpos- ing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 3

work page 2024

[26] [26]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,

work page

[27] [27]

Xin Kong, Shikun Liu, Xiaoyang Lyu, Marwan Taher, Xiao- juan Qi, and Andrew J. Davison. Eschernet: A generative model for scalable view synthesis. In CVPR, pages 9503– 9513, 2024. 1, 2

work page 2024

[28] [28]

Omninocs: A unified nocs dataset and model for 3d lifting of 2d objects

Akshay Krishnan, Abhijit Kundu, Kevis-Kokitsi Maninis, James Hays, and Matthew Brown. Omninocs: A unified nocs dataset and model for 3d lifting of 2d objects. In European Conference on Computer Vision , pages 127–145. Springer,

work page

[29] [29]

Advances in 3d generation: A survey

Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yan-Pei Cao, and Ying Shan. Advances in 3d generation: A survey. arXiv preprint arXiv:2401.17807, 2024. 1

work page arXiv 2024

[30] [30]

One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion. Advances in Neural Information Processing Systems , 36, 2024. 15

work page 2024

[31] [31]

Zero-1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, pages 9298– 9309, 2023. 1, 2, 6

work page 2023

[32] [32]

Nerf: Representing scenes as neural radiance fields for view syn- thesis

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM , 65(1):99–106, 2021. 2

work page 2021

[33] [33]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR,

work page

[34] [34]

Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model

Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, and Yinqiang Zheng. Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model. ECCV, 2024. 3 11

work page 2024

[35] [35]

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

work page 2024

[36] [36]

Aria digital twin: A new benchmark dataset for egocentric 3d machine percep- tion, 2023

Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Pe- ters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Carl Yuheng Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine percep- tion, 2023. 7

work page 2023

[37] [37]

Barron, and Ben Milden- hall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. ICLR,

work page

[38] [38]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. pages 8748–

work page

[39] [39]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [40]

High-resolution image syn- thesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models, 2021. 6

work page 2021

[41] [41]

Zeronvs: Zero- shot 360-degree view synthesis from a single image

Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry La- gun, Li Fei-Fei, Deqing Sun, and Jiajun Wu. Zeronvs: Zero- shot 360-degree view synthesis from a single image. In CVPR, pages 9420–9429, 2024. 2

work page 2024

[42] [42]

Mental rotation of three-dimensional objects

Roger N Shepard and Jacqueline Metzler. Mental rotation of three-dimensional objects. Science, 171(3972):701–703,

work page

[43] [43]

Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling

Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. In ACM SIGGRAPH 2024 Conference Papers , pages 1–11, 2024. 3

work page 2024

[44] [44]

Mental rotations, a group test of three-dimensional spatial visualization

Steven G Vandenberg and Allan R Kuse. Mental rotations, a group test of three-dimensional spatial visualization. Per- ceptual and motor skills, 47(2):599–604, 1978. 15

work page 1978

[45] [45]

Normalized object coordinate space for category-level 6d object pose and size estimation

He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. In CVPR, pages 2642–2651, 2019. 1, 4

work page 2019

[46] [46]

Yeh, and Greg Shakhnarovich

Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR, pages 12619–12629, 2023. 2

work page 2023

[47] [47]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. In CVPR, pages 20697–20709, 2024. 3, 13

work page 2024

[48] [48]

Image quality assessment: from error visibility to structural similarity

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004. 7

work page 2004

[49] [49]

Novel view synthesis with diffusion models

Daniel Watson, William Chan, Ricardo Martin Bru- alla, Jonathan Ho, Andrea Tagliasacchi, and Mohammad Norouzi. Novel view synthesis with diffusion models. In The Eleventh International Conference on Learning Repre- sentations, 2023. 3

work page 2023

[50] [50]

Srinivasan, Dor Verbin, Jonathan T

Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Ho?y?ski. Reconfusion: 3d reconstruction with diffusion priors. In CVPR, pages 21551–21561, 2024. 2

work page 2024

[51] [51]

Neural assets: 3d-aware multi-object scene synthesis with image diffusion models

Ziyi Wu, Yulia Rubanova, Rishabh Kabra, Drew A Hud- son, Igor Gilitschenski, Yusuf Aytar, Sjoerd van Steenkiste, Kelsey R Allen, and Thomas Kipf. Neural assets: 3d-aware multi-object scene synthesis with image diffusion models. arXiv preprint arXiv:2406.09292, 2024. 3

work page arXiv 2024

[52] [52]

Sparp: Fast 3d object reconstruction and pose estimation from sparse views

Chao Xu, Ang Li, Linghao Chen, Yulin Liu, Ruoxi Shi, Hao Su, and Minghua Liu. Sparp: Fast 3d object reconstruction and pose estimation from sparse views. ECCV, 2024. 3, 4

work page 2024

[53] [53]

Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model

Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Ji- ahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, et al. Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. ICLR, 2024. 2

work page 2024

[54] [54]

pixelNeRF: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural radiance fields from one or few images. In CVPR, 2021. 2, 3

work page 2021

[55] [55]

Adding conditional control to text-to-image diffusion models, 2023

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 3

work page 2023

[56] [56]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. 7

work page 2018

[57] [57]

be- hind

Chuanxia Zheng and Andrea Vedaldi. Free3d: Consistent novel view synthesis without 3d representation. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9720–9731, 2024. 2 12 A. Layout The appendix is organized as follows: we describe CROCS in detail in Appendix B, discussing in particular the geomet- rica...

work page 2024

[58] [58]

This describes the geomet- rical inaccuracy (when the intermediate representation is CROCS) in predicting the 3D shape and pose of a given object

The error of the prior alone. This describes the geomet- rical inaccuracy (when the intermediate representation is CROCS) in predicting the 3D shape and pose of a given object. 15 (a) Ground truth (b) unPIC via CROCS (Ours) (c) One-2-3-45 (XL) (d) CAT3D Figure 10. Additional qualitative comparison on Objaverse-XL holdouts

work page

[59] [59]

This de- scribes the difficulty of rendering the object (e.g., pre- dicting the object’s texture) at novel views when extrap- olating from a single source image

The error of the decoder when it is fed ground-truth CROCS rather than the output of the prior. This de- scribes the difficulty of rendering the object (e.g., pre- dicting the object’s texture) at novel views when extrap- olating from a single source image. We report these components along with total hierarchi- cal error in Table 5. We find that the prior...

work page 2000