pith. sign in

arxiv: 2605.18052 · v1 · pith:YAIQGPHXnew · submitted 2026-05-18 · 💻 cs.CV

Efficient 3D Content Reconstruction and Generation

Pith reviewed 2026-05-20 11:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D generation3D reconstructionmulti-view diffusionstructure-from-motionfeed-forward reconstructionGPU optimizationpose estimation
0
0 comments X

The pith

Instant3D generates high-quality 3D assets from text or images in 5-20 seconds while FastMap speeds up structure-from-motion by up to 10 times with comparable accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to replace slow manual 3D modeling and scanning with automated systems that synthesize or recover assets directly from text or images. Instant3D pairs multi-view diffusion with feed-forward sparse-view reconstruction to deliver usable 3D models in seconds. FastMap replaces earlier iterative solvers in structure-from-motion with first-order optimization and fused GPU kernels to cut runtime by an order of magnitude. A reader would care because these changes make rapid asset creation practical for games, virtual reality, robotics, and the collection of training data for larger 3D models.

Core claim

The thesis advances automatic 3D content creation by presenting Instant3D, which combines multi-view diffusion with feed-forward sparse-view 3D reconstruction to produce high-quality assets in 5-20 seconds, and FastMap, a structure-from-motion pipeline that achieves up to 10x speedup over prior state-of-the-art by using first-order optimization with fused GPU kernels extensively while maintaining comparable pose accuracy and downstream novel view synthesis quality.

What carries the argument

Instant3D's combination of multi-view diffusion followed by feed-forward reconstruction, and FastMap's replacement of conventional solvers with first-order optimization plus fused GPU kernels inside the structure-from-motion pipeline.

If this is right

  • Text or single-image inputs become sufficient to produce usable 3D assets without lengthy manual modeling.
  • Large image collections can be processed for camera poses and geometry at speeds that support real-time or near-real-time workflows.
  • Novel view synthesis quality stays comparable to slower pipelines, so downstream rendering and simulation tasks remain reliable.
  • Rapid iteration on 3D content becomes feasible for applications such as video-game prototyping and virtual-world construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the two components are chained, a single short pipeline could turn text prompts into editable 3D scenes with minimal additional compute.
  • The GPU-kernel approach used for pose optimization may transfer to other iterative refinement steps common in multi-view geometry.
  • Wider adoption could lower the cost of curating large-scale 3D training sets for foundation models.

Load-bearing premise

The premise that first-order optimization with fused GPU kernels can deliver the claimed speedup in pose estimation without introducing accuracy losses that would degrade later novel view synthesis.

What would settle it

A side-by-side run of FastMap against a prior structure-from-motion method on a standard multi-view benchmark that shows either pose error rising or novel view synthesis quality dropping at the reported 10x speedup.

Figures

Figures reproduced from arXiv: 2605.18052 by Jiahao Li.

Figure 3.1
Figure 3.1. Figure 3.1: Our method generates high-quality 3D NeRF assets from the given text prompts [PITH_FULL_IMAGE:figures/full_fig_p031_3_1.png] view at source ↗
Figure 3.2
Figure 3.2. Figure 3.2: Overview of our method. Given a text prompt (‘a car made out of sushi’), [PITH_FULL_IMAGE:figures/full_fig_p032_3_2.png] view at source ↗
Figure 3.3
Figure 3.3. Figure 3.3: Architecture of our sparse-view reconstructor. The model applies a pretrained [PITH_FULL_IMAGE:figures/full_fig_p034_3_3.png] view at source ↗
Figure 3.4
Figure 3.4. Figure 3.4: Qualitative comparisons on text-to-3D against previous methods. [PITH_FULL_IMAGE:figures/full_fig_p036_3_4.png] view at source ↗
Figure 3.5
Figure 3.5. Figure 3.5: Qualitative comparisons on results generated with and without Gaussian blob [PITH_FULL_IMAGE:figures/full_fig_p039_3_5.png] view at source ↗
Figure 3.6
Figure 3.6. Figure 3.6: Our Carve3D algorithm steadily improves the 3D consistency of a multi-view [PITH_FULL_IMAGE:figures/full_fig_p044_3_6.png] view at source ↗
Figure 3.7
Figure 3.7. Figure 3.7: Overview of Carve3D. Given a prompt sampled from our curated prompt set [PITH_FULL_IMAGE:figures/full_fig_p045_3_7.png] view at source ↗
Figure 3.8
Figure 3.8. Figure 3.8: Qualitative correlation between MRC and multi-view inconsistency with in [PITH_FULL_IMAGE:figures/full_fig_p045_3_8.png] view at source ↗
Figure 3.9
Figure 3.9. Figure 3.9: Quantitative correlation between MRC and multi-view inconsistency with in [PITH_FULL_IMAGE:figures/full_fig_p046_3_9.png] view at source ↗
Figure 3.10
Figure 3.10. Figure 3.10: Comparing the IS and the SF versions of Carve3D reward curves on the testing [PITH_FULL_IMAGE:figures/full_fig_p050_3_10.png] view at source ↗
Figure 3.11
Figure 3.11. Figure 3.11: We observe qualitative correlation between the KL value and the prompt [PITH_FULL_IMAGE:figures/full_fig_p051_3_11.png] view at source ↗
Figure 3.12
Figure 3.12. Figure 3.12: Scaling law for Carve3D’s diffusion model RLFT algorithm. When we scale [PITH_FULL_IMAGE:figures/full_fig_p054_3_12.png] view at source ↗
Figure 3.13
Figure 3.13. Figure 3.13: We conducted a user study with 20 randomly selected testing prompts and the [PITH_FULL_IMAGE:figures/full_fig_p056_3_13.png] view at source ↗
Figure 3.14
Figure 3.14. Figure 3.14: Top left: our approach achieves fast 3D generation from text or single-image [PITH_FULL_IMAGE:figures/full_fig_p061_3_14.png] view at source ↗
Figure 3.15
Figure 3.15. Figure 3.15: Overview of our method. We denoise multiple views (three shown in the figure; four used in experiments) for 3D generation. Our multi-view denoiser is a large transformer model that reconstructs a noise-free triplane NeRF from input noisy images with camera poses (parameterized by Plucker rays). During training, we supervise the triplane NeRF with a rendering loss at input and novel viewpoints. During in… view at source ↗
Figure 3.16
Figure 3.16. Figure 3.16: Qualitative comparisons on single-image reconstruction. Method VIT-B/32 ViT-L/14 R-Prec AP R-Prec AP Point-E 33.33 40.06 46.4 54.13 Shap-E 38.39 46.02 51.40 58.03 Ours 39.72 47.96 55.14 61.32 [PITH_FULL_IMAGE:figures/full_fig_p068_3_16.png] view at source ↗
Figure 3.17
Figure 3.17. Figure 3.17: Qualitative comparison on Text-to-3D . with different numbers (1, 2, 4, 6) of input views in Tab. 3.7. We can see that our model consistently achieves better quality when using more images, benefiting from capturing more shape and appearance information. However, the performance improvement of 6 views over four views is marginal, where some metrics (like PSNR) from the 4-view model is even better. We th… view at source ↗
Figure 3.18
Figure 3.18. Figure 3.18: Robustness on out-of-domain inputs of synthetic, real, and generated images. a large margin. In general, the novel view supervision is crucial for our model to achieve meaningful 3D generation, avoiding the model to learn a local minima that merely recovers the sparse multi-view images. In addition, our design of Plucker coordinate-based camera conditioning is also effective, leading to better quantitat… view at source ↗
Figure 4.1
Figure 4.1. Figure 4.1: Timing of FastMap compared to COLMAP and GLOMAP (all with GPU acceleration on a single A6000) on scenes from eight datasets, excluding the matching stage for all methods. Note the logarithmic time scale. Lines represent a least squares power function fit to timing across multiple datasets, as a function of the number of matched keypoint pairs. have been adopted to speed up LM optimization, such as the Sc… view at source ↗
Figure 4.2
Figure 4.2. Figure 4.2: An overview of FastMap. Input images are processed using feature extraction and matching. Given the matching results, FastMap estimates the intrinsics and extrinsics of the cameras. Finally a sparse point cloud is generated by triangulation. Matching FastMap’s matching stage is identical to that of both COLMAP and GLOMAP: it involves first extracting and matching keypoints from the input images, followed… view at source ↗
Figure 4.3
Figure 4.3. Figure 4.3: Effect of distortion on focal length estimation. Curves in [PITH_FULL_IMAGE:figures/full_fig_p078_4_3.png] view at source ↗
read the original abstract

Automatic 3D content creation seeks to replace labor-intensive modeling and scanning pipelines with systems that can synthesize or recover 3D assets directly from text or images. Its applications span video games, virtual reality, robotics, and simulation, enabling rapid asset prototyping, diverse interactive world generation, and efficient 3D data collection for training foundation models. Contemporary solutions largely follow two complementary paradigms: (i) text- or image-to-3D generation, which learns priors over 3D geometry and appearance to create novel assets from natural language or a single view image; and (ii) 3D reconstruction, which estimates camera poses and geometry from RGB images. This thesis advances both directions. On the generation side, I introduce Instant3D, which combines multi-view diffusion with feed-forward sparse-view 3D reconstruction to produce high-quality assets in 5-20 seconds. On the reconstruction side, I develop FastMap, a structure-from-motion pipeline that achieves up to 10x speedup over prior state-of-the-art by using first-order optimization with fused GPU kernels extensively, while maintaining comparable pose accuracy and downstream novel view synthesis quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Instant3D, which combines multi-view diffusion with feed-forward sparse-view 3D reconstruction to generate high-quality 3D assets from text or images in 5-20 seconds, and FastMap, a structure-from-motion pipeline that uses first-order optimization with fused GPU kernels to achieve up to 10x speedup over prior state-of-the-art while maintaining comparable pose accuracy and downstream novel view synthesis quality.

Significance. If the empirical claims hold, the work would advance efficient 3D content creation by reducing generation and reconstruction times substantially, with direct relevance to VR, robotics, and large-scale 3D data pipelines. The practical focus on GPU-accelerated first-order methods for SfM represents a useful engineering contribution if accuracy preservation is rigorously demonstrated.

major comments (1)
  1. The FastMap description claims that first-order optimization with fused GPU kernels preserves rotation/translation errors and NVS quality at ~10x speedup relative to second-order baselines (e.g., Levenberg-Marquardt or bundle adjustment). No iteration counts, convergence curves, per-scene error tables, or ablation on step-size/conditioning are referenced, leaving open whether accuracy holds after controlling for initialization sensitivity and kernel artifacts.
minor comments (1)
  1. The abstract and introduction would benefit from explicit dataset names and scene counts used for the 5-20s generation and 10x speedup claims to allow direct comparison with prior work.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment on FastMap below and will incorporate the suggested clarifications and additional results in the revised version.

read point-by-point responses
  1. Referee: The FastMap description claims that first-order optimization with fused GPU kernels preserves rotation/translation errors and NVS quality at ~10x speedup relative to second-order baselines (e.g., Levenberg-Marquardt or bundle adjustment). No iteration counts, convergence curves, per-scene error tables, or ablation on step-size/conditioning are referenced, leaving open whether accuracy holds after controlling for initialization sensitivity and kernel artifacts.

    Authors: We agree that the current presentation would benefit from more explicit empirical details. In the revised manuscript we will add iteration counts for our first-order method versus the second-order baselines, convergence curves across representative scenes, per-scene rotation/translation error tables, and ablations on step-size and conditioning. These will appear in Section 4.2. We already control for initialization by using identical COLMAP-derived starting poses for all compared methods and report mean and standard deviation over the full test set; we will make this protocol explicit. Numerical equivalence of the fused kernels to their unfused counterparts (within FP32 precision) is verified in the supplementary material, which we will reference in the main text. revision: yes

Circularity Check

0 steps flagged

No load-bearing circularity; empirical claims rest on proposed pipelines without self-referential definitions or fitted predictions

full rationale

The abstract and description introduce Instant3D (multi-view diffusion + feed-forward reconstruction) and FastMap (first-order optimization + fused kernels) as new combinations for 3D tasks. No equations, fitted parameters, or self-citations appear in the provided text that would define a result in terms of itself or rename an input fit as a prediction. Speedup and accuracy claims are stated as outcomes of the methods rather than derived by construction. This matches the default non-finding for a methods paper whose central content is independent of any visible self-referential math.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The work appears to rely on standard assumptions from diffusion models and structure-from-motion literature.

pith-pipeline@v0.9.0 · 5717 in / 1130 out tokens · 36970 ms · 2026-05-20T11:35:21.505016+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

294 extracted references · 294 canonical work pages · 28 internal anchors

  1. [1]

    Learning representations and generative models for 3d point clouds

    Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learn- ing, volume 80 ofProceedings of Machine Learning Research, pages 40–49. PMLR, 10–15 Jul 2018. URLhttps://...

  2. [2]

    Adobe firefly.https://www.adobe.com/sensei/generative-ai/firefly

    Adobe. Adobe firefly.https://www.adobe.com/sensei/generative-ai/firefly. html, 2023. Accessed: 2023-11-15

  3. [3]

    Building rome in a day

    Sameer Agarwal, Noah Snavely, Ian Simon, Steven M Seitz, and Richard Szeliski. Building rome in a day. In2009 IEEE 12th International Conference on Computer Vision, pages 72–79, September 2009. doi: 10.1109/ICCV.2009.5459148. URLhttp: //dx.doi.org/10.1109/ICCV.2009.5459148

  4. [4]

    Seitz, and Richard Szeliski

    Sameer Agarwal, Noah Snavely, Steven M. Seitz, and Richard Szeliski. Bundle adjust- ment in the large. InComputer Vision – ECCV 2010, pages 29–42, 2010. doi: 10.1007/ 978-3-642-15552-9 3. URLhttp://dx.doi.org/10.1007/978-3-642-15552-9_3

  5. [5]

    Seitz, and Richard Szeliski

    Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M. Seitz, and Richard Szeliski. Building rome in a day.Communications of the ACM, 54(10):105–112, October 2011. doi: 10.1145/2001269.2001293. URL http://dx.doi.org/10.1145/2001269.2001293

  6. [6]

    Ceres Solver

    Sameer Agarwal, Keir Mierle, and The Ceres Solver Team. Ceres Solver. Software, 10

  7. [7]

    URLhttps://github.com/ceres-solver/ceres-solver. 88

  8. [8]

    Mixvpr: Feature mixing for visual place recognition

    Amar Ali-bey, Brahim Chaib-draa, and Philippe Gigu` ere. Mixvpr: Feature mixing for visual place recognition. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2998–3007, January 2023

  9. [9]

    Renderdiffusion: Image diffusion for 3d recon- struction, inpainting and generation

    Titas Anciukeviˇ cius, Zexiang Xu, Matthew Fisher, Paul Henderson, Hakan Bilen, Niloy J Mitra, and Paul Guerrero. Renderdiffusion: Image diffusion for 3d recon- struction, inpainting and generation. InCVPR, 2023

  10. [10]

    Netvlad: Cnn architecture for weakly supervised place recognition

    Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

  11. [11]

    Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Das- Sarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson...

  12. [12]

    Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  13. [13]

    A minimal solution for two-view focal-length estimation using two affine correspondences

    Daniel Barath, Tekla Toth, and Levente Hajder. A minimal solution for two-view focal-length estimation using two affine correspondences. InCVPR, 2017

  14. [14]

    Fundamental matrix for cameras with radial distortion

    Jo˜ ao Pedro Barreto and Kostas Daniilidis. Fundamental matrix for cameras with radial distortion. InICCV, 2005

  15. [15]

    Masked feature prediction for self-supervised visual pre-training

    Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hed- man. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 5460–5469, 2022. doi: 10.1109/CVPR52688.2022.00539. URLhttps://doi.org/10.1109/CVPR52688. 2022.00539

  16. [16]

    2023 , url =

    Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid-based neural radiance fields. InIEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 19640–19648, 2023. doi: 10.1109/ICCV51070.2023.01804. URLhttps://doi.org/10.1109/ICCV51070.2023. 01804

  17. [17]

    Key.net: Keypoint detection by handcrafted and learned cnn filters, 2019

    Axel Barroso-Laguna, Edgar Riba, Daniel Ponsa, and Krystian Mikolajczyk. Key.net: Keypoint detection by handcrafted and learned cnn filters, 2019. URLhttps://arxiv. org/abs/1904.00889

  18. [18]

    Rethinking visual geo- localization for large-scale applications

    Gabriele Berton, Carlo Masone, and Barbara Caputo. Rethinking visual geo- localization for large-scale applications. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4878–4888, June 2022

  19. [19]

    DynaSLAM: Tracking, Mapping and Inpainting in Dynamic Scenes

    Berta Bescos, Jos´ e M. F´ acil, Javier Civera, and Jos´ e Neira. Dynaslam: Tracking, mapping and inpainting in dynamic scenes, 2018. URLhttps://arxiv.org/abs/ 1806.05620

  20. [20]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M¨ uller. Zoedepth: Zero-shot transfer by combining relative and metric depth.arXiv preprint arXiv:2302.12288, 2023. 90

  21. [21]

    Training diffusion models with reinforcement learning, 2023

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning, 2023

  22. [22]

    Deep- calib: a deep learning approach for automatic intrinsic calibration of wide field-of-view cameras

    Oleksandr Bogdan, Viktor Eckstein, Francois Rameau, and Jean-Charles Bazin. Deep- calib: a deep learning approach for automatic intrinsic calibration of wide field-of-view cameras. InProceedings of the 15th ACM SIGGRAPH European Conference on Visual Media Production, 2018

  23. [23]

    Neural-guided ransac: Learning where to sample model hypotheses

    Eric Brachmann and Carsten Rother. Neural-guided ransac: Learning where to sample model hypotheses. In2019 IEEE/CVF International Conference on Computer Vision (ICCV), page 4321–4330. IEEE, October 2019. doi: 10.1109/iccv.2019.00442. URL http://dx.doi.org/10.1109/iccv.2019.00442

  24. [24]

    DSAC - Differentiable RANSAC for Camera Localization

    Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. Dsac - differentiable ransac for camera local- ization, 2018. URLhttps://arxiv.org/abs/1611.05705

  25. [25]

    Accelerated co- ordinate encoding: Learning to relocalize in minutes using rgb and poses

    Eric Brachmann, Tommaso Cavallari, and Victor Adrian Prisacariu. Accelerated co- ordinate encoding: Learning to relocalize in minutes using rgb and poses. InCVPR, 2023

  26. [26]

    Scene coordinate recon- struction: Posing of image collections via incremental learning of a relocalizer

    Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cavallari, ´Aron Monszpart, Daniyar Turmukhambetov, and Victor Adrian Prisacariu. Scene coordinate recon- struction: Posing of image collections via incremental learning of a relocalizer. In ECCV, 2024

  27. [27]

    Language models are few-shot learners.Advances in neural information processing systems, 33, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33, 2020

  28. [28]

    Springer International Publishing, 2020

    Bingyi Cao, Andr´ e Araujo, and Jack Sim.Unifying Deep Local and Global Features for Image Search, page 726–743. Springer International Publishing, 2020. ISBN 91 9783030585655. doi: 10.1007/978-3-030-58565-5 43. URLhttp://dx.doi.org/10. 1007/978-3-030-58565-5_43

  29. [29]

    Emerging properties in self-supervised vision trans- formers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´ e J´ egou, Julien Mairal, Piotr Bo- janowski, and Armand Joulin. Emerging properties in self-supervised vision trans- formers. InProceedings of the International Conference on Computer Vision (ICCV), 2021

  30. [30]

    Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, and Dylan Hadfield-Menell

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J´ er´ emy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Rapha¨ el Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud,...

  31. [31]

    Chan, Mariano Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein

    Eric R. Chan, Mariano Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. InThe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021

  32. [32]

    Chan, Connor Z

    Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, and Gordon Wetzstein. Efficient geometry-aware 3d generative adversarial networks. InThe IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022

  33. [33]

    Efficient geometry-aware 3d generative adversarial networks

    Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. InCVPR, 2022

  34. [34]

    Efficient and robust large-scale ro- tation averaging

    Avishek Chatterjee and Venu Madhav Govindu. Efficient and robust large-scale ro- tation averaging. In2013 IEEE International Conference on Computer Vision, pages 92 521–528, December 2013. doi: 10.1109/ICCV.2013.70. URLhttp://dx.doi.org/10. 1109/ICCV.2013.70

  35. [35]

    Robust relative rotation averaging

    Avishek Chatterjee and Venu Madhav Govindu. Robust relative rotation averaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):958–972, April

  36. [36]

    URLhttp://dx.doi.org/10.1109/TPAMI

    doi: 10.1109/TPAMI.2017.2693984. URLhttp://dx.doi.org/10.1109/TPAMI. 2017.2693984

  37. [37]

    Tensorf: Ten- sorial radiance fields

    Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Ten- sorial radiance fields. InComputer Vision – ECCV 2022 – 17th European Confer- ence, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXII, volume 13692 ofLecture Notes in Computer Science, pages 333–350. Springer, 2022. doi: 10.1007/ 978-3-031-19824-3 20. URLhttps://doi....

  38. [38]

    Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction

    Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. InICCV, 2023

  39. [39]

    Aspanformer: Detector-free image matching with adaptive span transformer, 2022

    Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin Zhen, Tian Fang, David Mckinnon, Yanghai Tsin, and Long Quan. Aspanformer: Detector-free image matching with adaptive span transformer, 2022. URLhttps://arxiv.org/abs/2208.14201

  40. [40]

    Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation

    Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20622– 20633, October 2023

  41. [41]

    Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation.arXiv preprint arXiv:2303.13873, 2023

    Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation.arXiv preprint arXiv:2303.13873, 2023

  42. [42]

    Learning implicit fields for generative shape modeling

    Zhiqin Chen and Hao Zhang. Learning implicit fields for generative shape modeling. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 93

  43. [43]

    V3d: Video diffusion models are effective 3d generators, 2024

    Zilong Chen, Yikai Wang, Feng Wang, Zhengyi Wang, and Huaping Liu. V3d: Video diffusion models are effective 3d generators, 2024. URLhttps://arxiv.org/abs/ 2403.06738

  44. [44]

    Luciddreamer: Domain-free generation of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

    Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee, and Kyoung Mu Lee. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes.arXiv preprint arXiv:2311.13384, 2023

  45. [45]

    Abo: Dataset and benchmarks for real-world 3d object understanding

    Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object understanding. InCVPR, pages 21126–21136, 2022

  46. [46]

    Discrete- continuous optimization for large-scale structure from motion

    David Crandall, Andrew Owens, Noah Snavely, and Dan Huttenlocher. Discrete- continuous optimization for large-scale structure from motion. InCVPR 2011, June

  47. [47]

    URLhttp://dx.doi.org/10.1109/CVPR

    doi: 10.1109/CVPR.2011.5995626. URLhttp://dx.doi.org/10.1109/CVPR. 2011.5995626

  48. [48]

    Global structure-from-motion by similarity averaging

    Zhaopeng Cui and Ping Tan. Global structure-from-motion by similarity averaging. In2015 IEEE International Conference on Computer Vision (ICCV), pages 864–872, December 2015. doi: 10.1109/ICCV.2015.105. URLhttp://dx.doi.org/10.1109/ ICCV.2015.105

  49. [49]

    Obja- verse: A universe of annotated 3d objects, 2022

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli Vander- Bilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Obja- verse: A universe of annotated 3d objects, 2022

  50. [50]

    Objaverse-XL: A Universe of 10M+ 3D Objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

  51. [51]

    Obja- 94 verse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli Vander- Bilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Obja- 94 verse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023

  52. [52]

    Obja- verse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli Vander- Bilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Obja- verse: A universe of annotated 3d objects. InCVPR, pages 13142–13153, 2023

  53. [53]

    Superpoint: Self- supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self- supervised interest point detection and description. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), page 337–33712. IEEE, June 2018. doi: 10.1109/cvprw.2018.00060. URLhttp://dx.doi.org/10. 1109/cvprw.2018.00060

  54. [54]

    Theia: A library for structure-from-motion

    Theia Developers. Theia: A library for structure-from-motion. urlhttp://www.theia-sfm.org/, 2026. Accessed 2026-02-03

  55. [55]

    Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization

    Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16739–16752, June 2025

  56. [56]

    Google scanned objects: A high-quality dataset of 3d scanned household items

    Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022

  57. [57]

    Close-range camera calibration.Photogramm

    C Brown Duane. Close-range camera calibration.Photogramm. Eng, 37(8), 1971

  58. [58]

    MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion

    Bardienus Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R-SfM: A fully-integrated solution for uncon- strained structure-from-motion.arXiv preprint arXiv:2409.19152, 2024

  59. [59]

    MASt3r-sfm: a fully-integrated solution for unconstrained 95 structure-from-motion

    Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3r-sfm: a fully-integrated solution for unconstrained 95 structure-from-motion. InInternational Conference on 3D Vision (3DV), 2025. URL https://openreview.net/forum?id=5uw1GRBFoT

  60. [60]

    D2-Net: A Trainable CNN for Joint Detection and Description of Local Features

    Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint detection and description of local features, 2019. URLhttps://arxiv.org/abs/1905.03561

  61. [62]

    Hyperparameters in rein- forcement learning and how to tune them, 2023

    Theresa Eimer, Marius Lindauer, and Roberta Raileanu. Hyperparameters in rein- forcement learning and how to tune them, 2023

  62. [63]

    Syncity: Training-free generation of 3d worlds

    Jakob Engstler, Gaisi Xia, Brandon Trabucco, Huicheng Yan, Rui Ding, Yun-Chun Chen, and Gordon Wetzstein. Syncity: Training-free generation of 3d worlds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 18577–18587, 2025

  63. [64]

    Disentangled 3d scene gener- ation with layout learning

    Dave Epstein, Alisa Kim, Xueyue Wang, Hsin-Ying Chen, Hung-Yu Tseng, Hsiao- Yu Fish Chen, Ming-Hsuan Yang, and Wei Wang. Disentangled 3d scene gener- ation with layout learning. In Dylan Fortune, Paul Vicol, Serena Yeung, Han- nah Zhang, Frank Hutter, and Hugo Larochelle, editors,Proceedings of the 41st International Conference on Machine Learning, volume...

  64. [65]

    A review of the one-parameter division undistortion model.ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 1, 2021

    Bastian Erdn¨ uß. A review of the one-parameter division undistortion model.ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 1, 2021

  65. [66]

    Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point set generation network for 96 3d object reconstruction from a single image. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

  66. [67]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models, 2023

  67. [68]

    Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints

    Chuan Fang, Yuan Dong, Kunming Luo, Xiaotao Hu, Rakesh Shrestha, and Ping Tan. Ctrl-room: Controllable text-to-3d room meshes generation with layout constraints. InInternational Conference on 3D Vision (3DV), pages 692–701, 2025. doi: 10.1109/ 3DV66043.2025.00069

  68. [69]

    MIT Press, 1993

    Olivier Faugeras.Three-dimensional computer vision: A geometric viewpoint. MIT Press, 1993

  69. [70]

    Simultaneous linear estimation of multiple view geometry and lens distortion

    Andrew W Fitzgibbon. Simultaneous linear estimation of multiple view geometry and lens distortion. InCVPR, 2001

  70. [71]

    Get3d: A generative model of high quality 3d textured shapes learned from images.Advances in Neural Information Processing Systems, 35:31841–31854, 2022

    Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images.Advances in Neural Information Processing Systems, 35:31841–31854, 2022

  71. [72]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. SEED-X: multimodal models with unified multi-granularity comprehension and generation.CoRR, abs/2404.14396, 2024. doi: 10.48550/ARXIV. 2404.14396. URLhttps://doi.org/10.48550/arXiv.2404.14396

  72. [73]

    Learning shape templates with structured implicit functions

    Kyle Genova, Ethan Highlander, Ozgur Sirin, Wonmin Liao, Kostas Daniilidis, and Thomas Funkhouser. Learning shape templates with structured implicit functions. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  73. [74]

    S2dnet: Learning accurate 97 correspondences for sparse-to-dense feature matching, 2020

    Hugo Germain, Guillaume Bourmaud, and Vincent Lepetit. S2dnet: Learning accurate 97 correspondences for sparse-to-dense feature matching, 2020. URLhttps://arxiv. org/abs/2004.01673

  74. [75]

    Vivid-1-to-3: Novel view synthesis with video diffusion models, 2023

    Jeong gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, and Kwang Moo Yi. Vivid-1-to-3: Novel view synthesis with video diffusion models, 2023. URLhttps: //arxiv.org/abs/2312.01305

  75. [77]

    All for one, and one for all: Urbansyn dataset, the third musketeer of synthetic driving scenes.Neurocomputing, 637:130038, 2025

    Jose L G´ omez, Manuel Silva, Antonio Seoane, Agn` es Borr´ as, Mario Noriega, Germ´ an Ros, Jose A Iglesias-Guitian, and Antonio M L´ opez. All for one, and one for all: Urbansyn dataset, the third musketeer of synthetic driving scenes.Neurocomputing, 637:130038, 2025

  76. [79]

    V. M. Govindu. Lie-algebraic averaging for globally consistent motion estimation. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., volume 1, pages 684–691, 2004. doi: 10.1109/ CVPR.2004.1315098. URLhttp://dx.doi.org/10.1109/CVPR.2004.1315098

  77. [80]

    Kubric: A scalable dataset generator

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duck- worth, David J Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022

  78. [81]

    Kim, Bryan C

    Thibault Groueix, Matthew Fisher, Vladimir G. Kim, Bryan C. Russell, and Mathieu 98 Aubry. A papier-mache approach to learning 3d surface generation. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

  79. [82]

    Cas- cade cost volume for high-resolution multi-view stereo and stereo matching, 2020

    Xiaodong Gu, Zhiwen Fan, Zuozhuo Dai, Siyu Zhu, Feitong Tan, and Ping Tan. Cas- cade cost volume for high-resolution multi-view stereo and stereo matching, 2020. URL https://arxiv.org/abs/1912.06378

  80. [83]

    threestudio: A unified framework for 3d content generation.https://github.com/ threestudio-project/threestudio, 2023

    Yuan-Chen Guo, Ying-Tian Liu, Ruizhi Shao, Christian Laforte, Vikram Voleti, Guan Luo, Chia-Hao Chen, Zi-Xin Zou, Chen Wang, Yan-Pei Cao, and Song-Hai Zhang. threestudio: A unified framework for 3d content generation.https://github.com/ threestudio-project/threestudio, 2023

Showing first 80 references.