Recognition: unknown
Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
Pith reviewed 2026-05-10 03:29 UTC · model grok-4.3
The pith
Unposed-to-3D reconstructs 3D vehicle models from single real-world images without known camera poses by predicting poses and using self-supervised rendering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an image-to-3D reconstruction network can be trained on real-world driving images alone by inserting a camera prediction head whose output supplies the pose needed for differentiable rendering; the resulting photometric loss then acts as self-supervision. This process, together with explicit scale estimation and appearance harmonization, yields 3D vehicle models that remain pose-consistent, correctly scaled, and visually compatible when placed inside driving scenes.
What carries the argument
A camera prediction head that estimates camera parameters from unposed images to enable self-supervised photometric feedback through differentiable rendering.
If this is right
- Models maintain consistent 3D geometry and appearance across multiple input views of the same vehicle.
- Predicted real-world scales allow the assets to be placed at correct sizes inside driving simulations.
- Harmonization adapts lighting and texture so generated vehicles blend into target scenes without obvious mismatches.
- The pipeline operates on ordinary image collections, removing the need for posed or synthetic training data.
- The resulting assets support downstream tasks such as scene composition and digital-twin construction.
Where Pith is reading between the lines
- The same self-supervised loop could be applied to other rigid objects commonly seen in driving footage, such as traffic signs or road furniture.
- If the camera head generalizes across datasets, collections of dashcam video could be turned into city-scale 3D vehicle libraries with minimal manual labeling.
- The scale-aware module might be replaced or augmented by direct depth cues from stereo pairs or LiDAR when those sensors are available in the source data.
Load-bearing premise
The camera prediction head can produce sufficiently accurate pose estimates from unposed images to supply reliable photometric supervision via differentiable rendering, without ground-truth camera parameters or additional labels.
What would settle it
Reconstruct a set of vehicles from unposed images, render them from held-out viewpoints or insert them into new scenes, and measure whether photometric error, scale error, or visual inconsistency exceeds the levels obtained when ground-truth poses are supplied during training.
Figures
read the original abstract
Creating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train an image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameters from unposed images. The predicted pose is then used for differentiable rendering to provide self-supervised photometric feedback, enabling the model to learn 3D geometry purely from unposed images. To ensure simulation readiness, we further introduce a scale-aware module to predict real-world size information, and a harmonization module that adapts the generated vehicles to the target driving scene with consistent lighting and appearance. Extensive experiments demonstrate that Unposed-to-3D effectively reconstructs realistic, pose-consistent, and harmonized 3D vehicle models from real-world images, providing a scalable path toward creating high-quality assets for driving scene simulation and digital twin environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Unposed-to-3D, a two-stage framework for reconstructing 3D vehicle models from real-world driving images under image-only supervision. Stage 1 trains an image-to-3D network on posed images with known camera parameters. Stage 2 removes explicit camera supervision, introduces a camera prediction head to estimate poses from unposed images, and uses differentiable rendering to supply self-supervised photometric losses for learning geometry and texture. Additional scale-aware and harmonization modules are added to produce simulation-ready assets with real-world scale and scene-consistent appearance. The central claim is that this yields realistic, pose-consistent, harmonized 3D vehicles scalable for driving simulation and digital twins.
Significance. If the self-supervised loop is stable and the predicted poses sufficiently accurate, the approach would address the synthetic-to-real domain gap in 3D asset generation and reduce reliance on expensive pose annotations, offering a scalable pipeline for high-quality simulation assets.
major comments (2)
- [§3.2] §3.2 (Stage 2 training): The description of the joint optimization between the camera prediction head and reconstruction network via photometric rendering gradients provides no details on initialization, pose regularization, warm-start schedule, or independent validation of predicted poses against held-out ground truth. Without these, it is unclear whether the photometric loss can reliably supervise correct 3D geometry rather than degenerate solutions such as flattened shapes compensated by pose shifts.
- [§4] §4 Experiments: The claim that 'extensive experiments demonstrate effectiveness' is unsupported by any reported quantitative metrics (e.g., reconstruction PSNR, IoU, pose error, or FID), baselines, ablation studies on the camera head, or failure-case analysis. This absence makes it impossible to verify that the self-supervised loop produces accurate, simulation-ready models on real driving images with varying lighting and occlusion.
minor comments (2)
- The abstract and method overview would benefit from explicit dataset names and statistics (e.g., number of real-world images and sources) to contextualize the domain.
- Notation for the scale-aware module and harmonization loss could be clarified with a single equation reference rather than descriptive text only.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify and improve our manuscript. We address each major comment below and will revise the paper to incorporate additional details and evaluations where feasible.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Stage 2 training): The description of the joint optimization between the camera prediction head and reconstruction network via photometric rendering gradients provides no details on initialization, pose regularization, warm-start schedule, or independent validation of predicted poses against held-out ground truth. Without these, it is unclear whether the photometric loss can reliably supervise correct 3D geometry rather than degenerate solutions such as flattened shapes compensated by pose shifts.
Authors: We agree that §3.2 would benefit from greater detail on the Stage 2 procedure. In the revised manuscript we will expand this section to describe: initialization of the camera prediction head from the corresponding branch trained in Stage 1; regularization terms including temporal smoothness on predicted poses and consistency with the scale-aware module; a warm-start schedule in which the reconstruction network is initially frozen while the camera head is trained before joint fine-tuning; and any available pose validation on subsets with partial ground-truth annotations. We will also add explicit discussion of how the combination of photometric self-supervision, scale prediction, and multi-view consistency discourages degenerate solutions such as flattening. revision: yes
-
Referee: [§4] §4 Experiments: The claim that 'extensive experiments demonstrate effectiveness' is unsupported by any reported quantitative metrics (e.g., reconstruction PSNR, IoU, pose error, or FID), baselines, ablation studies on the camera head, or failure-case analysis. This absence makes it impossible to verify that the self-supervised loop produces accurate, simulation-ready models on real driving images with varying lighting and occlusion.
Authors: We acknowledge that the current experimental section relies primarily on qualitative results. In the revision we will augment §4 with quantitative metrics including PSNR/SSIM for photometric reconstruction quality, IoU on projected geometry where feasible, pose prediction error on any available annotated subsets, and FID scores for harmonized outputs. We will add comparisons against relevant baselines, ablation studies isolating the camera prediction head and harmonization module, and a dedicated discussion of failure cases under occlusion and lighting variation. These additions will provide stronger, verifiable support for the claims. revision: yes
- Independent validation of predicted poses against held-out ground truth is not possible for the core unposed real-world images that lack any pose annotations by design.
Circularity Check
No significant circularity; two-stage self-supervision is standard and non-reductive
full rationale
The abstract describes a two-stage pipeline: Stage 1 trains the reconstruction network with explicit known camera parameters; Stage 2 removes that supervision and substitutes a learned camera-prediction head whose outputs drive photometric rendering loss. This is a conventional self-supervised formulation (pose prediction + differentiable rendering) and does not equate the reconstruction output to its own inputs by definition. No quoted equation or module is defined in terms of the quantity it is supposed to predict, no self-citation is invoked as a uniqueness theorem, and no fitted parameter is relabeled as an independent prediction. The method therefore remains open to external validation on real driving images and receives the default non-circular finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Differentiable rendering can supply a usable photometric loss for 3D geometry learning when camera parameters are predicted rather than given
Reference graph
Works this paper leans on
-
[1]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 1
2020
-
[2]
Car full view dataset: Fine-grained predictions of car orientation from images.Electronics, 12(24):4947, 2023
Andy Catruna, Pavel Betiu, Emanuel Tertes, Vladimir Ghita, Emilian Radoi, Irina Mocanu, and Mihai Dascalu. Car full view dataset: Fine-grained predictions of car orientation from images.Electronics, 12(24):4947, 2023. 3, 4, 7, 2, 8
2023
-
[3]
ShapeNet: An Information-Rich 3D Model Repository
Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015. 2
work page internal anchor Pith review arXiv 2015
-
[4]
Geosim: Realistic video sim- ulation via geometry-aware composition for self-driving
Yun Chen, Frieda Rong, Shivam Duggal, Shenlong Wang, Xinchen Yan, Sivabalan Manivasagam, Shangjie Xue, Ersin Yumer, and Raquel Urtasun. Geosim: Realistic video sim- ulation via geometry-aware composition for self-driving. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 7230–7240, 2021. 2
2021
-
[5]
Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023
Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023. 2
2023
-
[6]
V oxel r-cnn: Towards high performance voxel-based 3d object detection
Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. V oxel r-cnn: Towards high performance voxel-based 3d object detection. InPro- ceedings of the AAAI conference on artificial intelligence, pages 1201–1209, 2021. 8
2021
-
[7]
Dreamcar: Leveraging car-specific prior for in-the-wild 3d car reconstruction.IEEE Robotics and Automation Let- ters, 2024
Xiaobiao Du, Haiyang Sun, Ming Lu, Tianqing Zhu, and Xin Yu. Dreamcar: Leveraging car-specific prior for in-the-wild 3d car reconstruction.IEEE Robotics and Automation Let- ters, 2024. 3
2024
-
[8]
3drealcar: An in-the-wild rgb-d car dataset with 360-degree views
Xiaobiao Du, Yida Wang, Haiyang Sun, Zhuojie Wu, Hong- wei Sheng, Shuyun Wang, Jiaying Ying, Ming Lu, Tian- qing Zhu, Kun Zhan, et al. 3drealcar: An in-the-wild rgb-d car dataset with 360-degree views. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26488–26498, 2025. 4, 6, 7, 1, 2, 5
2025
-
[9]
Get3d: A generative model of high quality 3d tex- tured shapes learned from images.Advances in neural infor- mation processing systems, 35:31841–31854, 2022
Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d tex- tured shapes learned from images.Advances in neural infor- mation processing systems, 35:31841–31854, 2022. 2
2022
-
[10]
Tan et al
K. Tan et al. H. Caesar, J. Kabzan. Nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. In CVPR ADP3 workshop, 2021. 1
2021
-
[11]
LRM: Large Reconstruction Model for Single Image to 3D
Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023. 2, 5
work page internal anchor Pith review arXiv 2023
-
[12]
No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse views
Ranran Huang and Krystian Mikolajczyk. No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse views. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 27947–27957, 2025. 3
2025
-
[13]
Mvsmamba: Multi-view stereo with state space model
Jianfei Jiang, Qiankun Liu, Hongyuan Liu, Haochen Yu, Liy- ong Wang, Jiansheng Chen, and Huimin Ma. Mvsmamba: Multi-view stereo with state space model. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. 1
2025
-
[14]
Ultralytics YOLO, 2023
Glenn Jocher, Jing Qiu, and Ayush Chaurasia. Ultralytics YOLO, 2023. 2
2023
-
[15]
Shap-e: Generating conditional 3d implicit functions
Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023. 2
-
[16]
Madrive: Memory-augmented driving scene modeling
Polina Karpikova, Daniil Selikhanovych, Kirill Struminsky, Ruslan Musaev, Maria Golitsyna, and Dmitry Baranchuk. Madrive: Memory-augmented driving scene modeling. arXiv preprint arXiv:2506.21520, 2025. 2, 6, 7, 1
-
[17]
Gsnet: Joint vehicle pose and shape recon- struction with geometrical and scene-aware supervision
Lei Ke, Shichao Li, Yanan Sun, Yu-Wing Tai, and Chi- Keung Tang. Gsnet: Joint vehicle pose and shape recon- struction with geometrical and scene-aware supervision. In European Conference on Computer Vision, pages 515–532. Springer, 2020. 3
2020
-
[18]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
-
[19]
Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion
Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern...
-
[20]
Instant3d: Fast text-to-3d with sparse-view gen- eration and large reconstruction model
Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023. 2
-
[21]
Photorealistic object insertion with diffusion-guided inverse rendering
Ruofan Liang, Zan Gojcic, Merlin Nimier-David, David Acuna, Nandita Vijaykumar, Sanja Fidler, and Zian Wang. Photorealistic object insertion with diffusion-guided inverse rendering. InEuropean Conference on Computer Vision, pages 446–465. Springer, 2024. 2, 5
2024
-
[22]
Luciddreamer: Towards high- fidelity text-to-3d generation via interval score matching
Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiao- gang Xu, and Yingcong Chen. Luciddreamer: Towards high- fidelity text-to-3d generation via interval score matching. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6517–6526, 2024. 2 9
2024
-
[23]
Magic3d: High-resolution text-to-3d content creation
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023. 2
2023
-
[24]
Instdrive: Instance-aware 3d gaussian splatting for driving scenes
Hongyuan Liu, Haochen Yu, Bochao Zou, Jianfei Jiang, Qiankun Liu, Jiansheng Chen, and Huimin Ma. Instdrive: Instance-aware 3d gaussian splatting for driving scenes. arXiv preprint arXiv:2508.12015, 2025. 2
-
[25]
Protocar: Learning 3d vehicle prototypes from single-view and unconstrained driving scene images
Hongyuan Liu, Haochen Yu, Bochao Zou, Juntao Lyu, Qi Mei, Jiansheng Chen, and Huimin Ma. Protocar: Learning 3d vehicle prototypes from single-view and unconstrained driving scene images. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 5460–5468, 2025. 3
2025
-
[26]
One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d dif- fusion
Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Ji- ayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d dif- fusion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 10072–10083,
-
[27]
Zero-1-to- 3: Zero-shot one image to 3d object
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 2
2023
-
[28]
Car-studio: learning car radiance fields from single- view and unlimited in-the-wild images.IEEE Robotics and Automation Letters, 9(3), 2024
Tianyu Liu, Hao Zhao, Yang Yu, Guyue Zhou, and Ming Liu. Car-studio: learning car radiance fields from single- view and unlimited in-the-wild images.IEEE Robotics and Automation Letters, 9(3), 2024. 3
2024
-
[29]
R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation
William Ljungbergh, Bernardo Taveira, Wenzhao Zheng, Adam Tonderski, Chensheng Peng, Fredrik Kahl, Christof- fer Petersson, Michael Felsberg, Kurt Keutzer, Masayoshi Tomizuka, et al. R3d2: Realistic 3d asset insertion via dif- fusion for autonomous driving simulation.arXiv preprint arXiv:2506.07826, 2025. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Ur- bancad: Towards highly controllable and photorealistic 3d vehicles for urban scene simulation
Yichong Lu, Yichi Cai, Shangzhan Zhang, Hongyu Zhou, Haoji Hu, Huimin Yu, Andreas Geiger, and Yiyi Liao. Ur- bancad: Towards highly controllable and photorealistic 3d vehicles for urban scene simulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27519–27530, 2025. 2
2025
-
[31]
Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1
2021
-
[32]
3d bounding box estimation using deep learn- ing and geometry
Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Kosecka. 3d bounding box estimation using deep learn- ing and geometry. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7074– 7082, 2017. 3
2017
-
[33]
Autorf: Learning 3d object radiance fields from single view observations
Norman M ¨uller, Andrea Simonelli, Lorenzo Porzi, Samuel Rota Bulo, Matthias Nießner, and Peter Kontschieder. Autorf: Learning 3d object radiance fields from single view observations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3971–3980, 2022. 3
2022
-
[34]
Point-E: A System for Generating 3D Point Clouds from Complex Prompts
Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[35]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3, 4, 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Neural scene graphs for dynamic scenes
Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. Neural scene graphs for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2856–2865, 2021. 2
2021
-
[37]
arXiv preprint arXiv:2403.12036 (2024)
Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models.arXiv preprint arXiv:2403.12036, 2024. 5
-
[38]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[39]
DreamFusion: Text-to-3D using 2D Diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[40]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2
2022
-
[41]
Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2
2022
-
[42]
Structure- from-motion revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 4
2016
-
[43]
Gina-3d: Learning to generate implicit neural assets in the wild
Bokui Shen, Xinchen Yan, Charles R Qi, Mahyar Najibi, Boyang Deng, Leonidas Guibas, Yin Zhou, and Dragomir Anguelov. Gina-3d: Learning to generate implicit neural assets in the wild. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4913–4926, 2023. 3, 5
2023
-
[44]
Scalability in perception for autonomous driving: Waymo open dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in percepti...
2020
-
[45]
arXiv preprint arXiv:2309.16653 , year=
Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for effi- 10 cient 3d content creation.arXiv preprint arXiv:2309.16653,
-
[46]
Lgm: Large multi-view gaussian model for high-resolution 3d content creation
Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024. 2, 7
2024
-
[47]
Openpcdet: An open- source toolbox for 3d object detection from point clouds
OpenPCDet Development Team. Openpcdet: An open- source toolbox for 3d object detection from point clouds. https://github.com/open-mmlab/OpenPCDet,
-
[48]
Cadsim: Robust and scalable in-the- wild 3d reconstruction for controllable sensor simulation
Jingkang Wang, Sivabalan Manivasagam, Yun Chen, Ze Yang, Ioan Andrei Bˆarsan, Anqi Joyce Yang, Wei-Chiu Ma, and Raquel Urtasun. Cadsim: Robust and scalable in-the- wild 3d reconstruction for controllable sensor simulation. arXiv preprint arXiv:2311.01447, 2023. 2
-
[49]
Vggsfm: Visual geometry grounded deep structure from motion
Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 21686–21697, 2024. 4
2024
-
[50]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 3, 4
2025
-
[51]
Dust3r: Geometric 3d vi- sion made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 3
2024
-
[52]
Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023. 2
2023
-
[53]
arXiv preprint arXiv:2412.18605 (2024)
Zehan Wang, Ziang Zhang, Tianyu Pang, Chao Du, Heng- shuang Zhao, and Zhou Zhao. Orient anything: Learning robust object orientation estimation from rendering 3d mod- els.arXiv preprint arXiv:2412.18605, 2024. 3
-
[54]
Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Hao Wen, Hongbo Kang, Jian Ma, Jing Huang, Yuanwang Yang, Haozhe Lin, Yu-Kun Lai, and Kun Li. Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2
2025
-
[55]
Mars: An instance-aware, mod- ular and realistic simulator for autonomous driving
Zirui Wu, Tianyu Liu, Liyi Luo, Zhide Zhong, Jianteng Chen, Hongmin Xiao, Chao Hou, Haozhe Lou, Yuantao Chen, Runyi Yang, et al. Mars: An instance-aware, mod- ular and realistic simulator for autonomous driving. InCAAI International Conference on Artificial Intelligence, pages 3–
-
[56]
Structured 3d latents for scalable and versatile 3d gen- eration
Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21469–21480, 2025. 2, 7, 3
2025
-
[57]
Data-driven 3d voxel patterns for object category recogni- tion
Yu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. Data-driven 3d voxel patterns for object category recogni- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1903–1911, 2015. 3
1903
-
[58]
Pandaset: Advanced sensor suite dataset for autonomous driving
Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, et al. Pandaset: Advanced sensor suite dataset for autonomous driving. In2021 IEEE international intelligent transportation systems conference (ITSC), pages 3095–3101. IEEE, 2021. 6
2021
-
[59]
Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models.arXiv preprint arXiv:2404.07191,
work page internal anchor Pith review arXiv
-
[60]
Street gaussians: Modeling dynamic urban scenes with gaussian splatting
Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024. 2
2024
-
[61]
Unisim: A neural closed-loop sensor simulator
Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Mani- vasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Ur- tasun. Unisim: A neural closed-loop sensor simulator. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 1389–1399, 2023. 2
2023
-
[62]
Get3dgs: Generate 3d gaussians based on points deformation fields.IEEE Transactions on Circuits and Systems for Video Technology, 2024
Haochen Yu, Weixi Gong, Jiansheng Chen, and Huimin Ma. Get3dgs: Generate 3d gaussians based on points deformation fields.IEEE Transactions on Circuits and Systems for Video Technology, 2024. 2
2024
-
[63]
Haochen Yu, Qiankun Liu, Hongyuan Liu, Jianfei Jiang, Jun- tao Lyu, Jiansheng Chen, and Huimin Ma. Xyzcylinder: Towards compatible feed-forward 3d gaussian splatting for driving scenes via unified cylinder lifting method.arXiv preprint arXiv:2510.07856, 2025. 2
-
[64]
Faster segment anything: Towards lightweight sam for mobile applications,
Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mo- bile applications.arXiv preprint arXiv:2306.14289, 2023. 2
-
[65]
Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2025
Hongyu Zhou, Longzhong Lin, Jiabao Wang, Yichong Lu, Dongfeng Bai, Bingbing Liu, Yue Wang, Andreas Geiger, and Yiyi Liao. Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2025. 2, 4
2025
-
[66]
Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers
Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Yan-Pei Cao, and Song-Hai Zhang. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10324–10335, 2024. 2, 5, 7 11 Unposed-to-3D: Learning Simulat...
2024
-
[67]
Network Architectures Texture Block cross self mlp zpatch𝑃𝑃 𝑃𝑃 𝑃𝑃 Geometry Block cross self mlp zcls × 5 × 10 Figure 5
Implementation Details 6.1. Network Architectures Texture Block cross self mlp zpatch𝑃𝑃 𝑃𝑃 𝑃𝑃 Geometry Block cross self mlp zcls × 5 × 10 Figure 5. The two base modules of the backbone network. Aggregation Module.We adopt the DINOv2-base [35] encoder and uniformly resize all input images to224×224. A learnable camera embedding is introduced as a single to...
-
[68]
More Results Zero-shot in Driving Scenario.As shown in Figure 7, we provide additional qualitative results on autonomous driv- ing datasets. Relying solely on single-view inputs, our method is able to recover high-quality vehicle models from real driving environments, demonstrating its strong gener- alization capability and practical applicability. More C...
2033
-
[69]
In addition we do not explic- itly model shadows cast by environmental illumination
Limitations and Future works Our method is unable to handle input images that exhibit ge- ometric distortions; severe non-uniform stretching or com- pression typically leads to corresponding deformations in the reconstructed 3D model. In addition we do not explic- itly model shadows cast by environmental illumination. The realism of asset insertion can st...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.