pith. machine review for the scientific record. sign in

arxiv: 2604.19257 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D vehicle reconstructionself-supervised learningdifferentiable renderingcamera pose predictionsimulation-ready assetsdriving scene generationimage-to-3D
0
0 comments X

The pith

Unposed-to-3D reconstructs 3D vehicle models from single real-world images without known camera poses by predicting poses and using self-supervised rendering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a two-stage method that first trains an image-to-3D network on posed images with known camera data, then switches to unposed images by adding a head that predicts camera parameters. Those predicted poses drive differentiable rendering, which supplies photometric consistency checks to supervise the 3D geometry and appearance directly from real driving photos. Scale prediction and scene harmonization modules are added so the resulting models carry real-world dimensions and match target lighting. If the approach works, it supplies a route to large collections of simulation-ready vehicle assets drawn from ordinary image collections rather than synthetic datasets.

Core claim

The central claim is that an image-to-3D reconstruction network can be trained on real-world driving images alone by inserting a camera prediction head whose output supplies the pose needed for differentiable rendering; the resulting photometric loss then acts as self-supervision. This process, together with explicit scale estimation and appearance harmonization, yields 3D vehicle models that remain pose-consistent, correctly scaled, and visually compatible when placed inside driving scenes.

What carries the argument

A camera prediction head that estimates camera parameters from unposed images to enable self-supervised photometric feedback through differentiable rendering.

If this is right

  • Models maintain consistent 3D geometry and appearance across multiple input views of the same vehicle.
  • Predicted real-world scales allow the assets to be placed at correct sizes inside driving simulations.
  • Harmonization adapts lighting and texture so generated vehicles blend into target scenes without obvious mismatches.
  • The pipeline operates on ordinary image collections, removing the need for posed or synthetic training data.
  • The resulting assets support downstream tasks such as scene composition and digital-twin construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same self-supervised loop could be applied to other rigid objects commonly seen in driving footage, such as traffic signs or road furniture.
  • If the camera head generalizes across datasets, collections of dashcam video could be turned into city-scale 3D vehicle libraries with minimal manual labeling.
  • The scale-aware module might be replaced or augmented by direct depth cues from stereo pairs or LiDAR when those sensors are available in the source data.

Load-bearing premise

The camera prediction head can produce sufficiently accurate pose estimates from unposed images to supply reliable photometric supervision via differentiable rendering, without ground-truth camera parameters or additional labels.

What would settle it

Reconstruct a set of vehicles from unposed images, render them from held-out viewpoints or insert them into new scenes, and measure whether photometric error, scale error, or visual inconsistency exceeds the levels obtained when ground-truth poses are supplied during training.

Figures

Figures reproduced from arXiv: 2604.19257 by Bochao Zou, Cheng Bi, Chen Liu, Haochen Yu, Hongyuan Liu, Huimin Ma, Jianfei Jiang, Jiansheng Chen, Qiankun Liu, Qi Mei, Xueyang Zhang, Yifei Zhan, Zhao Wang.

Figure 1
Figure 1. Figure 1: (a) Our method reconstructs 3D assets directly from real-world images without camera pose annotations. It jointly estimates [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Given input images from arbitrary viewpoints, we first extract visual features using DINOv2 [ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of reconstruction quality. For fair comparison, both our method and the baselines follow the single-image [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of harmonization results. We show multi-view renderings of the inserted assets before and after har [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The two base modules of the backbone network. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Schematic illustration of the camera construction [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative generalization results of single-image reconstruction on autonomous driving datasets. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative reconstruction results under different input settings. We present results using a single input view as well as using two [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: We visualize the Gaussian points to illustrate geometric quality. Compared to TRELLIS [ [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Harmonized Training Scene Construction. Multi View Reconstruction. To validate the effective￾ness of our multi-view input strategy, we present qualita￾tive comparisons in [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative results of LGM, TGS, DGS on 3DRealCar [ [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative results of TRELLIS and ours on 3DRealCar [ [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative results of LGM, TGS, DGS on CFV [ [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative results of TRELLIS and ours on CFV [ [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative results of camera parameters prediction. [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative results of harmonization. The first row presents the pre-harmonization outputs, while the second row shows the [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
read the original abstract

Creating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train an image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameters from unposed images. The predicted pose is then used for differentiable rendering to provide self-supervised photometric feedback, enabling the model to learn 3D geometry purely from unposed images. To ensure simulation readiness, we further introduce a scale-aware module to predict real-world size information, and a harmonization module that adapts the generated vehicles to the target driving scene with consistent lighting and appearance. Extensive experiments demonstrate that Unposed-to-3D effectively reconstructs realistic, pose-consistent, and harmonized 3D vehicle models from real-world images, providing a scalable path toward creating high-quality assets for driving scene simulation and digital twin environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Unposed-to-3D, a two-stage framework for reconstructing 3D vehicle models from real-world driving images under image-only supervision. Stage 1 trains an image-to-3D network on posed images with known camera parameters. Stage 2 removes explicit camera supervision, introduces a camera prediction head to estimate poses from unposed images, and uses differentiable rendering to supply self-supervised photometric losses for learning geometry and texture. Additional scale-aware and harmonization modules are added to produce simulation-ready assets with real-world scale and scene-consistent appearance. The central claim is that this yields realistic, pose-consistent, harmonized 3D vehicles scalable for driving simulation and digital twins.

Significance. If the self-supervised loop is stable and the predicted poses sufficiently accurate, the approach would address the synthetic-to-real domain gap in 3D asset generation and reduce reliance on expensive pose annotations, offering a scalable pipeline for high-quality simulation assets.

major comments (2)
  1. [§3.2] §3.2 (Stage 2 training): The description of the joint optimization between the camera prediction head and reconstruction network via photometric rendering gradients provides no details on initialization, pose regularization, warm-start schedule, or independent validation of predicted poses against held-out ground truth. Without these, it is unclear whether the photometric loss can reliably supervise correct 3D geometry rather than degenerate solutions such as flattened shapes compensated by pose shifts.
  2. [§4] §4 Experiments: The claim that 'extensive experiments demonstrate effectiveness' is unsupported by any reported quantitative metrics (e.g., reconstruction PSNR, IoU, pose error, or FID), baselines, ablation studies on the camera head, or failure-case analysis. This absence makes it impossible to verify that the self-supervised loop produces accurate, simulation-ready models on real driving images with varying lighting and occlusion.
minor comments (2)
  1. The abstract and method overview would benefit from explicit dataset names and statistics (e.g., number of real-world images and sources) to contextualize the domain.
  2. Notation for the scale-aware module and harmonization loss could be clarified with a single equation reference rather than descriptive text only.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify and improve our manuscript. We address each major comment below and will revise the paper to incorporate additional details and evaluations where feasible.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Stage 2 training): The description of the joint optimization between the camera prediction head and reconstruction network via photometric rendering gradients provides no details on initialization, pose regularization, warm-start schedule, or independent validation of predicted poses against held-out ground truth. Without these, it is unclear whether the photometric loss can reliably supervise correct 3D geometry rather than degenerate solutions such as flattened shapes compensated by pose shifts.

    Authors: We agree that §3.2 would benefit from greater detail on the Stage 2 procedure. In the revised manuscript we will expand this section to describe: initialization of the camera prediction head from the corresponding branch trained in Stage 1; regularization terms including temporal smoothness on predicted poses and consistency with the scale-aware module; a warm-start schedule in which the reconstruction network is initially frozen while the camera head is trained before joint fine-tuning; and any available pose validation on subsets with partial ground-truth annotations. We will also add explicit discussion of how the combination of photometric self-supervision, scale prediction, and multi-view consistency discourages degenerate solutions such as flattening. revision: yes

  2. Referee: [§4] §4 Experiments: The claim that 'extensive experiments demonstrate effectiveness' is unsupported by any reported quantitative metrics (e.g., reconstruction PSNR, IoU, pose error, or FID), baselines, ablation studies on the camera head, or failure-case analysis. This absence makes it impossible to verify that the self-supervised loop produces accurate, simulation-ready models on real driving images with varying lighting and occlusion.

    Authors: We acknowledge that the current experimental section relies primarily on qualitative results. In the revision we will augment §4 with quantitative metrics including PSNR/SSIM for photometric reconstruction quality, IoU on projected geometry where feasible, pose prediction error on any available annotated subsets, and FID scores for harmonized outputs. We will add comparisons against relevant baselines, ablation studies isolating the camera prediction head and harmonization module, and a dedicated discussion of failure cases under occlusion and lighting variation. These additions will provide stronger, verifiable support for the claims. revision: yes

standing simulated objections not resolved
  • Independent validation of predicted poses against held-out ground truth is not possible for the core unposed real-world images that lack any pose annotations by design.

Circularity Check

0 steps flagged

No significant circularity; two-stage self-supervision is standard and non-reductive

full rationale

The abstract describes a two-stage pipeline: Stage 1 trains the reconstruction network with explicit known camera parameters; Stage 2 removes that supervision and substitutes a learned camera-prediction head whose outputs drive photometric rendering loss. This is a conventional self-supervised formulation (pose prediction + differentiable rendering) and does not equate the reconstruction output to its own inputs by definition. No quoted equation or module is defined in terms of the quantity it is supposed to predict, no self-citation is invoked as a uniqueness theorem, and no fitted parameter is relabeled as an independent prediction. The method therefore remains open to external validation on real driving images and receives the default non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions of differentiable rendering and neural-network capacity for joint geometry and pose prediction; no explicit free parameters, new physical entities, or ad-hoc axioms are stated in the abstract.

axioms (1)
  • domain assumption Differentiable rendering can supply a usable photometric loss for 3D geometry learning when camera parameters are predicted rather than given
    Invoked in the second training stage to close the self-supervision loop.

pith-pipeline@v0.9.0 · 5590 in / 1375 out tokens · 59516 ms · 2026-05-10T03:29:18.344645+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 17 canonical work pages · 7 internal anchors

  1. [1]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 1

  2. [2]

    Car full view dataset: Fine-grained predictions of car orientation from images.Electronics, 12(24):4947, 2023

    Andy Catruna, Pavel Betiu, Emanuel Tertes, Vladimir Ghita, Emilian Radoi, Irina Mocanu, and Mihai Dascalu. Car full view dataset: Fine-grained predictions of car orientation from images.Electronics, 12(24):4947, 2023. 3, 4, 7, 2, 8

  3. [3]

    ShapeNet: An Information-Rich 3D Model Repository

    Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012, 2015. 2

  4. [4]

    Geosim: Realistic video sim- ulation via geometry-aware composition for self-driving

    Yun Chen, Frieda Rong, Shivam Duggal, Shenlong Wang, Xinchen Yan, Sivabalan Manivasagam, Shangjie Xue, Ersin Yumer, and Raquel Urtasun. Geosim: Realistic video sim- ulation via geometry-aware composition for self-driving. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 7230–7240, 2021. 2

  5. [5]

    Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023. 2

  6. [6]

    V oxel r-cnn: Towards high performance voxel-based 3d object detection

    Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. V oxel r-cnn: Towards high performance voxel-based 3d object detection. InPro- ceedings of the AAAI conference on artificial intelligence, pages 1201–1209, 2021. 8

  7. [7]

    Dreamcar: Leveraging car-specific prior for in-the-wild 3d car reconstruction.IEEE Robotics and Automation Let- ters, 2024

    Xiaobiao Du, Haiyang Sun, Ming Lu, Tianqing Zhu, and Xin Yu. Dreamcar: Leveraging car-specific prior for in-the-wild 3d car reconstruction.IEEE Robotics and Automation Let- ters, 2024. 3

  8. [8]

    3drealcar: An in-the-wild rgb-d car dataset with 360-degree views

    Xiaobiao Du, Yida Wang, Haiyang Sun, Zhuojie Wu, Hong- wei Sheng, Shuyun Wang, Jiaying Ying, Ming Lu, Tian- qing Zhu, Kun Zhan, et al. 3drealcar: An in-the-wild rgb-d car dataset with 360-degree views. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26488–26498, 2025. 4, 6, 7, 1, 2, 5

  9. [9]

    Get3d: A generative model of high quality 3d tex- tured shapes learned from images.Advances in neural infor- mation processing systems, 35:31841–31854, 2022

    Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d tex- tured shapes learned from images.Advances in neural infor- mation processing systems, 35:31841–31854, 2022. 2

  10. [10]

    Tan et al

    K. Tan et al. H. Caesar, J. Kabzan. Nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. In CVPR ADP3 workshop, 2021. 1

  11. [11]

    LRM: Large Reconstruction Model for Single Image to 3D

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023. 2, 5

  12. [12]

    No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse views

    Ranran Huang and Krystian Mikolajczyk. No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse views. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 27947–27957, 2025. 3

  13. [13]

    Mvsmamba: Multi-view stereo with state space model

    Jianfei Jiang, Qiankun Liu, Hongyuan Liu, Haochen Yu, Liy- ong Wang, Jiansheng Chen, and Huimin Ma. Mvsmamba: Multi-view stereo with state space model. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. 1

  14. [14]

    Ultralytics YOLO, 2023

    Glenn Jocher, Jing Qiu, and Ayush Chaurasia. Ultralytics YOLO, 2023. 2

  15. [15]

    Shap-e: Generating conditional 3d implicit functions

    Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023. 2

  16. [16]

    Madrive: Memory-augmented driving scene modeling

    Polina Karpikova, Daniil Selikhanovych, Kirill Struminsky, Ruslan Musaev, Maria Golitsyna, and Dmitry Baranchuk. Madrive: Memory-augmented driving scene modeling. arXiv preprint arXiv:2506.21520, 2025. 2, 6, 7, 1

  17. [17]

    Gsnet: Joint vehicle pose and shape recon- struction with geometrical and scene-aware supervision

    Lei Ke, Shichao Li, Yanan Sun, Yu-Wing Tai, and Chi- Keung Tang. Gsnet: Joint vehicle pose and shape recon- struction with geometrical and scene-aware supervision. In European Conference on Computer Vision, pages 515–532. Springer, 2020. 3

  18. [18]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  19. [19]

    Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion

    Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern...

  20. [20]

    Instant3d: Fast text-to-3d with sparse-view gen- eration and large reconstruction model

    Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. arXiv preprint arXiv:2311.06214, 2023. 2

  21. [21]

    Photorealistic object insertion with diffusion-guided inverse rendering

    Ruofan Liang, Zan Gojcic, Merlin Nimier-David, David Acuna, Nandita Vijaykumar, Sanja Fidler, and Zian Wang. Photorealistic object insertion with diffusion-guided inverse rendering. InEuropean Conference on Computer Vision, pages 446–465. Springer, 2024. 2, 5

  22. [22]

    Luciddreamer: Towards high- fidelity text-to-3d generation via interval score matching

    Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiao- gang Xu, and Yingcong Chen. Luciddreamer: Towards high- fidelity text-to-3d generation via interval score matching. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6517–6526, 2024. 2 9

  23. [23]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023. 2

  24. [24]

    Instdrive: Instance-aware 3d gaussian splatting for driving scenes

    Hongyuan Liu, Haochen Yu, Bochao Zou, Jianfei Jiang, Qiankun Liu, Jiansheng Chen, and Huimin Ma. Instdrive: Instance-aware 3d gaussian splatting for driving scenes. arXiv preprint arXiv:2508.12015, 2025. 2

  25. [25]

    Protocar: Learning 3d vehicle prototypes from single-view and unconstrained driving scene images

    Hongyuan Liu, Haochen Yu, Bochao Zou, Juntao Lyu, Qi Mei, Jiansheng Chen, and Huimin Ma. Protocar: Learning 3d vehicle prototypes from single-view and unconstrained driving scene images. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 5460–5468, 2025. 3

  26. [26]

    One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d dif- fusion

    Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Ji- ayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d dif- fusion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 10072–10083,

  27. [27]

    Zero-1-to- 3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 2

  28. [28]

    Car-studio: learning car radiance fields from single- view and unlimited in-the-wild images.IEEE Robotics and Automation Letters, 9(3), 2024

    Tianyu Liu, Hao Zhao, Yang Yu, Guyue Zhou, and Ming Liu. Car-studio: learning car radiance fields from single- view and unlimited in-the-wild images.IEEE Robotics and Automation Letters, 9(3), 2024. 3

  29. [29]

    R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation

    William Ljungbergh, Bernardo Taveira, Wenzhao Zheng, Adam Tonderski, Chensheng Peng, Fredrik Kahl, Christof- fer Petersson, Michael Felsberg, Kurt Keutzer, Masayoshi Tomizuka, et al. R3d2: Realistic 3d asset insertion via dif- fusion for autonomous driving simulation.arXiv preprint arXiv:2506.07826, 2025. 2, 5

  30. [30]

    Ur- bancad: Towards highly controllable and photorealistic 3d vehicles for urban scene simulation

    Yichong Lu, Yichi Cai, Shangzhan Zhang, Hongyu Zhou, Haoji Hu, Huimin Yu, Andreas Geiger, and Yiyi Liao. Ur- bancad: Towards highly controllable and photorealistic 3d vehicles for urban scene simulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27519–27530, 2025. 2

  31. [31]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1

  32. [32]

    3d bounding box estimation using deep learn- ing and geometry

    Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Kosecka. 3d bounding box estimation using deep learn- ing and geometry. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7074– 7082, 2017. 3

  33. [33]

    Autorf: Learning 3d object radiance fields from single view observations

    Norman M ¨uller, Andrea Simonelli, Lorenzo Porzi, Samuel Rota Bulo, Matthias Nießner, and Peter Kontschieder. Autorf: Learning 3d object radiance fields from single view observations. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3971–3980, 2022. 3

  34. [34]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022. 2

  35. [35]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3, 4, 1

  36. [36]

    Neural scene graphs for dynamic scenes

    Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. Neural scene graphs for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2856–2865, 2021. 2

  37. [37]

    arXiv preprint arXiv:2403.12036 (2024)

    Gaurav Parmar, Taesung Park, Srinivasa Narasimhan, and Jun-Yan Zhu. One-step image translation with text-to-image models.arXiv preprint arXiv:2403.12036, 2024. 5

  38. [38]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  39. [39]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 2

  40. [40]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

  41. [41]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 2

  42. [42]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 4

  43. [43]

    Gina-3d: Learning to generate implicit neural assets in the wild

    Bokui Shen, Xinchen Yan, Charles R Qi, Mahyar Najibi, Boyang Deng, Leonidas Guibas, Yin Zhou, and Dragomir Anguelov. Gina-3d: Learning to generate implicit neural assets in the wild. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4913–4926, 2023. 3, 5

  44. [44]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in percepti...

  45. [45]

    arXiv preprint arXiv:2309.16653 , year=

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for effi- 10 cient 3d content creation.arXiv preprint arXiv:2309.16653,

  46. [46]

    Lgm: Large multi-view gaussian model for high-resolution 3d content creation

    Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024. 2, 7

  47. [47]

    Openpcdet: An open- source toolbox for 3d object detection from point clouds

    OpenPCDet Development Team. Openpcdet: An open- source toolbox for 3d object detection from point clouds. https://github.com/open-mmlab/OpenPCDet,

  48. [48]

    Cadsim: Robust and scalable in-the- wild 3d reconstruction for controllable sensor simulation

    Jingkang Wang, Sivabalan Manivasagam, Yun Chen, Ze Yang, Ioan Andrei Bˆarsan, Anqi Joyce Yang, Wei-Chiu Ma, and Raquel Urtasun. Cadsim: Robust and scalable in-the- wild 3d reconstruction for controllable sensor simulation. arXiv preprint arXiv:2311.01447, 2023. 2

  49. [49]

    Vggsfm: Visual geometry grounded deep structure from motion

    Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 21686–21697, 2024. 4

  50. [50]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 3, 4

  51. [51]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 3

  52. [52]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023. 2

  53. [53]

    arXiv preprint arXiv:2412.18605 (2024)

    Zehan Wang, Ziang Zhang, Tianyu Pang, Chao Du, Heng- shuang Zhao, and Zhou Zhao. Orient anything: Learning robust object orientation estimation from rendering 3d mod- els.arXiv preprint arXiv:2412.18605, 2024. 3

  54. [54]

    Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Hao Wen, Hongbo Kang, Jian Ma, Jing Huang, Yuanwang Yang, Haozhe Lin, Yu-Kun Lai, and Kun Li. Dycrowd: Towards dynamic crowd reconstruction from a large-scene video.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

  55. [55]

    Mars: An instance-aware, mod- ular and realistic simulator for autonomous driving

    Zirui Wu, Tianyu Liu, Liyi Luo, Zhide Zhong, Jianteng Chen, Hongmin Xiao, Chao Hou, Haozhe Lou, Yuantao Chen, Runyi Yang, et al. Mars: An instance-aware, mod- ular and realistic simulator for autonomous driving. InCAAI International Conference on Artificial Intelligence, pages 3–

  56. [56]

    Structured 3d latents for scalable and versatile 3d gen- eration

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21469–21480, 2025. 2, 7, 3

  57. [57]

    Data-driven 3d voxel patterns for object category recogni- tion

    Yu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. Data-driven 3d voxel patterns for object category recogni- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1903–1911, 2015. 3

  58. [58]

    Pandaset: Advanced sensor suite dataset for autonomous driving

    Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, et al. Pandaset: Advanced sensor suite dataset for autonomous driving. In2021 IEEE international intelligent transportation systems conference (ITSC), pages 3095–3101. IEEE, 2021. 6

  59. [59]

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models.arXiv preprint arXiv:2404.07191,

  60. [60]

    Street gaussians: Modeling dynamic urban scenes with gaussian splatting

    Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians: Modeling dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision, pages 156–173. Springer, 2024. 2

  61. [61]

    Unisim: A neural closed-loop sensor simulator

    Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Mani- vasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Ur- tasun. Unisim: A neural closed-loop sensor simulator. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 1389–1399, 2023. 2

  62. [62]

    Get3dgs: Generate 3d gaussians based on points deformation fields.IEEE Transactions on Circuits and Systems for Video Technology, 2024

    Haochen Yu, Weixi Gong, Jiansheng Chen, and Huimin Ma. Get3dgs: Generate 3d gaussians based on points deformation fields.IEEE Transactions on Circuits and Systems for Video Technology, 2024. 2

  63. [63]

    Xyzcylinder: Towards compatible feed-forward 3d gaussian splatting for driving scenes via unified cylinder lifting method.arXiv preprint arXiv:2510.07856, 2025

    Haochen Yu, Qiankun Liu, Hongyuan Liu, Jianfei Jiang, Jun- tao Lyu, Jiansheng Chen, and Huimin Ma. Xyzcylinder: Towards compatible feed-forward 3d gaussian splatting for driving scenes via unified cylinder lifting method.arXiv preprint arXiv:2510.07856, 2025. 2

  64. [64]

    Faster segment anything: Towards lightweight sam for mobile applications,

    Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: Towards lightweight sam for mo- bile applications.arXiv preprint arXiv:2306.14289, 2023. 2

  65. [65]

    Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2025

    Hongyu Zhou, Longzhong Lin, Jiabao Wang, Yichong Lu, Dongfeng Bai, Bingbing Liu, Yue Wang, Andreas Geiger, and Yiyi Liao. Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2025. 2, 4

  66. [66]

    Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers

    Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Yan-Pei Cao, and Song-Hai Zhang. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10324–10335, 2024. 2, 5, 7 11 Unposed-to-3D: Learning Simulat...

  67. [67]

    Network Architectures Texture Block cross self mlp zpatch𝑃𝑃 𝑃𝑃 𝑃𝑃 Geometry Block cross self mlp zcls × 5 × 10 Figure 5

    Implementation Details 6.1. Network Architectures Texture Block cross self mlp zpatch𝑃𝑃 𝑃𝑃 𝑃𝑃 Geometry Block cross self mlp zcls × 5 × 10 Figure 5. The two base modules of the backbone network. Aggregation Module.We adopt the DINOv2-base [35] encoder and uniformly resize all input images to224×224. A learnable camera embedding is introduced as a single to...

  68. [68]

    More Results Zero-shot in Driving Scenario.As shown in Figure 7, we provide additional qualitative results on autonomous driv- ing datasets. Relying solely on single-view inputs, our method is able to recover high-quality vehicle models from real driving environments, demonstrating its strong gener- alization capability and practical applicability. More C...

  69. [69]

    In addition we do not explic- itly model shadows cast by environmental illumination

    Limitations and Future works Our method is unable to handle input images that exhibit ge- ometric distortions; severe non-uniform stretching or com- pression typically leads to corresponding deformations in the reconstructed 3D model. In addition we do not explic- itly model shadows cast by environmental illumination. The realism of asset insertion can st...