Recognition: 1 theorem link
Objaverse-XL: A Universe of 10M+ 3D Objects
Pith reviewed 2026-05-17 12:56 UTC · model grok-4.3
The pith
Objaverse-XL supplies over 10 million 3D objects that let models like Zero123 reach strong zero-shot generalization on novel view synthesis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Objaverse-XL comprises over 10 million deduplicated 3D objects drawn from manually designed models, photogrammetry scans of landmarks and everyday items, and professional scans of historic artifacts. Training Zero123 on novel view synthesis with more than 100 million multi-view rendered images from this collection yields strong zero-shot generalization abilities.
What carries the argument
Objaverse-XL dataset of over 10 million diverse 3D objects, which supplies the volume and variety needed to generate 100 million-plus multi-view training images for 3D models.
If this is right
- Zero123 exhibits stronger zero-shot performance on novel view synthesis when trained at the scale enabled by Objaverse-XL.
- 3D vision models gain access to training volumes previously unavailable, mirroring data-scaling benefits seen in 2D vision.
- Diversity across object sources supports generalization to varied object types and scanning styles.
- The dataset supports additional large-scale experiments that were previously limited by data availability.
Where Pith is reading between the lines
- The same scale of rendered multi-view data could be tested on other 3D tasks such as reconstruction or conditional generation.
- Pretrained models derived from this data volume may transfer to downstream applications like robotics or AR that require 3D understanding.
- Future work could measure how much further performance improves when the dataset size grows beyond the current 10 million objects.
Load-bearing premise
Observed gains in zero-shot performance stem primarily from the scale and diversity of the 3D objects rather than from other training choices or evaluation details that were not isolated.
What would settle it
Retraining Zero123 on the same procedure but with a much smaller subset of the objects and finding that zero-shot generalization remains comparable would falsify the central claim.
read the original abstract
Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. Representing the largest scale and diversity in the realm of 3D datasets, Objaverse-XL enables significant new possibilities for 3D vision. Our experiments demonstrate the improvements enabled with the scale provided by Objaverse-XL. We show that by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images, we achieve strong zero-shot generalization abilities. We hope that releasing Objaverse-XL will enable further innovations in the field of 3D vision at scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Objaverse-XL, a dataset of over 10 million deduplicated 3D objects drawn from manual designs, photogrammetry scans of landmarks and everyday items, and professional scans of historic artifacts. It positions this as the largest and most diverse 3D dataset to date. The central empirical demonstration renders over 100 million multi-view images from these objects and trains Zero123 on novel view synthesis, reporting strong zero-shot generalization.
Significance. If the reported gains are attributable to dataset scale, Objaverse-XL supplies a valuable public resource that could enable the same kind of scaling progress in 3D vision that large corpora have produced in NLP and 2D vision. The open release of 10M+ objects together with the 100M+ rendered views is a concrete community asset; the authors deserve credit for the curation effort and for making the data available.
major comments (2)
- [Experiments] Experiments section: the claim that 'improvements enabled with the scale provided by Objaverse-XL' are demonstrated by training Zero123 on >100M renders is not supported by a controlled comparison. No ablation is described that holds the rendering pipeline (camera sampling, lighting, resolution), optimizer, and evaluation protocol fixed while varying only the source dataset or its size (e.g., original Objaverse vs. Objaverse-XL subsets). Consequently the zero-shot gains cannot be isolated from unablated training or rendering choices.
- [Experiments] Experiments section: quantitative results for the Zero123 zero-shot novel-view-synthesis task lack reported baselines with exact matching settings, ablation controls, and statistical significance measures (e.g., standard errors or multiple runs). This weakens the strength of the scaling demonstration.
minor comments (2)
- [Abstract] Abstract: the quantitative improvements (e.g., specific metrics on zero-shot NVS) are not stated; adding one or two headline numbers would strengthen the summary.
- [Dataset] Dataset section: the deduplication procedure and any quantitative measure of diversity (e.g., category coverage or geometric variation statistics) should be described more explicitly.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation of our experimental results.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the claim that 'improvements enabled with the scale provided by Objaverse-XL' are demonstrated by training Zero123 on >100M renders is not supported by a controlled comparison. No ablation is described that holds the rendering pipeline (camera sampling, lighting, resolution), optimizer, and evaluation protocol fixed while varying only the source dataset or its size (e.g., original Objaverse vs. Objaverse-XL subsets). Consequently the zero-shot gains cannot be isolated from unablated training or rendering choices.
Authors: We appreciate the referee highlighting the value of a controlled ablation. The manuscript's experiments focus on demonstrating what becomes possible at the scale of Objaverse-XL by training Zero123 on over 100 million rendered views, achieving strong zero-shot generalization. While we did not include an explicit ablation that retrains with identical rendering, optimization, and evaluation settings on the original Objaverse versus Objaverse-XL subsets, the primary contribution is the public release of this much larger and more diverse dataset. In the revised manuscript we will add a dedicated paragraph in the experiments section that (1) explicitly compares the data scale and object diversity to the original Objaverse used in prior Zero123 work and (2) clarifies that all rendering parameters (camera sampling, lighting, resolution) are fully documented so that future controlled studies can be performed. We believe this addresses the isolation concern without overstating the current evidence. revision: yes
-
Referee: [Experiments] Experiments section: quantitative results for the Zero123 zero-shot novel-view-synthesis task lack reported baselines with exact matching settings, ablation controls, and statistical significance measures (e.g., standard errors or multiple runs). This weakens the strength of the scaling demonstration.
Authors: We agree that clearer reporting of baselines and any available measures of variability would improve the manuscript. The current results are presented as the performance obtained when training on the full Objaverse-XL scale. In the revision we will expand the experimental details to include side-by-side numerical comparison with the original Zero123 numbers, explicitly noting any differences in training settings or data volume. Because retraining the full 100-million-image model multiple times is computationally prohibitive, we will add an explicit limitations paragraph stating that results are from single runs and that statistical significance testing was not performed; we will also report any variance observed in smaller-scale pilot experiments if they exist. These changes will make the strength and limitations of the scaling demonstration more transparent. revision: yes
Circularity Check
No circularity: empirical dataset release and scaling observation
full rationale
The paper releases Objaverse-XL (10M+ 3D objects from diverse sources) and reports that training Zero123 on >100M multi-view renders from it yields strong zero-shot novel view synthesis. No derivation chain, equations, or 'predictions' are claimed. The central claim is an empirical scaling result, not a reduction of any output to fitted inputs or self-citations by construction. No self-definitional steps, uniqueness theorems, or ansatzes appear. The contribution is self-contained as a data release plus observed performance gains.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Deduplication across heterogeneous 3D sources removes near-duplicates without discarding useful diversity
Forward citations
Cited by 16 Pith papers
-
Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation
Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.
-
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
-
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
MoCapAnything reconstructs asset-specific BVH animations from monocular video by predicting 3D joint trajectories then applying constraint-aware inverse kinematics guided by a reference prompt encoder.
-
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation
DreamGaussian creates high-quality textured 3D meshes from single-view images in 2 minutes via generative Gaussian Splatting with mesh extraction and UV refinement.
-
Velox: Learning Representations of 4D Geometry and Appearance
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
-
3D-ReGen: A Unified 3D Geometry Regeneration Framework
3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
-
Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations
RecGen achieves state-of-the-art 3D multi-object scene reconstruction from sparse RGB-D views by combining compositional synthetic scene generation with strong 3D shape priors, outperforming SAM3D by 30%+ in shape qua...
-
PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing
PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.
-
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation
CamCo equips image-to-video generators with Plücker-coordinate camera inputs and epipolar attention to improve 3D consistency and camera controllability.
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
-
SyncDreamer: Generating Multiview-consistent Images from a Single-view Image
SyncDreamer produces multiview-consistent images from a single input image by jointly modeling their distribution and synchronizing intermediate diffusion states via 3D-aware attention.
-
MVDream: Multi-view Diffusion for 3D Generation
MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.
-
Syn4D: A Multiview Synthetic 4D Dataset
Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
-
Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model
Zero123++ produces high-quality 3D-consistent multi-view images from a single input by fine-tuning Stable Diffusion with targeted conditioning and training methods.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...
Reference graph
Works this paper leans on
-
[1]
URL https://commoncrawl.org/the-data/. 3
- [2]
-
[3]
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022. 2
work page 2022
-
[4]
L. Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb. com/. Software available from wandb.com. 10
work page 2020
-
[5]
Blender - a 3d modelling and rendering package
Blender Online Community. Blender - a 3d modelling and rendering package. https://www. blender.org, 2023. 10
work page 2023
-
[6]
M. Bostock, V . Ogievetsky, and J. Heer. D3: Data-driven documents.IEEE Transactions on Visualization and Computer Graphics, 2011. 10
work page 2011
- [7]
- [8]
-
[9]
A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015. 2, 3, 9
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[10]
W. Chen, H. Ling, J. Gao, E. Smith, J. Lehtinen, A. Jacobson, and S. Fidler. Learning to predict 3d objects with an interpolation-based differentiable renderer. Advances in neural information processing systems, 32, 2019. 3
work page 2019
- [11]
-
[12]
C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14 , pages 628–644. Springer, 2016. 3
work page 2016
-
[13]
J. Collins, S. Goel, K. Deng, A. Luthra, L. Xu, E. Gundogdu, X. Zhang, T. F. Y . Vicente, T. Dideriksen, H. Arora, et al. Abo: Dataset and benchmarks for real-world 3d object un- derstanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21126–21136, 2022. 3
work page 2022
- [14]
-
[15]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition , pages 248–255. Ieee, 2009. 2 10
work page 2009
-
[16]
K. Deng, A. Liu, J.-Y . Zhu, and D. Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022. 9
work page 2022
- [17]
-
[18]
Falcon and The PyTorch Lightning team
W. Falcon and The PyTorch Lightning team. PyTorch Lightning, Mar. 2019. URL https: //github.com/Lightning-AI/lightning. 10
work page 2019
-
[19]
H. Fu, R. Jia, L. Gao, M. Gong, B. Zhao, S. Maybank, and D. Tao. 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 129:3313–3337, 2021. 3
work page 2021
- [20]
- [21]
-
[22]
G. Gkioxari, J. Malik, and J. Johnson. Mesh r-cnn. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 9785–9795, 2019. 3
work page 2019
-
[23]
C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array programmin...
-
[24]
K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision , pages 2961–2969, 2017. 2
work page 2017
-
[25]
Training Compute-Optimal Large Language Models
J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
J. D. Hunter. Matplotlib: A 2d graphics environment. Computing in Science & Engineering , 9 (3):90–95, 2007. doi: 10.1109/MCSE.2007.55. 10
-
[27]
A. Jain, M. Tancik, and P. Abbeel. Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5885–5894, 2021. 9
work page 2021
-
[28]
Shap-E: Generating Conditional 3D Implicit Functions
H. Jun and A. Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[30]
H. Kato, Y . Ushiku, and T. Harada. Neural 3d mesh renderer. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 3907–3916, 2018. 3
work page 2018
-
[31]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing ikea objects: Fine pose estimation. In Proceedings of the IEEE international conference on computer vision , pages 2992–2999, 2013. 3
work page 2013
-
[33]
C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y . Liu, and T.-Y . Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 300–309, 2023. 3
work page 2023
- [34]
- [35]
-
[36]
R. Liu, R. Wu, B. V . Hoorick, P. Tokmakov, S. Zakharov, and C. V ondrick. Zero-1-to-3: Zero-shot one image to 3d object, 2023. 2, 3, 8, 14
work page 2023
- [37]
-
[38]
L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 4460–4470, 2019. 3
work page 2019
-
[39]
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020. 3, 8
work page 2020
-
[40]
G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM , 38(11): 39–41, 1995. 3
work page 1995
-
[41]
D. Morrison, P. Corke, and J. Leitner. Egad! an evolved grasping analysis dataset for diversity and reproducibility in robotic manipulation. IEEE Robotics and Automation Letters , 5(3): 4368–4375, 2020. 3
work page 2020
-
[42]
Point-E: A System for Generating 3D Point Clouds from Complex Prompts
A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [43]
- [44]
-
[45]
T. pandas development team. pandas-dev/pandas: Pandas, Feb. 2020. URL https://doi. org/10.5281/zenodo.3509134. 10
-
[46]
K. Park, K. Rematas, A. Farhadi, and S. M. Seitz. Photoshape: Photorealistic materials for large-scale shape collections. arXiv preprint arXiv:1809.09761, 2018. 3
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [47]
-
[48]
DreamFusion: Text-to-3D using 2D Diffusion
B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022. 3, 8
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. 2
work page 2019
-
[50]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748–8763. PMLR, 2021. 2, 6
work page 2021
-
[51]
Hierarchical Text-Conditional Image Generation with CLIP Latents
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[52]
N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W.-Y . Lo, J. Johnson, and G. Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2007
-
[53]
S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems , 28, 2015. 2
work page 2015
-
[54]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 2
work page 2022
-
[55]
LAION-5B: An open large-scale dataset for training next generation image-text models
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022. 2, 6, 33
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[56]
J. Tang. Stable-dreamfusion: Text-to-3d with stable-diffusion, 2022. https://github.com/ashawkey/stable-dreamfusion. 3 12
work page 2022
-
[57]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
H. Wang, X. Du, J. Li, R. A. Yeh, and G. Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12619–12629, 2023. 8
work page 2023
-
[59]
N. Wang, Y . Zhang, Z. Li, Y . Fu, W. Liu, and Y .-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European conference on computer vision (ECCV), pages 52–67, 2018. 3
work page 2018
-
[60]
Q. Wang, Z. Wang, K. Genova, P. P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4690–4699, 2021. 9
work page 2021
-
[61]
M. L. Waskom. seaborn: statistical data visualization. Journal of Open Source Software, 6(60): 3021, 2021. doi: 10.21105/joss.03021. URL https://doi.org/10.21105/joss.03021. 10
- [62]
-
[63]
T. Wu, J. Zhang, X. Fu, Y . Wang, J. Ren, L. Pan, W. Wu, L. Yang, J. Wang, C. Qian, D. Lin, and Z. Liu. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, recon- struction and generation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 3
work page 2023
-
[64]
A. Yu, V . Ye, M. Tancik, and A. Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021. 3, 8, 9
work page 2021
- [65]
-
[66]
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y . Mao, X. Ma, A. Efrat, P. Yu, L. Yu, et al. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023. 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Thingi10K: A Dataset of 10,000 3D-Printing Models
Q. Zhou and A. Jacobson. Thingi10k: A dataset of 10,000 3d-printing models. arXiv preprint arXiv:1605.04797, 2016. 3 13 A Implementation Details A.1 Zero123-XL A batch size of 2048 is used during training with a learning rate of 1e-4. Different from the original paper [36], we performed a second-stage finetuning with a smaller learning rate of 5e-5 on a h...
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.