pith. sign in

arxiv: 2505.22394 · v2 · pith:7RBHWKFMnew · submitted 2025-05-28 · 💻 cs.CV

PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models

Pith reviewed 2026-05-19 13:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords PBR texture generationview packingtext-to-3D texturevisual autoregressive modelsmulti-view generation3D mesh texturingphysically based materialsefficient texture synthesis
0
0 comments X

The pith

PacTure packs multiple views into single images to generate consistent high-resolution PBR textures for 3D meshes from text more efficiently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PacTure to create physically-based rendering textures on untextured 3D meshes from a text prompt. Current approaches either generate one view after another, which takes a long time and produces textures that do not match across the object, or generate all views together but at lower detail per view. PacTure instead packs several views together into one image so that the model sees them at higher effective resolution while keeping surface points that are close on the 3D shape close together in 2D. It then adapts a visual autoregressive model to control and produce several material properties at once. If this works, texturing 3D models for games or design could become both faster and more reliable without needing new model training.

Core claim

PacTure shows that packing multiple rendered views of a mesh into one composite image raises the usable resolution for each view during multi-view generation while preserving the spatial relationships the image model needs to keep textures coherent across the 3D surface. When this packing is combined with fine-grained control inside a next-scale autoregressive backbone, the same model can output multiple PBR channels together at lower total inference cost than sequential per-view or cross-attention methods.

What carries the argument

View packing: the arrangement of several camera views of the mesh into one composite 2D image so that nearby surface regions remain near each other in the layout, allowing any standard 2D generative model to reason about texture continuity without retraining or added cost.

If this is right

  • Global texture consistency improves because the model processes multiple views in one forward pass and can reason about continuity across packed regions.
  • Inference time drops because several views are handled in fewer model evaluations while each view still receives higher effective resolution.
  • The method remains compatible with any existing 2D generative model without architectural changes or retraining.
  • Multiple PBR properties such as albedo, normals, roughness, and metallic can be produced together through the same autoregressive steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same packing idea could be tested on other multi-view tasks such as consistent lighting or material editing across a mesh.
  • If the packing layout were made adaptive to mesh shape rather than fixed, it might reduce wasted space on highly irregular objects.
  • Applying the technique to higher base resolutions could show whether the efficiency gain scales or whether packing density eventually limits quality.

Load-bearing premise

That packing views together keeps the spatial proximity the model needs for coherent generation and does not create new boundary artifacts or force the model to be retrained.

What would settle it

Generate textures for a simple closed mesh such as a sphere using a single text prompt, unwrap the result, and inspect whether seams or color shifts appear exactly along the boundaries where the packed views meet on the 3D surface.

Figures

Figures reproduced from arXiv: 2505.22394 by Boxin Shi, Fan Fei, Fei-Peng Tian, Jiajun Tang, Ping Tan.

Figure 1
Figure 1. Figure 1: We propose view packing to compactly pack multi-view maps onto the atlas as the condition and target maps for image generative models used in texturing. This technique significantly increases the effective resolution for each view without increasing generation cost by reducing background pixels that do not contribute to the final texture. Unlike the compact but hard-to-read UV maps, the packed views retain… view at source ↗
Figure 2
Figure 2. Figure 2: An overview of our pipeline comprising the following steps. 1) Given the input untextured [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of our proposed view packing and the traditional regular view tiling. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison between PacTure and baseline texturing methods. We show the [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

We present PacTure, a novel framework for generating physically-based rendering (PBR) material textures for an untextured 3D mesh from a text description. Existing 2D generation-based texturing approaches either generate textures sequentially from different views, resulting in long inference times and globally inconsistent textures, or adopt multi-view generation with cross-view attention to enhance global consistency, which, however, limits the resolution for each view. In response to these weaknesses, we first introduce view packing, a novel technique that significantly increases the effective resolution for each view during multi-view generation, without imposing additional inference cost. Unlike UV mapping, it preserves the spatial proximity essential for image generation and maintains full compatibility with current 2D generative models. To further reduce the inferencing cost, we enable fine-grained control and multi-domain generation within the next-scale prediction autoregressive framework, creating an efficient multi-view PBR generation backbone. Extensive experiments show that PacTure outperforms state-of-the-art methods in both quality and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents PacTure, a framework for text-driven PBR texture generation on untextured 3D meshes. It proposes view packing to increase per-view resolution in multi-view generation while remaining compatible with existing 2D autoregressive models, combined with fine-grained control in a next-scale prediction backbone for efficient multi-domain (albedo, normal, roughness, metallic) output. The central claim is that this yields higher quality and lower inference cost than prior sequential or cross-attention multi-view methods.

Significance. If the view-packing operator demonstrably avoids boundary artifacts while delivering the claimed resolution and consistency gains, the work would offer a practical efficiency improvement for PBR texturing pipelines in computer graphics and 3D content creation.

major comments (2)
  1. [§3] §3 (View Packing): the claim that packing 'preserves the spatial proximity essential for image generation' and introduces no artifacts is load-bearing for the efficiency-without-retraining argument, yet the description provides no explicit operator definition or boundary-handling mechanism; without this, it is unclear whether seams are mitigated or merely assumed harmless for the autoregressive backbone.
  2. [§4] §4 / Table 2 (Experiments): the abstract and results section assert outperformance in quality and efficiency, but no quantitative values (e.g., FID, LPIPS, cross-view consistency, or seam-visibility scores) or ablation on packing density are referenced; this prevents verification that the packing operator actually resolves the spatial-coherence risk raised by the stress-test note.
minor comments (2)
  1. [§3.1] Clarify the precise packing layout (e.g., grid size, padding strategy) and its compatibility guarantees with the specific autoregressive tokenizer used.
  2. [§2] Add a short paragraph contrasting view packing with UV-based alternatives in terms of inference cost and model compatibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We provide detailed responses to each major comment and outline the revisions we will make to address the concerns about the view packing operator and experimental quantification.

read point-by-point responses
  1. Referee: [§3] §3 (View Packing): the claim that packing 'preserves the spatial proximity essential for image generation' and introduces no artifacts is load-bearing for the efficiency-without-retraining argument, yet the description provides no explicit operator definition or boundary-handling mechanism; without this, it is unclear whether seams are mitigated or merely assumed harmless for the autoregressive backbone.

    Authors: We are grateful to the referee for this insightful comment. The view packing technique is central to our approach, and we acknowledge that the current description in Section 3 could benefit from greater precision. We will revise the manuscript to include an explicit definition of the packing operator, specifying how multiple views are arranged in a grid within a single canvas while maintaining their relative spatial positions. Regarding boundary handling, we will detail the use of zero-padding or edge-aware blending to minimize potential seam artifacts, ensuring that the autoregressive model's next-scale prediction does not suffer from discontinuities. This addition will clarify that the method avoids introducing harmful artifacts and supports the efficiency claims without requiring model retraining. revision: yes

  2. Referee: [§4] §4 / Table 2 (Experiments): the abstract and results section assert outperformance in quality and efficiency, but no quantitative values (e.g., FID, LPIPS, cross-view consistency, or seam-visibility scores) or ablation on packing density are referenced; this prevents verification that the packing operator actually resolves the spatial-coherence risk raised by the stress-test note.

    Authors: We thank the referee for noting this gap in the presentation of results. Although our experiments in Section 4 and Table 2 compare PacTure against baselines in terms of visual quality and inference speed, we agree that incorporating standard quantitative metrics would enhance verifiability. In the revised version, we will report specific scores including FID and LPIPS for texture quality, as well as metrics for cross-view consistency and seam visibility. We will also add an ablation experiment on different packing densities to demonstrate how it affects spatial coherence and mitigates the risks highlighted in the stress-test. These changes will provide stronger evidence for the superiority of our method in both quality and efficiency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; core contributions are independent techniques

full rationale

The paper presents view packing and fine-grained autoregressive control as novel, independent techniques that increase effective resolution and reduce inference cost while preserving compatibility with existing 2D models. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described framework. The claimed improvements in quality and efficiency for PBR texture generation do not reduce by construction to prior inputs; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms or invented entities; the framework relies on standard assumptions of existing 2D generative models and autoregressive training.

pith-pipeline@v0.9.0 · 5717 in / 1004 out tokens · 40397 ms · 2026-05-19T13:22:35.966410+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 5 internal anchors

  1. [1]

    Pharr, W

    M. Pharr, W. Jakob, and G. Humphreys,Physically Based Rendering: From Theory to Implementation. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 3rd ed., 2016

  2. [2]

    TEXTure: Text-guided texturing of 3D shapes,

    E. Richardson, G. Metzer, Y . Alaluf, R. Giryes, and D. Cohen-Or, “TEXTure: Text-guided texturing of 3D shapes,” inProc. of the ACM SIGGRAPH Conference and Exhibition On Computer Graphics and Interactive Techniques (SIGGRAPH), 2023

  3. [3]

    Text2Tex: Text-driven texture synthesis via diffusion models,

    D. Z. Chen, Y . Siddiqui, H. Lee, S. Tulyakov, and M. Nießner, “Text2Tex: Text-driven texture synthesis via diffusion models,” inProc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  4. [4]

    Paint3D: Paint anything 3D with lighting-less texture diffusion models,

    X. Zeng, X. Chen, Z. Qi, W. Liu, Z. Zhao, Z. Wang, B. Fu, Y . Liu, and G. Yu, “Paint3D: Paint anything 3D with lighting-less texture diffusion models,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  5. [5]

    TexFusion: Synthesizing 3D textures with text-guided image diffusion models,

    T. Cao, K. Kreis, S. Fidler, N. Sharp, and K. Yin, “TexFusion: Synthesizing 3D textures with text-guided image diffusion models,” inProc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  6. [6]

    Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

    Z. Zhao, Z. Lai, Q. Lin, Y . Zhao, H. Liu, S. Yang, Y . Feng, M. Yang, S. Zhang, X. Yang, H. Shi, S. Liu, J. Wu, Y . Lian, F. Yang, R. Tang, Z. He, X. Wang, J. Liu, X. Zuo, Z. Chen, B. Lei, H. Weng, J. Xu, Y . Zhu, X. Liu, L. Xu, C. Hu, T. Huang, L. Wang, J. Zhang, M. Chen, L. Dong, Y . Jia, Y . Cai, J. Yu, Y . Tang, H. Zhang, Z. Ye, P. He, R. Wu, C. Zhan...

  7. [7]

    GenesisTex: Adapting image denoising diffusion to texture space,

    C. Gao, B. Jiang, X. Li, Y . Zhang, and Q. Yu, “GenesisTex: Adapting image denoising diffusion to texture space,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  8. [8]

    InTeX: Interactive text-to-texture synthesis via unified depth-aware inpainting,

    J. Tang, R. Lu, X. Chen, X. Wen, G. Zeng, and Z. Liu, “InTeX: Interactive text-to-texture synthesis via unified depth-aware inpainting,”arXiv preprint arXiv:2403.11878, 2024

  9. [9]

    Meta 3D TextureGen: Fast and consistent texture generation for 3D objects,

    R. Bensadoun, Y . Kleiman, I. Azuri, O. Harosh, A. Vedaldi, N. Neverova, and O. Gafni, “Meta 3D TextureGen: Fast and consistent texture generation for 3D objects,”arXiv preprint arXiv:2407.02430, 2024

  10. [10]

    Meta 3D AssetGen: Text-to-mesh generation with high- quality geometry, texture, and PBR materials,

    Y . Siddiqui, T. Monnier, F. Kokkinos, M. Kariya, Y . Kleiman, E. Garreau, O. Gafni, N. Neverova, A. Vedaldi, R. Shapovalov, and D. Novotný, “Meta 3D AssetGen: Text-to-mesh generation with high- quality geometry, texture, and PBR materials,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2024

  11. [11]

    VCD-Texture: Variance alignment based 3D-2D co-denoising for text-guided texturing,

    S. Liu, C. Yu, C. Cao, W. Qian, and F. Wang, “VCD-Texture: Variance alignment based 3D-2D co-denoising for text-guided texturing,” inProc. of European Conference on Computer Vision (ECCV), 2024

  12. [12]

    CLAY: A controllable large-scale generative model for creating high-quality 3D assets,

    L. Zhang, Z. Wang, Q. Zhang, Q. Qiu, A. Pang, H. Jiang, W. Yang, L. Xu, and J. Yu, “CLAY: A controllable large-scale generative model for creating high-quality 3D assets,”ACM Transactions on Graphics (TOG), 2024

  13. [13]

    TexGen: Text-guided 3D texture generation with multi-view sampling and resampling,

    D. Huo, Z. Guo, X. Zuo, Z. Shi, J. Lu, P. Dai, S. Xu, L. Cheng, and Y . Yang, “TexGen: Text-guided 3D texture generation with multi-view sampling and resampling,” inProc. of European Conference on Computer Vision (ECCV), 2024

  14. [14]

    Make-A- Texture: Fast shape-aware texture generation in 3 seconds,

    X. Xiang, L. S. Gorelik, Y . Fan, O. Armstrong, F. N. Iandola, Y . Li, I. Lifshitz, and R. Ranjan, “Make-A- Texture: Fast shape-aware texture generation in 3 seconds,” inProc. of IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

  15. [15]

    Jointly generating multi-view consistent PBR textures using collaborative control,

    S. Vainer, K. Kutsy, D. D. Nigris, C. Rowles, S. Elizarov, and S. Donné, “Jointly generating multi-view consistent PBR textures using collaborative control,”arXiv preprint arXiv:2410.06985, 2024

  16. [16]

    MVPaint: Synchronized multi-view diffusion for painting anything 3D,

    W. Cheng, J. Mu, X. Zeng, X. Chen, A. Pang, C. Zhang, Z. Wang, B. Fu, G. Yu, Z. Liu, and L. Pan, “MVPaint: Synchronized multi-view diffusion for painting anything 3D,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  17. [17]

    MCMat: Multiview-consistent and physically accurate PBR material generation,

    S. Zhu, L. Qiu, X. Gu, Z. Zhao, C. Xu, Y . He, Z. Li, X. Han, Y . Yao, X. Cao, S. Zhu, W. Yuan, Z. Dong, and H. Zhu, “MCMat: Multiview-consistent and physically accurate PBR material generation,”arXiv preprint arXiv:2412.14148, 2024

  18. [18]

    TexPainter: Generative mesh texturing with multi-view consistency,

    H. Zhang, Z. Pan, C. Zhang, L. Zhu, and X. Gao, “TexPainter: Generative mesh texturing with multi-view consistency,” inProc. of the ACM SIGGRAPH Conference and Exhibition On Computer Graphics and Interactive Techniques (SIGGRAPH), 2024

  19. [19]

    Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation

    J. Yang, T. Shang, W. Sun, X. Song, Z. Cheng, S. Wang, S. Chen, W. Liu, H. Li, and P. Ji, “Pan- dora3D: A comprehensive framework for high-quality 3D shape and texture generation,”arXiv preprint arXiv:2502.14247, 2025

  20. [20]

    Text-guided texturing by synchronized multi-view diffusion,

    Y . Liu, M. Xie, H. Liu, and T. Wong, “Text-guided texturing by synchronized multi-view diffusion,” inProc. of the ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia (SIGGRAPH Asia), 2024

  21. [21]

    Texture generation on 3D meshes with Point-UV diffusion,

    X. Yu, P. Dai, W. Li, L. Ma, Z. Liu, and X. Qi, “Texture generation on 3D meshes with Point-UV diffusion,” inProc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  22. [22]

    TEXGen: a generative diffusion model for mesh textures,

    X. Yu, Z. Yuan, Y . Guo, Y . Liu, J. Liu, Y . Li, Y . Cao, D. Liang, and X. Qi, “TEXGen: a generative diffusion model for mesh textures,”ACM Transactions on Graphics (TOG), 2024

  23. [23]

    UV-free texture generation with denoising and geodesic heat diffusions,

    S. Foti, S. Zafeiriou, and T. Birdal, “UV-free texture generation with denoising and geodesic heat diffusions,” arXiv preprint arXiv:2408.16762, 2024

  24. [24]

    TexOct: Generating textures of 3D models with octree-based diffusion,

    J. Liu, C. Wu, X. Liu, X. Liu, J. Wu, H. Peng, C. Zhao, H. Feng, J. Liu, and E. Ding, “TexOct: Generating textures of 3D models with octree-based diffusion,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 11

  25. [25]

    TexGaussian: Generating high-quality PBR material via octree-based 3D Gaussian splatting,

    B. Xiong, J. Liu, J. Hu, C. Wu, J. Wu, X. Liu, C. Zhao, E. Ding, and Z. Lian, “TexGaussian: Generating high-quality PBR material via octree-based 3D Gaussian splatting,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  26. [26]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  27. [27]

    Scaling rectified flow transformers for high- resolution image synthesis,

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, and R. Rombach, “Scaling rectified flow transformers for high- resolution image synthesis,” inProc. of International Conference on Machine Learning (ICML), 2024

  28. [28]

    SDXL: improving latent diffusion models for high-resolution image synthesis,

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “SDXL: improving latent diffusion models for high-resolution image synthesis,” inProc. of International Conference on Learning Representations (ICLR), 2024

  29. [29]

    arXiv preprint arXiv:2309.15807 , year=

    X. Dai, J. Hou, C. Ma, S. S. Tsai, J. Wang, R. Wang, P. Zhang, S. Vandenhende, X. Wang, A. Dubey, M. Yu, A. Kadian, F. Radenovic, D. Mahajan, K. Li, Y . Zhao, V . Petrovic, M. K. Singh, S. Motwani, Y . Wen, Y . Song, R. Sumbaly, V . Ramanathan, Z. He, P. Vajda, and D. Parikh, “Emu: Enhancing image generation models using photogenic needles in a haystack,”...

  30. [30]

    FlashTex: Fast relightable mesh texturing with LightControlNet,

    K. Deng, T. Omernick, A. Weiss, D. Ramanan, J. Zhu, T. Zhou, and M. Agrawala, “FlashTex: Fast relightable mesh texturing with LightControlNet,” inProc. of European Conference on Computer Vision (ECCV), 2024

  31. [31]

    TexDreamer: Towards zero-shot high-fidelity 3D human texture generation,

    Y . Liu, J. Zhu, J. Tang, S. Zhang, J. Zhang, W. Cao, C. Wang, Y . Wu, and D. Huang, “TexDreamer: Towards zero-shot high-fidelity 3D human texture generation,” inProc. of European Conference on Computer Vision (ECCV), 2024

  32. [32]

    Least squares conformal maps for automatic texture atlas generation,

    B. Lévy, S. Petitjean, N. Ray, and J. Maillot, “Least squares conformal maps for automatic texture atlas generation,”ACM Transactions on Graphics (TOG), vol. 21, no. 3, pp. 362–371, 2002

  33. [33]

    Two-dimensional finite bin-packing algorithms,

    J. O. Berkey and P. Y . Wang, “Two-dimensional finite bin-packing algorithms,”Journal of the Operational Research Society, 1987

  34. [34]

    A thousand ways to pack the bin – a practical approach to two-dimensional rectangle bin packing,

    J. Jylänki, “A thousand ways to pack the bin – a practical approach to two-dimensional rectangle bin packing,” 2010

  35. [35]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction,

    K. Tian, Y . Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual autoregressive modeling: Scalable image generation via next-scale prediction,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2024

  36. [36]

    Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis,

    J. Han, J. Liu, Y . Jiang, B. Yan, Y . Zhang, Z. Yuan, B. Peng, and X. Liu, “Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  37. [37]

    Pixart-Σ: Weak-to- strong training of diffusion transformer for 4K text-to-image generation,

    J. Chen, C. Ge, E. Xie, Y . Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li, “Pixart-Σ: Weak-to- strong training of diffusion transformer for 4K text-to-image generation,” inProc. of European Conference on Computer Vision (ECCV), 2024

  38. [38]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    P. Sun, Y . Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan, “Autoregressive model beats diffusion: Llama for scalable image generation,”arXiv preprint arXiv:2406.06525, 2024

  39. [39]

    HART: efficient visual generation with hybrid autoregressive transformer,

    H. Tang, Y . Wu, S. Yang, E. Xie, J. Chen, J. Chen, Z. Zhang, H. Cai, Y . Lu, and S. Han, “HART: efficient visual generation with hybrid autoregressive transformer,”arXiv preprint arXiv:2410.10812, 2024

  40. [40]

    Text2Mesh: Text-driven neural stylization for meshes,

    O. Michel, R. Bar-On, R. Liu, S. Benaim, and R. Hanocka, “Text2Mesh: Text-driven neural stylization for meshes,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  41. [41]

    Texturify: Generating textures on 3D shape surfaces,

    Y . Siddiqui, J. Thies, F. Ma, Q. Shan, M. Nießner, and A. Dai, “Texturify: Generating textures on 3D shape surfaces,” inProc. of European Conference on Computer Vision (ECCV), 2022

  42. [42]

    GET3D: A generative model of high quality 3D textured shapes learned from images,

    J. Gao, T. Shen, Z. Wang, W. Chen, K. Yin, D. Li, O. Litany, Z. Gojcic, and S. Fidler, “GET3D: A generative model of high quality 3D textured shapes learned from images,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2022

  43. [43]

    CLIP-Mesh: Generating textured meshes from text using pretrained image-text models,

    N. M. Khalid, T. Xie, E. Belilovsky, and T. Popa, “CLIP-Mesh: Generating textured meshes from text using pretrained image-text models,” inProc. of the ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia (SIGGRAPH Asia), 2022. 12

  44. [44]

    Latent-NeRF for shape-guided generation of 3D shapes and textures,

    G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or, “Latent-NeRF for shape-guided generation of 3D shapes and textures,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  45. [45]

    Paint-it: Text-to-texture synthesis via deep convolutional texture map optimization and physically-based rendering,

    K. Youwang, T. Oh, and G. Pons-Moll, “Paint-it: Text-to-texture synthesis via deep convolutional texture map optimization and physically-based rendering,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  46. [46]

    TextureDreamer: Image-guided texture synthesis through geometry-aware diffusion,

    Y . Yeh, J. Huang, C. Kim, L. Xiao, T. Nguyen-Phuoc, N. Khan, C. Zhang, M. Chandraker, C. S. Marshall, Z. Dong, and Z. Li, “TextureDreamer: Image-guided texture synthesis through geometry-aware diffusion,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  47. [47]

    DreamMat: High-quality PBR material generation with geometry- and light-aware diffusion models,

    Y . Zhang, Y . Liu, Z. Xie, L. Yang, Z. Liu, M. Yang, R. Zhang, Q. Kou, C. Lin, W. Wang, and X. Jin, “DreamMat: High-quality PBR material generation with geometry- and light-aware diffusion models,” ACM Transactions on Graphics (TOG), 2024

  48. [48]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProc. of International Conference on Machine Learning (ICML), 2021

  49. [49]

    Generative Adversarial Networks

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y . Bengio, “Generative adversarial networks,”arXiv preprint arXiv:1406.2661, 2014

  50. [50]

    DreamFusion: Text-to-3D using 2D diffusion,

    B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “DreamFusion: Text-to-3D using 2D diffusion,” inProc. of International Conference on Learning Representations (ICLR), 2023

  51. [51]

    Adding conditional control to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  52. [52]

    MatAtlas: Text-driven consistent geometry texturing and material assignment,

    D. Ceylan, V . Deschaintre, T. Groueix, R. Martin, C. P. Huang, R. Rouffet, V . G. Kim, and G. Las- sagne, “MatAtlas: Text-driven consistent geometry texturing and material assignment,”arXiv preprint arXiv:2404.02899, 2024

  53. [53]

    Make-it-Real: Unleashing large multi- modal model’s ability for painting 3D objects with realistic materials,

    Y . Fang, Z. Sun, T. Wu, J. Wang, Z. Liu, G. Wetzstein, and D. Lin, “Make-it-Real: Unleashing large multi- modal model’s ability for painting 3D objects with realistic materials,”arXiv preprint arXiv:2404.16829, 2024

  54. [54]

    MaterialSeg3D: Segmenting dense materials from 2D priors for 3D assets,

    Z. Li, R. Gan, C. Luo, Y . Wang, J. Liu, Z. Zhu, Q. Li, X. Yin, M. Zhang, Z. Zhang, and J. Peng, “MaterialSeg3D: Segmenting dense materials from 2D priors for 3D assets,” inProc. of ACM International Conference on Multimedia (ACM MM), 2024

  55. [55]

    MaPa: Text-driven photorealistic material painting for 3D shapes,

    S. Zhang, S. Peng, T. Xu, Y . Yang, T. Chen, N. Xue, Y . Shen, H. Bao, R. Hu, and X. Zhou, “MaPa: Text-driven photorealistic material painting for 3D shapes,” inProc. of the ACM SIGGRAPH Conference and Exhibition On Computer Graphics and Interactive Techniques (SIGGRAPH), 2024

  56. [56]

    U-Net: Convolutional networks for biomedical image seg- mentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image seg- mentation,” inProc. of International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015

  57. [57]

    T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,

    C. Mou, X. Wang, L. Xie, Y . Wu, J. Zhang, Z. Qi, and Y . Shan, “T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” inProc. of Association for the Advancement of Artificial Intelligence, 2024

  58. [58]

    ControlNeXt: Powerful and efficient control for image and video generation,

    B. Peng, J. Wang, Y . Zhang, W. Li, M. Yang, and J. Jia, “ControlNeXt: Powerful and efficient control for image and video generation,”arXiv preprint arXiv:2408.06070, 2024

  59. [59]

    OminiControl: Minimal and universal control for diffusion transformer,

    Z. Tan, S. Liu, X. Yang, Q. Xue, and X. Wang, “OminiControl: Minimal and universal control for diffusion transformer,”arXiv preprint arXiv:2411.15098, 2024

  60. [60]

    Uni-ControlNet: All-in-one control to text-to-image diffusion models,

    S. Zhao, D. Chen, Y . Chen, J. Bao, S. Hao, L. Yuan, and K. K. Wong, “Uni-ControlNet: All-in-one control to text-to-image diffusion models,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2023

  61. [61]

    UniControl: A unified diffusion model for controllable visual generation in the wild,

    C. Qin, S. Zhang, N. Yu, Y . Feng, X. Yang, Y . Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savarese, S. Ermon, Y . Fu, and R. Xu, “UniControl: A unified diffusion model for controllable visual generation in the wild,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2023

  62. [62]

    ControlNet++: Improving conditional controls with efficient consistency feedback,

    M. Li, T. Yang, H. Kuang, J. Wu, Z. Wang, X. Xiao, and C. Chen, “ControlNet++: Improving conditional controls with efficient consistency feedback,” inProc. of European Conference on Computer Vision (ECCV), 2024. 13

  63. [63]

    ControlAR: Controllable image generation with autoregressive models,

    Z. Li, T. Cheng, S. Chen, P. Sun, H. Shen, L. Ran, X. Chen, W. Liu, and X. Wang, “ControlAR: Controllable image generation with autoregressive models,”arXiv preprint arXiv:2410.02705, 2024

  64. [64]

    Dinov2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jégou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without supervi...

  65. [65]

    Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

    J. Chen, Y . Wu, S. Luo, E. Xie, S. Paul, P. Luo, H. Zhao, and Z. Li, “Pixart-δ: Fast and controllable image generation with latent consistency models,”arXiv preprint arXiv:2401.05252, 2024

  66. [66]

    Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis,

    J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Z. Wang, J. T. Kwok, P. Luo, H. Lu, and Z. Li, “Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis,” inProc. of International Conference on Learning Representations (ICLR), 2024

  67. [67]

    arXiv preprint arXiv:2406.09750 (2024)

    X. Li, K. Qiu, H. Chen, J. Kuen, Z. Lin, R. Singh, and B. Raj, “ControlV AR: Exploring controllable visual autoregressive modeling,”arXiv preprint arXiv:2406.09750, 2024

  68. [68]

    CAR: controllable autoregressive modeling for visual generation,

    Z. Yao, J. Li, Y . Zhou, Y . Liu, X. Jiang, C. Wang, F. Zheng, Y . Zou, and L. Li, “CAR: controllable autoregressive modeling for visual generation,”arXiv preprint arXiv:2410.04671, 2024

  69. [69]

    Neural discrete representation learning,

    A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2017

  70. [70]

    IntrinsicAnything: Learning diffusion priors for inverse rendering under unknown illumination,

    X. Chen, S. Peng, D. Yang, Y . Liu, B. Pan, C. Lv, and X. Zhou, “IntrinsicAnything: Learning diffusion priors for inverse rendering under unknown illumination,” inProc. of European Conference on Computer Vision (ECCV), 2024

  71. [71]

    Kaolin: A pytorch library for accelerating 3D deep learning research,

    K. M. Jatavallabhula, E. J. Smith, J. Lafleche, C. F. Tsang, A. Rozantsev, W. Chen, T. Xiang, R. Lebaredian, and S. Fidler, “Kaolin: A pytorch library for accelerating 3D deep learning research,”arXiv preprint arXiv:1911.05063, 2019

  72. [72]

    DRCT: saving image super-resolution away from information bottleneck,

    C. Hsu, C. Lee, and Y . Chou, “DRCT: saving image super-resolution away from information bottleneck,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024

  73. [73]

    Huang, Y

    Z. Huang, Y . Guo, H. Wang, R. Yi, L. Ma, Y . Cao, and L. Sheng, “MV-Adapter: Multi-view consistent image generation made easy,”arXiv preprint arXiv:2412.03632, 2024

  74. [74]

    Zero-1-to-3: Zero-shot one image to 3D object,

    R. Liu, R. Wu, B. V . Hoorick, P. Tokmakov, S. Zakharov, and C. V ondrick, “Zero-1-to-3: Zero-shot one image to 3D object,” inProc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  75. [75]

    Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

    R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, L. Chen, C. Zeng, and H. Su, “Zero123++: a single image to consistent multi-view diffusion base model,”arXiv preprint arXiv:2310.15110, 2023

  76. [76]

    Objaverse: A universe of annotated 3D objects,

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kem- bhavi, and A. Farhadi, “Objaverse: A universe of annotated 3D objects,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  77. [77]

    Objaverse-XL: A universe of 10M+ 3D objects,

    M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadre, E. VanderBilt, A. Kembhavi, C. V ondrick, G. Gkioxari, K. Ehsani, L. Schmidt, and A. Farhadi, “Objaverse-XL: A universe of 10M+ 3D objects,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2023

  78. [78]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models,”arXiv preprint arXiv:2308.06721, 2023

  79. [79]

    SciPy 1.0: fundamental algorithms for scientific computing in Python,

    P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright,et al., “SciPy 1.0: fundamental algorithms for scientific computing in Python,” Nature methods, 2020

  80. [80]

    Taming transformers for high-resolution image synthesis,

    P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

Showing first 80 references.