PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models

Boxin Shi; Fan Fei; Fei-Peng Tian; Jiajun Tang; Ping Tan

arxiv: 2505.22394 · v2 · pith:7RBHWKFMnew · submitted 2025-05-28 · 💻 cs.CV

PacTure: Efficient PBR Texture Generation on Packed Views with Visual Autoregressive Models

Fan Fei , Jiajun Tang , Fei-Peng Tian , Boxin Shi , Ping Tan This is my paper

Pith reviewed 2026-05-19 13:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords PBR texture generationview packingtext-to-3D texturevisual autoregressive modelsmulti-view generation3D mesh texturingphysically based materialsefficient texture synthesis

0 comments

The pith

PacTure packs multiple views into single images to generate consistent high-resolution PBR textures for 3D meshes from text more efficiently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PacTure to create physically-based rendering textures on untextured 3D meshes from a text prompt. Current approaches either generate one view after another, which takes a long time and produces textures that do not match across the object, or generate all views together but at lower detail per view. PacTure instead packs several views together into one image so that the model sees them at higher effective resolution while keeping surface points that are close on the 3D shape close together in 2D. It then adapts a visual autoregressive model to control and produce several material properties at once. If this works, texturing 3D models for games or design could become both faster and more reliable without needing new model training.

Core claim

PacTure shows that packing multiple rendered views of a mesh into one composite image raises the usable resolution for each view during multi-view generation while preserving the spatial relationships the image model needs to keep textures coherent across the 3D surface. When this packing is combined with fine-grained control inside a next-scale autoregressive backbone, the same model can output multiple PBR channels together at lower total inference cost than sequential per-view or cross-attention methods.

What carries the argument

View packing: the arrangement of several camera views of the mesh into one composite 2D image so that nearby surface regions remain near each other in the layout, allowing any standard 2D generative model to reason about texture continuity without retraining or added cost.

If this is right

Global texture consistency improves because the model processes multiple views in one forward pass and can reason about continuity across packed regions.
Inference time drops because several views are handled in fewer model evaluations while each view still receives higher effective resolution.
The method remains compatible with any existing 2D generative model without architectural changes or retraining.
Multiple PBR properties such as albedo, normals, roughness, and metallic can be produced together through the same autoregressive steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same packing idea could be tested on other multi-view tasks such as consistent lighting or material editing across a mesh.
If the packing layout were made adaptive to mesh shape rather than fixed, it might reduce wasted space on highly irregular objects.
Applying the technique to higher base resolutions could show whether the efficiency gain scales or whether packing density eventually limits quality.

Load-bearing premise

That packing views together keeps the spatial proximity the model needs for coherent generation and does not create new boundary artifacts or force the model to be retrained.

What would settle it

Generate textures for a simple closed mesh such as a sphere using a single text prompt, unwrap the result, and inspect whether seams or color shifts appear exactly along the boundaries where the packed views meet on the 3D surface.

Figures

Figures reproduced from arXiv: 2505.22394 by Boxin Shi, Fan Fei, Fei-Peng Tian, Jiajun Tang, Ping Tan.

**Figure 1.** Figure 1: We propose view packing to compactly pack multi-view maps onto the atlas as the condition and target maps for image generative models used in texturing. This technique significantly increases the effective resolution for each view without increasing generation cost by reducing background pixels that do not contribute to the final texture. Unlike the compact but hard-to-read UV maps, the packed views retain… view at source ↗

**Figure 2.** Figure 2: An overview of our pipeline comprising the following steps. 1) Given the input untextured [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of our proposed view packing and the traditional regular view tiling. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison between PacTure and baseline texturing methods. We show the [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

We present PacTure, a novel framework for generating physically-based rendering (PBR) material textures for an untextured 3D mesh from a text description. Existing 2D generation-based texturing approaches either generate textures sequentially from different views, resulting in long inference times and globally inconsistent textures, or adopt multi-view generation with cross-view attention to enhance global consistency, which, however, limits the resolution for each view. In response to these weaknesses, we first introduce view packing, a novel technique that significantly increases the effective resolution for each view during multi-view generation, without imposing additional inference cost. Unlike UV mapping, it preserves the spatial proximity essential for image generation and maintains full compatibility with current 2D generative models. To further reduce the inferencing cost, we enable fine-grained control and multi-domain generation within the next-scale prediction autoregressive framework, creating an efficient multi-view PBR generation backbone. Extensive experiments show that PacTure outperforms state-of-the-art methods in both quality and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PacTure's view packing lets them run higher-res multi-view PBR generation inside a standard autoregressive backbone without extra cost or retraining, but the seam consistency claim needs the actual numbers to hold up.

read the letter

The main thing to know is that this paper packs multiple rendered views of a mesh into one canvas so the next-scale autoregressive model can produce PBR maps at better per-view resolution while staying compatible with off-the-shelf 2D generators. They add fine-grained multi-domain control to handle albedo, normals, roughness, and metallic together in one pass, which cuts inference time compared to sequential view generation or heavy cross-view attention setups.

Referee Report

2 major / 2 minor

Summary. The manuscript presents PacTure, a framework for text-driven PBR texture generation on untextured 3D meshes. It proposes view packing to increase per-view resolution in multi-view generation while remaining compatible with existing 2D autoregressive models, combined with fine-grained control in a next-scale prediction backbone for efficient multi-domain (albedo, normal, roughness, metallic) output. The central claim is that this yields higher quality and lower inference cost than prior sequential or cross-attention multi-view methods.

Significance. If the view-packing operator demonstrably avoids boundary artifacts while delivering the claimed resolution and consistency gains, the work would offer a practical efficiency improvement for PBR texturing pipelines in computer graphics and 3D content creation.

major comments (2)

[§3] §3 (View Packing): the claim that packing 'preserves the spatial proximity essential for image generation' and introduces no artifacts is load-bearing for the efficiency-without-retraining argument, yet the description provides no explicit operator definition or boundary-handling mechanism; without this, it is unclear whether seams are mitigated or merely assumed harmless for the autoregressive backbone.
[§4] §4 / Table 2 (Experiments): the abstract and results section assert outperformance in quality and efficiency, but no quantitative values (e.g., FID, LPIPS, cross-view consistency, or seam-visibility scores) or ablation on packing density are referenced; this prevents verification that the packing operator actually resolves the spatial-coherence risk raised by the stress-test note.

minor comments (2)

[§3.1] Clarify the precise packing layout (e.g., grid size, padding strategy) and its compatibility guarantees with the specific autoregressive tokenizer used.
[§2] Add a short paragraph contrasting view packing with UV-based alternatives in terms of inference cost and model compatibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We provide detailed responses to each major comment and outline the revisions we will make to address the concerns about the view packing operator and experimental quantification.

read point-by-point responses

Referee: [§3] §3 (View Packing): the claim that packing 'preserves the spatial proximity essential for image generation' and introduces no artifacts is load-bearing for the efficiency-without-retraining argument, yet the description provides no explicit operator definition or boundary-handling mechanism; without this, it is unclear whether seams are mitigated or merely assumed harmless for the autoregressive backbone.

Authors: We are grateful to the referee for this insightful comment. The view packing technique is central to our approach, and we acknowledge that the current description in Section 3 could benefit from greater precision. We will revise the manuscript to include an explicit definition of the packing operator, specifying how multiple views are arranged in a grid within a single canvas while maintaining their relative spatial positions. Regarding boundary handling, we will detail the use of zero-padding or edge-aware blending to minimize potential seam artifacts, ensuring that the autoregressive model's next-scale prediction does not suffer from discontinuities. This addition will clarify that the method avoids introducing harmful artifacts and supports the efficiency claims without requiring model retraining. revision: yes
Referee: [§4] §4 / Table 2 (Experiments): the abstract and results section assert outperformance in quality and efficiency, but no quantitative values (e.g., FID, LPIPS, cross-view consistency, or seam-visibility scores) or ablation on packing density are referenced; this prevents verification that the packing operator actually resolves the spatial-coherence risk raised by the stress-test note.

Authors: We thank the referee for noting this gap in the presentation of results. Although our experiments in Section 4 and Table 2 compare PacTure against baselines in terms of visual quality and inference speed, we agree that incorporating standard quantitative metrics would enhance verifiability. In the revised version, we will report specific scores including FID and LPIPS for texture quality, as well as metrics for cross-view consistency and seam visibility. We will also add an ablation experiment on different packing densities to demonstrate how it affects spatial coherence and mitigates the risks highlighted in the stress-test. These changes will provide stronger evidence for the superiority of our method in both quality and efficiency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; core contributions are independent techniques

full rationale

The paper presents view packing and fine-grained autoregressive control as novel, independent techniques that increase effective resolution and reduce inference cost while preserving compatibility with existing 2D models. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described framework. The claimed improvements in quality and efficiency for PBR texture generation do not reduce by construction to prior inputs; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms or invented entities; the framework relies on standard assumptions of existing 2D generative models and autoregressive training.

pith-pipeline@v0.9.0 · 5717 in / 1004 out tokens · 40397 ms · 2026-05-19T13:22:35.966410+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate the arrangement of multi-view maps as a 2D rectangle bin packing problem... employs binary search to determine the maximal enlargement ratio
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

next-scale prediction autoregressive framework... multi-domain generation within the VAR T2I model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 5 internal anchors

[1]

Pharr, W

M. Pharr, W. Jakob, and G. Humphreys,Physically Based Rendering: From Theory to Implementation. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 3rd ed., 2016

work page 2016
[2]

TEXTure: Text-guided texturing of 3D shapes,

E. Richardson, G. Metzer, Y . Alaluf, R. Giryes, and D. Cohen-Or, “TEXTure: Text-guided texturing of 3D shapes,” inProc. of the ACM SIGGRAPH Conference and Exhibition On Computer Graphics and Interactive Techniques (SIGGRAPH), 2023

work page 2023
[3]

Text2Tex: Text-driven texture synthesis via diffusion models,

D. Z. Chen, Y . Siddiqui, H. Lee, S. Tulyakov, and M. Nießner, “Text2Tex: Text-driven texture synthesis via diffusion models,” inProc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[4]

Paint3D: Paint anything 3D with lighting-less texture diffusion models,

X. Zeng, X. Chen, Z. Qi, W. Liu, Z. Zhao, Z. Wang, B. Fu, Y . Liu, and G. Yu, “Paint3D: Paint anything 3D with lighting-less texture diffusion models,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[5]

TexFusion: Synthesizing 3D textures with text-guided image diffusion models,

T. Cao, K. Kreis, S. Fidler, N. Sharp, and K. Yin, “TexFusion: Synthesizing 3D textures with text-guided image diffusion models,” inProc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[6]

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Z. Zhao, Z. Lai, Q. Lin, Y . Zhao, H. Liu, S. Yang, Y . Feng, M. Yang, S. Zhang, X. Yang, H. Shi, S. Liu, J. Wu, Y . Lian, F. Yang, R. Tang, Z. He, X. Wang, J. Liu, X. Zuo, Z. Chen, B. Lei, H. Weng, J. Xu, Y . Zhu, X. Liu, L. Xu, C. Hu, T. Huang, L. Wang, J. Zhang, M. Chen, L. Dong, Y . Jia, Y . Cai, J. Yu, Y . Tang, H. Zhang, Z. Ye, P. He, R. Wu, C. Zhan...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

GenesisTex: Adapting image denoising diffusion to texture space,

C. Gao, B. Jiang, X. Li, Y . Zhang, and Q. Yu, “GenesisTex: Adapting image denoising diffusion to texture space,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[8]

InTeX: Interactive text-to-texture synthesis via unified depth-aware inpainting,

J. Tang, R. Lu, X. Chen, X. Wen, G. Zeng, and Z. Liu, “InTeX: Interactive text-to-texture synthesis via unified depth-aware inpainting,”arXiv preprint arXiv:2403.11878, 2024

work page arXiv 2024
[9]

Meta 3D TextureGen: Fast and consistent texture generation for 3D objects,

R. Bensadoun, Y . Kleiman, I. Azuri, O. Harosh, A. Vedaldi, N. Neverova, and O. Gafni, “Meta 3D TextureGen: Fast and consistent texture generation for 3D objects,”arXiv preprint arXiv:2407.02430, 2024

work page arXiv 2024
[10]

Meta 3D AssetGen: Text-to-mesh generation with high- quality geometry, texture, and PBR materials,

Y . Siddiqui, T. Monnier, F. Kokkinos, M. Kariya, Y . Kleiman, E. Garreau, O. Gafni, N. Neverova, A. Vedaldi, R. Shapovalov, and D. Novotný, “Meta 3D AssetGen: Text-to-mesh generation with high- quality geometry, texture, and PBR materials,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[11]

VCD-Texture: Variance alignment based 3D-2D co-denoising for text-guided texturing,

S. Liu, C. Yu, C. Cao, W. Qian, and F. Wang, “VCD-Texture: Variance alignment based 3D-2D co-denoising for text-guided texturing,” inProc. of European Conference on Computer Vision (ECCV), 2024

work page 2024
[12]

CLAY: A controllable large-scale generative model for creating high-quality 3D assets,

L. Zhang, Z. Wang, Q. Zhang, Q. Qiu, A. Pang, H. Jiang, W. Yang, L. Xu, and J. Yu, “CLAY: A controllable large-scale generative model for creating high-quality 3D assets,”ACM Transactions on Graphics (TOG), 2024

work page 2024
[13]

TexGen: Text-guided 3D texture generation with multi-view sampling and resampling,

D. Huo, Z. Guo, X. Zuo, Z. Shi, J. Lu, P. Dai, S. Xu, L. Cheng, and Y . Yang, “TexGen: Text-guided 3D texture generation with multi-view sampling and resampling,” inProc. of European Conference on Computer Vision (ECCV), 2024

work page 2024
[14]

Make-A- Texture: Fast shape-aware texture generation in 3 seconds,

X. Xiang, L. S. Gorelik, Y . Fan, O. Armstrong, F. N. Iandola, Y . Li, I. Lifshitz, and R. Ranjan, “Make-A- Texture: Fast shape-aware texture generation in 3 seconds,” inProc. of IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

work page 2025
[15]

Jointly generating multi-view consistent PBR textures using collaborative control,

S. Vainer, K. Kutsy, D. D. Nigris, C. Rowles, S. Elizarov, and S. Donné, “Jointly generating multi-view consistent PBR textures using collaborative control,”arXiv preprint arXiv:2410.06985, 2024

work page arXiv 2024
[16]

MVPaint: Synchronized multi-view diffusion for painting anything 3D,

W. Cheng, J. Mu, X. Zeng, X. Chen, A. Pang, C. Zhang, Z. Wang, B. Fu, G. Yu, Z. Liu, and L. Pan, “MVPaint: Synchronized multi-view diffusion for painting anything 3D,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[17]

MCMat: Multiview-consistent and physically accurate PBR material generation,

S. Zhu, L. Qiu, X. Gu, Z. Zhao, C. Xu, Y . He, Z. Li, X. Han, Y . Yao, X. Cao, S. Zhu, W. Yuan, Z. Dong, and H. Zhu, “MCMat: Multiview-consistent and physically accurate PBR material generation,”arXiv preprint arXiv:2412.14148, 2024

work page arXiv 2024
[18]

TexPainter: Generative mesh texturing with multi-view consistency,

H. Zhang, Z. Pan, C. Zhang, L. Zhu, and X. Gao, “TexPainter: Generative mesh texturing with multi-view consistency,” inProc. of the ACM SIGGRAPH Conference and Exhibition On Computer Graphics and Interactive Techniques (SIGGRAPH), 2024

work page 2024
[19]

Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation

J. Yang, T. Shang, W. Sun, X. Song, Z. Cheng, S. Wang, S. Chen, W. Liu, H. Li, and P. Ji, “Pan- dora3D: A comprehensive framework for high-quality 3D shape and texture generation,”arXiv preprint arXiv:2502.14247, 2025

work page arXiv 2025
[20]

Text-guided texturing by synchronized multi-view diffusion,

Y . Liu, M. Xie, H. Liu, and T. Wong, “Text-guided texturing by synchronized multi-view diffusion,” inProc. of the ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia (SIGGRAPH Asia), 2024

work page 2024
[21]

Texture generation on 3D meshes with Point-UV diffusion,

X. Yu, P. Dai, W. Li, L. Ma, Z. Liu, and X. Qi, “Texture generation on 3D meshes with Point-UV diffusion,” inProc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[22]

TEXGen: a generative diffusion model for mesh textures,

X. Yu, Z. Yuan, Y . Guo, Y . Liu, J. Liu, Y . Li, Y . Cao, D. Liang, and X. Qi, “TEXGen: a generative diffusion model for mesh textures,”ACM Transactions on Graphics (TOG), 2024

work page 2024
[23]

UV-free texture generation with denoising and geodesic heat diffusions,

S. Foti, S. Zafeiriou, and T. Birdal, “UV-free texture generation with denoising and geodesic heat diffusions,” arXiv preprint arXiv:2408.16762, 2024

work page arXiv 2024
[24]

TexOct: Generating textures of 3D models with octree-based diffusion,

J. Liu, C. Wu, X. Liu, X. Liu, J. Wu, H. Peng, C. Zhao, H. Feng, J. Liu, and E. Ding, “TexOct: Generating textures of 3D models with octree-based diffusion,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 11

work page 2024
[25]

TexGaussian: Generating high-quality PBR material via octree-based 3D Gaussian splatting,

B. Xiong, J. Liu, J. Hu, C. Wu, J. Wu, X. Liu, C. Zhao, E. Ding, and Z. Lian, “TexGaussian: Generating high-quality PBR material via octree-based 3D Gaussian splatting,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[26]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[27]

Scaling rectified flow transformers for high- resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, and R. Rombach, “Scaling rectified flow transformers for high- resolution image synthesis,” inProc. of International Conference on Machine Learning (ICML), 2024

work page 2024
[28]

SDXL: improving latent diffusion models for high-resolution image synthesis,

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “SDXL: improving latent diffusion models for high-resolution image synthesis,” inProc. of International Conference on Learning Representations (ICLR), 2024

work page 2024
[29]

arXiv preprint arXiv:2309.15807 , year=

X. Dai, J. Hou, C. Ma, S. S. Tsai, J. Wang, R. Wang, P. Zhang, S. Vandenhende, X. Wang, A. Dubey, M. Yu, A. Kadian, F. Radenovic, D. Mahajan, K. Li, Y . Zhao, V . Petrovic, M. K. Singh, S. Motwani, Y . Wen, Y . Song, R. Sumbaly, V . Ramanathan, Z. He, P. Vajda, and D. Parikh, “Emu: Enhancing image generation models using photogenic needles in a haystack,”...

work page arXiv 2023
[30]

FlashTex: Fast relightable mesh texturing with LightControlNet,

K. Deng, T. Omernick, A. Weiss, D. Ramanan, J. Zhu, T. Zhou, and M. Agrawala, “FlashTex: Fast relightable mesh texturing with LightControlNet,” inProc. of European Conference on Computer Vision (ECCV), 2024

work page 2024
[31]

TexDreamer: Towards zero-shot high-fidelity 3D human texture generation,

Y . Liu, J. Zhu, J. Tang, S. Zhang, J. Zhang, W. Cao, C. Wang, Y . Wu, and D. Huang, “TexDreamer: Towards zero-shot high-fidelity 3D human texture generation,” inProc. of European Conference on Computer Vision (ECCV), 2024

work page 2024
[32]

Least squares conformal maps for automatic texture atlas generation,

B. Lévy, S. Petitjean, N. Ray, and J. Maillot, “Least squares conformal maps for automatic texture atlas generation,”ACM Transactions on Graphics (TOG), vol. 21, no. 3, pp. 362–371, 2002

work page 2002
[33]

Two-dimensional finite bin-packing algorithms,

J. O. Berkey and P. Y . Wang, “Two-dimensional finite bin-packing algorithms,”Journal of the Operational Research Society, 1987

work page 1987
[34]

A thousand ways to pack the bin – a practical approach to two-dimensional rectangle bin packing,

J. Jylänki, “A thousand ways to pack the bin – a practical approach to two-dimensional rectangle bin packing,” 2010

work page 2010
[35]

Visual autoregressive modeling: Scalable image generation via next-scale prediction,

K. Tian, Y . Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual autoregressive modeling: Scalable image generation via next-scale prediction,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[36]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis,

J. Han, J. Liu, Y . Jiang, B. Yan, Y . Zhang, Z. Yuan, B. Peng, and X. Liu, “Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[37]

Pixart-Σ: Weak-to- strong training of diffusion transformer for 4K text-to-image generation,

J. Chen, C. Ge, E. Xie, Y . Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li, “Pixart-Σ: Weak-to- strong training of diffusion transformer for 4K text-to-image generation,” inProc. of European Conference on Computer Vision (ECCV), 2024

work page 2024
[38]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

P. Sun, Y . Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan, “Autoregressive model beats diffusion: Llama for scalable image generation,”arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

HART: efficient visual generation with hybrid autoregressive transformer,

H. Tang, Y . Wu, S. Yang, E. Xie, J. Chen, J. Chen, Z. Zhang, H. Cai, Y . Lu, and S. Han, “HART: efficient visual generation with hybrid autoregressive transformer,”arXiv preprint arXiv:2410.10812, 2024

work page arXiv 2024
[40]

Text2Mesh: Text-driven neural stylization for meshes,

O. Michel, R. Bar-On, R. Liu, S. Benaim, and R. Hanocka, “Text2Mesh: Text-driven neural stylization for meshes,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[41]

Texturify: Generating textures on 3D shape surfaces,

Y . Siddiqui, J. Thies, F. Ma, Q. Shan, M. Nießner, and A. Dai, “Texturify: Generating textures on 3D shape surfaces,” inProc. of European Conference on Computer Vision (ECCV), 2022

work page 2022
[42]

GET3D: A generative model of high quality 3D textured shapes learned from images,

J. Gao, T. Shen, Z. Wang, W. Chen, K. Yin, D. Li, O. Litany, Z. Gojcic, and S. Fidler, “GET3D: A generative model of high quality 3D textured shapes learned from images,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[43]

CLIP-Mesh: Generating textured meshes from text using pretrained image-text models,

N. M. Khalid, T. Xie, E. Belilovsky, and T. Popa, “CLIP-Mesh: Generating textured meshes from text using pretrained image-text models,” inProc. of the ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia (SIGGRAPH Asia), 2022. 12

work page 2022
[44]

Latent-NeRF for shape-guided generation of 3D shapes and textures,

G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or, “Latent-NeRF for shape-guided generation of 3D shapes and textures,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[45]

Paint-it: Text-to-texture synthesis via deep convolutional texture map optimization and physically-based rendering,

K. Youwang, T. Oh, and G. Pons-Moll, “Paint-it: Text-to-texture synthesis via deep convolutional texture map optimization and physically-based rendering,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[46]

TextureDreamer: Image-guided texture synthesis through geometry-aware diffusion,

Y . Yeh, J. Huang, C. Kim, L. Xiao, T. Nguyen-Phuoc, N. Khan, C. Zhang, M. Chandraker, C. S. Marshall, Z. Dong, and Z. Li, “TextureDreamer: Image-guided texture synthesis through geometry-aware diffusion,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[47]

DreamMat: High-quality PBR material generation with geometry- and light-aware diffusion models,

Y . Zhang, Y . Liu, Z. Xie, L. Yang, Z. Liu, M. Yang, R. Zhang, Q. Kou, C. Lin, W. Wang, and X. Jin, “DreamMat: High-quality PBR material generation with geometry- and light-aware diffusion models,” ACM Transactions on Graphics (TOG), 2024

work page 2024
[48]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProc. of International Conference on Machine Learning (ICML), 2021

work page 2021
[49]

Generative Adversarial Networks

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y . Bengio, “Generative adversarial networks,”arXiv preprint arXiv:1406.2661, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[50]

DreamFusion: Text-to-3D using 2D diffusion,

B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “DreamFusion: Text-to-3D using 2D diffusion,” inProc. of International Conference on Learning Representations (ICLR), 2023

work page 2023
[51]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[52]

MatAtlas: Text-driven consistent geometry texturing and material assignment,

D. Ceylan, V . Deschaintre, T. Groueix, R. Martin, C. P. Huang, R. Rouffet, V . G. Kim, and G. Las- sagne, “MatAtlas: Text-driven consistent geometry texturing and material assignment,”arXiv preprint arXiv:2404.02899, 2024

work page arXiv 2024
[53]

Make-it-Real: Unleashing large multi- modal model’s ability for painting 3D objects with realistic materials,

Y . Fang, Z. Sun, T. Wu, J. Wang, Z. Liu, G. Wetzstein, and D. Lin, “Make-it-Real: Unleashing large multi- modal model’s ability for painting 3D objects with realistic materials,”arXiv preprint arXiv:2404.16829, 2024

work page arXiv 2024
[54]

MaterialSeg3D: Segmenting dense materials from 2D priors for 3D assets,

Z. Li, R. Gan, C. Luo, Y . Wang, J. Liu, Z. Zhu, Q. Li, X. Yin, M. Zhang, Z. Zhang, and J. Peng, “MaterialSeg3D: Segmenting dense materials from 2D priors for 3D assets,” inProc. of ACM International Conference on Multimedia (ACM MM), 2024

work page 2024
[55]

MaPa: Text-driven photorealistic material painting for 3D shapes,

S. Zhang, S. Peng, T. Xu, Y . Yang, T. Chen, N. Xue, Y . Shen, H. Bao, R. Hu, and X. Zhou, “MaPa: Text-driven photorealistic material painting for 3D shapes,” inProc. of the ACM SIGGRAPH Conference and Exhibition On Computer Graphics and Interactive Techniques (SIGGRAPH), 2024

work page 2024
[56]

U-Net: Convolutional networks for biomedical image seg- mentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image seg- mentation,” inProc. of International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015

work page 2015
[57]

T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,

C. Mou, X. Wang, L. Xie, Y . Wu, J. Zhang, Z. Qi, and Y . Shan, “T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” inProc. of Association for the Advancement of Artificial Intelligence, 2024

work page 2024
[58]

ControlNeXt: Powerful and efficient control for image and video generation,

B. Peng, J. Wang, Y . Zhang, W. Li, M. Yang, and J. Jia, “ControlNeXt: Powerful and efficient control for image and video generation,”arXiv preprint arXiv:2408.06070, 2024

work page arXiv 2024
[59]

OminiControl: Minimal and universal control for diffusion transformer,

Z. Tan, S. Liu, X. Yang, Q. Xue, and X. Wang, “OminiControl: Minimal and universal control for diffusion transformer,”arXiv preprint arXiv:2411.15098, 2024

work page arXiv 2024
[60]

Uni-ControlNet: All-in-one control to text-to-image diffusion models,

S. Zhao, D. Chen, Y . Chen, J. Bao, S. Hao, L. Yuan, and K. K. Wong, “Uni-ControlNet: All-in-one control to text-to-image diffusion models,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[61]

UniControl: A unified diffusion model for controllable visual generation in the wild,

C. Qin, S. Zhang, N. Yu, Y . Feng, X. Yang, Y . Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savarese, S. Ermon, Y . Fu, and R. Xu, “UniControl: A unified diffusion model for controllable visual generation in the wild,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[62]

ControlNet++: Improving conditional controls with efficient consistency feedback,

M. Li, T. Yang, H. Kuang, J. Wu, Z. Wang, X. Xiao, and C. Chen, “ControlNet++: Improving conditional controls with efficient consistency feedback,” inProc. of European Conference on Computer Vision (ECCV), 2024. 13

work page 2024
[63]

ControlAR: Controllable image generation with autoregressive models,

Z. Li, T. Cheng, S. Chen, P. Sun, H. Shen, L. Ran, X. Chen, W. Liu, and X. Wang, “ControlAR: Controllable image generation with autoregressive models,”arXiv preprint arXiv:2410.02705, 2024

work page arXiv 2024
[64]

Dinov2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jégou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without supervi...

work page 2024
[65]

Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

J. Chen, Y . Wu, S. Luo, E. Xie, S. Paul, P. Luo, H. Zhao, and Z. Li, “Pixart-δ: Fast and controllable image generation with latent consistency models,”arXiv preprint arXiv:2401.05252, 2024

work page arXiv 2024
[66]

Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis,

J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Z. Wang, J. T. Kwok, P. Luo, H. Lu, and Z. Li, “Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis,” inProc. of International Conference on Learning Representations (ICLR), 2024

work page 2024
[67]

arXiv preprint arXiv:2406.09750 (2024)

X. Li, K. Qiu, H. Chen, J. Kuen, Z. Lin, R. Singh, and B. Raj, “ControlV AR: Exploring controllable visual autoregressive modeling,”arXiv preprint arXiv:2406.09750, 2024

work page arXiv 2024
[68]

CAR: controllable autoregressive modeling for visual generation,

Z. Yao, J. Li, Y . Zhou, Y . Liu, X. Jiang, C. Wang, F. Zheng, Y . Zou, and L. Li, “CAR: controllable autoregressive modeling for visual generation,”arXiv preprint arXiv:2410.04671, 2024

work page arXiv 2024
[69]

Neural discrete representation learning,

A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[70]

IntrinsicAnything: Learning diffusion priors for inverse rendering under unknown illumination,

X. Chen, S. Peng, D. Yang, Y . Liu, B. Pan, C. Lv, and X. Zhou, “IntrinsicAnything: Learning diffusion priors for inverse rendering under unknown illumination,” inProc. of European Conference on Computer Vision (ECCV), 2024

work page 2024
[71]

Kaolin: A pytorch library for accelerating 3D deep learning research,

K. M. Jatavallabhula, E. J. Smith, J. Lafleche, C. F. Tsang, A. Rozantsev, W. Chen, T. Xiang, R. Lebaredian, and S. Fidler, “Kaolin: A pytorch library for accelerating 3D deep learning research,”arXiv preprint arXiv:1911.05063, 2019

work page arXiv 1911
[72]

DRCT: saving image super-resolution away from information bottleneck,

C. Hsu, C. Lee, and Y . Chou, “DRCT: saving image super-resolution away from information bottleneck,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024

work page 2024
[73]

Huang, Y

Z. Huang, Y . Guo, H. Wang, R. Yi, L. Ma, Y . Cao, and L. Sheng, “MV-Adapter: Multi-view consistent image generation made easy,”arXiv preprint arXiv:2412.03632, 2024

work page arXiv 2024
[74]

Zero-1-to-3: Zero-shot one image to 3D object,

R. Liu, R. Wu, B. V . Hoorick, P. Tokmakov, S. Zakharov, and C. V ondrick, “Zero-1-to-3: Zero-shot one image to 3D object,” inProc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[75]

Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, L. Chen, C. Zeng, and H. Su, “Zero123++: a single image to consistent multi-view diffusion base model,”arXiv preprint arXiv:2310.15110, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Objaverse: A universe of annotated 3D objects,

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kem- bhavi, and A. Farhadi, “Objaverse: A universe of annotated 3D objects,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[77]

Objaverse-XL: A universe of 10M+ 3D objects,

M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadre, E. VanderBilt, A. Kembhavi, C. V ondrick, G. Gkioxari, K. Ehsani, L. Schmidt, and A. Farhadi, “Objaverse-XL: A universe of 10M+ 3D objects,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[78]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models,”arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

SciPy 1.0: fundamental algorithms for scientific computing in Python,

P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright,et al., “SciPy 1.0: fundamental algorithms for scientific computing in Python,” Nature methods, 2020

work page 2020
[80]

Taming transformers for high-resolution image synthesis,

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021

Showing first 80 references.

[1] [1]

Pharr, W

M. Pharr, W. Jakob, and G. Humphreys,Physically Based Rendering: From Theory to Implementation. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 3rd ed., 2016

work page 2016

[2] [2]

TEXTure: Text-guided texturing of 3D shapes,

E. Richardson, G. Metzer, Y . Alaluf, R. Giryes, and D. Cohen-Or, “TEXTure: Text-guided texturing of 3D shapes,” inProc. of the ACM SIGGRAPH Conference and Exhibition On Computer Graphics and Interactive Techniques (SIGGRAPH), 2023

work page 2023

[3] [3]

Text2Tex: Text-driven texture synthesis via diffusion models,

D. Z. Chen, Y . Siddiqui, H. Lee, S. Tulyakov, and M. Nießner, “Text2Tex: Text-driven texture synthesis via diffusion models,” inProc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023

[4] [4]

Paint3D: Paint anything 3D with lighting-less texture diffusion models,

X. Zeng, X. Chen, Z. Qi, W. Liu, Z. Zhao, Z. Wang, B. Fu, Y . Liu, and G. Yu, “Paint3D: Paint anything 3D with lighting-less texture diffusion models,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[5] [5]

TexFusion: Synthesizing 3D textures with text-guided image diffusion models,

T. Cao, K. Kreis, S. Fidler, N. Sharp, and K. Yin, “TexFusion: Synthesizing 3D textures with text-guided image diffusion models,” inProc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023

[6] [6]

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

Z. Zhao, Z. Lai, Q. Lin, Y . Zhao, H. Liu, S. Yang, Y . Feng, M. Yang, S. Zhang, X. Yang, H. Shi, S. Liu, J. Wu, Y . Lian, F. Yang, R. Tang, Z. He, X. Wang, J. Liu, X. Zuo, Z. Chen, B. Lei, H. Weng, J. Xu, Y . Zhu, X. Liu, L. Xu, C. Hu, T. Huang, L. Wang, J. Zhang, M. Chen, L. Dong, Y . Jia, Y . Cai, J. Yu, Y . Tang, H. Zhang, Z. Ye, P. He, R. Wu, C. Zhan...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

GenesisTex: Adapting image denoising diffusion to texture space,

C. Gao, B. Jiang, X. Li, Y . Zhang, and Q. Yu, “GenesisTex: Adapting image denoising diffusion to texture space,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[8] [8]

InTeX: Interactive text-to-texture synthesis via unified depth-aware inpainting,

J. Tang, R. Lu, X. Chen, X. Wen, G. Zeng, and Z. Liu, “InTeX: Interactive text-to-texture synthesis via unified depth-aware inpainting,”arXiv preprint arXiv:2403.11878, 2024

work page arXiv 2024

[9] [9]

Meta 3D TextureGen: Fast and consistent texture generation for 3D objects,

R. Bensadoun, Y . Kleiman, I. Azuri, O. Harosh, A. Vedaldi, N. Neverova, and O. Gafni, “Meta 3D TextureGen: Fast and consistent texture generation for 3D objects,”arXiv preprint arXiv:2407.02430, 2024

work page arXiv 2024

[10] [10]

Meta 3D AssetGen: Text-to-mesh generation with high- quality geometry, texture, and PBR materials,

Y . Siddiqui, T. Monnier, F. Kokkinos, M. Kariya, Y . Kleiman, E. Garreau, O. Gafni, N. Neverova, A. Vedaldi, R. Shapovalov, and D. Novotný, “Meta 3D AssetGen: Text-to-mesh generation with high- quality geometry, texture, and PBR materials,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[11] [11]

VCD-Texture: Variance alignment based 3D-2D co-denoising for text-guided texturing,

S. Liu, C. Yu, C. Cao, W. Qian, and F. Wang, “VCD-Texture: Variance alignment based 3D-2D co-denoising for text-guided texturing,” inProc. of European Conference on Computer Vision (ECCV), 2024

work page 2024

[12] [12]

CLAY: A controllable large-scale generative model for creating high-quality 3D assets,

L. Zhang, Z. Wang, Q. Zhang, Q. Qiu, A. Pang, H. Jiang, W. Yang, L. Xu, and J. Yu, “CLAY: A controllable large-scale generative model for creating high-quality 3D assets,”ACM Transactions on Graphics (TOG), 2024

work page 2024

[13] [13]

TexGen: Text-guided 3D texture generation with multi-view sampling and resampling,

D. Huo, Z. Guo, X. Zuo, Z. Shi, J. Lu, P. Dai, S. Xu, L. Cheng, and Y . Yang, “TexGen: Text-guided 3D texture generation with multi-view sampling and resampling,” inProc. of European Conference on Computer Vision (ECCV), 2024

work page 2024

[14] [14]

Make-A- Texture: Fast shape-aware texture generation in 3 seconds,

X. Xiang, L. S. Gorelik, Y . Fan, O. Armstrong, F. N. Iandola, Y . Li, I. Lifshitz, and R. Ranjan, “Make-A- Texture: Fast shape-aware texture generation in 3 seconds,” inProc. of IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025

work page 2025

[15] [15]

Jointly generating multi-view consistent PBR textures using collaborative control,

S. Vainer, K. Kutsy, D. D. Nigris, C. Rowles, S. Elizarov, and S. Donné, “Jointly generating multi-view consistent PBR textures using collaborative control,”arXiv preprint arXiv:2410.06985, 2024

work page arXiv 2024

[16] [16]

MVPaint: Synchronized multi-view diffusion for painting anything 3D,

W. Cheng, J. Mu, X. Zeng, X. Chen, A. Pang, C. Zhang, Z. Wang, B. Fu, G. Yu, Z. Liu, and L. Pan, “MVPaint: Synchronized multi-view diffusion for painting anything 3D,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[17] [17]

MCMat: Multiview-consistent and physically accurate PBR material generation,

S. Zhu, L. Qiu, X. Gu, Z. Zhao, C. Xu, Y . He, Z. Li, X. Han, Y . Yao, X. Cao, S. Zhu, W. Yuan, Z. Dong, and H. Zhu, “MCMat: Multiview-consistent and physically accurate PBR material generation,”arXiv preprint arXiv:2412.14148, 2024

work page arXiv 2024

[18] [18]

TexPainter: Generative mesh texturing with multi-view consistency,

H. Zhang, Z. Pan, C. Zhang, L. Zhu, and X. Gao, “TexPainter: Generative mesh texturing with multi-view consistency,” inProc. of the ACM SIGGRAPH Conference and Exhibition On Computer Graphics and Interactive Techniques (SIGGRAPH), 2024

work page 2024

[19] [19]

Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation

J. Yang, T. Shang, W. Sun, X. Song, Z. Cheng, S. Wang, S. Chen, W. Liu, H. Li, and P. Ji, “Pan- dora3D: A comprehensive framework for high-quality 3D shape and texture generation,”arXiv preprint arXiv:2502.14247, 2025

work page arXiv 2025

[20] [20]

Text-guided texturing by synchronized multi-view diffusion,

Y . Liu, M. Xie, H. Liu, and T. Wong, “Text-guided texturing by synchronized multi-view diffusion,” inProc. of the ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia (SIGGRAPH Asia), 2024

work page 2024

[21] [21]

Texture generation on 3D meshes with Point-UV diffusion,

X. Yu, P. Dai, W. Li, L. Ma, Z. Liu, and X. Qi, “Texture generation on 3D meshes with Point-UV diffusion,” inProc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023

[22] [22]

TEXGen: a generative diffusion model for mesh textures,

X. Yu, Z. Yuan, Y . Guo, Y . Liu, J. Liu, Y . Li, Y . Cao, D. Liang, and X. Qi, “TEXGen: a generative diffusion model for mesh textures,”ACM Transactions on Graphics (TOG), 2024

work page 2024

[23] [23]

UV-free texture generation with denoising and geodesic heat diffusions,

S. Foti, S. Zafeiriou, and T. Birdal, “UV-free texture generation with denoising and geodesic heat diffusions,” arXiv preprint arXiv:2408.16762, 2024

work page arXiv 2024

[24] [24]

TexOct: Generating textures of 3D models with octree-based diffusion,

J. Liu, C. Wu, X. Liu, X. Liu, J. Wu, H. Peng, C. Zhao, H. Feng, J. Liu, and E. Ding, “TexOct: Generating textures of 3D models with octree-based diffusion,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 11

work page 2024

[25] [25]

TexGaussian: Generating high-quality PBR material via octree-based 3D Gaussian splatting,

B. Xiong, J. Liu, J. Hu, C. Wu, J. Wu, X. Liu, C. Zhao, E. Ding, and Z. Lian, “TexGaussian: Generating high-quality PBR material via octree-based 3D Gaussian splatting,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[26] [26]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[27] [27]

Scaling rectified flow transformers for high- resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, and R. Rombach, “Scaling rectified flow transformers for high- resolution image synthesis,” inProc. of International Conference on Machine Learning (ICML), 2024

work page 2024

[28] [28]

SDXL: improving latent diffusion models for high-resolution image synthesis,

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “SDXL: improving latent diffusion models for high-resolution image synthesis,” inProc. of International Conference on Learning Representations (ICLR), 2024

work page 2024

[29] [29]

arXiv preprint arXiv:2309.15807 , year=

X. Dai, J. Hou, C. Ma, S. S. Tsai, J. Wang, R. Wang, P. Zhang, S. Vandenhende, X. Wang, A. Dubey, M. Yu, A. Kadian, F. Radenovic, D. Mahajan, K. Li, Y . Zhao, V . Petrovic, M. K. Singh, S. Motwani, Y . Wen, Y . Song, R. Sumbaly, V . Ramanathan, Z. He, P. Vajda, and D. Parikh, “Emu: Enhancing image generation models using photogenic needles in a haystack,”...

work page arXiv 2023

[30] [30]

FlashTex: Fast relightable mesh texturing with LightControlNet,

K. Deng, T. Omernick, A. Weiss, D. Ramanan, J. Zhu, T. Zhou, and M. Agrawala, “FlashTex: Fast relightable mesh texturing with LightControlNet,” inProc. of European Conference on Computer Vision (ECCV), 2024

work page 2024

[31] [31]

TexDreamer: Towards zero-shot high-fidelity 3D human texture generation,

Y . Liu, J. Zhu, J. Tang, S. Zhang, J. Zhang, W. Cao, C. Wang, Y . Wu, and D. Huang, “TexDreamer: Towards zero-shot high-fidelity 3D human texture generation,” inProc. of European Conference on Computer Vision (ECCV), 2024

work page 2024

[32] [32]

Least squares conformal maps for automatic texture atlas generation,

B. Lévy, S. Petitjean, N. Ray, and J. Maillot, “Least squares conformal maps for automatic texture atlas generation,”ACM Transactions on Graphics (TOG), vol. 21, no. 3, pp. 362–371, 2002

work page 2002

[33] [33]

Two-dimensional finite bin-packing algorithms,

J. O. Berkey and P. Y . Wang, “Two-dimensional finite bin-packing algorithms,”Journal of the Operational Research Society, 1987

work page 1987

[34] [34]

A thousand ways to pack the bin – a practical approach to two-dimensional rectangle bin packing,

J. Jylänki, “A thousand ways to pack the bin – a practical approach to two-dimensional rectangle bin packing,” 2010

work page 2010

[35] [35]

Visual autoregressive modeling: Scalable image generation via next-scale prediction,

K. Tian, Y . Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual autoregressive modeling: Scalable image generation via next-scale prediction,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[36] [36]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis,

J. Han, J. Liu, Y . Jiang, B. Yan, Y . Zhang, Z. Yuan, B. Peng, and X. Liu, “Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025

[37] [37]

Pixart-Σ: Weak-to- strong training of diffusion transformer for 4K text-to-image generation,

J. Chen, C. Ge, E. Xie, Y . Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li, “Pixart-Σ: Weak-to- strong training of diffusion transformer for 4K text-to-image generation,” inProc. of European Conference on Computer Vision (ECCV), 2024

work page 2024

[38] [38]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

P. Sun, Y . Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan, “Autoregressive model beats diffusion: Llama for scalable image generation,”arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

HART: efficient visual generation with hybrid autoregressive transformer,

H. Tang, Y . Wu, S. Yang, E. Xie, J. Chen, J. Chen, Z. Zhang, H. Cai, Y . Lu, and S. Han, “HART: efficient visual generation with hybrid autoregressive transformer,”arXiv preprint arXiv:2410.10812, 2024

work page arXiv 2024

[40] [40]

Text2Mesh: Text-driven neural stylization for meshes,

O. Michel, R. Bar-On, R. Liu, S. Benaim, and R. Hanocka, “Text2Mesh: Text-driven neural stylization for meshes,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[41] [41]

Texturify: Generating textures on 3D shape surfaces,

Y . Siddiqui, J. Thies, F. Ma, Q. Shan, M. Nießner, and A. Dai, “Texturify: Generating textures on 3D shape surfaces,” inProc. of European Conference on Computer Vision (ECCV), 2022

work page 2022

[42] [42]

GET3D: A generative model of high quality 3D textured shapes learned from images,

J. Gao, T. Shen, Z. Wang, W. Chen, K. Yin, D. Li, O. Litany, Z. Gojcic, and S. Fidler, “GET3D: A generative model of high quality 3D textured shapes learned from images,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[43] [43]

CLIP-Mesh: Generating textured meshes from text using pretrained image-text models,

N. M. Khalid, T. Xie, E. Belilovsky, and T. Popa, “CLIP-Mesh: Generating textured meshes from text using pretrained image-text models,” inProc. of the ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia (SIGGRAPH Asia), 2022. 12

work page 2022

[44] [44]

Latent-NeRF for shape-guided generation of 3D shapes and textures,

G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or, “Latent-NeRF for shape-guided generation of 3D shapes and textures,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[45] [45]

Paint-it: Text-to-texture synthesis via deep convolutional texture map optimization and physically-based rendering,

K. Youwang, T. Oh, and G. Pons-Moll, “Paint-it: Text-to-texture synthesis via deep convolutional texture map optimization and physically-based rendering,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[46] [46]

TextureDreamer: Image-guided texture synthesis through geometry-aware diffusion,

Y . Yeh, J. Huang, C. Kim, L. Xiao, T. Nguyen-Phuoc, N. Khan, C. Zhang, M. Chandraker, C. S. Marshall, Z. Dong, and Z. Li, “TextureDreamer: Image-guided texture synthesis through geometry-aware diffusion,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[47] [47]

DreamMat: High-quality PBR material generation with geometry- and light-aware diffusion models,

Y . Zhang, Y . Liu, Z. Xie, L. Yang, Z. Liu, M. Yang, R. Zhang, Q. Kou, C. Lin, W. Wang, and X. Jin, “DreamMat: High-quality PBR material generation with geometry- and light-aware diffusion models,” ACM Transactions on Graphics (TOG), 2024

work page 2024

[48] [48]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProc. of International Conference on Machine Learning (ICML), 2021

work page 2021

[49] [49]

Generative Adversarial Networks

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y . Bengio, “Generative adversarial networks,”arXiv preprint arXiv:1406.2661, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[50] [50]

DreamFusion: Text-to-3D using 2D diffusion,

B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “DreamFusion: Text-to-3D using 2D diffusion,” inProc. of International Conference on Learning Representations (ICLR), 2023

work page 2023

[51] [51]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023

[52] [52]

MatAtlas: Text-driven consistent geometry texturing and material assignment,

D. Ceylan, V . Deschaintre, T. Groueix, R. Martin, C. P. Huang, R. Rouffet, V . G. Kim, and G. Las- sagne, “MatAtlas: Text-driven consistent geometry texturing and material assignment,”arXiv preprint arXiv:2404.02899, 2024

work page arXiv 2024

[53] [53]

Make-it-Real: Unleashing large multi- modal model’s ability for painting 3D objects with realistic materials,

Y . Fang, Z. Sun, T. Wu, J. Wang, Z. Liu, G. Wetzstein, and D. Lin, “Make-it-Real: Unleashing large multi- modal model’s ability for painting 3D objects with realistic materials,”arXiv preprint arXiv:2404.16829, 2024

work page arXiv 2024

[54] [54]

MaterialSeg3D: Segmenting dense materials from 2D priors for 3D assets,

Z. Li, R. Gan, C. Luo, Y . Wang, J. Liu, Z. Zhu, Q. Li, X. Yin, M. Zhang, Z. Zhang, and J. Peng, “MaterialSeg3D: Segmenting dense materials from 2D priors for 3D assets,” inProc. of ACM International Conference on Multimedia (ACM MM), 2024

work page 2024

[55] [55]

MaPa: Text-driven photorealistic material painting for 3D shapes,

S. Zhang, S. Peng, T. Xu, Y . Yang, T. Chen, N. Xue, Y . Shen, H. Bao, R. Hu, and X. Zhou, “MaPa: Text-driven photorealistic material painting for 3D shapes,” inProc. of the ACM SIGGRAPH Conference and Exhibition On Computer Graphics and Interactive Techniques (SIGGRAPH), 2024

work page 2024

[56] [56]

U-Net: Convolutional networks for biomedical image seg- mentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image seg- mentation,” inProc. of International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015

work page 2015

[57] [57]

T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,

C. Mou, X. Wang, L. Xie, Y . Wu, J. Zhang, Z. Qi, and Y . Shan, “T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” inProc. of Association for the Advancement of Artificial Intelligence, 2024

work page 2024

[58] [58]

ControlNeXt: Powerful and efficient control for image and video generation,

B. Peng, J. Wang, Y . Zhang, W. Li, M. Yang, and J. Jia, “ControlNeXt: Powerful and efficient control for image and video generation,”arXiv preprint arXiv:2408.06070, 2024

work page arXiv 2024

[59] [59]

OminiControl: Minimal and universal control for diffusion transformer,

Z. Tan, S. Liu, X. Yang, Q. Xue, and X. Wang, “OminiControl: Minimal and universal control for diffusion transformer,”arXiv preprint arXiv:2411.15098, 2024

work page arXiv 2024

[60] [60]

Uni-ControlNet: All-in-one control to text-to-image diffusion models,

S. Zhao, D. Chen, Y . Chen, J. Bao, S. Hao, L. Yuan, and K. K. Wong, “Uni-ControlNet: All-in-one control to text-to-image diffusion models,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[61] [61]

UniControl: A unified diffusion model for controllable visual generation in the wild,

C. Qin, S. Zhang, N. Yu, Y . Feng, X. Yang, Y . Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savarese, S. Ermon, Y . Fu, and R. Xu, “UniControl: A unified diffusion model for controllable visual generation in the wild,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[62] [62]

ControlNet++: Improving conditional controls with efficient consistency feedback,

M. Li, T. Yang, H. Kuang, J. Wu, Z. Wang, X. Xiao, and C. Chen, “ControlNet++: Improving conditional controls with efficient consistency feedback,” inProc. of European Conference on Computer Vision (ECCV), 2024. 13

work page 2024

[63] [63]

ControlAR: Controllable image generation with autoregressive models,

Z. Li, T. Cheng, S. Chen, P. Sun, H. Shen, L. Ran, X. Chen, W. Liu, and X. Wang, “ControlAR: Controllable image generation with autoregressive models,”arXiv preprint arXiv:2410.02705, 2024

work page arXiv 2024

[64] [64]

Dinov2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jégou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without supervi...

work page 2024

[65] [65]

Pixart-{\delta}: Fast and controllable image generation with latent consistency models.arXiv preprint arXiv:2401.05252, 2024

J. Chen, Y . Wu, S. Luo, E. Xie, S. Paul, P. Luo, H. Zhao, and Z. Li, “Pixart-δ: Fast and controllable image generation with latent consistency models,”arXiv preprint arXiv:2401.05252, 2024

work page arXiv 2024

[66] [66]

Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis,

J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Z. Wang, J. T. Kwok, P. Luo, H. Lu, and Z. Li, “Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis,” inProc. of International Conference on Learning Representations (ICLR), 2024

work page 2024

[67] [67]

arXiv preprint arXiv:2406.09750 (2024)

X. Li, K. Qiu, H. Chen, J. Kuen, Z. Lin, R. Singh, and B. Raj, “ControlV AR: Exploring controllable visual autoregressive modeling,”arXiv preprint arXiv:2406.09750, 2024

work page arXiv 2024

[68] [68]

CAR: controllable autoregressive modeling for visual generation,

Z. Yao, J. Li, Y . Zhou, Y . Liu, X. Jiang, C. Wang, F. Zheng, Y . Zou, and L. Li, “CAR: controllable autoregressive modeling for visual generation,”arXiv preprint arXiv:2410.04671, 2024

work page arXiv 2024

[69] [69]

Neural discrete representation learning,

A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[70] [70]

IntrinsicAnything: Learning diffusion priors for inverse rendering under unknown illumination,

X. Chen, S. Peng, D. Yang, Y . Liu, B. Pan, C. Lv, and X. Zhou, “IntrinsicAnything: Learning diffusion priors for inverse rendering under unknown illumination,” inProc. of European Conference on Computer Vision (ECCV), 2024

work page 2024

[71] [71]

Kaolin: A pytorch library for accelerating 3D deep learning research,

K. M. Jatavallabhula, E. J. Smith, J. Lafleche, C. F. Tsang, A. Rozantsev, W. Chen, T. Xiang, R. Lebaredian, and S. Fidler, “Kaolin: A pytorch library for accelerating 3D deep learning research,”arXiv preprint arXiv:1911.05063, 2019

work page arXiv 1911

[72] [72]

DRCT: saving image super-resolution away from information bottleneck,

C. Hsu, C. Lee, and Y . Chou, “DRCT: saving image super-resolution away from information bottleneck,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024

work page 2024

[73] [73]

Huang, Y

Z. Huang, Y . Guo, H. Wang, R. Yi, L. Ma, Y . Cao, and L. Sheng, “MV-Adapter: Multi-view consistent image generation made easy,”arXiv preprint arXiv:2412.03632, 2024

work page arXiv 2024

[74] [74]

Zero-1-to-3: Zero-shot one image to 3D object,

R. Liu, R. Wu, B. V . Hoorick, P. Tokmakov, S. Zakharov, and C. V ondrick, “Zero-1-to-3: Zero-shot one image to 3D object,” inProc. of IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023

[75] [75]

Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model

R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, L. Chen, C. Zeng, and H. Su, “Zero123++: a single image to consistent multi-view diffusion base model,”arXiv preprint arXiv:2310.15110, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[76] [76]

Objaverse: A universe of annotated 3D objects,

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kem- bhavi, and A. Farhadi, “Objaverse: A universe of annotated 3D objects,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[77] [77]

Objaverse-XL: A universe of 10M+ 3D objects,

M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadre, E. VanderBilt, A. Kembhavi, C. V ondrick, G. Gkioxari, K. Ehsani, L. Schmidt, and A. Farhadi, “Objaverse-XL: A universe of 10M+ 3D objects,” inProc. of Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[78] [78]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models,”arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[79] [79]

SciPy 1.0: fundamental algorithms for scientific computing in Python,

P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright,et al., “SciPy 1.0: fundamental algorithms for scientific computing in Python,” Nature methods, 2020

work page 2020

[80] [80]

Taming transformers for high-resolution image synthesis,

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” inProc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021