pith. machine review for the scientific record. sign in

arxiv: 2512.07834 · v2 · submitted 2025-12-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Voxify3D: Pixel Art Meets Volumetric Rendering

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords voxel art generation3D mesh stylizationdifferentiable renderingCLIP semantic alignmentGumbel-Softmax quantizationpixel art aestheticsvolumetric renderingdiscrete optimization
0
0 comments X

The pith

Voxify3D converts 3D meshes into voxel art by aligning straight-on pixel renders with CLIP patches and using differentiable palette quantization to keep semantics intact under discretization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a two-stage differentiable framework that turns smooth 3D meshes into blocky voxel art while preserving what the object represents and restricting output to small color sets. It relies on rendering the model from straight-on orthographic views to match exact pixel positions, matching small image patches to text descriptions to hold onto meaning across detail loss, and a soft quantization step that lets the system pick discrete colors from a chosen palette during training. This combination tackles the core tension in voxel stylization where geometry must simplify drastically yet still read as the original shape with coherent limited colors. A sympathetic reader would care because it provides an automated route to game-style assets that follow strict pixel-art rules without manual cleanup or loss of recognizable identity.

Core claim

Voxify3D bridges 3D mesh optimization with 2D pixel art supervision by integrating orthographic pixel art supervision that removes perspective distortion for precise voxel-pixel alignment, patch-based CLIP alignment that preserves semantics across discretization levels, and palette-constrained Gumbel-Softmax quantization that enables differentiable optimization over discrete color spaces with controllable palette strategies. This integration directly addresses semantic preservation under extreme discretization and pixel-art aesthetics through volumetric rendering while supporting end-to-end discrete optimization.

What carries the argument

The synergistic integration of orthographic pixel art supervision, patch-based CLIP alignment, and palette-constrained Gumbel-Softmax quantization, which together carry the argument by allowing gradient-based optimization of voxel grids from meshes while enforcing pixel-level aesthetics and semantic consistency.

If this is right

  • Voxel outputs can be generated controllably at abstraction levels from 2-8 colors and 20x-50x resolutions while remaining recognizable.
  • End-to-end training replaces separate post-processing steps for color quantization and geometry abstraction in voxel pipelines.
  • The framework produces higher semantic alignment scores and stronger user preference than prior mesh-to-voxel methods across characters.
  • Palette strategies become tunable parameters that directly influence the final discrete color coherence without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same supervision stack could be tested on generating voxel versions of static environments or props to check if semantic preservation generalizes beyond characters.
  • Combining this discretization approach with rigged 3D models might enable direct creation of animated voxel characters for games.
  • The reliance on patch CLIP suggests the method could adapt to other extreme stylizations such as low-poly or sprite-based outputs from 3D sources.

Load-bearing premise

Straight-on orthographic renders plus patch-level CLIP matches will keep the original meaning and desired pixel-art look intact once the mesh is locked into a coarse grid and tiny color palette.

What would settle it

Run the method on standard test meshes such as a human character or simple object at 20x resolution with a 4-color palette and measure whether human viewers can still correctly identify the source object from the voxel output at rates above chance.

Figures

Figures reproduced from arXiv: 2512.07834 by Hao-Jen Chien, Jiewen Chan, Yi-Chuan Huang, Yu-Lun Liu.

Figure 1
Figure 1. Figure 1: Stylized voxel art with controllable abstraction. Voxify3D converts 3D meshes into stylized voxel art using discrete color palettes, pixel art supervision, and voxel-based radiance fields. This teaser showcases the flexibility and quality of our method. (a) Diverse voxel art outputs across object types and use cases. (b) Comparison of different palette selection methods. (c) Control over the resolution of … view at source ↗
Figure 2
Figure 2. Figure 2: Existing methods often miss key features in voxeliza￾tion. While IN2N [30], Vox-E [83], and Blender (Geometry Nodes) generate outputs that are coarse, blurry, or semantically inconsis￾tent, they frequently lose critical elements such as facial features. In contrast, our method preserves structural details and produces visually appealing voxel art with sharp abstraction. structure. Despite its growing popul… view at source ↗
Figure 3
Figure 3. Figure 3: Our two-stage voxel art generation pipeline. (a) Coarse voxel grid training: Given a 3D mesh, we render multi-view images and optimize a voxel-based radiance field (DVGO [87]) using MSE loss to learn coarse RGB and density. (b) Orthographic pixel art fine-tuning: We refine the voxel grid using six orthographic pixel art views, which also serve to extract a discrete color palette (e.g., via k-means). Optimi… view at source ↗
Figure 4
Figure 4. Figure 4: Perspective vs. Orthographic. (Left) Six-view pixel art pipeline. (Right) Perspective views (red) misalign pixels, while six orthographic views (green) enable precise pixel–voxel alignment. comparing them against pixel art supervision generated by the pixel art generator [102]. This six-view setup compactly covers the major surfaces of the object, while orthographic rendering formulates parallel ray castin… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons on character models from the Rodin [97] dataset. We compare our voxel art results with Pixel art to 3D extension, IN2N [30], Vox-E [83], and Blender’s voxelization. Our method produces stylized yet consistent voxel representations with pixel art aesthetics. 2. IN2N [30]: Language-guided mesh editing with view￾consistent 3D stylization. 3. Vox-E [83]: Language-to-voxel generation pri… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of Palette Selection and Color Count. Each row corresponds to a different palette extraction method: K-means, Max-Min, Median Cut, and Simulated Annealing. Each column shows increasing color counts (2, 3, 4, 8). Each method produces unique color clustering effects [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study on model components. We show outputs after removing key modules: pixel art supervision, orthographic projection, grid initialization, depth loss, CLIP loss, and Gumbel Softmax. Each row shows a different input; columns compare ablations. The full model yields coherent stylization, while removals cause distortions, color artifacts, or semantic loss [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Fabrication: LEGO render. Rendered using KeyShot 2023. Our method extends to LEGO applications, where achieving rich visual results within the limited color palette is crucial for practical fabrication. Input w/o Gumbel (w/ Gumbel) Ours Input w/o Gumbel (w/ Gumbel) Ours [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation user study of Gumbel. Four representative examples comparing results with and without Gumbel-Softmax. Without Gumbel-Softmax, voxel colors become blurred and fea￾tures less distinct. character assets. Extensive experiments and user studies con￾firm its advantages over existing baselines in both geometric faithfulness and artistic stylization. In addition to digital results, we further illustrate t… view at source ↗
Figure 10
Figure 10. Figure 10: Downsample vs. Pixel Art Stage 1 (Coarse voxelization). The voxel grid is optimized with MSE reconstruction loss, regularized by density and background terms: Ltotal = Lrender + λdLdensity + λbLbg, where Lrender is MSE between rendered and target colors, Ldensity applies density regularization and total variation smoothing, and Lbg uses entropy to suppress background noise. This stage provides a stable in… view at source ↗
Figure 11
Figure 11. Figure 11: Greyscale examples. grayscale voxel renderings [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: User study UI. • Voxel art appeal: “Which version looks most visually appealing as a voxel art character, like something you might see in Minecraft or a stylized game?” For the grayscale examples, participants answered: • Geometry preservation: “Which grayscale voxel render￾ing more closely resembles the original 3D mesh in terms of overall geometry?” Expert study on color preference. We further conducted… view at source ↗
Figure 13
Figure 13. Figure 13: Additional qualitative comparisons with baselines. Eight representative examples compared with Pixel, IN2N, Vox-E, and Blender Geometry Nodes. Max-Min Median Cut Simulated Annealing K-means input 2 colors 3 colors 4 colors 8 colors input 2 colors 3 colors 4 colors 8 colors Max-Min Median Cut Simulated Annealing K-means [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Results with varying palette settings. Examples using different palette extraction strategies and palette sizes. As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Results under different voxel sizes. Input Gemini3 Ours [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison with Gemini 3 [27]. While Gemini 3 can generate voxel art through code, it lacks precise control over resolution, palette, and visual fidelity to input references. Input pixel art Input mesh Rodin [Wang et al. (2022b)] Ours Input pixel art Input mesh Rodin [Wang et al. (2022b)] Ours [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison with Rodin [97]. Rodin excels at image-to-mesh but is not tailored for voxel art, often yielding non-voxel outputs (right) or flat geometry (left). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Representative failure cases. Complex shapes with fine￾grained geometric details are difficult to represent under limited voxel resolution, resulting in loss of intricate structures. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
read the original abstract

Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. Voxify3D is a differentiable two-stage framework for converting 3D meshes into voxel art. It combines orthographic pixel art supervision to eliminate perspective distortion, patch-based CLIP alignment to preserve semantics under discretization, and palette-constrained Gumbel-Softmax quantization to enable end-to-end optimization over discrete color palettes. Experiments report superior results with 37.12 CLIP-IQA and 77.90% user preference across characters at 2-8 colors and 20x-50x resolutions.

Significance. If the central claims hold after proper validation, the work could advance automated voxel stylization for games and digital media by providing a practical pipeline that jointly handles geometric abstraction, semantic fidelity, and palette constraints. The explicit use of orthographic supervision and differentiable quantization offers a concrete route to controllable discrete 3D outputs.

major comments (2)
  1. [Abstract] Abstract: the reported superiority (37.12 CLIP-IQA, 77.90% user preference) is presented without any description of baselines, ablation studies, or the precise experimental protocol (view selection, number of test meshes, evaluation views). This information is load-bearing for the claim that the three-component integration outperforms prior art.
  2. [Abstract] Method description (abstract): the claim that orthographic + patch-CLIP supervision plus Gumbel-Softmax produces 3D-consistent semantics after discretization lacks an explicit cross-view consistency term. Nothing in the pipeline is shown to penalize a voxel assignment that is coherent only on the training orthographic axes but collapses or recolors under rotation, which directly undermines the semantic-preservation guarantee.
minor comments (1)
  1. [Abstract] Abstract: no statement on code, data, or model availability is provided despite the project page link.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We have revised the manuscript to address the concerns about experimental context in the abstract and to clarify the mechanisms supporting 3D semantic consistency. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported superiority (37.12 CLIP-IQA, 77.90% user preference) is presented without any description of baselines, ablation studies, or the precise experimental protocol (view selection, number of test meshes, evaluation views). This information is load-bearing for the claim that the three-component integration outperforms prior art.

    Authors: We agree that the abstract would benefit from additional context to substantiate the reported metrics. In the revised version we have expanded the abstract to note that results are compared against prior voxel stylization and mesh-to-voxel methods, that the test set comprises 20 diverse character meshes, that evaluation uses six orthographic views per mesh, and that ablations isolating each of the three components appear in Section 4.2. The full protocol (view selection, palette sizes, and user-study details) is already described in Section 4.1; we now reference it explicitly from the abstract. revision: yes

  2. Referee: [Abstract] Method description (abstract): the claim that orthographic + patch-CLIP supervision plus Gumbel-Softmax produces 3D-consistent semantics after discretization lacks an explicit cross-view consistency term. Nothing in the pipeline is shown to penalize a voxel assignment that is coherent only on the training orthographic axes but collapses or recolors under rotation, which directly undermines the semantic-preservation guarantee.

    Authors: The shared 3D voxel grid is optimized jointly from multiple orthographic axes; because voxel colors and occupancies are viewpoint-independent, any assignment that is coherent on the supervised axes remains coherent under arbitrary rotation by construction. Volumetric rendering further enforces this property. We have added a short paragraph in the method section and a clarifying sentence in the abstract that makes this implicit consistency explicit. We also include new supplementary visualizations rendered from random viewpoints to demonstrate preservation of semantics and palette under rotation. An additional explicit consistency loss is not required for the current claims but could be explored in future work. revision: partial

Circularity Check

0 steps flagged

No circularity: independent optimization pipeline with no self-referential derivations

full rationale

The paper presents Voxify3D as a two-stage differentiable framework that combines orthographic pixel art supervision, patch-based CLIP alignment, and palette-constrained Gumbel-Softmax quantization. No equations, loss formulations, or uniqueness theorems are shown that reduce claimed outputs to inputs by construction. Reported metrics (CLIP-IQA, user preference) are experimental results rather than predictions forced by fitted parameters or self-citations. The method description remains self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the three listed components can be combined differentiably without loss of the target aesthetic; no new physical entities or ad-hoc constants are introduced beyond standard optimization hyperparameters.

free parameters (2)
  • palette size
    Controllable parameter (2-8 colors) chosen per run to match desired abstraction level.
  • voxel resolution
    Controllable parameter (20x-50x) set by user to control discretization granularity.
axioms (2)
  • standard math Gumbel-Softmax provides a differentiable approximation to discrete sampling
    Invoked to enable gradient flow through color quantization.
  • domain assumption CLIP features on patches capture semantic content at multiple discretization levels
    Used to justify semantic preservation under voxelization.

pith-pipeline@v0.9.0 · 5526 in / 1411 out tokens · 27430 ms · 2026-05-16T23:57:50.689161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · 6 internal anchors

  1. [1]

    Mak- ing data matter: V oxel printing for the digital fabrication of data across scales and domains.Science advances, 4(5): eaas8652, 2018

    Christoph Bader, Dominik Kolb, James C Weaver, Sunanda Sharma, Ahmed Hosny, Jo˜ao Costa, and Neri Oxman. Mak- ing data matter: V oxel printing for the digital fabrication of data across scales and domains.Science advances, 4(5): eaas8652, 2018. 3

  2. [2]

    Edify 3d: Scalable high-quality 3d asset generation.arXiv preprint arXiv:2411.07135, 2024

    Maciej Bala, Yin Cui, Yifan Ding, Yunhao Ge, Zekun Hao, Jon Hasselgren, Jacob Huffman, Jingyi Jin, JP Lewis, Zhaoshuo Li, et al. Edify 3d: Scalable high-quality 3d asset generation.arXiv preprint arXiv:2411.07135, 2024. 3

  3. [3]

    Sd- πxl: Generating low-resolution quantized imagery via score dis- tillation

    Alexandre Binninger and Olga Sorkine-Hornung. Sd- πxl: Generating low-resolution quantized imagery via score dis- tillation. InSIGGRAPH Asia Conference Papers, pages 1–12, 2024. 2, 3

  4. [4]

    Proxylessnas: Di- rect neural architecture search on target task and hardware

    Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Di- rect neural architecture search on target task and hardware. InInternational Conference on Learning Representations (ICLR), 2019. 3

  5. [5]

    Denoising likelihood score matching for conditional score-based data generation.arXiv preprint arXiv:2203.14206, 2022

    Chen-Hao Chao, Wei-Fang Sun, Bo-Wun Cheng, Yi-Chen Lo, Chia-Che Chang, Yu-Lun Liu, Yu-Lin Chang, Chia- Ping Chen, and Chun-Yi Lee. Denoising likelihood score matching for conditional score-based data generation.arXiv preprint arXiv:2203.14206, 2022. 3

  6. [6]

    Tensorf: Tensorial radiance fields

    Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean conference on computer vision, pages 333–350. Springer,

  7. [7]

    Improv- ing robustness for joint optimization of camera pose and decomposed low-rank tensorial radiance fields

    Bo-Yu Chen, Wei-Chen Chiu, and Yu-Lun Liu. Improv- ing robustness for joint optimization of camera pose and decomposed low-rank tensorial radiance fields. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 990–1000, 2024. 2

  8. [8]

    Clip-driven open-vocabulary 3d scene graph generation via cross-modality contrastive learning

    Lianggangxu Chen, Xuejiao Wang, Jiale Lu, Shaohui Lin, Changbo Wang, and Gaoqi He. Clip-driven open-vocabulary 3d scene graph generation via cross-modality contrastive learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27863– 27873, 2024. 3

  9. [9]

    Q-dit: Accu- rate post-training quantization for diffusion transformers

    Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jingyan Jiang, Xin Wang, Zhi Wang, and Wenwu Zhu. Q-dit: Accu- rate post-training quantization for diffusion transformers. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28306–28315, 2025. 3

  10. [10]

    Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation

    Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation. InProceedings of the IEEE/CVF international conference on computer vision, pages 22246–22256, 2023. 3

  11. [11]

    Ortho-nerf: generating a true dig- ital orthophoto map using the neural radiance field from unmanned aerial vehicle images.Geo-spatial Information Science, 28(2):741–760, 2025

    Shihan Chen, Qingsong Yan, Yingjie Qu, Wang Gao, Junx- ing Yang, and Fei Deng. Ortho-nerf: generating a true dig- ital orthophoto map using the neural radiance field from unmanned aerial vehicle images.Geo-spatial Information Science, 28(2):741–760, 2025. 3

  12. [12]

    V oxelnext: Fully sparse voxelnet for 3d object detection and tracking

    Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. V oxelnext: Fully sparse voxelnet for 3d object detection and tracking. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 21674–21683, 2023. 2

  13. [13]

    Stylecity: Large-scale 3d urban scenes stylization

    Yingshu Chen, Huajian Huang, Tuan-Anh Vu, Ka Chun Shum, and Sai-Kit Yeung. Stylecity: Large-scale 3d urban scenes stylization. InEuropean conference on computer vision, pages 395–413. Springer, 2024. 3

  14. [14]

    Content-adaptive image downscaling

    Sungjoon Choi and Munchurl Kim. Content-adaptive image downscaling. InICCV, pages 261–269, 2015. 2

  15. [15]

    3dstyleglip: Part-tailored text-guided 3d neural stylization

    SeungJeh Chung, JooHyun Park, and HyeongYeop Kang. 3dstyleglip: Part-tailored text-guided 3d neural stylization. arXiv preprint arXiv:2404.02634, 2024. 3

  16. [16]

    Regularization of voxel art

    David Coeurjolly, Pierre Gueth, and Jacques-Olivier Lachaud. Regularization of voxel art. InACM SIGGRAPH 2018 Talks, pages 1–2. 2018. 3

  17. [17]

    Generating pixel art character sprites using gans.arXiv preprint arXiv:2208.06413, 2022

    Fl´avio Coutinho and Luiz Chaimowicz. Generating pixel art character sprites using gans.arXiv preprint arXiv:2208.06413, 2022. Submitted to SBGames 2022. 2

  18. [18]

    3d paintbrush: Local stylization of 3d shapes with cascaded score distillation

    Dale Decatur, Itai Lang, Kfir Aberman, and Rana Hanocka. 3d paintbrush: Local stylization of 3d shapes with cascaded score distillation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4473–4483, 2024. 3

  19. [19]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bj¨orn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, 2021. 2, 3

  20. [20]

    Clipdraw: Ex- ploring text-to-drawing synthesis through language-image encoders.Advances in Neural Information Processing Sys- tems, 35:5207–5218, 2022

    Kevin Frans, Lisa Soros, and Olaf Witkowski. Clipdraw: Ex- ploring text-to-drawing synthesis through language-image encoders.Advances in Neural Information Processing Sys- tems, 35:5207–5218, 2022. 3, 5

  21. [21]

    Plenoxels: Radiance fields without neural networks

    Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5501–5510, 2022. 2

  22. [22]

    Style- nerf2nerf: 3d style transfer from style-aligned multi-view images

    Haruo Fujiwara, Yusuke Mukuta, and Tatsuya Harada. Style- nerf2nerf: 3d style transfer from style-aligned multi-view images. InSIGGRAPH Asia 2024 Conference Papers, pages 1–10, 2024. 3

  23. [23]

    Fastnerf: High-fidelity neural rendering at 200fps

    Stephan J Garbin, Marek Kowalski, Matthew Johnson, Jamie Shotton, and Julien Valentin. Fastnerf: High-fidelity neural rendering at 200fps. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 14346–14355, 2021. 2

  24. [24]

    Learn to create simple lego micro buildings.ACM Transactions on Graphics (TOG), 43(6):1–13, 2024

    Jiahao Ge, Mingjun Zhou, and Chi-wing Fu. Learn to create simple lego micro buildings.ACM Transactions on Graphics (TOG), 43(6):1–13, 2024. 3

  25. [25]

    Pixelated 9 image abstraction with integrated user constraints.Comput- ers & Graphics, 37(5):333–347, 2013

    Timothy Gerstner, Doug DeCarlo, Marc Alexa, Adam Finkelstein, Yotam Gingold, and Andrew Nealen. Pixelated 9 image abstraction with integrated user constraints.Comput- ers & Graphics, 37(5):333–347, 2013. 2

  26. [26]

    Controllable neural style transfer for dynamic meshes

    Guilherme Gomes Haetinger, Jingwei Tang, Raphael Ortiz, Paul Kanyuk, and Vinicius Azevedo. Controllable neural style transfer for dynamic meshes. InAcm siggraph 2024 conference papers, pages 1–11, 2024. 3

  27. [27]

    Gemini models: Product overview

    Google DeepMind. Gemini models: Product overview. https : / / deepmind . google / technologies / gemini/, 2025. Accessed: November 21, 2025. 14, 16, 18

  28. [28]

    Deep unsuper- vised pixelization.ACM Transactions on Graphics (SIG- GRAPH Asia 2018 issue), 37(6):243:1–243:11, 2018

    Chu Han, Qiang Wen, Shengfeng He, Qianshu Zhu, Yinjie Tan, Guoqiang Han, and Tien-Tsin Wong. Deep unsuper- vised pixelization.ACM Transactions on Graphics (SIG- GRAPH Asia 2018 issue), 37(6):243:1–243:11, 2018. 2

  29. [29]

    Deep unsuper- vised pixelization.ACM Transactions on Graphics (TOG), 37(6):1–11, 2018

    Chu Han, Qiang Wen, Shengfeng He, Qianshu Zhu, Yinjie Tan, Guoqiang Han, and Tien-Tsin Wong. Deep unsuper- vised pixelization.ACM Transactions on Graphics (TOG), 37(6):1–11, 2018. 2

  30. [30]

    Efros, Aleksander Holynski, and Angjoo Kanazawa

    Ayaan Haque, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8874–8884, 2023. 2, 6, 7, 16

  31. [31]

    LRM: Large Reconstruction Model for Single Image to 3D

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023. 3

  32. [32]

    Plankassembly: Robust 3d reconstruction from three orthographic views with learnt shape programs

    Wentao Hu, Jia Zheng, Zixin Zhang, Xiaojun Yuan, Jian Yin, and Zihan Zhou. Plankassembly: Robust 3d reconstruction from three orthographic views with learnt shape programs. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18495–18505, 2023. 3

  33. [33]

    Quantart: Quantizing image style transfer towards high visual fidelity

    Siyu Huang, Jie An, Donglai Wei, Jiebo Luo, and Hanspeter Pfister. Quantart: Quantizing image style transfer towards high visual fidelity. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 5947–5956, 2023. 3

  34. [34]

    Part-aware shape generation with latent 3d diffusion of neu- ral voxel fields.IEEE Transactions on Visualization and Computer Graphics, 2025

    Yuhang Huang, Shilong Zou, Xinwang Liu, and Kai Xu. Part-aware shape generation with latent 3d diffusion of neu- ral voxel fields.IEEE Transactions on Visualization and Computer Graphics, 2025. 3

  35. [35]

    Spar3d: Stable point-aware recon- struction of 3d objects from single images.arXiv preprint arXiv:2501.04689, 2025

    Zixuan Huang, Mark Boss, Aaryaman Vasishta, James M Rehg, and Varun Jampani. Spar3d: Stable point-aware recon- struction of 3d objects from single images.arXiv preprint arXiv:2501.04689, 2025. 3

  36. [36]

    Pixel art adaptation for handicraft fabrication.Computer Graphics Forum (Pacific Graphics 2022), 41(7):489–494, 2022

    Yuki Igarashi and Takeo Igarashi. Pixel art adaptation for handicraft fabrication.Computer Graphics Forum (Pacific Graphics 2022), 41(7):489–494, 2022. 2

  37. [37]

    Image-to-image translation with conditional adver- sarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. InCVPR, pages 1125–1134, 2017. 2

  38. [38]

    Vectorfusion: Text- to-svg by abstracting pixel-based diffusion models.arXiv preprint arXiv:2211.13845, 2022

    Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text- to-svg by abstracting pixel-based diffusion models.arXiv preprint arXiv:2211.13845, 2022. 2

  39. [39]

    Categorical repa- rameterization with gumbel-softmax

    Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa- rameterization with gumbel-softmax. InProceedings of the 5th International Conference on Learning Representations (ICLR), 2017. 3, 5

  40. [40]

    Perceptual Losses for Real-Time Style Transfer and Super-Resolution

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution.arXiv preprint arXiv:1603.08155, 2016. 2

  41. [41]

    Minecraft- ify: Minecraft style image generation with text-guided image editing for in-game application.arXiv preprint arXiv:2402.05448, 2024

    Bumsoo Kim, Sanghyun Byun, Yonghoon Jung, Wonseop Shin, Sareer UI Amin, and Sanghyun Seo. Minecraft- ify: Minecraft style image generation with text-guided image editing for in-game application.arXiv preprint arXiv:2402.05448, 2024. 3

  42. [42]

    Diffusionclip: Text-guided im- age manipulation using diffusion models.arXiv preprint arXiv:2110.02711, 2022

    Jeong Joon Kim, Youngjung Hwang, Jaesung Park, Jae- jun Choi, and Taesup Kim. Diffusionclip: Text-guided im- age manipulation using diffusion models.arXiv preprint arXiv:2110.02711, 2022. 3

  43. [43]

    Palettenerf: Palette-based appearance editing of neural radiance fields

    Zhengfei Kuang, Fujun Luan, Sai Bi, Zhixin Shu, Gordon Wetzstein, and Kalyan Sunkavalli. Palettenerf: Palette-based appearance editing of neural radiance fields. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20691–20700, 2023. 3

  44. [44]

    Ice-nerf: Interactive color editing of nerfs via decomposition-aware weight op- timization

    Jae-Hyeok Lee and Dae-Shik Kim. Ice-nerf: Interactive color editing of nerfs via decomposition-aware weight op- timization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3491–3501, 2023. 3

  45. [45]

    V oxelizing google earth: A pipeline for new virtual worlds

    Ryan Hardesty Lewis. V oxelizing google earth: A pipeline for new virtual worlds. InACM SIGGRAPH 2024 Labs, pages 1–2. 2024. 3

  46. [46]

    Compressing volumetric radiance fields to 1 mb

    Lingzhi Li, Zhen Shen, Zhongshu Wang, Li Shen, and Liefeng Bo. Compressing volumetric radiance fields to 1 mb. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4222–4231, 2023. 2

  47. [47]

    Diffusion- sdf: Text-to-shape via voxelized diffusion

    Muheng Li, Yueqi Duan, Jie Zhou, and Jiwen Lu. Diffusion- sdf: Text-to-shape via voxelized diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12642–12651, 2023. 3

  48. [48]

    Genrc: Generative 3d room completion from sparse image collections

    Ming-Feng Li, Yueh-Feng Ku, Hong-Xuan Yen, Chi Liu, Yu-Lun Liu, Albert YC Chen, Cheng-Hao Kuo, and Min Sun. Genrc: Generative 3d room completion from sparse image collections. InEuropean Conference on Computer Vision, pages 146–163. Springer, 2024. 3

  49. [49]

    Blended latent diffusion: Text-driven editing of natural images.arXiv preprint arXiv:2301.11093, 2023

    Yuxuan Li, Robin Rombach, Yiqin Zhang, Xiaohang Zhan, Wenqiang Xu, Patrick Esser, and Bj ¨orn Ommer. Blended latent diffusion: Text-driven editing of natural images.arXiv preprint arXiv:2301.11093, 2023. 3

  50. [50]

    V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion

    Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anand- kumar. V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9087–9098, 2023. 3

  51. [51]

    Svdtree: Semantic voxel diffusion for single image tree reconstruction

    Yuan Li, Zhihao Liu, Bedrich Benes, Xiaopeng Zhang, and Jianwei Guo. Svdtree: Semantic voxel diffusion for single image tree reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4692–4702, 2024. 3

  52. [52]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023. 3 10

  53. [53]

    Frugalnerf: Fast conver- gence for extreme few-shot novel view synthesis without learned priors

    Chin-Yang Lin, Chung-Ho Wu, Chang-Han Yeh, Shih-Han Yen, Cheng Sun, and Yu-Lun Liu. Frugalnerf: Fast conver- gence for extreme few-shot novel view synthesis without learned priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11227–11238, 2025. 2

  54. [54]

    Darts: Differentiable architecture search

    Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. InInternational Confer- ence on Learning Representations (ICLR), 2019. 3

  55. [55]

    Stylerf: Zero-shot 3d style transfer of neural radiance fields

    Kunhao Liu, Fangneng Zhan, Yiwen Chen, Jiahui Zhang, Yingchen Yu, Abdulmotaleb El Saddik, Shijian Lu, and Eric P Xing. Stylerf: Zero-shot 3d style transfer of neural radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8338– 8348, 2023. 2, 3

  56. [56]

    Editing neural radiance fields by scene operations

    Lingjie Liu et al. Editing neural radiance fields by scene operations. InCVPR, 2022. 2

  57. [57]

    One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36, 2024

    Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36, 2024. 3

  58. [58]

    Wir3d: Visually-informed and geometry-aware 3d shape abstraction.arXiv preprint arXiv:2505.04813, 2025

    Richard Liu, Daniel Fu, Noah Tan, Itai Lang, and Rana Hanocka. Wir3d: Visually-informed and geometry-aware 3d shape abstraction.arXiv preprint arXiv:2505.04813, 2025. 3

  59. [59]

    Content-aware radiance fields: Aligning model complexity with scene intricacy through learned bitwidth quantization

    Weihang Liu, Xue Xian Zheng, Jingyi Yu, and Xin Lou. Content-aware radiance fields: Aligning model complexity with scene intricacy through learned bitwidth quantization. InEuropean Conference on Computer Vision, pages 239–

  60. [60]

    Robust dynamic radiance fields

    Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Jo- hannes Kopf, and Jia-Bin Huang. Robust dynamic radiance fields. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 13–23, 2023. 2

  61. [61]

    Material palette: Extraction of materials from a single image

    Ivan Lopes, Fabio Pizzati, and Raoul de Charette. Material palette: Extraction of materials from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4379–4388, 2024. 3

  62. [62]

    Simu- lating water and smoke with an octree data structure.ACM Transactions on Graphics (TOG), 23(3):457–462, 2004

    Frank Losasso, Fr´ed´eric Gibou, and Ronald Fedkiw. Simu- lating water and smoke with an octree data structure.ACM Transactions on Graphics (TOG), 23(3):457–462, 2004. 3

  63. [63]

    Differentiable voxeliza- tion and mesh morphing.arXiv preprint arXiv:2407.11272,

    Yihao Luo, Yikai Wang, Zhengrui Xiang, Yuliang Xiu, Guang Yang, and ChoonHwai Yap. Differentiable voxeliza- tion and mesh morphing.arXiv preprint arXiv:2407.11272,

  64. [64]

    The concrete distribution: A continuous relaxation of discrete random variables

    Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. InProceedings of the 5th International Conference on Learning Representations (ICLR), 2017. 3, 5

  65. [65]

    Pulse: Self-supervised photo upsampling via latent space exploration of generative models

    Sachin Menon, Alex Damian, Shijia Hu, Namkug Ravi, and Cynthia Rudin. Pulse: Self-supervised photo upsampling via latent space exploration of generative models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2437–2445, 2020. 3

  66. [66]

    Progressively optimized local radiance fields for robust view synthesis

    Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, Min H Kim, and Johannes Kopf. Progressively optimized local radiance fields for robust view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16539–16548, 2023. 2

  67. [67]

    Text2mesh: Text-driven neural stylization for meshes

    Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. Text2mesh: Text-driven neural stylization for meshes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13492– 13502, 2022. 3

  68. [68]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

  69. [69]

    Dit-3d: Exploring plain diffusion transformers for 3d shape generation.Advances in neural information processing systems, 36:67960–67971,

    Shentong Mo, Enze Xie, Ruihang Chu, Lanqing Hong, Matthias Niessner, and Zhenguo Li. Dit-3d: Exploring plain diffusion transformers for 3d shape generation.Advances in neural information processing systems, 36:67960–67971,

  70. [70]

    Clipcap: Clip prefix for image captioning

    Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. InEuropean Conference on Computer Vision (ECCV), pages 531–547. Springer, 2022. 3

  71. [71]

    Instant neural graphics primitives with a multires- olution hash encoding

    Thomas M¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding. InACM Transactions on Graphics (TOG), pages 102:1–102:15. ACM, 2022. 2

  72. [72]

    Vdb: High-resolution sparse volumes with dynamic topology.ACM Transactions on Graphics (TOG), 32(3):1–22, 2013

    Ken Museth. Vdb: High-resolution sparse volumes with dynamic topology.ACM Transactions on Graphics (TOG), 32(3):1–22, 2013. 3

  73. [73]

    Styleclip: Text-driven manipulation of stylegan imagery.arXiv preprint arXiv:2103.17249, 2021

    Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery.arXiv preprint arXiv:2103.17249, 2021. 3

  74. [74]

    Geoscaler: Geometry and rendering- aware downsampling of 3d mesh textures

    Sai Karthikey Pentapati, Anshul Rai, Arkady Ten, Chaitanya Atluru, and Alan Bovik. Geoscaler: Geometry and rendering- aware downsampling of 3d mesh textures. In2025 IEEE International Conference on Image Processing (ICIP), pages 1007–1012. IEEE, 2025. 3

  75. [75]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 3

  76. [76]

    Generating physically stable and buildable lego designs from text.arXiv preprint arXiv:2505.05469, 2025

    Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, and Jun-Yan Zhu. Generating physically stable and buildable lego designs from text.arXiv preprint arXiv:2505.05469, 2025. 3

  77. [77]

    Mod- ular procedural generation for voxel maps

    Adarsh Pyarelal, Aditya Banerjee, and Kobus Barnard. Mod- ular procedural generation for voxel maps. InAAAI Fall Symposium, pages 85–101. Springer, 2021. 3

  78. [78]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 5

  79. [79]

    Kilonerf: Scalable neural radiance fields 11 with thousands of tiny mlps

    Claudius Reiser, Gernot Riegler, Anton S Kaplanyan, and Marc Pollefeys. Kilonerf: Scalable neural radiance fields 11 with thousands of tiny mlps. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14325–14334, 2021. 2

  80. [80]

    Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

    Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4209–4219, 2024. 2, 3

Showing first 80 references.