arxiv: 2512.07834 · v2 · submitted 2025-12-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Voxify3D: Pixel Art Meets Volumetric Rendering

Yi-Chuan Huang , Jiewen Chan , Hao-Jen Chien , Yu-Lun Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords voxel art generation3D mesh stylizationdifferentiable renderingCLIP semantic alignmentGumbel-Softmax quantizationpixel art aestheticsvolumetric renderingdiscrete optimization

0 comments

The pith

Voxify3D converts 3D meshes into voxel art by aligning straight-on pixel renders with CLIP patches and using differentiable palette quantization to keep semantics intact under discretization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a two-stage differentiable framework that turns smooth 3D meshes into blocky voxel art while preserving what the object represents and restricting output to small color sets. It relies on rendering the model from straight-on orthographic views to match exact pixel positions, matching small image patches to text descriptions to hold onto meaning across detail loss, and a soft quantization step that lets the system pick discrete colors from a chosen palette during training. This combination tackles the core tension in voxel stylization where geometry must simplify drastically yet still read as the original shape with coherent limited colors. A sympathetic reader would care because it provides an automated route to game-style assets that follow strict pixel-art rules without manual cleanup or loss of recognizable identity.

Core claim

Voxify3D bridges 3D mesh optimization with 2D pixel art supervision by integrating orthographic pixel art supervision that removes perspective distortion for precise voxel-pixel alignment, patch-based CLIP alignment that preserves semantics across discretization levels, and palette-constrained Gumbel-Softmax quantization that enables differentiable optimization over discrete color spaces with controllable palette strategies. This integration directly addresses semantic preservation under extreme discretization and pixel-art aesthetics through volumetric rendering while supporting end-to-end discrete optimization.

What carries the argument

The synergistic integration of orthographic pixel art supervision, patch-based CLIP alignment, and palette-constrained Gumbel-Softmax quantization, which together carry the argument by allowing gradient-based optimization of voxel grids from meshes while enforcing pixel-level aesthetics and semantic consistency.

If this is right

Voxel outputs can be generated controllably at abstraction levels from 2-8 colors and 20x-50x resolutions while remaining recognizable.
End-to-end training replaces separate post-processing steps for color quantization and geometry abstraction in voxel pipelines.
The framework produces higher semantic alignment scores and stronger user preference than prior mesh-to-voxel methods across characters.
Palette strategies become tunable parameters that directly influence the final discrete color coherence without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same supervision stack could be tested on generating voxel versions of static environments or props to check if semantic preservation generalizes beyond characters.
Combining this discretization approach with rigged 3D models might enable direct creation of animated voxel characters for games.
The reliance on patch CLIP suggests the method could adapt to other extreme stylizations such as low-poly or sprite-based outputs from 3D sources.

Load-bearing premise

Straight-on orthographic renders plus patch-level CLIP matches will keep the original meaning and desired pixel-art look intact once the mesh is locked into a coarse grid and tiny color palette.

What would settle it

Run the method on standard test meshes such as a human character or simple object at 20x resolution with a 4-color palette and measure whether human viewers can still correctly identify the source object from the voxel output at rates above chance.

Figures

Figures reproduced from arXiv: 2512.07834 by Hao-Jen Chien, Jiewen Chan, Yi-Chuan Huang, Yu-Lun Liu.

**Figure 1.** Figure 1: Stylized voxel art with controllable abstraction. Voxify3D converts 3D meshes into stylized voxel art using discrete color palettes, pixel art supervision, and voxel-based radiance fields. This teaser showcases the flexibility and quality of our method. (a) Diverse voxel art outputs across object types and use cases. (b) Comparison of different palette selection methods. (c) Control over the resolution of … view at source ↗

**Figure 2.** Figure 2: Existing methods often miss key features in voxelization. While IN2N [30], Vox-E [83], and Blender (Geometry Nodes) generate outputs that are coarse, blurry, or semantically inconsistent, they frequently lose critical elements such as facial features. In contrast, our method preserves structural details and produces visually appealing voxel art with sharp abstraction. structure. Despite its growing popul… view at source ↗

**Figure 3.** Figure 3: Our two-stage voxel art generation pipeline. (a) Coarse voxel grid training: Given a 3D mesh, we render multi-view images and optimize a voxel-based radiance field (DVGO [87]) using MSE loss to learn coarse RGB and density. (b) Orthographic pixel art fine-tuning: We refine the voxel grid using six orthographic pixel art views, which also serve to extract a discrete color palette (e.g., via k-means). Optimi… view at source ↗

**Figure 4.** Figure 4: Perspective vs. Orthographic. (Left) Six-view pixel art pipeline. (Right) Perspective views (red) misalign pixels, while six orthographic views (green) enable precise pixel–voxel alignment. comparing them against pixel art supervision generated by the pixel art generator [102]. This six-view setup compactly covers the major surfaces of the object, while orthographic rendering formulates parallel ray castin… view at source ↗

**Figure 5.** Figure 5: Qualitative comparisons on character models from the Rodin [97] dataset. We compare our voxel art results with Pixel art to 3D extension, IN2N [30], Vox-E [83], and Blender’s voxelization. Our method produces stylized yet consistent voxel representations with pixel art aesthetics. 2. IN2N [30]: Language-guided mesh editing with viewconsistent 3D stylization. 3. Vox-E [83]: Language-to-voxel generation pri… view at source ↗

**Figure 6.** Figure 6: Effect of Palette Selection and Color Count. Each row corresponds to a different palette extraction method: K-means, Max-Min, Median Cut, and Simulated Annealing. Each column shows increasing color counts (2, 3, 4, 8). Each method produces unique color clustering effects [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation study on model components. We show outputs after removing key modules: pixel art supervision, orthographic projection, grid initialization, depth loss, CLIP loss, and Gumbel Softmax. Each row shows a different input; columns compare ablations. The full model yields coherent stylization, while removals cause distortions, color artifacts, or semantic loss [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Fabrication: LEGO render. Rendered using KeyShot 2023. Our method extends to LEGO applications, where achieving rich visual results within the limited color palette is crucial for practical fabrication. Input w/o Gumbel (w/ Gumbel) Ours Input w/o Gumbel (w/ Gumbel) Ours [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Ablation user study of Gumbel. Four representative examples comparing results with and without Gumbel-Softmax. Without Gumbel-Softmax, voxel colors become blurred and features less distinct. character assets. Extensive experiments and user studies confirm its advantages over existing baselines in both geometric faithfulness and artistic stylization. In addition to digital results, we further illustrate t… view at source ↗

**Figure 10.** Figure 10: Downsample vs. Pixel Art Stage 1 (Coarse voxelization). The voxel grid is optimized with MSE reconstruction loss, regularized by density and background terms: Ltotal = Lrender + λdLdensity + λbLbg, where Lrender is MSE between rendered and target colors, Ldensity applies density regularization and total variation smoothing, and Lbg uses entropy to suppress background noise. This stage provides a stable in… view at source ↗

**Figure 11.** Figure 11: Greyscale examples. grayscale voxel renderings [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: User study UI. • Voxel art appeal: “Which version looks most visually appealing as a voxel art character, like something you might see in Minecraft or a stylized game?” For the grayscale examples, participants answered: • Geometry preservation: “Which grayscale voxel rendering more closely resembles the original 3D mesh in terms of overall geometry?” Expert study on color preference. We further conducted… view at source ↗

**Figure 13.** Figure 13: Additional qualitative comparisons with baselines. Eight representative examples compared with Pixel, IN2N, Vox-E, and Blender Geometry Nodes. Max-Min Median Cut Simulated Annealing K-means input 2 colors 3 colors 4 colors 8 colors input 2 colors 3 colors 4 colors 8 colors Max-Min Median Cut Simulated Annealing K-means [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Results with varying palette settings. Examples using different palette extraction strategies and palette sizes. As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Results under different voxel sizes. Input Gemini3 Ours [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Comparison with Gemini 3 [27]. While Gemini 3 can generate voxel art through code, it lacks precise control over resolution, palette, and visual fidelity to input references. Input pixel art Input mesh Rodin [Wang et al. (2022b)] Ours Input pixel art Input mesh Rodin [Wang et al. (2022b)] Ours [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

**Figure 17.** Figure 17: Comparison with Rodin [97]. Rodin excels at image-to-mesh but is not tailored for voxel art, often yielding non-voxel outputs (right) or flat geometry (left). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗

**Figure 18.** Figure 18: Representative failure cases. Complex shapes with finegrained geometric details are difficult to represent under limited voxel resolution, resulting in loss of intricate structures. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗

read the original abstract

Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Voxify3D gives a concrete differentiable pipeline for mesh-to-voxel stylization using orthographic supervision, patch CLIP, and palette Gumbel-Softmax, but cross-view consistency remains an open question.

read the letter

Colleague, the core of this paper is a two-stage setup that optimizes voxel grids from meshes by supervising with orthographic renders, aligning semantics via patch CLIP features, and quantizing colors through a constrained Gumbel-Softmax. That triad is the actual novelty; it is not just another stylization trick but a way to keep end-to-end gradients while forcing a small discrete palette and avoiding perspective distortion in the loss. The reported numbers (37-ish CLIP-IQA and 78% user preference) plus the controllable range of 2-8 colors and 20-50x resolutions show the method produces usable output on character meshes, which is the practical win here. The orthographic choice and the palette constraint both make sense for pixel-art aesthetics and are implemented in a way that looks reproducible from the description. The framework is also honest about its free parameters and does not hide them behind extra fitting. The soft spot is the 3D consistency angle. Supervision is limited to a handful of orthographic axes, so nothing in the loss explicitly pushes voxel decisions to agree under rotation. If a limb or feature holds up from the training views but shifts color or shape at 45 degrees, the semantic-preservation claim weakens, and the user-study scores could partly reflect view selection rather than robust 3D structure. The abstract also skips detailed baselines and ablations, so it is hard to isolate how much each component adds versus prior voxel or discretization work. This paper is for graphics people who need automated voxel assets for games or media pipelines and who already work with differentiable rendering. A reader who cares about style transfer under discretization will find the integration useful even if they later add view-consistency regularizers. It deserves a serious referee because the pipeline is concrete, the problem is real, and the experiments are at least directionally positive; revisions would mainly need more controls and cross-view tests rather than a full rewrite.

Referee Report

2 major / 1 minor

Summary. Voxify3D is a differentiable two-stage framework for converting 3D meshes into voxel art. It combines orthographic pixel art supervision to eliminate perspective distortion, patch-based CLIP alignment to preserve semantics under discretization, and palette-constrained Gumbel-Softmax quantization to enable end-to-end optimization over discrete color palettes. Experiments report superior results with 37.12 CLIP-IQA and 77.90% user preference across characters at 2-8 colors and 20x-50x resolutions.

Significance. If the central claims hold after proper validation, the work could advance automated voxel stylization for games and digital media by providing a practical pipeline that jointly handles geometric abstraction, semantic fidelity, and palette constraints. The explicit use of orthographic supervision and differentiable quantization offers a concrete route to controllable discrete 3D outputs.

major comments (2)

[Abstract] Abstract: the reported superiority (37.12 CLIP-IQA, 77.90% user preference) is presented without any description of baselines, ablation studies, or the precise experimental protocol (view selection, number of test meshes, evaluation views). This information is load-bearing for the claim that the three-component integration outperforms prior art.
[Abstract] Method description (abstract): the claim that orthographic + patch-CLIP supervision plus Gumbel-Softmax produces 3D-consistent semantics after discretization lacks an explicit cross-view consistency term. Nothing in the pipeline is shown to penalize a voxel assignment that is coherent only on the training orthographic axes but collapses or recolors under rotation, which directly undermines the semantic-preservation guarantee.

minor comments (1)

[Abstract] Abstract: no statement on code, data, or model availability is provided despite the project page link.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We have revised the manuscript to address the concerns about experimental context in the abstract and to clarify the mechanisms supporting 3D semantic consistency. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the reported superiority (37.12 CLIP-IQA, 77.90% user preference) is presented without any description of baselines, ablation studies, or the precise experimental protocol (view selection, number of test meshes, evaluation views). This information is load-bearing for the claim that the three-component integration outperforms prior art.

Authors: We agree that the abstract would benefit from additional context to substantiate the reported metrics. In the revised version we have expanded the abstract to note that results are compared against prior voxel stylization and mesh-to-voxel methods, that the test set comprises 20 diverse character meshes, that evaluation uses six orthographic views per mesh, and that ablations isolating each of the three components appear in Section 4.2. The full protocol (view selection, palette sizes, and user-study details) is already described in Section 4.1; we now reference it explicitly from the abstract. revision: yes
Referee: [Abstract] Method description (abstract): the claim that orthographic + patch-CLIP supervision plus Gumbel-Softmax produces 3D-consistent semantics after discretization lacks an explicit cross-view consistency term. Nothing in the pipeline is shown to penalize a voxel assignment that is coherent only on the training orthographic axes but collapses or recolors under rotation, which directly undermines the semantic-preservation guarantee.

Authors: The shared 3D voxel grid is optimized jointly from multiple orthographic axes; because voxel colors and occupancies are viewpoint-independent, any assignment that is coherent on the supervised axes remains coherent under arbitrary rotation by construction. Volumetric rendering further enforces this property. We have added a short paragraph in the method section and a clarifying sentence in the abstract that makes this implicit consistency explicit. We also include new supplementary visualizations rendered from random viewpoints to demonstrate preservation of semantics and palette under rotation. An additional explicit consistency loss is not required for the current claims but could be explored in future work. revision: partial

Circularity Check

0 steps flagged

No circularity: independent optimization pipeline with no self-referential derivations

full rationale

The paper presents Voxify3D as a two-stage differentiable framework that combines orthographic pixel art supervision, patch-based CLIP alignment, and palette-constrained Gumbel-Softmax quantization. No equations, loss formulations, or uniqueness theorems are shown that reduce claimed outputs to inputs by construction. Reported metrics (CLIP-IQA, user preference) are experimental results rather than predictions forced by fitted parameters or self-citations. The method description remains self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the three listed components can be combined differentiably without loss of the target aesthetic; no new physical entities or ad-hoc constants are introduced beyond standard optimization hyperparameters.

free parameters (2)

palette size
Controllable parameter (2-8 colors) chosen per run to match desired abstraction level.
voxel resolution
Controllable parameter (20x-50x) set by user to control discretization granularity.

axioms (2)

standard math Gumbel-Softmax provides a differentiable approximation to discrete sampling
Invoked to enable gradient flow through color quantization.
domain assumption CLIP features on patches capture semantic content at multiple discretization levels
Used to justify semantic preservation under voxelization.

pith-pipeline@v0.9.0 · 5526 in / 1411 out tokens · 27430 ms · 2026-05-16T23:57:50.689161+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · 6 internal anchors

[1]

Mak- ing data matter: V oxel printing for the digital fabrication of data across scales and domains.Science advances, 4(5): eaas8652, 2018

Christoph Bader, Dominik Kolb, James C Weaver, Sunanda Sharma, Ahmed Hosny, Jo˜ao Costa, and Neri Oxman. Mak- ing data matter: V oxel printing for the digital fabrication of data across scales and domains.Science advances, 4(5): eaas8652, 2018. 3

work page 2018
[2]

Edify 3d: Scalable high-quality 3d asset generation.arXiv preprint arXiv:2411.07135, 2024

Maciej Bala, Yin Cui, Yifan Ding, Yunhao Ge, Zekun Hao, Jon Hasselgren, Jacob Huffman, Jingyi Jin, JP Lewis, Zhaoshuo Li, et al. Edify 3d: Scalable high-quality 3d asset generation.arXiv preprint arXiv:2411.07135, 2024. 3

work page arXiv 2024
[3]

Sd- πxl: Generating low-resolution quantized imagery via score dis- tillation

Alexandre Binninger and Olga Sorkine-Hornung. Sd- πxl: Generating low-resolution quantized imagery via score dis- tillation. InSIGGRAPH Asia Conference Papers, pages 1–12, 2024. 2, 3

work page 2024
[4]

Proxylessnas: Di- rect neural architecture search on target task and hardware

Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Di- rect neural architecture search on target task and hardware. InInternational Conference on Learning Representations (ICLR), 2019. 3

work page 2019
[5]

Denoising likelihood score matching for conditional score-based data generation.arXiv preprint arXiv:2203.14206, 2022

Chen-Hao Chao, Wei-Fang Sun, Bo-Wun Cheng, Yi-Chen Lo, Chia-Che Chang, Yu-Lun Liu, Yu-Lin Chang, Chia- Ping Chen, and Chun-Yi Lee. Denoising likelihood score matching for conditional score-based data generation.arXiv preprint arXiv:2203.14206, 2022. 3

work page arXiv 2022
[6]

Tensorf: Tensorial radiance fields

Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean conference on computer vision, pages 333–350. Springer,

work page
[7]

Improv- ing robustness for joint optimization of camera pose and decomposed low-rank tensorial radiance fields

Bo-Yu Chen, Wei-Chen Chiu, and Yu-Lun Liu. Improv- ing robustness for joint optimization of camera pose and decomposed low-rank tensorial radiance fields. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 990–1000, 2024. 2

work page 2024
[8]

Clip-driven open-vocabulary 3d scene graph generation via cross-modality contrastive learning

Lianggangxu Chen, Xuejiao Wang, Jiale Lu, Shaohui Lin, Changbo Wang, and Gaoqi He. Clip-driven open-vocabulary 3d scene graph generation via cross-modality contrastive learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27863– 27873, 2024. 3

work page 2024
[9]

Q-dit: Accu- rate post-training quantization for diffusion transformers

Lei Chen, Yuan Meng, Chen Tang, Xinzhu Ma, Jingyan Jiang, Xin Wang, Zhi Wang, and Wenwu Zhu. Q-dit: Accu- rate post-training quantization for diffusion transformers. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28306–28315, 2025. 3

work page 2025
[10]

Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation

Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fan- tasia3d: Disentangling geometry and appearance for high- quality text-to-3d content creation. InProceedings of the IEEE/CVF international conference on computer vision, pages 22246–22256, 2023. 3

work page 2023
[11]

Ortho-nerf: generating a true dig- ital orthophoto map using the neural radiance field from unmanned aerial vehicle images.Geo-spatial Information Science, 28(2):741–760, 2025

Shihan Chen, Qingsong Yan, Yingjie Qu, Wang Gao, Junx- ing Yang, and Fei Deng. Ortho-nerf: generating a true dig- ital orthophoto map using the neural radiance field from unmanned aerial vehicle images.Geo-spatial Information Science, 28(2):741–760, 2025. 3

work page 2025
[12]

V oxelnext: Fully sparse voxelnet for 3d object detection and tracking

Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. V oxelnext: Fully sparse voxelnet for 3d object detection and tracking. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 21674–21683, 2023. 2

work page 2023
[13]

Stylecity: Large-scale 3d urban scenes stylization

Yingshu Chen, Huajian Huang, Tuan-Anh Vu, Ka Chun Shum, and Sai-Kit Yeung. Stylecity: Large-scale 3d urban scenes stylization. InEuropean conference on computer vision, pages 395–413. Springer, 2024. 3

work page 2024
[14]

Content-adaptive image downscaling

Sungjoon Choi and Munchurl Kim. Content-adaptive image downscaling. InICCV, pages 261–269, 2015. 2

work page 2015
[15]

3dstyleglip: Part-tailored text-guided 3d neural stylization

SeungJeh Chung, JooHyun Park, and HyeongYeop Kang. 3dstyleglip: Part-tailored text-guided 3d neural stylization. arXiv preprint arXiv:2404.02634, 2024. 3

work page arXiv 2024
[16]

Regularization of voxel art

David Coeurjolly, Pierre Gueth, and Jacques-Olivier Lachaud. Regularization of voxel art. InACM SIGGRAPH 2018 Talks, pages 1–2. 2018. 3

work page 2018
[17]

Generating pixel art character sprites using gans.arXiv preprint arXiv:2208.06413, 2022

Fl´avio Coutinho and Luiz Chaimowicz. Generating pixel art character sprites using gans.arXiv preprint arXiv:2208.06413, 2022. Submitted to SBGames 2022. 2

work page arXiv 2022
[18]

3d paintbrush: Local stylization of 3d shapes with cascaded score distillation

Dale Decatur, Itai Lang, Kfir Aberman, and Rana Hanocka. 3d paintbrush: Local stylization of 3d shapes with cascaded score distillation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 4473–4483, 2024. 3

work page 2024
[19]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bj¨orn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, 2021. 2, 3

work page 2021
[20]

Clipdraw: Ex- ploring text-to-drawing synthesis through language-image encoders.Advances in Neural Information Processing Sys- tems, 35:5207–5218, 2022

Kevin Frans, Lisa Soros, and Olaf Witkowski. Clipdraw: Ex- ploring text-to-drawing synthesis through language-image encoders.Advances in Neural Information Processing Sys- tems, 35:5207–5218, 2022. 3, 5

work page 2022
[21]

Plenoxels: Radiance fields without neural networks

Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5501–5510, 2022. 2

work page 2022
[22]

Style- nerf2nerf: 3d style transfer from style-aligned multi-view images

Haruo Fujiwara, Yusuke Mukuta, and Tatsuya Harada. Style- nerf2nerf: 3d style transfer from style-aligned multi-view images. InSIGGRAPH Asia 2024 Conference Papers, pages 1–10, 2024. 3

work page 2024
[23]

Fastnerf: High-fidelity neural rendering at 200fps

Stephan J Garbin, Marek Kowalski, Matthew Johnson, Jamie Shotton, and Julien Valentin. Fastnerf: High-fidelity neural rendering at 200fps. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 14346–14355, 2021. 2

work page 2021
[24]

Learn to create simple lego micro buildings.ACM Transactions on Graphics (TOG), 43(6):1–13, 2024

Jiahao Ge, Mingjun Zhou, and Chi-wing Fu. Learn to create simple lego micro buildings.ACM Transactions on Graphics (TOG), 43(6):1–13, 2024. 3

work page 2024
[25]

Pixelated 9 image abstraction with integrated user constraints.Comput- ers & Graphics, 37(5):333–347, 2013

Timothy Gerstner, Doug DeCarlo, Marc Alexa, Adam Finkelstein, Yotam Gingold, and Andrew Nealen. Pixelated 9 image abstraction with integrated user constraints.Comput- ers & Graphics, 37(5):333–347, 2013. 2

work page 2013
[26]

Controllable neural style transfer for dynamic meshes

Guilherme Gomes Haetinger, Jingwei Tang, Raphael Ortiz, Paul Kanyuk, and Vinicius Azevedo. Controllable neural style transfer for dynamic meshes. InAcm siggraph 2024 conference papers, pages 1–11, 2024. 3

work page 2024
[27]

Gemini models: Product overview

Google DeepMind. Gemini models: Product overview. https : / / deepmind . google / technologies / gemini/, 2025. Accessed: November 21, 2025. 14, 16, 18

work page 2025
[28]

Deep unsuper- vised pixelization.ACM Transactions on Graphics (SIG- GRAPH Asia 2018 issue), 37(6):243:1–243:11, 2018

Chu Han, Qiang Wen, Shengfeng He, Qianshu Zhu, Yinjie Tan, Guoqiang Han, and Tien-Tsin Wong. Deep unsuper- vised pixelization.ACM Transactions on Graphics (SIG- GRAPH Asia 2018 issue), 37(6):243:1–243:11, 2018. 2

work page 2018
[29]

Deep unsuper- vised pixelization.ACM Transactions on Graphics (TOG), 37(6):1–11, 2018

Chu Han, Qiang Wen, Shengfeng He, Qianshu Zhu, Yinjie Tan, Guoqiang Han, and Tien-Tsin Wong. Deep unsuper- vised pixelization.ACM Transactions on Graphics (TOG), 37(6):1–11, 2018. 2

work page 2018
[30]

Efros, Aleksander Holynski, and Angjoo Kanazawa

Ayaan Haque, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-nerf2nerf: Editing 3d scenes with instructions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8874–8884, 2023. 2, 6, 7, 16

work page 2023
[31]

LRM: Large Reconstruction Model for Single Image to 3D

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Plankassembly: Robust 3d reconstruction from three orthographic views with learnt shape programs

Wentao Hu, Jia Zheng, Zixin Zhang, Xiaojun Yuan, Jian Yin, and Zihan Zhou. Plankassembly: Robust 3d reconstruction from three orthographic views with learnt shape programs. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18495–18505, 2023. 3

work page 2023
[33]

Quantart: Quantizing image style transfer towards high visual fidelity

Siyu Huang, Jie An, Donglai Wei, Jiebo Luo, and Hanspeter Pfister. Quantart: Quantizing image style transfer towards high visual fidelity. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 5947–5956, 2023. 3

work page 2023
[34]

Part-aware shape generation with latent 3d diffusion of neu- ral voxel fields.IEEE Transactions on Visualization and Computer Graphics, 2025

Yuhang Huang, Shilong Zou, Xinwang Liu, and Kai Xu. Part-aware shape generation with latent 3d diffusion of neu- ral voxel fields.IEEE Transactions on Visualization and Computer Graphics, 2025. 3

work page 2025
[35]

Spar3d: Stable point-aware recon- struction of 3d objects from single images.arXiv preprint arXiv:2501.04689, 2025

Zixuan Huang, Mark Boss, Aaryaman Vasishta, James M Rehg, and Varun Jampani. Spar3d: Stable point-aware recon- struction of 3d objects from single images.arXiv preprint arXiv:2501.04689, 2025. 3

work page arXiv 2025
[36]

Pixel art adaptation for handicraft fabrication.Computer Graphics Forum (Pacific Graphics 2022), 41(7):489–494, 2022

Yuki Igarashi and Takeo Igarashi. Pixel art adaptation for handicraft fabrication.Computer Graphics Forum (Pacific Graphics 2022), 41(7):489–494, 2022. 2

work page 2022
[37]

Image-to-image translation with conditional adver- sarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. InCVPR, pages 1125–1134, 2017. 2

work page 2017
[38]

Vectorfusion: Text- to-svg by abstracting pixel-based diffusion models.arXiv preprint arXiv:2211.13845, 2022

Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text- to-svg by abstracting pixel-based diffusion models.arXiv preprint arXiv:2211.13845, 2022. 2

work page arXiv 2022
[39]

Categorical repa- rameterization with gumbel-softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical repa- rameterization with gumbel-softmax. InProceedings of the 5th International Conference on Learning Representations (ICLR), 2017. 3, 5

work page 2017
[40]

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution.arXiv preprint arXiv:1603.08155, 2016. 2

work page internal anchor Pith review Pith/arXiv arXiv 2016
[41]

Minecraft- ify: Minecraft style image generation with text-guided image editing for in-game application.arXiv preprint arXiv:2402.05448, 2024

Bumsoo Kim, Sanghyun Byun, Yonghoon Jung, Wonseop Shin, Sareer UI Amin, and Sanghyun Seo. Minecraft- ify: Minecraft style image generation with text-guided image editing for in-game application.arXiv preprint arXiv:2402.05448, 2024. 3

work page arXiv 2024
[42]

Diffusionclip: Text-guided im- age manipulation using diffusion models.arXiv preprint arXiv:2110.02711, 2022

Jeong Joon Kim, Youngjung Hwang, Jaesung Park, Jae- jun Choi, and Taesup Kim. Diffusionclip: Text-guided im- age manipulation using diffusion models.arXiv preprint arXiv:2110.02711, 2022. 3

work page arXiv 2022
[43]

Palettenerf: Palette-based appearance editing of neural radiance fields

Zhengfei Kuang, Fujun Luan, Sai Bi, Zhixin Shu, Gordon Wetzstein, and Kalyan Sunkavalli. Palettenerf: Palette-based appearance editing of neural radiance fields. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20691–20700, 2023. 3

work page 2023
[44]

Ice-nerf: Interactive color editing of nerfs via decomposition-aware weight op- timization

Jae-Hyeok Lee and Dae-Shik Kim. Ice-nerf: Interactive color editing of nerfs via decomposition-aware weight op- timization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3491–3501, 2023. 3

work page 2023
[45]

V oxelizing google earth: A pipeline for new virtual worlds

Ryan Hardesty Lewis. V oxelizing google earth: A pipeline for new virtual worlds. InACM SIGGRAPH 2024 Labs, pages 1–2. 2024. 3

work page 2024
[46]

Compressing volumetric radiance fields to 1 mb

Lingzhi Li, Zhen Shen, Zhongshu Wang, Li Shen, and Liefeng Bo. Compressing volumetric radiance fields to 1 mb. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4222–4231, 2023. 2

work page 2023
[47]

Diffusion- sdf: Text-to-shape via voxelized diffusion

Muheng Li, Yueqi Duan, Jie Zhou, and Jiwen Lu. Diffusion- sdf: Text-to-shape via voxelized diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12642–12651, 2023. 3

work page 2023
[48]

Genrc: Generative 3d room completion from sparse image collections

Ming-Feng Li, Yueh-Feng Ku, Hong-Xuan Yen, Chi Liu, Yu-Lun Liu, Albert YC Chen, Cheng-Hao Kuo, and Min Sun. Genrc: Generative 3d room completion from sparse image collections. InEuropean Conference on Computer Vision, pages 146–163. Springer, 2024. 3

work page 2024
[49]

Blended latent diffusion: Text-driven editing of natural images.arXiv preprint arXiv:2301.11093, 2023

Yuxuan Li, Robin Rombach, Yiqin Zhang, Xiaohang Zhan, Wenqiang Xu, Patrick Esser, and Bj ¨orn Ommer. Blended latent diffusion: Text-driven editing of natural images.arXiv preprint arXiv:2301.11093, 2023. 3

work page arXiv 2023
[50]

V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion

Yiming Li, Zhiding Yu, Christopher Choy, Chaowei Xiao, Jose M Alvarez, Sanja Fidler, Chen Feng, and Anima Anand- kumar. V oxformer: Sparse voxel transformer for camera- based 3d semantic scene completion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9087–9098, 2023. 3

work page 2023
[51]

Svdtree: Semantic voxel diffusion for single image tree reconstruction

Yuan Li, Zhihao Liu, Bedrich Benes, Xiaopeng Zhang, and Jianwei Guo. Svdtree: Semantic voxel diffusion for single image tree reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4692–4702, 2024. 3

work page 2024
[52]

Magic3d: High-resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023. 3 10

work page 2023
[53]

Frugalnerf: Fast conver- gence for extreme few-shot novel view synthesis without learned priors

Chin-Yang Lin, Chung-Ho Wu, Chang-Han Yeh, Shih-Han Yen, Cheng Sun, and Yu-Lun Liu. Frugalnerf: Fast conver- gence for extreme few-shot novel view synthesis without learned priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11227–11238, 2025. 2

work page 2025
[54]

Darts: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. InInternational Confer- ence on Learning Representations (ICLR), 2019. 3

work page 2019
[55]

Stylerf: Zero-shot 3d style transfer of neural radiance fields

Kunhao Liu, Fangneng Zhan, Yiwen Chen, Jiahui Zhang, Yingchen Yu, Abdulmotaleb El Saddik, Shijian Lu, and Eric P Xing. Stylerf: Zero-shot 3d style transfer of neural radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8338– 8348, 2023. 2, 3

work page 2023
[56]

Editing neural radiance fields by scene operations

Lingjie Liu et al. Editing neural radiance fields by scene operations. InCVPR, 2022. 2

work page 2022
[57]

One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36, 2024

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36, 2024. 3

work page 2024
[58]

Wir3d: Visually-informed and geometry-aware 3d shape abstraction.arXiv preprint arXiv:2505.04813, 2025

Richard Liu, Daniel Fu, Noah Tan, Itai Lang, and Rana Hanocka. Wir3d: Visually-informed and geometry-aware 3d shape abstraction.arXiv preprint arXiv:2505.04813, 2025. 3

work page arXiv 2025
[59]

Content-aware radiance fields: Aligning model complexity with scene intricacy through learned bitwidth quantization

Weihang Liu, Xue Xian Zheng, Jingyi Yu, and Xin Lou. Content-aware radiance fields: Aligning model complexity with scene intricacy through learned bitwidth quantization. InEuropean Conference on Computer Vision, pages 239–

work page
[60]

Robust dynamic radiance fields

Yu-Lun Liu, Chen Gao, Andreas Meuleman, Hung-Yu Tseng, Ayush Saraf, Changil Kim, Yung-Yu Chuang, Jo- hannes Kopf, and Jia-Bin Huang. Robust dynamic radiance fields. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 13–23, 2023. 2

work page 2023
[61]

Material palette: Extraction of materials from a single image

Ivan Lopes, Fabio Pizzati, and Raoul de Charette. Material palette: Extraction of materials from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4379–4388, 2024. 3

work page 2024
[62]

Simu- lating water and smoke with an octree data structure.ACM Transactions on Graphics (TOG), 23(3):457–462, 2004

Frank Losasso, Fr´ed´eric Gibou, and Ronald Fedkiw. Simu- lating water and smoke with an octree data structure.ACM Transactions on Graphics (TOG), 23(3):457–462, 2004. 3

work page 2004
[63]

Differentiable voxeliza- tion and mesh morphing.arXiv preprint arXiv:2407.11272,

Yihao Luo, Yikai Wang, Zhengrui Xiang, Yuliang Xiu, Guang Yang, and ChoonHwai Yap. Differentiable voxeliza- tion and mesh morphing.arXiv preprint arXiv:2407.11272,

work page arXiv
[64]

The concrete distribution: A continuous relaxation of discrete random variables

Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. InProceedings of the 5th International Conference on Learning Representations (ICLR), 2017. 3, 5

work page 2017
[65]

Pulse: Self-supervised photo upsampling via latent space exploration of generative models

Sachin Menon, Alex Damian, Shijia Hu, Namkug Ravi, and Cynthia Rudin. Pulse: Self-supervised photo upsampling via latent space exploration of generative models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2437–2445, 2020. 3

work page 2020
[66]

Progressively optimized local radiance fields for robust view synthesis

Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, Min H Kim, and Johannes Kopf. Progressively optimized local radiance fields for robust view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16539–16548, 2023. 2

work page 2023
[67]

Text2mesh: Text-driven neural stylization for meshes

Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. Text2mesh: Text-driven neural stylization for meshes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13492– 13502, 2022. 3

work page 2022
[68]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

work page 2021
[69]

Dit-3d: Exploring plain diffusion transformers for 3d shape generation.Advances in neural information processing systems, 36:67960–67971,

Shentong Mo, Enze Xie, Ruihang Chu, Lanqing Hong, Matthias Niessner, and Zhenguo Li. Dit-3d: Exploring plain diffusion transformers for 3d shape generation.Advances in neural information processing systems, 36:67960–67971,

work page
[70]

Clipcap: Clip prefix for image captioning

Ron Mokady, Amir Hertz, and Amit H Bermano. Clipcap: Clip prefix for image captioning. InEuropean Conference on Computer Vision (ECCV), pages 531–547. Springer, 2022. 3

work page 2022
[71]

Instant neural graphics primitives with a multires- olution hash encoding

Thomas M¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding. InACM Transactions on Graphics (TOG), pages 102:1–102:15. ACM, 2022. 2

work page 2022
[72]

Vdb: High-resolution sparse volumes with dynamic topology.ACM Transactions on Graphics (TOG), 32(3):1–22, 2013

Ken Museth. Vdb: High-resolution sparse volumes with dynamic topology.ACM Transactions on Graphics (TOG), 32(3):1–22, 2013. 3

work page 2013
[73]

Styleclip: Text-driven manipulation of stylegan imagery.arXiv preprint arXiv:2103.17249, 2021

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery.arXiv preprint arXiv:2103.17249, 2021. 3

work page arXiv 2021
[74]

Geoscaler: Geometry and rendering- aware downsampling of 3d mesh textures

Sai Karthikey Pentapati, Anshul Rai, Arkady Ten, Chaitanya Atluru, and Alan Bovik. Geoscaler: Geometry and rendering- aware downsampling of 3d mesh textures. In2025 IEEE International Conference on Image Processing (ICIP), pages 1007–1012. IEEE, 2025. 3

work page 2025
[75]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[76]

Generating physically stable and buildable lego designs from text.arXiv preprint arXiv:2505.05469, 2025

Ava Pun, Kangle Deng, Ruixuan Liu, Deva Ramanan, Changliu Liu, and Jun-Yan Zhu. Generating physically stable and buildable lego designs from text.arXiv preprint arXiv:2505.05469, 2025. 3

work page arXiv 2025
[77]

Mod- ular procedural generation for voxel maps

Adarsh Pyarelal, Aditya Banerjee, and Kobus Barnard. Mod- ular procedural generation for voxel maps. InAAAI Fall Symposium, pages 85–101. Springer, 2021. 3

work page 2021
[78]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 5

work page 2021
[79]

Kilonerf: Scalable neural radiance fields 11 with thousands of tiny mlps

Claudius Reiser, Gernot Riegler, Anton S Kaplanyan, and Marc Pollefeys. Kilonerf: Scalable neural radiance fields 11 with thousands of tiny mlps. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14325–14334, 2021. 2

work page 2021
[80]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4209–4219, 2024. 2, 3

work page 2024

Showing first 80 references.