SEIG uses staged VLM prompting to output executable Blender programs that reconstruct editable 3D scenes from single images, showing improved fidelity over non-staged baselines.
arXiv preprint arXiv:2505.05469 , year =
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Text Encoded Extrusions (TEE) lets LLMs generate and edit manifold 3D meshes by learning sequences of face extrusions from decomposed quadrilateral meshes.
Voxify3D generates voxel art from 3D meshes via orthographic pixel supervision, patch-based CLIP alignment, and palette-constrained Gumbel-Softmax quantization, achieving 37.12 CLIP-IQA and 77.90% user preference.
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
citing papers explorer
-
Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models
SEIG uses staged VLM prompting to output executable Blender programs that reconstruct editable 3D scenes from single images, showing improved fidelity over non-staged baselines.
-
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.