Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Guangzhao He; Hadar Averbuch-Elor; Rundong Luo; Wei-Chiu Ma

arxiv: 2606.02580 · v1 · pith:4MDLVEJ2new · submitted 2026-06-01 · 💻 cs.CV

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Guangzhao He , Rundong Luo , Wei-Chiu Ma , Hadar Averbuch-Elor This is my paper

Pith reviewed 2026-06-28 15:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords inverse graphicsvision-language modelsBlender3D reconstructionstaged reasoningexecutable codesingle-image reconstructionscene editing

0 comments

The pith

Pretrained vision-language models can turn single images into editable Blender scenes by writing code in successive refinement stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether general-purpose vision-language models can solve inverse graphics by generating executable Blender programs that recreate a full 3D scene from one photo. Rather than asking the model to produce the entire program at once, the method breaks the work into ordered stages that refine geometry, then materials, then composition, then lighting. Experiments across varied scenes show that the staged version produces scenes with higher pixel, perceptual, and semantic agreement to the input image. A sympathetic reader would care because the result points to a way of obtaining manipulable 3D content without 3D-specific models or multi-view data.

Core claim

Staged Executable Inverse Graphics (SEIG) enables pretrained VLMs to perform executable inverse graphics from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. The experiments demonstrate that this staged reconstruction substantially improves reconstruction fidelity over non-staged baselines.

What carries the argument

The SEIG framework that decomposes reconstruction into sequential refinements of geometry, materials, composition, and lighting inside executable Blender code.

If this is right

Staged decomposition raises reconstruction fidelity across pixel-level, perceptual, and semantic metrics.
The resulting Blender programs support direct rendering, relighting, and object manipulation.
The method operates with only a single image and off-the-shelf VLMs, without specialized 3D models or multi-view supervision.
Reconstructed scenes enable downstream editing and rendering applications inside standard Blender.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If VLM code-generation accuracy continues to rise, the same staged approach could handle more cluttered or dynamic scenes.
The decomposition pattern may transfer to other tasks that require generating long, structured code outputs from visual input.
One could measure whether adding further intermediate stages beyond the four described continues to improve results or saturates.
Editable Blender outputs could serve as starting points for interactive tools that let users adjust scenes through natural language.

Load-bearing premise

That pretrained vision-language models already contain enough spatial reasoning and code-generation ability to write correct Blender programs for diverse real scenes from a single view without 3D training or extra images.

What would settle it

Generate Blender programs from a test set of real single images, render them, and measure whether the rendered outputs match the input images in geometry and appearance at rates no better than non-staged baselines.

Figures

Figures reproduced from arXiv: 2606.02580 by Guangzhao He, Hadar Averbuch-Elor, Rundong Luo, Wei-Chiu Ma.

**Figure 1.** Figure 1: From a single reference image (leftmost inset), SEIG reconstructs an editable Blender scene through a staged generator–verifier loop driven entirely by [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Method overview. Our agentic reconstruction pipeline decomposes inverse graphics into four sequential stages of Blender code generation, each consisting of a generator step followed by a verifier step. An initialization stage samples four coarse scene initializations and labels a scene graph for all objects and parts, which is then passed to all subsequent stages for object- and part-centric refinement. Th… view at source ↗

**Figure 4.** Figure 4: Relighting. Two reconstructed scenes (top: pendant lamps; bottom: sailboat) re-rendered under two new lighting configurations. Since lights are separately stored in Blender, new illumination can be applied by adding or reconfiguring light sources without re-running any part of the pipeline. into disjoint, mis-colored meshes, as the underlying VLM agent often overwrites the texture of the generated 3D objec… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison across methods. Each block shows a different reference image (leftmost column) together with the reconstructions produced by VIGAVLM-only, VIGAfull, and our pipeline. Within each method’s column, the large rendering is taken from the reference viewpoint, and the two smaller show the same reconstructed scene rendered from alternate viewpoints, revealing the underlying 3D structure. A… view at source ↗

**Figure 5.** Figure 5: Object editing. Two reconstructed scenes (top: aircraft; bottom: castle), each shown alongside two example edits performed directly in Blender on the recovered scene graph: part duplication and texture editing for the aircraft; shape manipulation and object composition for the castle. Shake the table Reference Prediction Simulation Drop the ball [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Physics simulation. Two reconstructed scenes used as input to Blender’s built-in physics engine. Top: rigid-body dynamics—the recovered mugs and saucers slide and rattle when the table is given an external acceleration (shake the table). Bottom: soft-body dynamics—a ball is dropped onto the recovered cushion, which deforms accordingly (drop the ball). Both simulations run directly on the reconstructed sce… view at source ↗

**Figure 7.** Figure 7: Intermediate outputs across pipeline stages. Two examples showing the rendered scene through our pipeline: starting from a coarse initial scaffold (Initialization), through the four stages—Geometry, Material, Composition, and Lighting—each closed by its own generator–verifier loop, and the final image rendered from a VLM-determined camera (Camera-adjustment, rightmost column). Each stage commits its output… view at source ↗

**Figure 8.** Figure 8: Gallery of Blender scenes created by SEIG from in-the-wild and synthetic reference images. The synthetic scenes correspond to the examples shown in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Staged VLM prompting to generate executable Blender code for single-image inverse graphics is the new piece here, but the fidelity gains rest on unevaluated claims with no numbers or call-count controls.

read the letter

The main thing is that they show a VLM can produce a full Blender Python script from one photo by breaking the task into stages—geometry, then materials, composition, lighting—and outputting runnable code at each step. This keeps the output directly editable in a standard tool without training any 3D model or using differentiable rendering.

The approach is new in routing the entire reconstruction through executable code space with explicit task decomposition. Prior work on VLM 3D reasoning or inverse graphics tends to stop at descriptions or use specialized components; this one stays general and produces something you can immediately render and tweak. If the full paper includes the actual metric tables and scene examples, that would make the method easy to try.

The evaluation side is thin. The abstract states that staged reconstruction improves pixel, perceptual, and semantic fidelity but gives no numbers, no direct baselines, and no breakdown of where errors occur. The stress-test concern lands: without an ablation that holds total VLM calls or tokens roughly constant, any lift could come from extra inference steps rather than the staging structure itself. The core assumption—that a pretrained VLM has reliable enough spatial and code-generation ability for diverse real scenes—also needs concrete failure cases or success rates to be convincing.

This is for people working on VLM agents for graphics or low-barrier 3D content creation. A reader who wants a no-training route to manipulable scenes would get the method description and the downstream demos.

It should go to peer review. The idea is distinct enough and the setup practical enough that referees can usefully ask for the missing quantitative controls and ablations.

Referee Report

3 major / 0 minor

Summary. The paper introduces Staged Executable Inverse Graphics (SEIG), an agentic framework in which pretrained vision-language models generate executable Blender programs to reconstruct editable 3D scenes from single images. The reconstruction proceeds by progressive refinement of scene factors (geometry, materials, composition, lighting) directly in code space, without specialized 3D models, differentiable rendering, or multi-view supervision. The central claim is that this staged decomposition substantially improves reconstruction fidelity over direct (non-staged) VLM prompting, as measured by pixel-level, perceptual, and semantic metrics, while also enabling downstream editing applications.

Significance. If the quantitative results and ablations hold, the work would provide evidence that general-purpose VLMs can be orchestrated for practical inverse-graphics tasks via executable code output, reducing dependence on 3D-specific foundation models. The emphasis on editability and the absence of invented parameters or closed-form derivations are consistent with an empirical agentic approach.

major comments (3)

[Abstract] Abstract: the assertion that 'staged reconstruction substantially improves reconstruction fidelity' is presented without any numerical results, baseline descriptions, error analysis, or statistical tests; this is load-bearing for the central claim.
[Evaluation / Experiments] Evaluation / Experiments: no ablation equalizes total VLM calls, token budget, or wall-clock inference cost between the staged SEIG pipeline and the direct baseline, so any measured lift cannot yet be attributed to task decomposition rather than extra inference opportunities.
[Method / Evaluation] Method / Evaluation: the claim that pretrained VLMs suffice for correct Blender programs on diverse real scenes (without 3D training or multi-view data) is not accompanied by reported success rates, failure-case analysis, or comparison against 3D-specific baselines, leaving the weakest assumption untested.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below with honest responses and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that 'staged reconstruction substantially improves reconstruction fidelity' is presented without any numerical results, baseline descriptions, error analysis, or statistical tests; this is load-bearing for the central claim.

Authors: We agree the abstract should better support the central claim with concrete evidence. We will revise it to include key quantitative results (e.g., improvements in PSNR, LPIPS, and semantic metrics) and brief baseline descriptions drawn from the Experiments section. revision: yes
Referee: [Evaluation / Experiments] Evaluation / Experiments: no ablation equalizes total VLM calls, token budget, or wall-clock inference cost between the staged SEIG pipeline and the direct baseline, so any measured lift cannot yet be attributed to task decomposition rather than extra inference opportunities.

Authors: This is a valid point; the staged pipeline uses multiple VLM calls by design. We will add a controlled ablation that matches total VLM calls and token budget (e.g., by granting the direct baseline equivalent refinement iterations) to isolate the effect of decomposition. revision: yes
Referee: [Method / Evaluation] Method / Evaluation: the claim that pretrained VLMs suffice for correct Blender programs on diverse real scenes (without 3D training or multi-view data) is not accompanied by reported success rates, failure-case analysis, or comparison against 3D-specific baselines, leaving the weakest assumption untested.

Authors: We will add explicit success rates for valid executable program generation and a failure-case analysis section. Direct comparisons to 3D-specific baselines fall outside our focus on general-purpose VLMs without specialized training or multi-view data, but we will expand the discussion of related work for context. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical agentic framework with independent evaluation

full rationale

The paper describes an empirical agentic pipeline (SEIG) that uses pretrained VLMs to generate staged Blender code from single images, with results measured on pixel, perceptual, and semantic metrics. No equations, derivations, or parameter-fitting steps are present that reduce outputs to inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The central claim rests on experimental comparisons rather than any closed mathematical reduction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that current VLMs can reliably produce runnable Blender code for geometry, materials, composition, and lighting from image descriptions alone.

axioms (1)

domain assumption Pretrained VLMs contain sufficient implicit 3D and code-generation knowledge to solve inverse graphics when prompted in stages.
Invoked in the description of SEIG as the mechanism that replaces specialized 3D models and differentiable rendering.

pith-pipeline@v0.9.1-grok · 5718 in / 1195 out tokens · 19994 ms · 2026-06-28T15:01:31.544834+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 5 linked inside Pith

[1]

European Conference on Computer Vision , year =

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis , author =. European Conference on Computer Vision , year =
[2]

ACM Transactions on Graphics , year =

3D Gaussian Splatting for Real-Time Radiance Field Rendering , author =. ACM Transactions on Graphics , year =
[3]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Deep 3d capture: Geometry and reflectance from sparse multi-view images , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=
[4]

ACM Transactions on Graphics , year=

Practical svbrdf acquisition of 3d objects with unstructured flash photography , author=. ACM Transactions on Graphics , year=
[5]

ACM Transactions on Graphics , year=

Appearance-from-motion: Recovering spatially varying surface reflectance under unknown lighting , author=. ACM Transactions on Graphics , year=
[6]

ACM Transactions on Graphics , year=

Learning to reconstruct shape and spatially-varying reflectance from a single image , author=. ACM Transactions on Graphics , year=
[7]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Csgnet: Neural shape parser for constructive solid geometry , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=
[8]

Advances in Neural Information Processing Systems , year=

Differentiable blocks world: Qualitative 3d decomposition by rendering primitives , author=. Advances in Neural Information Processing Systems , year=
[9]

IEEE/CVF Winter Conference on Applications of Computer Vision , year=

Volumetric disentanglement for 3d scene manipulation , author=. IEEE/CVF Winter Conference on Applications of Computer Vision , year=
[10]

IEEE/CVF International Conference on Computer Vision , year=

Learning object-compositional neural radiance field for editable scene rendering , author=. IEEE/CVF International Conference on Computer Vision , year=
[11]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Dynamic lidar re-simulation using compositional neural fields , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=
[12]

International Conference on Learning Representations , year=

Omnire: Omni urban scene reconstruction , author=. International Conference on Learning Representations , year=
[13]

Shape from shading: A method for obtaining the shape of a smooth opaque object from one view , author=
[14]

Computer Vision Systems , year=

Recovering intrinsic scene characteristics , author=. Computer Vision Systems , year=
[15]

1963 , school=

Machine perception of three-dimensional solids , author=. 1963 , school=

1963
[16]

arXiv preprint arXiv:2601.11109 , year =

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning , author =. arXiv preprint arXiv:2601.11109 , year =

Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2410.13882 , year =

Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model , author =. arXiv preprint arXiv:2410.13882 , year =

arXiv
[18]

arXiv preprint arXiv:2505.05469 , year =

Generating Physically Stable and Buildable Brick Structures from Text , author =. arXiv preprint arXiv:2505.05469 , year =

arXiv
[19]

arXiv preprint arXiv:2307.05663 , year =

Objaverse-XL: A Universe of 10M+ 3D Objects , author =. arXiv preprint arXiv:2307.05663 , year =

Pith/arXiv arXiv
[20]

Inverse Rendering for Computer Graphics , author =
[21]

ACM SIGGRAPH , year =

A Signal-Processing Framework for Inverse Rendering , author =. ACM SIGGRAPH , year =
[22]

Advances in Neural Information Processing Systems , year =

Deep Convolutional Inverse Graphics Network , author =. Advances in Neural Information Processing Systems , year =
[23]

arXiv preprint arXiv:2404.15228 , year =

Re-Thinking Inverse Graphics With Large Language Models , author =. arXiv preprint arXiv:2404.15228 , year =

arXiv
[24]

arXiv preprint arXiv:2405.14871 , year =

NeRF-Casting: Improved View-Dependent Appearance with Consistent Reflections , author =. arXiv preprint arXiv:2405.14871 , year =

arXiv
[25]

arXiv preprint arXiv:2304.12461 , year =

TensoIR: Tensorial Inverse Rendering , author =. arXiv preprint arXiv:2304.12461 , year =

arXiv
[26]

IEEE/CVF International Conference on Computer Vision , year =

Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions , author =. IEEE/CVF International Conference on Computer Vision , year =
[27]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

DRAWER: Digital Reconstruction and Articulation With Environment Realism , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
[28]

Advances in Neural Information Processing Systems , year =

Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems , year =
[29]

arXiv preprint arXiv:2409.12191 , year =

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author =. arXiv preprint arXiv:2409.12191 , year =

Pith/arXiv arXiv
[30]

arXiv preprint arXiv:2508.08228 , year =

LL3M: Large Language 3D Modelers , author =. arXiv preprint arXiv:2508.08228 , year =

arXiv
[31]

arXiv preprint arXiv:2512.11061 , year =

VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation , author =. arXiv preprint arXiv:2512.11061 , year =

Pith/arXiv arXiv
[32]

ACM Transactions on Graphics , year =

NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination , author =. ACM Transactions on Graphics , year =
[33]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

VGGT: Visual Geometry Grounded Transformer , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=
[34]

2024 , journal=

The Scene Language: Representing Scenes with Programs, Words, and Embeddings , author=. 2024 , journal=

2024
[35]

Advances in Neural Information Processing Systems , year=

CAT3D: Create Anything in 3D with Multi-View Diffusion Models , author=. Advances in Neural Information Processing Systems , year=
[36]

Barron and Pratul P

Dor Verbin and Peter Hedman and Ben Mildenhall and Todd Zickler and Jonathan T. Barron and Pratul P. Srinivasan , booktitle=
[37]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Material Editing and Relighting , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
[38]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

Extracting Triangular 3D Models, Materials, and Lighting From Images , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
[39]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

GS-IR: 3D Gaussian Splatting for Inverse Rendering , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
[40]

GPT-4V(ision) System Card , author =
[41]

Gemini: A Family of Highly Capable Multimodal Models , author =
[42]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
[43]

2025 , howpublished =

Claude Opus 4.7 , author =. 2025 , howpublished =

2025
[44]

Transactions on Machine Learning Research , year =

Unsupervised Discovery of Object-Centric Neural Fields , author =. Transactions on Machine Learning Research , year =
[45]

arXiv preprint arXiv:2312.09738 , year =

3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V , author =. arXiv preprint arXiv:2312.09738 , year =

arXiv
[46]

International Conference on Machine Learning , year =

SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code , author =. International Conference on Machine Learning , year =
[47]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
[48]

arXiv preprint arXiv:2508.14879 , year =

MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds , author =. arXiv preprint arXiv:2508.14879 , year =

arXiv
[49]

arXiv preprint arXiv:2506.23329 , year =

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering , author =. arXiv preprint arXiv:2506.23329 , year =

arXiv
[50]

International Conference on 3D Vision , year =

Voxhammer: Training-free precise and coherent 3d editing in native 3d space , author =. International Conference on 3D Vision , year =
[51]

IEEE/CVF International Conference on Computer Vision , year =

Segment Anything , author =. IEEE/CVF International Conference on Computer Vision , year =
[52]

arXiv preprint arXiv:2511.16624 , year =

SAM 3D: 3Dfy Anything in Images , author =. arXiv preprint arXiv:2511.16624 , year =

Pith/arXiv arXiv
[53]

Advances in Neural Information Processing Systems , year =

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data , author =. Advances in Neural Information Processing Systems , year =
[54]

International Conference on Machine Learning , year =

Learning Transferable Visual Models From Natural Language Supervision , author =. International Conference on Machine Learning , year =
[55]

Advances in Neural Information Processing Systems , year =

Non-rigid Point Cloud Registration with Neural Deformation Pyramid , author =. Advances in Neural Information Processing Systems , year =
[56]

Transactions on Machine Learning Research , year =

DINOv2: Learning Robust Visual Features without Supervision , author =. Transactions on Machine Learning Research , year =

[1] [1]

European Conference on Computer Vision , year =

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis , author =. European Conference on Computer Vision , year =

[2] [2]

ACM Transactions on Graphics , year =

3D Gaussian Splatting for Real-Time Radiance Field Rendering , author =. ACM Transactions on Graphics , year =

[3] [3]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Deep 3d capture: Geometry and reflectance from sparse multi-view images , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

[4] [4]

ACM Transactions on Graphics , year=

Practical svbrdf acquisition of 3d objects with unstructured flash photography , author=. ACM Transactions on Graphics , year=

[5] [5]

ACM Transactions on Graphics , year=

Appearance-from-motion: Recovering spatially varying surface reflectance under unknown lighting , author=. ACM Transactions on Graphics , year=

[6] [6]

ACM Transactions on Graphics , year=

Learning to reconstruct shape and spatially-varying reflectance from a single image , author=. ACM Transactions on Graphics , year=

[7] [7]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Csgnet: Neural shape parser for constructive solid geometry , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

[8] [8]

Advances in Neural Information Processing Systems , year=

Differentiable blocks world: Qualitative 3d decomposition by rendering primitives , author=. Advances in Neural Information Processing Systems , year=

[9] [9]

IEEE/CVF Winter Conference on Applications of Computer Vision , year=

Volumetric disentanglement for 3d scene manipulation , author=. IEEE/CVF Winter Conference on Applications of Computer Vision , year=

[10] [10]

IEEE/CVF International Conference on Computer Vision , year=

Learning object-compositional neural radiance field for editable scene rendering , author=. IEEE/CVF International Conference on Computer Vision , year=

[11] [11]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

Dynamic lidar re-simulation using compositional neural fields , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

[12] [12]

International Conference on Learning Representations , year=

Omnire: Omni urban scene reconstruction , author=. International Conference on Learning Representations , year=

[13] [13]

Shape from shading: A method for obtaining the shape of a smooth opaque object from one view , author=

[14] [14]

Computer Vision Systems , year=

Recovering intrinsic scene characteristics , author=. Computer Vision Systems , year=

[15] [15]

1963 , school=

Machine perception of three-dimensional solids , author=. 1963 , school=

1963

[16] [16]

arXiv preprint arXiv:2601.11109 , year =

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning , author =. arXiv preprint arXiv:2601.11109 , year =

Pith/arXiv arXiv

[17] [17]

arXiv preprint arXiv:2410.13882 , year =

Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model , author =. arXiv preprint arXiv:2410.13882 , year =

arXiv

[18] [18]

arXiv preprint arXiv:2505.05469 , year =

Generating Physically Stable and Buildable Brick Structures from Text , author =. arXiv preprint arXiv:2505.05469 , year =

arXiv

[19] [19]

arXiv preprint arXiv:2307.05663 , year =

Objaverse-XL: A Universe of 10M+ 3D Objects , author =. arXiv preprint arXiv:2307.05663 , year =

Pith/arXiv arXiv

[20] [20]

Inverse Rendering for Computer Graphics , author =

[21] [21]

ACM SIGGRAPH , year =

A Signal-Processing Framework for Inverse Rendering , author =. ACM SIGGRAPH , year =

[22] [22]

Advances in Neural Information Processing Systems , year =

Deep Convolutional Inverse Graphics Network , author =. Advances in Neural Information Processing Systems , year =

[23] [23]

arXiv preprint arXiv:2404.15228 , year =

Re-Thinking Inverse Graphics With Large Language Models , author =. arXiv preprint arXiv:2404.15228 , year =

arXiv

[24] [24]

arXiv preprint arXiv:2405.14871 , year =

NeRF-Casting: Improved View-Dependent Appearance with Consistent Reflections , author =. arXiv preprint arXiv:2405.14871 , year =

arXiv

[25] [25]

arXiv preprint arXiv:2304.12461 , year =

TensoIR: Tensorial Inverse Rendering , author =. arXiv preprint arXiv:2304.12461 , year =

arXiv

[26] [26]

IEEE/CVF International Conference on Computer Vision , year =

Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions , author =. IEEE/CVF International Conference on Computer Vision , year =

[27] [27]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

DRAWER: Digital Reconstruction and Articulation With Environment Realism , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

[28] [28]

Advances in Neural Information Processing Systems , year =

Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems , year =

[29] [29]

arXiv preprint arXiv:2409.12191 , year =

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author =. arXiv preprint arXiv:2409.12191 , year =

Pith/arXiv arXiv

[30] [30]

arXiv preprint arXiv:2508.08228 , year =

LL3M: Large Language 3D Modelers , author =. arXiv preprint arXiv:2508.08228 , year =

arXiv

[31] [31]

arXiv preprint arXiv:2512.11061 , year =

VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation , author =. arXiv preprint arXiv:2512.11061 , year =

Pith/arXiv arXiv

[32] [32]

ACM Transactions on Graphics , year =

NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination , author =. ACM Transactions on Graphics , year =

[33] [33]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

VGGT: Visual Geometry Grounded Transformer , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

[34] [34]

2024 , journal=

The Scene Language: Representing Scenes with Programs, Words, and Embeddings , author=. 2024 , journal=

2024

[35] [35]

Advances in Neural Information Processing Systems , year=

CAT3D: Create Anything in 3D with Multi-View Diffusion Models , author=. Advances in Neural Information Processing Systems , year=

[36] [36]

Barron and Pratul P

Dor Verbin and Peter Hedman and Ben Mildenhall and Todd Zickler and Jonathan T. Barron and Pratul P. Srinivasan , booktitle=

[37] [37]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Material Editing and Relighting , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

[38] [38]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

Extracting Triangular 3D Models, Materials, and Lighting From Images , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

[39] [39]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

GS-IR: 3D Gaussian Splatting for Inverse Rendering , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

[40] [40]

GPT-4V(ision) System Card , author =

[41] [41]

Gemini: A Family of Highly Capable Multimodal Models , author =

[42] [42]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

[43] [43]

2025 , howpublished =

Claude Opus 4.7 , author =. 2025 , howpublished =

2025

[44] [44]

Transactions on Machine Learning Research , year =

Unsupervised Discovery of Object-Centric Neural Fields , author =. Transactions on Machine Learning Research , year =

[45] [45]

arXiv preprint arXiv:2312.09738 , year =

3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V , author =. arXiv preprint arXiv:2312.09738 , year =

arXiv

[46] [46]

International Conference on Machine Learning , year =

SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code , author =. International Conference on Machine Learning , year =

[47] [47]

IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

[48] [48]

arXiv preprint arXiv:2508.14879 , year =

MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds , author =. arXiv preprint arXiv:2508.14879 , year =

arXiv

[49] [49]

arXiv preprint arXiv:2506.23329 , year =

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering , author =. arXiv preprint arXiv:2506.23329 , year =

arXiv

[50] [50]

International Conference on 3D Vision , year =

Voxhammer: Training-free precise and coherent 3d editing in native 3d space , author =. International Conference on 3D Vision , year =

[51] [51]

IEEE/CVF International Conference on Computer Vision , year =

Segment Anything , author =. IEEE/CVF International Conference on Computer Vision , year =

[52] [52]

arXiv preprint arXiv:2511.16624 , year =

SAM 3D: 3Dfy Anything in Images , author =. arXiv preprint arXiv:2511.16624 , year =

Pith/arXiv arXiv

[53] [53]

Advances in Neural Information Processing Systems , year =

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data , author =. Advances in Neural Information Processing Systems , year =

[54] [54]

International Conference on Machine Learning , year =

Learning Transferable Visual Models From Natural Language Supervision , author =. International Conference on Machine Learning , year =

[55] [55]

Advances in Neural Information Processing Systems , year =

Non-rigid Point Cloud Registration with Neural Deformation Pyramid , author =. Advances in Neural Information Processing Systems , year =

[56] [56]

Transactions on Machine Learning Research , year =

DINOv2: Learning Robust Visual Features without Supervision , author =. Transactions on Machine Learning Research , year =