pith. sign in

arxiv: 2606.02580 · v1 · pith:4MDLVEJ2new · submitted 2026-06-01 · 💻 cs.CV

Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

Pith reviewed 2026-06-28 15:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords inverse graphicsvision-language modelsBlender3D reconstructionstaged reasoningexecutable codesingle-image reconstructionscene editing
0
0 comments X

The pith

Pretrained vision-language models can turn single images into editable Blender scenes by writing code in successive refinement stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether general-purpose vision-language models can solve inverse graphics by generating executable Blender programs that recreate a full 3D scene from one photo. Rather than asking the model to produce the entire program at once, the method breaks the work into ordered stages that refine geometry, then materials, then composition, then lighting. Experiments across varied scenes show that the staged version produces scenes with higher pixel, perceptual, and semantic agreement to the input image. A sympathetic reader would care because the result points to a way of obtaining manipulable 3D content without 3D-specific models or multi-view data.

Core claim

Staged Executable Inverse Graphics (SEIG) enables pretrained VLMs to perform executable inverse graphics from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. The experiments demonstrate that this staged reconstruction substantially improves reconstruction fidelity over non-staged baselines.

What carries the argument

The SEIG framework that decomposes reconstruction into sequential refinements of geometry, materials, composition, and lighting inside executable Blender code.

If this is right

  • Staged decomposition raises reconstruction fidelity across pixel-level, perceptual, and semantic metrics.
  • The resulting Blender programs support direct rendering, relighting, and object manipulation.
  • The method operates with only a single image and off-the-shelf VLMs, without specialized 3D models or multi-view supervision.
  • Reconstructed scenes enable downstream editing and rendering applications inside standard Blender.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If VLM code-generation accuracy continues to rise, the same staged approach could handle more cluttered or dynamic scenes.
  • The decomposition pattern may transfer to other tasks that require generating long, structured code outputs from visual input.
  • One could measure whether adding further intermediate stages beyond the four described continues to improve results or saturates.
  • Editable Blender outputs could serve as starting points for interactive tools that let users adjust scenes through natural language.

Load-bearing premise

That pretrained vision-language models already contain enough spatial reasoning and code-generation ability to write correct Blender programs for diverse real scenes from a single view without 3D training or extra images.

What would settle it

Generate Blender programs from a test set of real single images, render them, and measure whether the rendered outputs match the input images in geometry and appearance at rates no better than non-staged baselines.

Figures

Figures reproduced from arXiv: 2606.02580 by Guangzhao He, Hadar Averbuch-Elor, Rundong Luo, Wei-Chiu Ma.

Figure 1
Figure 1. Figure 1: From a single reference image (leftmost inset), SEIG reconstructs an editable Blender scene through a staged generator–verifier loop driven entirely by [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. Our agentic reconstruction pipeline decomposes inverse graphics into four sequential stages of Blender code generation, each consisting of a generator step followed by a verifier step. An initialization stage samples four coarse scene initializations and labels a scene graph for all objects and parts, which is then passed to all subsequent stages for object- and part-centric refinement. Th… view at source ↗
Figure 4
Figure 4. Figure 4: Relighting. Two reconstructed scenes (top: pendant lamps; bottom: sailboat) re-rendered under two new lighting configurations. Since lights are separately stored in Blender, new illumination can be applied by adding or reconfiguring light sources without re-running any part of the pipeline. into disjoint, mis-colored meshes, as the underlying VLM agent often overwrites the texture of the generated 3D objec… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison across methods. Each block shows a different reference image (leftmost column) together with the reconstruc￾tions produced by VIGAVLM-only, VIGAfull, and our pipeline. Within each method’s column, the large rendering is taken from the reference viewpoint, and the two smaller show the same reconstructed scene rendered from alternate viewpoints, revealing the underlying 3D structure. A… view at source ↗
Figure 5
Figure 5. Figure 5: Object editing. Two reconstructed scenes (top: aircraft; bottom: castle), each shown alongside two example edits performed directly in Blender on the recovered scene graph: part duplication and texture editing for the aircraft; shape manipulation and object composition for the castle. Shake the table Reference Prediction Simulation Drop the ball [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Physics simulation. Two reconstructed scenes used as input to Blender’s built-in physics engine. Top: rigid-body dynamics—the recovered mugs and saucers slide and rattle when the table is given an external ac￾celeration (shake the table). Bottom: soft-body dynamics—a ball is dropped onto the recovered cushion, which deforms accordingly (drop the ball). Both simulations run directly on the reconstructed sce… view at source ↗
Figure 7
Figure 7. Figure 7: Intermediate outputs across pipeline stages. Two examples showing the rendered scene through our pipeline: starting from a coarse initial scaffold (Initialization), through the four stages—Geometry, Material, Composition, and Lighting—each closed by its own generator–verifier loop, and the final image rendered from a VLM-determined camera (Camera-adjustment, rightmost column). Each stage commits its output… view at source ↗
Figure 8
Figure 8. Figure 8: Gallery of Blender scenes created by SEIG from in-the-wild and synthetic reference images. The synthetic scenes correspond to the examples shown in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces Staged Executable Inverse Graphics (SEIG), an agentic framework in which pretrained vision-language models generate executable Blender programs to reconstruct editable 3D scenes from single images. The reconstruction proceeds by progressive refinement of scene factors (geometry, materials, composition, lighting) directly in code space, without specialized 3D models, differentiable rendering, or multi-view supervision. The central claim is that this staged decomposition substantially improves reconstruction fidelity over direct (non-staged) VLM prompting, as measured by pixel-level, perceptual, and semantic metrics, while also enabling downstream editing applications.

Significance. If the quantitative results and ablations hold, the work would provide evidence that general-purpose VLMs can be orchestrated for practical inverse-graphics tasks via executable code output, reducing dependence on 3D-specific foundation models. The emphasis on editability and the absence of invented parameters or closed-form derivations are consistent with an empirical agentic approach.

major comments (3)
  1. [Abstract] Abstract: the assertion that 'staged reconstruction substantially improves reconstruction fidelity' is presented without any numerical results, baseline descriptions, error analysis, or statistical tests; this is load-bearing for the central claim.
  2. [Evaluation / Experiments] Evaluation / Experiments: no ablation equalizes total VLM calls, token budget, or wall-clock inference cost between the staged SEIG pipeline and the direct baseline, so any measured lift cannot yet be attributed to task decomposition rather than extra inference opportunities.
  3. [Method / Evaluation] Method / Evaluation: the claim that pretrained VLMs suffice for correct Blender programs on diverse real scenes (without 3D training or multi-view data) is not accompanied by reported success rates, failure-case analysis, or comparison against 3D-specific baselines, leaving the weakest assumption untested.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below with honest responses and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that 'staged reconstruction substantially improves reconstruction fidelity' is presented without any numerical results, baseline descriptions, error analysis, or statistical tests; this is load-bearing for the central claim.

    Authors: We agree the abstract should better support the central claim with concrete evidence. We will revise it to include key quantitative results (e.g., improvements in PSNR, LPIPS, and semantic metrics) and brief baseline descriptions drawn from the Experiments section. revision: yes

  2. Referee: [Evaluation / Experiments] Evaluation / Experiments: no ablation equalizes total VLM calls, token budget, or wall-clock inference cost between the staged SEIG pipeline and the direct baseline, so any measured lift cannot yet be attributed to task decomposition rather than extra inference opportunities.

    Authors: This is a valid point; the staged pipeline uses multiple VLM calls by design. We will add a controlled ablation that matches total VLM calls and token budget (e.g., by granting the direct baseline equivalent refinement iterations) to isolate the effect of decomposition. revision: yes

  3. Referee: [Method / Evaluation] Method / Evaluation: the claim that pretrained VLMs suffice for correct Blender programs on diverse real scenes (without 3D training or multi-view data) is not accompanied by reported success rates, failure-case analysis, or comparison against 3D-specific baselines, leaving the weakest assumption untested.

    Authors: We will add explicit success rates for valid executable program generation and a failure-case analysis section. Direct comparisons to 3D-specific baselines fall outside our focus on general-purpose VLMs without specialized training or multi-view data, but we will expand the discussion of related work for context. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical agentic framework with independent evaluation

full rationale

The paper describes an empirical agentic pipeline (SEIG) that uses pretrained VLMs to generate staged Blender code from single images, with results measured on pixel, perceptual, and semantic metrics. No equations, derivations, or parameter-fitting steps are present that reduce outputs to inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The central claim rests on experimental comparisons rather than any closed mathematical reduction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that current VLMs can reliably produce runnable Blender code for geometry, materials, composition, and lighting from image descriptions alone.

axioms (1)
  • domain assumption Pretrained VLMs contain sufficient implicit 3D and code-generation knowledge to solve inverse graphics when prompted in stages.
    Invoked in the description of SEIG as the mechanism that replaces specialized 3D models and differentiable rendering.

pith-pipeline@v0.9.1-grok · 5718 in / 1195 out tokens · 19994 ms · 2026-06-28T15:01:31.544834+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 5 linked inside Pith

  1. [1]

    European Conference on Computer Vision , year =

    NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis , author =. European Conference on Computer Vision , year =

  2. [2]

    ACM Transactions on Graphics , year =

    3D Gaussian Splatting for Real-Time Radiance Field Rendering , author =. ACM Transactions on Graphics , year =

  3. [3]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

    Deep 3d capture: Geometry and reflectance from sparse multi-view images , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

  4. [4]

    ACM Transactions on Graphics , year=

    Practical svbrdf acquisition of 3d objects with unstructured flash photography , author=. ACM Transactions on Graphics , year=

  5. [5]

    ACM Transactions on Graphics , year=

    Appearance-from-motion: Recovering spatially varying surface reflectance under unknown lighting , author=. ACM Transactions on Graphics , year=

  6. [6]

    ACM Transactions on Graphics , year=

    Learning to reconstruct shape and spatially-varying reflectance from a single image , author=. ACM Transactions on Graphics , year=

  7. [7]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

    Csgnet: Neural shape parser for constructive solid geometry , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

  8. [8]

    Advances in Neural Information Processing Systems , year=

    Differentiable blocks world: Qualitative 3d decomposition by rendering primitives , author=. Advances in Neural Information Processing Systems , year=

  9. [9]

    IEEE/CVF Winter Conference on Applications of Computer Vision , year=

    Volumetric disentanglement for 3d scene manipulation , author=. IEEE/CVF Winter Conference on Applications of Computer Vision , year=

  10. [10]

    IEEE/CVF International Conference on Computer Vision , year=

    Learning object-compositional neural radiance field for editable scene rendering , author=. IEEE/CVF International Conference on Computer Vision , year=

  11. [11]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

    Dynamic lidar re-simulation using compositional neural fields , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

  12. [12]

    International Conference on Learning Representations , year=

    Omnire: Omni urban scene reconstruction , author=. International Conference on Learning Representations , year=

  13. [13]

    Shape from shading: A method for obtaining the shape of a smooth opaque object from one view , author=

  14. [14]

    Computer Vision Systems , year=

    Recovering intrinsic scene characteristics , author=. Computer Vision Systems , year=

  15. [15]

    1963 , school=

    Machine perception of three-dimensional solids , author=. 1963 , school=

  16. [16]

    arXiv preprint arXiv:2601.11109 , year =

    Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning , author =. arXiv preprint arXiv:2601.11109 , year =

  17. [17]

    arXiv preprint arXiv:2410.13882 , year =

    Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model , author =. arXiv preprint arXiv:2410.13882 , year =

  18. [18]

    arXiv preprint arXiv:2505.05469 , year =

    Generating Physically Stable and Buildable Brick Structures from Text , author =. arXiv preprint arXiv:2505.05469 , year =

  19. [19]

    arXiv preprint arXiv:2307.05663 , year =

    Objaverse-XL: A Universe of 10M+ 3D Objects , author =. arXiv preprint arXiv:2307.05663 , year =

  20. [20]

    Inverse Rendering for Computer Graphics , author =

  21. [21]

    ACM SIGGRAPH , year =

    A Signal-Processing Framework for Inverse Rendering , author =. ACM SIGGRAPH , year =

  22. [22]

    Advances in Neural Information Processing Systems , year =

    Deep Convolutional Inverse Graphics Network , author =. Advances in Neural Information Processing Systems , year =

  23. [23]

    arXiv preprint arXiv:2404.15228 , year =

    Re-Thinking Inverse Graphics With Large Language Models , author =. arXiv preprint arXiv:2404.15228 , year =

  24. [24]

    arXiv preprint arXiv:2405.14871 , year =

    NeRF-Casting: Improved View-Dependent Appearance with Consistent Reflections , author =. arXiv preprint arXiv:2405.14871 , year =

  25. [25]

    arXiv preprint arXiv:2304.12461 , year =

    TensoIR: Tensorial Inverse Rendering , author =. arXiv preprint arXiv:2304.12461 , year =

  26. [26]

    IEEE/CVF International Conference on Computer Vision , year =

    Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions , author =. IEEE/CVF International Conference on Computer Vision , year =

  27. [27]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

    DRAWER: Digital Reconstruction and Articulation With Environment Realism , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

  28. [28]

    Advances in Neural Information Processing Systems , year =

    Visual Instruction Tuning , author =. Advances in Neural Information Processing Systems , year =

  29. [29]

    arXiv preprint arXiv:2409.12191 , year =

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author =. arXiv preprint arXiv:2409.12191 , year =

  30. [30]

    arXiv preprint arXiv:2508.08228 , year =

    LL3M: Large Language 3D Modelers , author =. arXiv preprint arXiv:2508.08228 , year =

  31. [31]

    arXiv preprint arXiv:2512.11061 , year =

    VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation , author =. arXiv preprint arXiv:2512.11061 , year =

  32. [32]

    ACM Transactions on Graphics , year =

    NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination , author =. ACM Transactions on Graphics , year =

  33. [33]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

    VGGT: Visual Geometry Grounded Transformer , author=. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

  34. [34]

    2024 , journal=

    The Scene Language: Representing Scenes with Programs, Words, and Embeddings , author=. 2024 , journal=

  35. [35]

    Advances in Neural Information Processing Systems , year=

    CAT3D: Create Anything in 3D with Multi-View Diffusion Models , author=. Advances in Neural Information Processing Systems , year=

  36. [36]

    Barron and Pratul P

    Dor Verbin and Peter Hedman and Ben Mildenhall and Todd Zickler and Jonathan T. Barron and Pratul P. Srinivasan , booktitle=

  37. [37]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

    PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Material Editing and Relighting , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

  38. [38]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

    Extracting Triangular 3D Models, Materials, and Lighting From Images , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

  39. [39]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

    GS-IR: 3D Gaussian Splatting for Inverse Rendering , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

  40. [40]

    GPT-4V(ision) System Card , author =

  41. [41]

    Gemini: A Family of Highly Capable Multimodal Models , author =

  42. [42]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

    SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

  43. [43]

    2025 , howpublished =

    Claude Opus 4.7 , author =. 2025 , howpublished =

  44. [44]

    Transactions on Machine Learning Research , year =

    Unsupervised Discovery of Object-Centric Neural Fields , author =. Transactions on Machine Learning Research , year =

  45. [45]

    arXiv preprint arXiv:2312.09738 , year =

    3DAxiesPrompts: Unleashing the 3D Spatial Task Capabilities of GPT-4V , author =. arXiv preprint arXiv:2312.09738 , year =

  46. [46]

    International Conference on Machine Learning , year =

    SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code , author =. International Conference on Machine Learning , year =

  47. [47]

    IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

    BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing , author =. IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =

  48. [48]

    arXiv preprint arXiv:2508.14879 , year =

    MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds , author =. arXiv preprint arXiv:2508.14879 , year =

  49. [49]

    arXiv preprint arXiv:2506.23329 , year =

    IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering , author =. arXiv preprint arXiv:2506.23329 , year =

  50. [50]

    International Conference on 3D Vision , year =

    Voxhammer: Training-free precise and coherent 3d editing in native 3d space , author =. International Conference on 3D Vision , year =

  51. [51]

    IEEE/CVF International Conference on Computer Vision , year =

    Segment Anything , author =. IEEE/CVF International Conference on Computer Vision , year =

  52. [52]

    arXiv preprint arXiv:2511.16624 , year =

    SAM 3D: 3Dfy Anything in Images , author =. arXiv preprint arXiv:2511.16624 , year =

  53. [53]

    Advances in Neural Information Processing Systems , year =

    DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data , author =. Advances in Neural Information Processing Systems , year =

  54. [54]

    International Conference on Machine Learning , year =

    Learning Transferable Visual Models From Natural Language Supervision , author =. International Conference on Machine Learning , year =

  55. [55]

    Advances in Neural Information Processing Systems , year =

    Non-rigid Point Cloud Registration with Neural Deformation Pyramid , author =. Advances in Neural Information Processing Systems , year =

  56. [56]

    Transactions on Machine Learning Research , year =

    DINOv2: Learning Robust Visual Features without Supervision , author =. Transactions on Machine Learning Research , year =