pith. machine review for the scientific record. sign in

arxiv: 2601.11109 · v3 · submitted 2026-01-16 · 💻 cs.CV · cs.AI· cs.GR

Recognition: unknown

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Authors on Pith no claims yet
classification 💻 cs.CV cs.AIcs.GR
keywords vigamultimodalvision-as-inverse-graphicsagentblenderbenchchallengingframeworkinterleaved
0
0 comments X
read the original abstract

Vision-as-inverse-graphics, the concept of reconstructing images into editable programs, remains challenging for Vision-Language Models (VLMs), which inherently lack fine-grained spatial grounding in one-shot settings. To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other. VIGA operates through a tightly coupled code-render-inspect loop: synthesizing symbolic programs, projecting them into visual states, and inspecting discrepancies to guide iterative edits. Equipped with high-level semantic skills and an evolving multimodal memory, VIGA sustains evidence-based modifications over long horizons. This training-free, task-agnostic framework seamlessly supports 2D document generation, 3D reconstruction, multi-step 3D editing, and 4D physical interaction. Finally, we introduce BlenderBench, a challenging visual-to-code benchmark. Empirically, VIGA substantially improves accuracy compared with one-shot baselines in BlenderGym (35.32%), SlideBench (117.17%) and our proposed BlenderBench (124.70%).

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.

  2. Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    Closed-loop VLM agents using multi-view reasoning, object-centered visualization, and single-axis rotation prediction achieve superior text-guided 6D pose rearrangement for target objects in scenes.

  3. LychSim: A Controllable and Interactive Simulation Framework for Vision Research

    cs.CV 2026-05 unverdicted novelty 4.0

    LychSim introduces a controllable simulation platform on Unreal Engine 5 with Python API, procedural generation, and LLM integration for vision research tasks.