PhotoFlow: Agentic 3D Virtual Photography Missions

Haojia Wei; Hongjie Zhang; Jiarui Guo; Xue Yang; Yifei Liu; Yiming Zhang; Yuning Gong; Zhihang Zhong

arxiv: 2605.23771 · v1 · pith:XURTZ3P4new · submitted 2026-05-22 · 💻 cs.CV · cs.AI· cs.MA

PhotoFlow: Agentic 3D Virtual Photography Missions

Jiarui Guo , Haojia Wei , Yiming Zhang , Yifei Liu , Yuning Gong , Hongjie Zhang , Xue Yang , Zhihang Zhong This is my paper

Pith reviewed 2026-05-25 04:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MA

keywords virtual photography3D scene agentscamera pose selectionvision-language modelsBlender renderingclosed-loop searchaesthetic judgment

0 comments

The pith

A Director-Reviewer-Reflector agent searches camera poses in 3D scenes to match language intents, outperforming one-shot and random baselines under a six-render budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhotoFlow as a closed-loop agent that lets an LLM direct, critique, and reflect on camera placements inside arbitrary Blender scenes to produce photographs matching a language goal. It builds VPhotoBench with 47 scenes and 141 missions covering placement, composition, and style. Experiments show the full agent reaches the highest quality-alignment score and success rate compared with one-shot prediction, single-chain reflection, anchor selection, and random search. The work frames virtual photography as a solvable agent task that stresses both 3D spatial reasoning and aesthetic choice together.

Core claim

PhotoFlow's Director proposes diverse candidate cameras from a soft blueprint, the Reviewer applies rule checks, visual critique, and pairwise selection to keep the best incumbent, and the Reflector turns failures into region memory and dead-zone suppression for relocation; this loop produces the strongest external quality-alignment composite and success rate on held-out VPhotoBench missions under a six-round rendering budget.

What carries the argument

The Director-Reviewer-Reflector agent that converts language intent and scene information into executable camera parameters through iterative proposal, critique, and memory-guided relocation.

If this is right

Language-conditioned photography missions become executable agent tasks in arbitrary prepared 3D scenes without preselected poses.
The agent architecture directly handles subject placement, relational composition, and atmosphere or style intents on the introduced benchmark.
Closed-loop reflection with region memory improves outcomes over open-loop or single-pass methods when rendering budget is limited.
The same structure can be applied to any 3D environment where camera parameters must be chosen to satisfy both geometric and aesthetic constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the memory mechanism to track successful regions as well as failures could further reduce wasted renders on repeated missions.
Replacing the underlying vision-language model with a stronger spatial reasoner would likely raise the ceiling on scene complexity the agent can handle.
The same Director-Reviewer-Reflector pattern could transfer to real-world robot camera placement once sim-to-real gaps in rendering and perception are addressed.

Load-bearing premise

The combination of rule checks, visual critique, pairwise selection, and region-memory suppression suffices to overcome the joint difficulties of 3D spatial understanding and abstract aesthetic judgment within six rendering rounds on the benchmark scenes.

What would settle it

A new test set of Blender scenes or language intents where PhotoFlow's success rate or quality-alignment score falls below at least one of the one-shot, single-chain, or random baselines under the same six-round limit.

Figures

Figures reproduced from arXiv: 2605.23771 by Haojia Wei, Hongjie Zhang, Jiarui Guo, Xue Yang, Yifei Liu, Yiming Zhang, Yuning Gong, Zhihang Zhong.

**Figure 1.** Figure 1: Virtual photography as spatial-aesthetic decision making. Given a controllable 3D scene and a language instruction, the agent must choose an executable camera state that satisfies spatial constraints, semantic intent, and photographic quality. The benchmark evaluates the final rendered image together with the search process that produced it. preference is subjective and depends on both image attributes and… view at source ↗

**Figure 2.** Figure 2: PhotoFlow pipeline. The system first scouts the scene and constructs a soft photographic blueprint. The Director proposes candidate cameras from global anchors, region-memory-guided seeds, and a forced high-explore lane. Candidate previews are rendered in parallel, scored by a structured Reviewer, and summarized by a Reflector that updates search bias and forbidden regions for the next round. is the key di… view at source ↗

**Figure 3.** Figure 3: Task boundary with VLN. VLN is a useful neighboring formulation because both settings make language-conditioned 3D decisions, but VLN evaluates navigation paths while virtual photography evaluates the final executable camera state and rendered view. 4 VPhotoBench: Benchmark Formulation 4.1 Benchmark composition VPhotoBench instantiates the task formulation from Section 3.1 over 47 open-license Blender scen… view at source ↗

**Figure 4.** Figure 4: Search-process diagnostic. Internal cumulative best score during search. The horizontal axis is feedback round for iterative methods and evaluated-candidate index for one-shot candidate pools. External image metrics in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: High-explore as a switchable safeguard. A representative case where forced high-explore helps the search leave a locally acceptable but weak viewpoint and find a stronger composition. This component is intended as an escape route from local collapse, not as a universally better proposal source. 5.2 Ablations The ablation study is organized around the three-role design. The Director cannot be removed wholes… view at source ↗

**Figure 6.** Figure 6: Successful qualitative cases. Each row is organized as prompt, iterative previews, and final render. The three examples cover a city/island composition, a courtyard architecture view, and a stylized bicycle subject, showing how PhotoFlow turns language into a sequence of rendered camera hypotheses and a final executable camera state across different scene scales and visual styles. 15 [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 7.** Figure 7: Failure qualitative cases. Each row is organized as prompt, iterative previews, and final render. The top row is ‘037_attic_hideout_atmosphere_style’: the search collapses into a dark, lowquality atmospheric view and receives a hard-failure tag with constraint satisfaction 0.0 (Mqs = .244). The bottom row is ‘031_medieval_ship_ocean_scene_subject_placement’: the final camera fails the requested subject/fr… view at source ↗

read the original abstract

Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhotoFlow is the first to frame language-driven virtual photography in Blender scenes as an agent task and ships a 47-scene benchmark, but missing module ablations leave the contribution of the Reviewer and Reflector unclear.

read the letter

The main takeaway is that this paper defines a new agent task for virtual photography and releases VPhotoBench, then shows their Director-Reviewer-Reflector loop beating four baselines on success rate and a quality-alignment score under a six-round budget. That framing and the benchmark are genuinely new; no prior work is cited as having done the same with arbitrary scenes and language intent. The architecture itself is sensible: the Director generates diverse camera proposals, the Reviewer applies rules plus visual critique and pairwise selection, and the Reflector turns failures into memory and dead-zone suppression. Those pieces directly target the dual difficulties of 3D layout and aesthetic choice, and the closed-loop design fits the six-round constraint better than one-shot or pure random search. The benchmark spans subject placement, relational composition, and style, which gives it some breadth for a first release. The soft spots are exactly where the stress-test note flags them. There are no ablations that remove or simplify the Reviewer or Reflector, so we cannot tell whether the gains come from those mechanisms or simply from generating more varied proposals or from the underlying LLM. The baselines do not include a Director-only variant, and 47 scenes is small enough that scene selection could matter. The abstract gives no numbers, variance, or scoring details, though the full paper presumably supplies them. The evaluation still rests on external judges or composites rather than fully objective metrics, which is reasonable for aesthetics but needs transparent protocol. This work is aimed at researchers building LLM agents for spatial and creative tasks in 3D environments. Anyone already working on vision-language agents or procedural camera control will find the task definition and the benchmark worth looking at. It deserves a serious referee because the task is new, the comparison is against sensible alternatives, and the core loop is coherent even if the evidence for each module is incomplete. I would send it to review with a request for module ablations and clearer evaluation details.

Referee Report

2 major / 2 minor

Summary. The paper introduces PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search in language-conditioned virtual photography tasks within arbitrary 3D Blender scenes. It also presents VPhotoBench, a new benchmark of 47 open-license scenes and 141 missions covering subject placement, relational composition, and atmosphere/style. The central empirical claim is that, under a six-round rendering budget on held-out experiments, PhotoFlow attains the highest external quality-alignment composite and success rate relative to one-shot prediction, single-chain reflection, anchor-bank selection, and random search baselines.

Significance. If the results hold after addressing evaluation gaps, the work is significant for formalizing virtual photography as an executable agent task that jointly stresses 3D spatial reasoning and aesthetic judgment, and for releasing the first dedicated benchmark. The agent architecture with rule checks, visual critique, pairwise selection, region-memory suppression, and dead-zone suppression represents a concrete attempt to close the loop on these challenges. The paper explicitly positions itself as the first to treat the problem in this agentic form, which is a clear contribution to the emerging area of LLM-based spatial agents.

major comments (2)

[held-out experiments] Held-out experiments paragraph: The reported superiority rests on comparisons to the four listed baselines, yet no ablations of PhotoFlow itself (Director-only, without Reviewer, or without Reflector) are presented. This leaves open whether the performance edge derives from the specific Reviewer/Reflector mechanisms or simply from proposal diversity and the fixed six-round budget, directly undermining the claim that the full loop overcomes the 3D+aesthetic difficulties.
[VPhotoBench] VPhotoBench description: The benchmark uses only 47 scenes. No scene-level variance analysis, statistical significance tests on the observed gaps, or sensitivity checks to scene selection are reported, so it is unclear whether the composite-score and success-rate advantages generalize beyond the chosen set or could be artifacts of the particular 47 scenes.

minor comments (2)

[Abstract] Abstract: The terms 'external quality-alignment composite' and 'success rate' are introduced without definition or pointer to the evaluation protocol; these must be defined with explicit scoring procedures in the main text to support reproducibility.
[Method] Method overview: The interfaces and data flow between Director, Reviewer, and Reflector are described at a high level; a pseudocode listing or explicit state diagram would clarify how region memory and dead-zone suppression are implemented and updated across rounds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important gaps in our evaluation. We address each point below and commit to revisions that strengthen the empirical claims without altering the core contributions.

read point-by-point responses

Referee: The reported superiority rests on comparisons to the four listed baselines, yet no ablations of PhotoFlow itself (Director-only, without Reviewer, or without Reflector) are presented. This leaves open whether the performance edge derives from the specific Reviewer/Reflector mechanisms or simply from proposal diversity and the fixed six-round budget, directly undermining the claim that the full loop overcomes the 3D+aesthetic difficulties.

Authors: We agree that component ablations are necessary to isolate the contribution of the Reviewer and Reflector. The current baselines (one-shot, single-chain reflection, anchor-bank, random) do not fully substitute for internal variants of PhotoFlow. In the revision we will add Director-only and Director+Reviewer ablations under the same six-round budget, reporting the same composite and success metrics. This will directly test whether the full loop provides gains beyond proposal diversity. revision: yes
Referee: The benchmark uses only 47 scenes. No scene-level variance analysis, statistical significance tests on the observed gaps, or sensitivity checks to scene selection are reported, so it is unclear whether the composite-score and success-rate advantages generalize beyond the chosen set or could be artifacts of the particular 47 scenes.

Authors: The limited scene count is a valid concern for generalization claims. While we cannot expand the benchmark to hundreds of scenes in the current revision without substantial new data collection, we will add: (1) per-scene score distributions and variance, (2) statistical significance tests (paired Wilcoxon signed-rank) on the gaps versus baselines, and (3) sensitivity checks by reporting results on random subsets of 30 scenes. These additions will quantify robustness within the existing 47-scene set. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces an agentic system (Director-Reviewer-Reflector) and evaluates it empirically on a new benchmark (VPhotoBench) against listed baselines under a fixed rendering budget. No equations, fitted parameters, or mathematical derivations are present. Central claims rest on comparative success rates and quality metrics on held-out scenes, with no reduction to self-defined quantities or load-bearing self-citations. The work is self-contained as an experimental demonstration of an LLM-based spatial agent.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone.

pith-pipeline@v0.9.0 · 5808 in / 1228 out tokens · 41311 ms · 2026-05-25T04:31:20.036884+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

[1]

Advanced composition in virtual camera control

Rafid Abdullah, Marc Christie, Guy Schofield, Christophe Lino, and Patrick Olivier. Advanced composition in virtual camera control. InSmart Graphics, volume 6815 ofLecture Notes in Computer Science, pages 13–24, 2011. doi: 10.1007/978-3-642-22571-0_2

work page doi:10.1007/978-3-642-22571-0_2 2011
[2]

Multi-robot task planning under individual and collaborative temporal logic specifications

Hadi AlZayer, Hubert Lin, and Kavita Bala. AutoPhoto: Aesthetic photo capture using reinforcement learning. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 944–951, 2021. doi: 10.1109/IROS51168.2021.9636788

work page doi:10.1109/iros51168.2021.9636788 2021
[3]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel X. Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents, 2018. arXiv:1807.06757

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

work page 2018
[5]

Blend swap.https://www.blendswap.com/, 2025

Blend Swap. Blend swap.https://www.blendswap.com/, 2025. Accessed 2026-05-04

work page 2025
[6]

Blender demo files

Blender Foundation. Blender demo files. https://www.blender.org/download/ demo-files/, 2025. Accessed 2026-05-04

work page 2025
[7]

Autonomous aerial cinematography in unstructured environments with learned artistic decision-making.Journal of Field Robotics, 37(4):606–641, 2020

Rogerio Bonatti, Wenshan Wang, Cherie Ho, Aayush Ahuja, Mirko Gschwindt, Efe Camci, Erdal Kayacan, Sanjiban Choudhury, and Sebastian Scherer. Autonomous aerial cinematography in unstructured environments with learned artistic decision-making.Journal of Field Robotics, 37(4):606–641, 2020. doi: 10.1002/rob.21931

work page doi:10.1002/rob.21931 2020
[8]

Grimm, and William D

Zachary Byers, Michael Dixon, Kevin Goodier, Cindy M. Grimm, and William D. Smart. An autonomous robot photographer. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2636–2641, 2003. doi: 10.1109/IROS.2003.1249268

work page doi:10.1109/iros.2003.1249268 2003
[9]

UniPercept: Towards unified perceptual-level image understanding across aesthetics, quality, structure, and texture, 2025

Shuo Cao, Jiayang Li, Xiaohui Li, Yuandong Pu, Kaiwen Zhu, Yuanting Gao, Siqi Luo, Yi Xin, Qi Qin, Yu Zhou, Xiangyu Chen, Wenlong Zhang, Bin Fu, Yu Qiao, and Yihao Liu. UniPercept: Towards unified perceptual-level image understanding across aesthetics, quality, structure, and texture, 2025. arXiv:2512.21675

work page arXiv 2025
[10]

Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang

Angel X. Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. InInternational Conference on 3D Vision, 2017

work page 2017
[11]

Virtual camera planning: A survey

Marc Christie, Rumesh Machap, Jean-Marie Normand, Patrick Olivier, and Jonathan Pickering. Virtual camera planning: A survey. InSmart Graphics, volume 3638 ofLecture Notes in Computer Science, pages 40–52. Springer, 2005. doi: 10.1007/11536482_4

work page doi:10.1007/11536482_4 2005
[12]

Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z. Wang. Studying aesthetics in photographic images using a computational approach. InEuropean Conference on Computer Vision, 2006

work page 2006
[13]

Creatism: A deep-learning photographer capable of creating professional work

Hui Fang and Meng Zhang. Creatism: A deep-learning photographer capable of creating professional work.arXiv preprint arXiv:1707.03491, 2017. 12

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. BLINK: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, 2024

work page 2024
[15]

Cohen, and David H

Li-Wei He, Michael F. Cohen, and David H. Salesin. The virtual cinematographer: A paradigm for automatic real-time camera control and directing. InProceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pages 217–224, 1996. doi: 10.1145/237170.237259

work page doi:10.1145/237170.237259 1996
[16]

Stay on the path: Instruction fidelity in vision-and-language navigation

Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, and Jason Baldridge. Stay on the path: Instruction fidelity in vision-and-language navigation. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[17]

What’s" up" with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning, 2023. arXiv:2310.19785

work page arXiv 2023
[18]

LeRoP: A learning-based modular robot photography framework, 2019

Hao Kang, Jianming Zhang, Haoxiang Li, Zhe Lin, TJ Rhodes, and Bedrich Benes. LeRoP: A learning-based modular robot photography framework, 2019. arXiv:1911.12470

work page arXiv 2019
[19]

Xue, Xinping Song, Chao Qin, and Hugh H.-T

Yifan Lin, Sophie Ziyu Liu, Ran Qi, George Z. Xue, Xinping Song, Chao Qin, and Hugh H.-T. Liu. Agentic aerial cinematography: From dialogue cues to cinematic trajectories, 2025. arXiv:2509.16176

work page arXiv 2025
[20]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023
[21]

ChatCam: Empowering camera control through conversational ai

Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. ChatCam: Empowering camera control through conversational ai. InAdvances in Neural Information Processing Systems, volume 37, pages 54483–54506, 2024. doi: 10.52202/079017-1726

work page doi:10.52202/079017-1726 2024
[22]

A V A: A large-scale database for aesthetic visual analysis

Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. InIEEE Conference on Computer Vision and Pattern Recognition, 2012

work page 2012
[23]

Real-time planning for automated multi-view drone cinematography.ACM Transactions on Graphics, 36(4):1–10, 2017

Tobias Nägeli, Lukas Meier, Alexander Domahidi, Javier Alonso-Mora, and Otmar Hilliges. Real-time planning for automated multi-view drone cinematography.ACM Transactions on Graphics, 36(4):1–10, 2017. doi: 10.1145/3072959.3073712

work page doi:10.1145/3072959.3073712 2017
[24]

Mind-of- director: Multi-modal agent-driven film previsualization via collaborative decision-making,

Shufeng Nan, Mengtian Li, Sixiao Zheng, Yuwei Lu, Han Zhang, and Yanwei Fu. Mind-of- director: Multi-modal agent-driven film previsualization via collaborative decision-making,

work page
[25]

Murillo, and Mac Schwager

Pablo Pueyo, Juan Dendarieta, Eduardo Montijano, Ana C. Murillo, and Mac Schwager. CineMPC: A fully autonomous drone cinematography system incorporating zoom, focus, pose, and scene composition.IEEE Transactions on Robotics, 40:1740–1757, 2024. doi: 10.1109/TRO.2024.3353550. arXiv:2401.05272

work page doi:10.1109/tro.2024.3353550 2024
[26]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InIEEE/CVF International Conference on Computer Vision, 2019

work page 2019
[27]

Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707,

Ilias Stogiannidis, Steven McDonagh, and Sotirios A. Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models, 2025. arXiv:2503.19707

work page arXiv 2025
[28]

NIMA: Neural image assessment.IEEE Transactions on Image Processing, 2018

Hossein Talebi and Peyman Milanfar. NIMA: Neural image assessment.IEEE Transactions on Image Processing, 2018

work page 2018
[29]

Aesthetic camera viewpoint suggestion with 3d aesthetic field, 2026

Sheyang Tang, Armin Shafiee Sarvestani, Jialu Xu, Xiaoyu Xu, and Zhou Wang. Aesthetic camera viewpoint suggestion with 3d aesthetic field, 2026. arXiv:2602.20363

work page arXiv 2026
[30]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models,

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models,

work page
[31]

arXiv:2406.14852; NeurIPS 2024. 13

work page arXiv 2024
[32]

Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson Env: Real-world perception for embodied agents. InIEEE Conference on Computer Vision and Pattern Recognition, 2018

work page 2018
[33]

FilmAgent: A multi-agent framework for end-to-end film automation in virtual 3d spaces, 2025

Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, and Min Zhang. FilmAgent: A multi-agent framework for end-to-end film automation in virtual 3d spaces, 2025. arXiv:2501.12909

work page arXiv 2025
[34]

Personal- ized image aesthetics assessment with rich attributes

Yuzhe Yang, Liwu Xu, Leida Li, Nan Qie, Yaqian Li, Peng Zhang, and Yandong Guo. Personal- ized image aesthetics assessment with rich attributes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19861–19869, 2022

work page 2022
[35]

NavGPT: Explicit reasoning in vision-and-language navigation with large language models, 2023

Gengze Zhou, Yicong Hong, and Qi Wu. NavGPT: Explicit reasoning in vision-and-language navigation with large language models, 2023. arXiv:2305.16986. 14 Figure 6:Successful qualitative cases.Each row is organized as prompt, iterative previews, and final render. The three examples cover a city/island composition, a courtyard architecture view, and a styliz...

work page arXiv 2023

[1] [1]

Advanced composition in virtual camera control

Rafid Abdullah, Marc Christie, Guy Schofield, Christophe Lino, and Patrick Olivier. Advanced composition in virtual camera control. InSmart Graphics, volume 6815 ofLecture Notes in Computer Science, pages 13–24, 2011. doi: 10.1007/978-3-642-22571-0_2

work page doi:10.1007/978-3-642-22571-0_2 2011

[2] [2]

Multi-robot task planning under individual and collaborative temporal logic specifications

Hadi AlZayer, Hubert Lin, and Kavita Bala. AutoPhoto: Aesthetic photo capture using reinforcement learning. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 944–951, 2021. doi: 10.1109/IROS51168.2021.9636788

work page doi:10.1109/iros51168.2021.9636788 2021

[3] [3]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel X. Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, and Amir R. Zamir. On evaluation of embodied navigation agents, 2018. arXiv:1807.06757

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

work page 2018

[5] [5]

Blend swap.https://www.blendswap.com/, 2025

Blend Swap. Blend swap.https://www.blendswap.com/, 2025. Accessed 2026-05-04

work page 2025

[6] [6]

Blender demo files

Blender Foundation. Blender demo files. https://www.blender.org/download/ demo-files/, 2025. Accessed 2026-05-04

work page 2025

[7] [7]

Autonomous aerial cinematography in unstructured environments with learned artistic decision-making.Journal of Field Robotics, 37(4):606–641, 2020

Rogerio Bonatti, Wenshan Wang, Cherie Ho, Aayush Ahuja, Mirko Gschwindt, Efe Camci, Erdal Kayacan, Sanjiban Choudhury, and Sebastian Scherer. Autonomous aerial cinematography in unstructured environments with learned artistic decision-making.Journal of Field Robotics, 37(4):606–641, 2020. doi: 10.1002/rob.21931

work page doi:10.1002/rob.21931 2020

[8] [8]

Grimm, and William D

Zachary Byers, Michael Dixon, Kevin Goodier, Cindy M. Grimm, and William D. Smart. An autonomous robot photographer. InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2636–2641, 2003. doi: 10.1109/IROS.2003.1249268

work page doi:10.1109/iros.2003.1249268 2003

[9] [9]

UniPercept: Towards unified perceptual-level image understanding across aesthetics, quality, structure, and texture, 2025

Shuo Cao, Jiayang Li, Xiaohui Li, Yuandong Pu, Kaiwen Zhu, Yuanting Gao, Siqi Luo, Yi Xin, Qi Qin, Yu Zhou, Xiangyu Chen, Wenlong Zhang, Bin Fu, Yu Qiao, and Yihao Liu. UniPercept: Towards unified perceptual-level image understanding across aesthetics, quality, structure, and texture, 2025. arXiv:2512.21675

work page arXiv 2025

[10] [10]

Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang

Angel X. Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Nießner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3D: Learning from RGB-D data in indoor environments. InInternational Conference on 3D Vision, 2017

work page 2017

[11] [11]

Virtual camera planning: A survey

Marc Christie, Rumesh Machap, Jean-Marie Normand, Patrick Olivier, and Jonathan Pickering. Virtual camera planning: A survey. InSmart Graphics, volume 3638 ofLecture Notes in Computer Science, pages 40–52. Springer, 2005. doi: 10.1007/11536482_4

work page doi:10.1007/11536482_4 2005

[12] [12]

Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z. Wang. Studying aesthetics in photographic images using a computational approach. InEuropean Conference on Computer Vision, 2006

work page 2006

[13] [13]

Creatism: A deep-learning photographer capable of creating professional work

Hui Fang and Meng Zhang. Creatism: A deep-learning photographer capable of creating professional work.arXiv preprint arXiv:1707.03491, 2017. 12

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. BLINK: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, 2024

work page 2024

[15] [15]

Cohen, and David H

Li-Wei He, Michael F. Cohen, and David H. Salesin. The virtual cinematographer: A paradigm for automatic real-time camera control and directing. InProceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pages 217–224, 1996. doi: 10.1145/237170.237259

work page doi:10.1145/237170.237259 1996

[16] [16]

Stay on the path: Instruction fidelity in vision-and-language navigation

Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, and Jason Baldridge. Stay on the path: Instruction fidelity in vision-and-language navigation. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019

[17] [17]

What’s" up" with vision-language models? investigating their strug- gle with spatial reasoning.arXiv preprint arXiv:2310.19785,

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning, 2023. arXiv:2310.19785

work page arXiv 2023

[18] [18]

LeRoP: A learning-based modular robot photography framework, 2019

Hao Kang, Jianming Zhang, Haoxiang Li, Zhe Lin, TJ Rhodes, and Bedrich Benes. LeRoP: A learning-based modular robot photography framework, 2019. arXiv:1911.12470

work page arXiv 2019

[19] [19]

Xue, Xinping Song, Chao Qin, and Hugh H.-T

Yifan Lin, Sophie Ziyu Liu, Ran Qi, George Z. Xue, Xinping Song, Chao Qin, and Hugh H.-T. Liu. Agentic aerial cinematography: From dialogue cues to cinematic trajectories, 2025. arXiv:2509.16176

work page arXiv 2025

[20] [20]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023

[21] [21]

ChatCam: Empowering camera control through conversational ai

Xinhang Liu, Yu-Wing Tai, and Chi-Keung Tang. ChatCam: Empowering camera control through conversational ai. InAdvances in Neural Information Processing Systems, volume 37, pages 54483–54506, 2024. doi: 10.52202/079017-1726

work page doi:10.52202/079017-1726 2024

[22] [22]

A V A: A large-scale database for aesthetic visual analysis

Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. InIEEE Conference on Computer Vision and Pattern Recognition, 2012

work page 2012

[23] [23]

Real-time planning for automated multi-view drone cinematography.ACM Transactions on Graphics, 36(4):1–10, 2017

Tobias Nägeli, Lukas Meier, Alexander Domahidi, Javier Alonso-Mora, and Otmar Hilliges. Real-time planning for automated multi-view drone cinematography.ACM Transactions on Graphics, 36(4):1–10, 2017. doi: 10.1145/3072959.3073712

work page doi:10.1145/3072959.3073712 2017

[24] [24]

Mind-of- director: Multi-modal agent-driven film previsualization via collaborative decision-making,

Shufeng Nan, Mengtian Li, Sixiao Zheng, Yuwei Lu, Han Zhang, and Yanwei Fu. Mind-of- director: Multi-modal agent-driven film previsualization via collaborative decision-making,

work page

[25] [25]

Murillo, and Mac Schwager

Pablo Pueyo, Juan Dendarieta, Eduardo Montijano, Ana C. Murillo, and Mac Schwager. CineMPC: A fully autonomous drone cinematography system incorporating zoom, focus, pose, and scene composition.IEEE Transactions on Robotics, 40:1740–1757, 2024. doi: 10.1109/TRO.2024.3353550. arXiv:2401.05272

work page doi:10.1109/tro.2024.3353550 2024

[26] [26]

Habitat: A platform for embodied ai research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Dhruv Batra. Habitat: A platform for embodied ai research. InIEEE/CVF International Conference on Computer Vision, 2019

work page 2019

[27] [27]

Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707,

Ilias Stogiannidis, Steven McDonagh, and Sotirios A. Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models, 2025. arXiv:2503.19707

work page arXiv 2025

[28] [28]

NIMA: Neural image assessment.IEEE Transactions on Image Processing, 2018

Hossein Talebi and Peyman Milanfar. NIMA: Neural image assessment.IEEE Transactions on Image Processing, 2018

work page 2018

[29] [29]

Aesthetic camera viewpoint suggestion with 3d aesthetic field, 2026

Sheyang Tang, Armin Shafiee Sarvestani, Jialu Xu, Xiaoyu Xu, and Zhou Wang. Aesthetic camera viewpoint suggestion with 3d aesthetic field, 2026. arXiv:2602.20363

work page arXiv 2026

[30] [30]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models,

Jiayu Wang, Yifei Ming, Zhenmei Shi, Vibhav Vineet, Xin Wang, Yixuan Li, and Neel Joshi. Is a picture worth a thousand words? delving into spatial reasoning for vision language models,

work page

[31] [31]

arXiv:2406.14852; NeurIPS 2024. 13

work page arXiv 2024

[32] [32]

Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese

Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson Env: Real-world perception for embodied agents. InIEEE Conference on Computer Vision and Pattern Recognition, 2018

work page 2018

[33] [33]

FilmAgent: A multi-agent framework for end-to-end film automation in virtual 3d spaces, 2025

Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, and Min Zhang. FilmAgent: A multi-agent framework for end-to-end film automation in virtual 3d spaces, 2025. arXiv:2501.12909

work page arXiv 2025

[34] [34]

Personal- ized image aesthetics assessment with rich attributes

Yuzhe Yang, Liwu Xu, Leida Li, Nan Qie, Yaqian Li, Peng Zhang, and Yandong Guo. Personal- ized image aesthetics assessment with rich attributes. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19861–19869, 2022

work page 2022

[35] [35]

NavGPT: Explicit reasoning in vision-and-language navigation with large language models, 2023

Gengze Zhou, Yicong Hong, and Qi Wu. NavGPT: Explicit reasoning in vision-and-language navigation with large language models, 2023. arXiv:2305.16986. 14 Figure 6:Successful qualitative cases.Each row is organized as prompt, iterative previews, and final render. The three examples cover a city/island composition, a courtyard architecture view, and a styliz...

work page arXiv 2023