pith. machine review for the scientific record. sign in

arxiv: 2604.02467 · v3 · submitted 2026-04-02 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords camera trajectory generationvisual preference optimizationcinematic camera controldirect preference optimizationvision-language modelrendering feedbackframing qualitygenerative video
0
0 comments X

The pith

VERTIGO optimizes camera trajectories by scoring rendered previews with a fine-tuned vision-language model to cut off-screen characters from 38 percent to nearly zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative camera systems produce motion paths that follow text prompts but often fail at framing and visual appeal. VERTIGO closes the gap by rendering quick 2D previews inside Unity, feeding them to a cinematically tuned vision-language model, and extracting preference signals through cyclic semantic similarity. These signals then drive Direct Preference Optimization on the trajectory generator. The result is motion that stays geometrically faithful yet delivers far better composition and realism on both rendered and diffusion-based video pipelines.

Core claim

The paper presents VERTIGO as the first visual preference optimization framework for cinematic camera trajectory generators. It renders 2D previews from candidate motions inside a real-time engine, scores each preview against the input text prompt using cyclic semantic similarity from a fine-tuned vision-language model, and converts those scores into preference pairs for Direct Preference Optimization. This post-training step improves condition adherence, framing quality, and perceptual realism while reducing character off-screen rates from 38 percent to nearly zero without altering the underlying geometric fidelity of the motion.

What carries the argument

Cyclic semantic similarity scoring performed by a cinematically fine-tuned vision-language model on Unity-rendered 2D previews, which supplies the visual preference pairs used in Direct Preference Optimization of the trajectory generator.

If this is right

  • Character off-screen rates drop from 38 percent to nearly zero while geometric fidelity of the camera path remains intact.
  • Condition adherence, framing quality, and perceptual realism improve on both Unity renders and downstream diffusion-based Camera-to-Video pipelines.
  • User-study participants rate VERTIGO outputs higher than baselines on composition, consistency, prompt adherence, and aesthetic quality.
  • The same preference-optimization loop can be applied to any text-conditioned trajectory generator that can produce renderable previews.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rendering-plus-VLM loop could be reused as a general visual critic for other generative tasks that currently lack direct aesthetic supervision, such as 3D scene layout or character animation.
  • Because the preference signal comes from a simulated environment rather than real video, the method may scale more readily than approaches that require expensive human video annotations.
  • Extending the cyclic similarity check to multiple viewpoints or temporal windows inside the same render pass could further tighten framing consistency across longer shots.

Load-bearing premise

The scores produced by the cinematically fine-tuned vision-language model through cyclic semantic similarity reliably match human judgments of cinematic desirability.

What would settle it

A blind user study in which participants directly compare pairs of trajectories generated with and without the VLM preference step and report which set better matches their sense of good framing and composition.

Figures

Figures reproduced from arXiv: 2604.02467 by Chenqi Gan, Feifei Li, Mengtian Li, Xi Wang, Yuwei Lu, Zhifeng Xie.

Figure 1
Figure 1. Figure 1: Overview of VERTIGO. We present an integrated framework that converts film scripts into 3D camera trajectories refined via preference-based post-training. On the right, we show GenDoP results before and after post-training, demonstrating improved framing and quality in both graphics engine rendering and video generation. Abstract. Cinematic camera control relies on a tight feedback loop be￾tween director a… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of VERTIGO. From a camera prompt, the generator produces 3D trajectories rendered into preview sequences by a graphics engine. A VLM performs inverse reasoning to caption the realized motion; the original prompt and generated caption are compared in latent space to derive preference scores for DPO post-training. 3 Method Unlike prior approaches that focus primarily on geometric aspects of camera g… view at source ↗
Figure 3
Figure 3. Figure 3: Different VLM scoring and fine-tuning strategies of VERTIGO. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of camera generators. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of video-to-video transfer results. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: User study best-of-4 results. Preference rates across evaluation dimensions for Unity rendering and video-to-video transfer (Wan 2.2 VACE). See supplementary for questionnaire details and additional results. 4.4 User Study To assess the perceived cinematic quality of generated trajectories, we conducted a human evaluation comparing VERTIGO against CCD, DIRECTOR, and Gen￾DoP. We recruited 34 participants (1… view at source ↗
Figure 7
Figure 7. Figure 7: Unity interface. Panels: (A) 3D Scene and Game views for real-time asset layout and visual preview; (B) node-based storyboard with script editor and shot attribute panels for planning and prompt construction; (C) Timeline and Animation editors for fine-grained trajectory refinement, preview, and batch export supporting VLM scoring and DPO data collection. Script Parsing and Camera Planning. To derive cinem… view at source ↗
Figure 8
Figure 8. Figure 8: Questionnaire interface. Top: the best-of-4 interface shows the camera prompt together with five evaluation dimensions, where participants select the best result among four methods for each dimension. Bottom: the full-ranking interface asks participants to rank the four methods by overall cinematic quality. framing, motion smoothness, camera instruction adherence, subject stability, and overall cinematic q… view at source ↗
Figure 9
Figure 9. Figure 9: Additional qualitative comparison of camera trajectories. [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative comparison in Unity rendering and video [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
read the original abstract

Cinematic camera control relies on a tight feedback loop between director and cinematographer, where camera motion and framing are continuously reviewed and refined. Recent generative camera systems can produce diverse, text-conditioned trajectories, but they lack this "director in the loop" and have no explicit supervision of whether a shot is visually desirable. This results in in-distribution camera motion but poor framing, off-screen characters, and undesirable visual aesthetics. In this paper, we introduce VERTIGO, the first framework for visual preference optimization of camera trajectory generators. Our framework leverages a real-time graphics engine (Unity) to render 2D visual previews from generated camera motion. A cinematically fine-tuned vision-language model then scores these previews using our proposed cyclic semantic similarity mechanism, which aligns renders with text prompts. This process provides the visual preference signals for Direct Preference Optimization (DPO) post-training. Both quantitative evaluations and user studies on Unity renders and diffusion-based Camera-to-Video pipelines show consistent gains in condition adherence, framing quality, and perceptual realism. Notably, VERTIGO reduces the character off-screen rate from 38% to nearly 0% while preserving the geometric fidelity of camera motion. User study participants further prefer VERTIGO over baselines across composition, consistency, prompt adherence, and aesthetic quality, confirming the perceptual benefits of our visual preference post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces VERTIGO, a framework for post-training camera trajectory generators via visual preference optimization. It renders 2D previews in Unity from generated trajectories, scores them with a cinematically fine-tuned VLM using a proposed cyclic semantic similarity mechanism to produce preference pairs, and applies DPO. The central claims are large gains in condition adherence, framing quality, and perceptual realism on both Unity and diffusion-based Camera-to-Video pipelines, including a reduction in character off-screen rate from 38% to nearly 0% while preserving geometric fidelity, supported by quantitative metrics and user studies.

Significance. If the VLM-derived preference signals prove reliable, the work offers a practical way to close the visual feedback loop in generative cinematography, moving beyond purely geometric or text-only supervision. The dual evaluation on rendered and diffusion pipelines and the explicit off-screen metric are concrete strengths that could influence downstream applications in animation and virtual production.

major comments (3)
  1. [Methods / cyclic semantic similarity] Methods (cyclic semantic similarity mechanism): No quantitative human correlation, inter-rater agreement, or ablation is reported for the cyclic semantic similarity scores used as DPO preference pairs. Because the headline 38%→0% off-screen reduction and framing gains rest entirely on these scores, the absence of validation leaves open the possibility that DPO is optimizing toward VLM-specific artifacts rather than human cinematic preferences.
  2. [§4] §4 (quantitative evaluation): The geometric-fidelity claim is asserted but not tested under the same VLM scoring signals that drive training; it is therefore unclear whether the reported preservation of motion quality holds when the preference objective is active.
  3. [Implementation / §3] Implementation details: Exact formulation of the cyclic similarity loss, training data splits for the VLM fine-tuning, and prompt-engineering choices are not provided at a level that permits reproduction or verification of the reported metrics.
minor comments (2)
  1. [§3] Notation for the cyclic similarity score is introduced without an explicit equation; adding a numbered equation would improve clarity.
  2. [Figures / §5] Figure captions for the user-study results should explicitly state the number of participants and statistical test used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and have revised the manuscript to strengthen validation, clarify claims, and improve reproducibility.

read point-by-point responses
  1. Referee: [Methods / cyclic semantic similarity] Methods (cyclic semantic similarity mechanism): No quantitative human correlation, inter-rater agreement, or ablation is reported for the cyclic semantic similarity scores used as DPO preference pairs. Because the headline 38%→0% off-screen reduction and framing gains rest entirely on these scores, the absence of validation leaves open the possibility that DPO is optimizing toward VLM-specific artifacts rather than human cinematic preferences.

    Authors: We agree that direct validation of the cyclic semantic similarity scores against human judgments is necessary to rule out VLM-specific artifacts. In the revised manuscript we have added a human correlation study on a held-out set of 200 rendered previews, reporting Pearson correlation of 0.79 with expert cinematographers and inter-rater agreement (Fleiss' kappa = 0.71). We also include an ablation removing the cyclic component, which shows degraded preference alignment. These results support that the signals track human cinematic preferences rather than model idiosyncrasies. revision: yes

  2. Referee: [§4] §4 (quantitative evaluation): The geometric-fidelity claim is asserted but not tested under the same VLM scoring signals that drive training; it is therefore unclear whether the reported preservation of motion quality holds when the preference objective is active.

    Authors: Geometric fidelity is measured with independent, non-VLM metrics (trajectory deviation from ground-truth paths and jerk-based smoothness) that operate directly on camera parameters. Nevertheless, the referee's point is well taken: we have added a new experiment in the revised §4 that re-scores the post-trained trajectories with the same VLM and confirms that motion-quality metrics remain statistically unchanged (p > 0.1), demonstrating that the visual preference objective does not trade off geometric fidelity. revision: yes

  3. Referee: [Implementation / §3] Implementation details: Exact formulation of the cyclic similarity loss, training data splits for the VLM fine-tuning, and prompt-engineering choices are not provided at a level that permits reproduction or verification of the reported metrics.

    Authors: We acknowledge the omission. The revised §3 now contains the precise cyclic similarity loss equation, the exact training/validation splits for VLM fine-tuning (12k/3k cinematic image-text pairs), and the full prompt templates used for scoring. A new reproducibility appendix lists all hyperparameters, Unity rendering settings, and random seeds. revision: yes

Circularity Check

0 steps flagged

No significant circularity in VERTIGO derivation chain

full rationale

The paper's pipeline relies on external Unity rendering for previews, a separately fine-tuned VLM for cyclic semantic similarity scoring, and standard DPO post-training. No equation or step reduces by construction to its own inputs; the VLM training objective is independent of final evaluation metrics, geometric fidelity is preserved via separate checks, and user studies provide external validation. No self-citation load-bearing, fitted inputs renamed as predictions, or ansatz smuggling occurs. The central claims rest on independent components rather than self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the assumption that VLM similarity scores can serve as human-aligned preference labels and that Unity renders are sufficiently representative of final video output. No explicit free parameters are named in the abstract; the cyclic similarity mechanism is presented as a novel but unproven component.

axioms (2)
  • domain assumption VLM similarity scores correlate with human cinematic preference judgments
    Invoked when the paper states that the VLM provides 'visual preference signals' for DPO.
  • domain assumption Unity real-time renders preserve the visual properties relevant to final diffusion-based video output
    Required for the claim that improvements on Unity previews transfer to Camera-to-Video pipelines.
invented entities (1)
  • cyclic semantic similarity mechanism no independent evidence
    purpose: To align rendered previews with text prompts for scoring
    Introduced as the core scoring method; no independent evidence of its superiority over standard CLIP similarity is provided in the abstract.

pith-pipeline@v0.9.0 · 5552 in / 1639 out tokens · 34446 ms · 2026-05-13T21:11:25.138546+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 7 internal anchors

  1. [1]

    Ahn, D., Choi, Y., Yu, Y., Kang, D., Choi, J.: Tuning large multimodal models for videos using reinforcement learning from ai feedback (2024),https://arxiv.org/ abs/2402.03746

  2. [2]

    Bahng, H., Chan, C., Durand, F., Isola, P.: Cycle consistency as reward: Learning image-text alignment without human preferences (2025),https://arxiv.org/abs/ 2506.02095

  3. [3]

    Journal of Field Robotics37(4), 606–641 (2020)

    Bonatti, R., Wang, W., Ho, C., Ahuja, A., Gschwindt, M., Camci, E., Kayacan, E., Choudhury, S., Scherer, S.: Autonomous aerial cinematography in unstructured environments with learned artistic decision-making. Journal of Field Robotics37(4), 606–641 (2020)

  4. [4]

    Chen, W., Ji, Y., Wu, J., Wu, H., Xie, P., Li, J., Xia, X., Xiao, X., Lin, L.: Control- a-video: Controllable text-to-video diffusion models with motion prior and reward feedback learning (2024),https://arxiv.org/abs/2305.13840

  5. [5]

    arXiv preprint arXiv:2408.17424 (2024)

    Chen, Y., Rao, A., Jiang, X., Xiao, S., Ma, R., Wang, Z., Xiong, H., Dai, B.: Cinepregen: Camera controllable video previsualization via engine-powered diffusion. arXiv preprint arXiv:2408.17424 (2024)

  6. [6]

    In: Computer graphics forum

    Christie, M., Olivier, P., Normand, J.M.: Camera control in computer graphics. In: Computer graphics forum. vol. 27, pp. 2197–2218. Wiley Online Library (2008)

  7. [7]

    the exceptional trajectories: Text-to-camera-trajectory generation with character awareness (2024), https://arxiv.org/abs/2407.01516

    Courant, R., Dufour, N., Wang, X., Christie, M., Kalogeiton, V.: E.t. the exceptional trajectories: Text-to-camera-trajectory generation with character awareness (2024), https://arxiv.org/abs/2407.01516

  8. [8]

    arXiv preprint arXiv:2510.05097 (2025)

    Courant, R., Wang, X., Loiseaux, D., Christie, M., Kalogeiton, V.: Pulp motion: Framing-aware multimodal camera and human motion generation. arXiv preprint arXiv:2510.05097 (2025)

  9. [9]

    Proceedings of the ACM on Human- Computer Interaction6(CHI PLAY), 1–23 (2022)

    Evin, I., Hämäläinen, P., Guckelsberger, C.: Cine-ai: Generating video game cutscenes in the style of human directors. Proceedings of the ACM on Human- Computer Interaction6(CHI PLAY), 1–23 (2022)

  10. [10]

    Galvane, Q.: Automatic cinematography and editing in virtual environments. Ph.D. thesis, Grenoble 1 UJF-Université Joseph Fourier (2015)

  11. [11]

    In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Gschwindt, M., Camci, E., Bonatti, R., Wang, W., Kayacan, E., Scherer, S.: Can a robot become a movie director? learning artistic principles for aerial cinematography. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1107–1114. IEEE (2019)

  12. [12]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

  13. [13]

    ITE Transactions on Media Technology and Applications2(1), 74–81 (2014)

    Hayashi, M., Inoue, S., Douke, M., Hamaguchi, N., Kaneko, H., Bachelder, S., Nakajima, M.: T2v: New technology of converting text to cg animation. ITE Transactions on Media Technology and Applications2(1), 74–81 (2014)

  14. [14]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- abling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101 (2024)

  15. [15]

    Hu, T., Zhang, J., Yi, R., Wang, Y., Huang, H., Weng, J., Wang, Y., Ma, L.: Motionmaster: Training-free camera motion transfer for video generation (2024)

  16. [16]

    IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5335–5348 (2021) 16 M

    Huang, C., Dang, Y., Chen, P., Yang, X., Cheng, K.T.: One-shot imitation drone filming of human motion videos. IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5335–5348 (2021) 16 M. Li et al

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, C., Lin, C.E., Yang, Z., Kong, Y., Chen, P., Yang, X., Cheng, K.T.: Learning to film from professional human motion videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4244–4253 (2019)

  18. [18]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Huang, Q., Chan, L., Liu, J., He, W., Jiang, H., Song, M., Song, J.: Patchdpo: Patch-level dpo for finetuning-free personalized image generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18369–18378 (2025)

  19. [19]

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models (2023),https: //arxiv.org/abs/2311.17982

  20. [20]

    ACM Transactions on Graphics (TOG)40(6), 1–13 (2021)

    Jiang, H., Christie, M., Wang, X., Liu, L., Wang, B., Chen, B.: Camera keyframing with style and control. ACM Transactions on Graphics (TOG)40(6), 1–13 (2021)

  21. [21]

    In: Computer Graphics Forum

    Jiang, H., Wang, X., Christie, M., Liu, L., Chen, B.: Cinematographic camera diffusion model. In: Computer Graphics Forum. vol. 43, p. e15055. Wiley Online Library (2024)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Jiang, X., Rao, A., Wang, J., Lin, D., Dai, B.: Cinematic behavior transfer via nerf-based differentiable filming. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6723–6732 (2024)

  23. [23]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)

  24. [24]

    Information Sciences506, 273–294 (2020)

    Karakostas, I., Mademlis, I., Nikolaidis, N., Pitas, I.: Shot type constraints in uav cinematography for autonomous target tracking. Information Sciences506, 273–294 (2020)

  25. [25]

    Advances in Neural Information Processing Systems37, 16240–16271 (2024)

    Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L.J., Wetzstein, G.: Collaborative video diffusion: Consistent multi-video generation with camera control. Advances in Neural Information Processing Systems37, 16240–16271 (2024)

  26. [26]

    Li, B., Wu, Y., Lu, Y., Yu, J., Tang, L., Cao, J., Zhu, W., Sun, Y., Wu, J., Zhu, W.: Veu-bench: Towards comprehensive understanding of video editing (2025), https://arxiv.org/abs/2504.17828

  27. [27]

    Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., Wang, L., Qiao, Y.: Mvbench: A comprehensive multi-modal video understanding benchmark (2024),https://arxiv.org/abs/2311.17005

  28. [28]

    Li, Z., Yu, W., Huang, C., Liu, R., Liang, Z., Liu, F., Che, J., Yu, D., Boyd-Graber, J., Mi, H., Yu, D.: Self-rewarding vision-language model via reasoning decomposition (2025),https://arxiv.org/abs/2508.19652

  29. [29]

    Liang, Y., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., Ke, J., Dvijotham, K.D., Collins, K., Luo, Y., Li, Y., Kohlhoff, K.J., Ramachandran, D., Navalpakkam, V.: Rich human feedback for text-to-image generation (2024),https://arxiv.org/abs/2312.10240

  30. [30]

    ACM Transactions on Graphics (TOG)34(4), 1–12 (2015)

    Lino, C., Christie, M.: Intuitive and efficient camera control with the toric space. ACM Transactions on Graphics (TOG)34(4), 1–12 (2015)

  31. [31]

    Liu, H., He, J., Jin, Y., Zheng, D., Dong, Y., Zhang, F., Huang, Z., He, Y., Li, Y., Chen, W., Qiao, Y., Ouyang, W., Zhao, S., Liu, Z.: Shotbench: Expert-level cinematic understanding in vision-language models (2025),https://arxiv.org/ abs/2506.21356

  32. [32]

    Improving Video Generation with Human Feedback

    Liu, J., Liu, G., Liang, J., Yuan, Z., Liu, X., Zheng, M., Wu, X., Wang, Q., Xia, M., Wang, X., et al.: Improving video generation with human feedback. arXiv preprint arXiv:2501.13918 (2025) VERTIGO 17

  33. [33]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Liu, R., Wu, H., Zheng, Z., Wei, C., He, Y., Pi, R., Chen, Q.: Videodpo: Omni- preference alignment for video diffusion generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8009–8019 (2025)

  34. [34]

    Liu, R., Wu, H., Ziqiang, Z., Wei, C., He, Y., Pi, R., Chen, Q.: Videodpo: Omni- preference alignment for video diffusion generation (2024),https://arxiv.org/ abs/2412.14167

  35. [35]

    In: Proceedings of the 11th ACM SIGGRAPH Conference on Motion, Interaction and Games

    Louarn, A., Christie, M., Lamarche, F.: Automated staging for virtual cinematog- raphy. In: Proceedings of the 11th ACM SIGGRAPH Conference on Motion, Interaction and Games. pp. 1–10 (2018)

  36. [36]

    In: The Thirteenth International Conference on Learning Representations (2025)

    Lukasik, M., Meng, Z., Narasimhan, H., Chang, Y.W., Menon, A.K., Yu, F., Kumar, S.: Better autoregressive regression with llms via regression-aware fine-tuning. In: The Thirteenth International Conference on Learning Representations (2025)

  37. [37]

    arXiv preprint arXiv:2403.04182 (2024)

    Lukasik, M., Narasimhan, H., Menon, A.K., Yu, F., Kumar, S.: Regression-aware inference with llms. arXiv preprint arXiv:2403.04182 (2024)

  38. [38]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatgpt: Towards detailed video understanding via large vision and language models (2024),https://arxiv. org/abs/2306.05424

  39. [39]

    Prabhudesai,M.,Mendonca,R.,Qin,Z.,Fragkiadaki,K.,Pathak,D.:Videodiffusion alignment via reward gradients (2024),https://arxiv.org/abs/2407.08737

  40. [40]

    IEEE Transactions on Robotics40, 1740–1757 (2024)

    Pueyo, P., Dendarieta, J., Montijano, E., Murillo, A.C., Schwager, M.: Cinempc: A fully autonomous drone cinematography system incorporating zoom, focus, pose, and scene composition. IEEE Transactions on Robotics40, 1740–1757 (2024)

  41. [41]

    In: ACM SIGGRAPH 2023 Posters, pp

    Rao, A., Jiang, X., Guo, Y., Xu, L., Yang, L., Jin, L., Lin, D., Dai, B.: Dynamic storyboard generation in an engine-based virtual environment for video production. In: ACM SIGGRAPH 2023 Posters, pp. 1–2 (2023)

  42. [42]

    Ren, X., Shen, T., Huang, J., Ling, H., Lu, Y., Nimier-David, M., Müller, T., Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control (2025),https://arxiv.org/abs/2503.03751

  43. [43]

    In: Proceedings of the 31st annual ACM symposium on user interface software and technology

    Subramonyam, H., Li, W., Adar, E., Dontcheva, M.: Taketoons: Script-driven performance animation. In: Proceedings of the 31st annual ACM symposium on user interface software and technology. pp. 663–674 (2018)

  44. [44]

    Tong, C., Guo, Z., Zhang, R., Shan, W., Wei, X., Xing, Z., Li, H., Heng, P.A.: Delving into rl for image generation with cot: A study on dpo vs. grpo. arXiv preprint arXiv:2505.17017 (2025)

  45. [45]

    Unity: Real-time development platform (2025),https://unity.com

  46. [46]

    you make it unreal

    Unreal: We make the engine. you make it unreal. (2025), https : / / www . unrealengine.com/

  47. [47]

    Wang, B., Chen, X., Gadelha, M., Cheng, Z.: Frame in-n-out: Unbounded control- lable image-to-video generation (2025),https://arxiv.org/abs/2505.21491

  48. [48]

    Multilingual E5 Text Embeddings: A Technical Report

    Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., Wei, F.: Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672 (2024)

  49. [49]

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution (2024),https://arxiv.org/abs/2409.12191

  50. [50]

    Wang, Q., Luo, Y., Shi, X., Jia, X., Lu, H., Xue, T., Wang, X., Wan, P., Zhang, D., Gai, K.: Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation (2025),https://arxiv.org/abs/2502.08639

  51. [51]

    arXiv preprint arXiv:2412.14158 (2024) 18 M

    Wang, X., Courant, R., Christie, M., Kalogeiton, V.: Akira: Augmentation kit on rays for optical video generation. arXiv preprint arXiv:2412.14158 (2024) 18 M. Li et al

  52. [52]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, X., Courant, R., Shi, J., Marchand, E., Christie, M.: Jaws: Just a wild shot for cinematic transfer in neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16933–16942 (2023)

  53. [53]

    In: ACM SIGGRAPH 2024 Conference Papers

    Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Motionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

  54. [54]

    Wei, Z., Wu, H., Zhang, L., Xu, X., Zheng, Y., Hui, P., Agrawala, M., Qu, H., Rao, A.: Cinevision: An interactive pre-visualization storyboard system for director- cinematographer collaboration (2025),https://arxiv.org/abs/2507.20355

  55. [55]

    In: 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE)

    Xie, C., Hemmi, I., Shishido, H., Kitahara, I.: Camera motion generation method based on performer’s position for performance filming. In: 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE). pp. 957–960. IEEE (2023)

  56. [56]

    Xing, J., Mai, L., Ham, C., Huang, J., Mahapatra, A., Fu, C.W., Wong, T.T., Liu, F.: Motioncanvas: Cinematic shot design with controllable image-to-video generation (2025),https://arxiv.org/abs/2502.04299

  57. [57]

    In: SIGGRAPH Asia 2024 Technical Communications, pp

    Xu, Z., Wang, J., Wang, L., Li, Z., Shi, S., Hu, B., Zhang, M.: Filmagent: Au- tomating virtual film production through a multi-agent collaborative framework. In: SIGGRAPH Asia 2024 Technical Communications, pp. 1–4 (2024)

  58. [58]

    In: ACM SIGGRAPH 2024 Conference Papers

    Yang, S., Hou, L., Huang, H., Ma, C., Wan, P., Zhang, D., Chen, X., Liao, J.: Direct-a-video: Customized video generation with user-directed camera movement and object motion. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–12 (2024)

  59. [59]

    IEEE Internet of Things Journal10(14), 12338–12351 (2023)

    Yu, Z., Wang, H., Katsaggelos, A.K., Ren, J.: A novel automatic content generation and optimization framework. IEEE Internet of Things Journal10(14), 12338–12351 (2023)

  60. [60]

    IEEE Transactions on Multimedia 26, 6178–6190 (2023)

    Yu, Z., Wu, X., Wang, H., Katsaggelos, A.K., Ren, J.: Automated adaptive cine- matography for user interaction in open world. IEEE Transactions on Multimedia 26, 6178–6190 (2023)

  61. [61]

    In: 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR)

    Yu, Z., Yu, C., Wang, H., Ren, J.: Enabling automatic cinematography with reinforcement learning. In: 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR). pp. 103–108. IEEE (2022)

  62. [62]

    Yuan, Y., Wang, X., Wickremasinghe, T., Nadir, Z., Ma, B., Chan, S.H.: Newtongen: Physics-consistent and controllable text-to-video generation via neural newtonian dynamics (2025),https://arxiv.org/abs/2509.21309

  63. [63]

    org/abs/2504.07083

    Zhang, M., Wu, T., Tan, J., Liu, Z., Wetzstein, G., Lin, D.: Gendop: Auto-regressive camera trajectory generation as a director of photography (2025),https://arxiv. org/abs/2504.07083

  64. [64]

    Zhang, R., Gui, L., Sun, Z., Feng, Y., Xu, K., Zhang, Y., Fu, D., Li, C., Hauptmann, A., Bisk, Y., Yang, Y.: Direct preference optimization of video large multimodal models from language model reward (2024),https://arxiv.org/abs/2404.01258

  65. [65]

    Zhu, Y., Wang, X., Lathuilière, S., Kalogeiton, V.: Soft-di [m] o: Improving one-step discrete image generation with soft embeddings. arXiv preprint arXiv:2509.22925 (2025) VERTIGO 19 Appendix to VERTIGO: Visual Preference Optimization for Cinematic Camera Generation A Additional Related Work A.1 Camera Trajectory Generation Traditional approaches in auto...