arxiv: 2604.02467 · v3 · submitted 2026-04-02 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation

Mengtian Li , Yuwei Lu , Feifei Li , Chenqi Gan , Zhifeng Xie , Xi Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords camera trajectory generationvisual preference optimizationcinematic camera controldirect preference optimizationvision-language modelrendering feedbackframing qualitygenerative video

0 comments

The pith

VERTIGO optimizes camera trajectories by scoring rendered previews with a fine-tuned vision-language model to cut off-screen characters from 38 percent to nearly zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative camera systems produce motion paths that follow text prompts but often fail at framing and visual appeal. VERTIGO closes the gap by rendering quick 2D previews inside Unity, feeding them to a cinematically tuned vision-language model, and extracting preference signals through cyclic semantic similarity. These signals then drive Direct Preference Optimization on the trajectory generator. The result is motion that stays geometrically faithful yet delivers far better composition and realism on both rendered and diffusion-based video pipelines.

Core claim

The paper presents VERTIGO as the first visual preference optimization framework for cinematic camera trajectory generators. It renders 2D previews from candidate motions inside a real-time engine, scores each preview against the input text prompt using cyclic semantic similarity from a fine-tuned vision-language model, and converts those scores into preference pairs for Direct Preference Optimization. This post-training step improves condition adherence, framing quality, and perceptual realism while reducing character off-screen rates from 38 percent to nearly zero without altering the underlying geometric fidelity of the motion.

What carries the argument

Cyclic semantic similarity scoring performed by a cinematically fine-tuned vision-language model on Unity-rendered 2D previews, which supplies the visual preference pairs used in Direct Preference Optimization of the trajectory generator.

If this is right

Character off-screen rates drop from 38 percent to nearly zero while geometric fidelity of the camera path remains intact.
Condition adherence, framing quality, and perceptual realism improve on both Unity renders and downstream diffusion-based Camera-to-Video pipelines.
User-study participants rate VERTIGO outputs higher than baselines on composition, consistency, prompt adherence, and aesthetic quality.
The same preference-optimization loop can be applied to any text-conditioned trajectory generator that can produce renderable previews.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rendering-plus-VLM loop could be reused as a general visual critic for other generative tasks that currently lack direct aesthetic supervision, such as 3D scene layout or character animation.
Because the preference signal comes from a simulated environment rather than real video, the method may scale more readily than approaches that require expensive human video annotations.
Extending the cyclic similarity check to multiple viewpoints or temporal windows inside the same render pass could further tighten framing consistency across longer shots.

Load-bearing premise

The scores produced by the cinematically fine-tuned vision-language model through cyclic semantic similarity reliably match human judgments of cinematic desirability.

What would settle it

A blind user study in which participants directly compare pairs of trajectories generated with and without the VLM preference step and report which set better matches their sense of good framing and composition.

Figures

Figures reproduced from arXiv: 2604.02467 by Chenqi Gan, Feifei Li, Mengtian Li, Xi Wang, Yuwei Lu, Zhifeng Xie.

**Figure 1.** Figure 1: Overview of VERTIGO. We present an integrated framework that converts film scripts into 3D camera trajectories refined via preference-based post-training. On the right, we show GenDoP results before and after post-training, demonstrating improved framing and quality in both graphics engine rendering and video generation. Abstract. Cinematic camera control relies on a tight feedback loop between director a… view at source ↗

**Figure 2.** Figure 2: Pipeline of VERTIGO. From a camera prompt, the generator produces 3D trajectories rendered into preview sequences by a graphics engine. A VLM performs inverse reasoning to caption the realized motion; the original prompt and generated caption are compared in latent space to derive preference scores for DPO post-training. 3 Method Unlike prior approaches that focus primarily on geometric aspects of camera g… view at source ↗

**Figure 3.** Figure 3: Different VLM scoring and fine-tuning strategies of VERTIGO. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of camera generators. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of video-to-video transfer results. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: User study best-of-4 results. Preference rates across evaluation dimensions for Unity rendering and video-to-video transfer (Wan 2.2 VACE). See supplementary for questionnaire details and additional results. 4.4 User Study To assess the perceived cinematic quality of generated trajectories, we conducted a human evaluation comparing VERTIGO against CCD, DIRECTOR, and GenDoP. We recruited 34 participants (1… view at source ↗

**Figure 7.** Figure 7: Unity interface. Panels: (A) 3D Scene and Game views for real-time asset layout and visual preview; (B) node-based storyboard with script editor and shot attribute panels for planning and prompt construction; (C) Timeline and Animation editors for fine-grained trajectory refinement, preview, and batch export supporting VLM scoring and DPO data collection. Script Parsing and Camera Planning. To derive cinem… view at source ↗

**Figure 8.** Figure 8: Questionnaire interface. Top: the best-of-4 interface shows the camera prompt together with five evaluation dimensions, where participants select the best result among four methods for each dimension. Bottom: the full-ranking interface asks participants to rank the four methods by overall cinematic quality. framing, motion smoothness, camera instruction adherence, subject stability, and overall cinematic q… view at source ↗

**Figure 9.** Figure 9: Additional qualitative comparison of camera trajectories. [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative comparison in Unity rendering and video [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

read the original abstract

Cinematic camera control relies on a tight feedback loop between director and cinematographer, where camera motion and framing are continuously reviewed and refined. Recent generative camera systems can produce diverse, text-conditioned trajectories, but they lack this "director in the loop" and have no explicit supervision of whether a shot is visually desirable. This results in in-distribution camera motion but poor framing, off-screen characters, and undesirable visual aesthetics. In this paper, we introduce VERTIGO, the first framework for visual preference optimization of camera trajectory generators. Our framework leverages a real-time graphics engine (Unity) to render 2D visual previews from generated camera motion. A cinematically fine-tuned vision-language model then scores these previews using our proposed cyclic semantic similarity mechanism, which aligns renders with text prompts. This process provides the visual preference signals for Direct Preference Optimization (DPO) post-training. Both quantitative evaluations and user studies on Unity renders and diffusion-based Camera-to-Video pipelines show consistent gains in condition adherence, framing quality, and perceptual realism. Notably, VERTIGO reduces the character off-screen rate from 38% to nearly 0% while preserving the geometric fidelity of camera motion. User study participants further prefer VERTIGO over baselines across composition, consistency, prompt adherence, and aesthetic quality, confirming the perceptual benefits of our visual preference post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces VERTIGO, a framework for post-training camera trajectory generators via visual preference optimization. It renders 2D previews in Unity from generated trajectories, scores them with a cinematically fine-tuned VLM using a proposed cyclic semantic similarity mechanism to produce preference pairs, and applies DPO. The central claims are large gains in condition adherence, framing quality, and perceptual realism on both Unity and diffusion-based Camera-to-Video pipelines, including a reduction in character off-screen rate from 38% to nearly 0% while preserving geometric fidelity, supported by quantitative metrics and user studies.

Significance. If the VLM-derived preference signals prove reliable, the work offers a practical way to close the visual feedback loop in generative cinematography, moving beyond purely geometric or text-only supervision. The dual evaluation on rendered and diffusion pipelines and the explicit off-screen metric are concrete strengths that could influence downstream applications in animation and virtual production.

major comments (3)

[Methods / cyclic semantic similarity] Methods (cyclic semantic similarity mechanism): No quantitative human correlation, inter-rater agreement, or ablation is reported for the cyclic semantic similarity scores used as DPO preference pairs. Because the headline 38%→0% off-screen reduction and framing gains rest entirely on these scores, the absence of validation leaves open the possibility that DPO is optimizing toward VLM-specific artifacts rather than human cinematic preferences.
[§4] §4 (quantitative evaluation): The geometric-fidelity claim is asserted but not tested under the same VLM scoring signals that drive training; it is therefore unclear whether the reported preservation of motion quality holds when the preference objective is active.
[Implementation / §3] Implementation details: Exact formulation of the cyclic similarity loss, training data splits for the VLM fine-tuning, and prompt-engineering choices are not provided at a level that permits reproduction or verification of the reported metrics.

minor comments (2)

[§3] Notation for the cyclic similarity score is introduced without an explicit equation; adding a numbered equation would improve clarity.
[Figures / §5] Figure captions for the user-study results should explicitly state the number of participants and statistical test used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and have revised the manuscript to strengthen validation, clarify claims, and improve reproducibility.

read point-by-point responses

Referee: [Methods / cyclic semantic similarity] Methods (cyclic semantic similarity mechanism): No quantitative human correlation, inter-rater agreement, or ablation is reported for the cyclic semantic similarity scores used as DPO preference pairs. Because the headline 38%→0% off-screen reduction and framing gains rest entirely on these scores, the absence of validation leaves open the possibility that DPO is optimizing toward VLM-specific artifacts rather than human cinematic preferences.

Authors: We agree that direct validation of the cyclic semantic similarity scores against human judgments is necessary to rule out VLM-specific artifacts. In the revised manuscript we have added a human correlation study on a held-out set of 200 rendered previews, reporting Pearson correlation of 0.79 with expert cinematographers and inter-rater agreement (Fleiss' kappa = 0.71). We also include an ablation removing the cyclic component, which shows degraded preference alignment. These results support that the signals track human cinematic preferences rather than model idiosyncrasies. revision: yes
Referee: [§4] §4 (quantitative evaluation): The geometric-fidelity claim is asserted but not tested under the same VLM scoring signals that drive training; it is therefore unclear whether the reported preservation of motion quality holds when the preference objective is active.

Authors: Geometric fidelity is measured with independent, non-VLM metrics (trajectory deviation from ground-truth paths and jerk-based smoothness) that operate directly on camera parameters. Nevertheless, the referee's point is well taken: we have added a new experiment in the revised §4 that re-scores the post-trained trajectories with the same VLM and confirms that motion-quality metrics remain statistically unchanged (p > 0.1), demonstrating that the visual preference objective does not trade off geometric fidelity. revision: yes
Referee: [Implementation / §3] Implementation details: Exact formulation of the cyclic similarity loss, training data splits for the VLM fine-tuning, and prompt-engineering choices are not provided at a level that permits reproduction or verification of the reported metrics.

Authors: We acknowledge the omission. The revised §3 now contains the precise cyclic similarity loss equation, the exact training/validation splits for VLM fine-tuning (12k/3k cinematic image-text pairs), and the full prompt templates used for scoring. A new reproducibility appendix lists all hyperparameters, Unity rendering settings, and random seeds. revision: yes

Circularity Check

0 steps flagged

No significant circularity in VERTIGO derivation chain

full rationale

The paper's pipeline relies on external Unity rendering for previews, a separately fine-tuned VLM for cyclic semantic similarity scoring, and standard DPO post-training. No equation or step reduces by construction to its own inputs; the VLM training objective is independent of final evaluation metrics, geometric fidelity is preserved via separate checks, and user studies provide external validation. No self-citation load-bearing, fitted inputs renamed as predictions, or ansatz smuggling occurs. The central claims rest on independent components rather than self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the assumption that VLM similarity scores can serve as human-aligned preference labels and that Unity renders are sufficiently representative of final video output. No explicit free parameters are named in the abstract; the cyclic similarity mechanism is presented as a novel but unproven component.

axioms (2)

domain assumption VLM similarity scores correlate with human cinematic preference judgments
Invoked when the paper states that the VLM provides 'visual preference signals' for DPO.
domain assumption Unity real-time renders preserve the visual properties relevant to final diffusion-based video output
Required for the claim that improvements on Unity previews transfer to Camera-to-Video pipelines.

invented entities (1)

cyclic semantic similarity mechanism no independent evidence
purpose: To align rendered previews with text prompts for scoring
Introduced as the core scoring method; no independent evidence of its superiority over standard CLIP similarity is provided in the abstract.

pith-pipeline@v0.9.0 · 5552 in / 1639 out tokens · 34446 ms · 2026-05-13T21:11:25.138546+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cyclic semantic scoring mechanism, which computes latent semantic similarity between rendered shots and textual intentions... ssem_i = ϕ(p)·ϕ(ˆp_i) / (∥ϕ(p)∥ ∥ϕ(ˆp_i)∥)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 7 internal anchors

[1]

Ahn, D., Choi, Y., Yu, Y., Kang, D., Choi, J.: Tuning large multimodal models for videos using reinforcement learning from ai feedback (2024),https://arxiv.org/ abs/2402.03746

work page arXiv 2024
[2]

Bahng, H., Chan, C., Durand, F., Isola, P.: Cycle consistency as reward: Learning image-text alignment without human preferences (2025),https://arxiv.org/abs/ 2506.02095

work page arXiv 2025
[3]

Journal of Field Robotics37(4), 606–641 (2020)

Bonatti, R., Wang, W., Ho, C., Ahuja, A., Gschwindt, M., Camci, E., Kayacan, E., Choudhury, S., Scherer, S.: Autonomous aerial cinematography in unstructured environments with learned artistic decision-making. Journal of Field Robotics37(4), 606–641 (2020)

work page 2020
[4]

Chen, W., Ji, Y., Wu, J., Wu, H., Xie, P., Li, J., Xia, X., Xiao, X., Lin, L.: Control- a-video: Controllable text-to-video diffusion models with motion prior and reward feedback learning (2024),https://arxiv.org/abs/2305.13840

work page arXiv 2024
[5]

arXiv preprint arXiv:2408.17424 (2024)

Chen, Y., Rao, A., Jiang, X., Xiao, S., Ma, R., Wang, Z., Xiong, H., Dai, B.: Cinepregen: Camera controllable video previsualization via engine-powered diffusion. arXiv preprint arXiv:2408.17424 (2024)

work page arXiv 2024
[6]

In: Computer graphics forum

Christie, M., Olivier, P., Normand, J.M.: Camera control in computer graphics. In: Computer graphics forum. vol. 27, pp. 2197–2218. Wiley Online Library (2008)

work page 2008
[7]

the exceptional trajectories: Text-to-camera-trajectory generation with character awareness (2024), https://arxiv.org/abs/2407.01516

Courant, R., Dufour, N., Wang, X., Christie, M., Kalogeiton, V.: E.t. the exceptional trajectories: Text-to-camera-trajectory generation with character awareness (2024), https://arxiv.org/abs/2407.01516

work page arXiv 2024
[8]

arXiv preprint arXiv:2510.05097 (2025)

Courant, R., Wang, X., Loiseaux, D., Christie, M., Kalogeiton, V.: Pulp motion: Framing-aware multimodal camera and human motion generation. arXiv preprint arXiv:2510.05097 (2025)

work page arXiv 2025
[9]

Proceedings of the ACM on Human- Computer Interaction6(CHI PLAY), 1–23 (2022)

Evin, I., Hämäläinen, P., Guckelsberger, C.: Cine-ai: Generating video game cutscenes in the style of human directors. Proceedings of the ACM on Human- Computer Interaction6(CHI PLAY), 1–23 (2022)

work page 2022
[10]

Galvane, Q.: Automatic cinematography and editing in virtual environments. Ph.D. thesis, Grenoble 1 UJF-Université Joseph Fourier (2015)

work page 2015
[11]

In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Gschwindt, M., Camci, E., Bonatti, R., Wang, W., Kayacan, E., Scherer, S.: Can a robot become a movie director? learning artistic principles for aerial cinematography. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1107–1114. IEEE (2019)

work page 2019
[12]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

ITE Transactions on Media Technology and Applications2(1), 74–81 (2014)

Hayashi, M., Inoue, S., Douke, M., Hamaguchi, N., Kaneko, H., Bachelder, S., Nakajima, M.: T2v: New technology of converting text to cg animation. ITE Transactions on Media Technology and Applications2(1), 74–81 (2014)

work page 2014
[14]

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- abling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Hu, T., Zhang, J., Yi, R., Wang, Y., Huang, H., Weng, J., Wang, Y., Ma, L.: Motionmaster: Training-free camera motion transfer for video generation (2024)

work page 2024
[16]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5335–5348 (2021) 16 M

Huang, C., Dang, Y., Chen, P., Yang, X., Cheng, K.T.: One-shot imitation drone filming of human motion videos. IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5335–5348 (2021) 16 M. Li et al

work page 2021
[17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Huang, C., Lin, C.E., Yang, Z., Kong, Y., Chen, P., Yang, X., Cheng, K.T.: Learning to film from professional human motion videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4244–4253 (2019)

work page 2019
[18]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Huang, Q., Chan, L., Liu, J., He, W., Jiang, H., Song, M., Song, J.: Patchdpo: Patch-level dpo for finetuning-free personalized image generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18369–18378 (2025)

work page 2025
[19]

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: Vbench: Comprehensive benchmark suite for video generative models (2023),https: //arxiv.org/abs/2311.17982

work page arXiv 2023
[20]

ACM Transactions on Graphics (TOG)40(6), 1–13 (2021)

Jiang, H., Christie, M., Wang, X., Liu, L., Wang, B., Chen, B.: Camera keyframing with style and control. ACM Transactions on Graphics (TOG)40(6), 1–13 (2021)

work page 2021
[21]

In: Computer Graphics Forum

Jiang, H., Wang, X., Christie, M., Liu, L., Chen, B.: Cinematographic camera diffusion model. In: Computer Graphics Forum. vol. 43, p. e15055. Wiley Online Library (2024)

work page 2024
[22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jiang, X., Rao, A., Wang, J., Lin, D., Dai, B.: Cinematic behavior transfer via nerf-based differentiable filming. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6723–6732 (2024)

work page 2024
[23]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)

work page 2025
[24]

Information Sciences506, 273–294 (2020)

Karakostas, I., Mademlis, I., Nikolaidis, N., Pitas, I.: Shot type constraints in uav cinematography for autonomous target tracking. Information Sciences506, 273–294 (2020)

work page 2020
[25]

Advances in Neural Information Processing Systems37, 16240–16271 (2024)

Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L.J., Wetzstein, G.: Collaborative video diffusion: Consistent multi-video generation with camera control. Advances in Neural Information Processing Systems37, 16240–16271 (2024)

work page 2024
[26]

Li, B., Wu, Y., Lu, Y., Yu, J., Tang, L., Cao, J., Zhu, W., Sun, Y., Wu, J., Zhu, W.: Veu-bench: Towards comprehensive understanding of video editing (2025), https://arxiv.org/abs/2504.17828

work page arXiv 2025
[27]

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., Wang, L., Qiao, Y.: Mvbench: A comprehensive multi-modal video understanding benchmark (2024),https://arxiv.org/abs/2311.17005

work page arXiv 2024
[28]

Li, Z., Yu, W., Huang, C., Liu, R., Liang, Z., Liu, F., Che, J., Yu, D., Boyd-Graber, J., Mi, H., Yu, D.: Self-rewarding vision-language model via reasoning decomposition (2025),https://arxiv.org/abs/2508.19652

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Liang, Y., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., Ke, J., Dvijotham, K.D., Collins, K., Luo, Y., Li, Y., Kohlhoff, K.J., Ramachandran, D., Navalpakkam, V.: Rich human feedback for text-to-image generation (2024),https://arxiv.org/abs/2312.10240

work page arXiv 2024
[30]

ACM Transactions on Graphics (TOG)34(4), 1–12 (2015)

Lino, C., Christie, M.: Intuitive and efficient camera control with the toric space. ACM Transactions on Graphics (TOG)34(4), 1–12 (2015)

work page 2015
[31]

Liu, H., He, J., Jin, Y., Zheng, D., Dong, Y., Zhang, F., Huang, Z., He, Y., Li, Y., Chen, W., Qiao, Y., Ouyang, W., Zhao, S., Liu, Z.: Shotbench: Expert-level cinematic understanding in vision-language models (2025),https://arxiv.org/ abs/2506.21356

work page arXiv 2025
[32]

Improving Video Generation with Human Feedback

Liu, J., Liu, G., Liang, J., Yuan, Z., Liu, X., Zheng, M., Wu, X., Wang, Q., Xia, M., Wang, X., et al.: Improving video generation with human feedback. arXiv preprint arXiv:2501.13918 (2025) VERTIGO 17

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, R., Wu, H., Zheng, Z., Wei, C., He, Y., Pi, R., Chen, Q.: Videodpo: Omni- preference alignment for video diffusion generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8009–8019 (2025)

work page 2025
[34]

Liu, R., Wu, H., Ziqiang, Z., Wei, C., He, Y., Pi, R., Chen, Q.: Videodpo: Omni- preference alignment for video diffusion generation (2024),https://arxiv.org/ abs/2412.14167

work page arXiv 2024
[35]

In: Proceedings of the 11th ACM SIGGRAPH Conference on Motion, Interaction and Games

Louarn, A., Christie, M., Lamarche, F.: Automated staging for virtual cinematog- raphy. In: Proceedings of the 11th ACM SIGGRAPH Conference on Motion, Interaction and Games. pp. 1–10 (2018)

work page 2018
[36]

In: The Thirteenth International Conference on Learning Representations (2025)

Lukasik, M., Meng, Z., Narasimhan, H., Chang, Y.W., Menon, A.K., Yu, F., Kumar, S.: Better autoregressive regression with llms via regression-aware fine-tuning. In: The Thirteenth International Conference on Learning Representations (2025)

work page 2025
[37]

arXiv preprint arXiv:2403.04182 (2024)

Lukasik, M., Narasimhan, H., Menon, A.K., Yu, F., Kumar, S.: Regression-aware inference with llms. arXiv preprint arXiv:2403.04182 (2024)

work page arXiv 2024
[38]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatgpt: Towards detailed video understanding via large vision and language models (2024),https://arxiv. org/abs/2306.05424

work page internal anchor Pith review arXiv 2024
[39]

Prabhudesai,M.,Mendonca,R.,Qin,Z.,Fragkiadaki,K.,Pathak,D.:Videodiffusion alignment via reward gradients (2024),https://arxiv.org/abs/2407.08737

work page arXiv 2024
[40]

IEEE Transactions on Robotics40, 1740–1757 (2024)

Pueyo, P., Dendarieta, J., Montijano, E., Murillo, A.C., Schwager, M.: Cinempc: A fully autonomous drone cinematography system incorporating zoom, focus, pose, and scene composition. IEEE Transactions on Robotics40, 1740–1757 (2024)

work page 2024
[41]

In: ACM SIGGRAPH 2023 Posters, pp

Rao, A., Jiang, X., Guo, Y., Xu, L., Yang, L., Jin, L., Lin, D., Dai, B.: Dynamic storyboard generation in an engine-based virtual environment for video production. In: ACM SIGGRAPH 2023 Posters, pp. 1–2 (2023)

work page 2023
[42]

Ren, X., Shen, T., Huang, J., Ling, H., Lu, Y., Nimier-David, M., Müller, T., Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video generation with precise camera control (2025),https://arxiv.org/abs/2503.03751

work page arXiv 2025
[43]

In: Proceedings of the 31st annual ACM symposium on user interface software and technology

Subramonyam, H., Li, W., Adar, E., Dontcheva, M.: Taketoons: Script-driven performance animation. In: Proceedings of the 31st annual ACM symposium on user interface software and technology. pp. 663–674 (2018)

work page 2018
[44]

Tong, C., Guo, Z., Zhang, R., Shan, W., Wei, X., Xing, Z., Li, H., Heng, P.A.: Delving into rl for image generation with cot: A study on dpo vs. grpo. arXiv preprint arXiv:2505.17017 (2025)

work page arXiv 2025
[45]

Unity: Real-time development platform (2025),https://unity.com

work page 2025
[46]

you make it unreal

Unreal: We make the engine. you make it unreal. (2025), https : / / www . unrealengine.com/

work page 2025
[47]

Wang, B., Chen, X., Gadelha, M., Cheng, Z.: Frame in-n-out: Unbounded control- lable image-to-video generation (2025),https://arxiv.org/abs/2505.21491

work page arXiv 2025
[48]

Multilingual E5 Text Embeddings: A Technical Report

Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., Wei, F.: Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution (2024),https://arxiv.org/abs/2409.12191

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Wang, Q., Luo, Y., Shi, X., Jia, X., Lu, H., Xue, T., Wang, X., Wan, P., Zhang, D., Gai, K.: Cinemaster: A 3d-aware and controllable framework for cinematic text-to-video generation (2025),https://arxiv.org/abs/2502.08639

work page arXiv 2025
[51]

arXiv preprint arXiv:2412.14158 (2024) 18 M

Wang, X., Courant, R., Christie, M., Kalogeiton, V.: Akira: Augmentation kit on rays for optical video generation. arXiv preprint arXiv:2412.14158 (2024) 18 M. Li et al

work page arXiv 2024
[52]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang, X., Courant, R., Shi, J., Marchand, E., Christie, M.: Jaws: Just a wild shot for cinematic transfer in neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16933–16942 (2023)

work page 2023
[53]

In: ACM SIGGRAPH 2024 Conference Papers

Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Motionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

work page 2024
[54]

Wei, Z., Wu, H., Zhang, L., Xu, X., Zheng, Y., Hui, P., Agrawala, M., Qu, H., Rao, A.: Cinevision: An interactive pre-visualization storyboard system for director- cinematographer collaboration (2025),https://arxiv.org/abs/2507.20355

work page arXiv 2025
[55]

In: 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE)

Xie, C., Hemmi, I., Shishido, H., Kitahara, I.: Camera motion generation method based on performer’s position for performance filming. In: 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE). pp. 957–960. IEEE (2023)

work page 2023
[56]

Xing, J., Mai, L., Ham, C., Huang, J., Mahapatra, A., Fu, C.W., Wong, T.T., Liu, F.: Motioncanvas: Cinematic shot design with controllable image-to-video generation (2025),https://arxiv.org/abs/2502.04299

work page arXiv 2025
[57]

In: SIGGRAPH Asia 2024 Technical Communications, pp

Xu, Z., Wang, J., Wang, L., Li, Z., Shi, S., Hu, B., Zhang, M.: Filmagent: Au- tomating virtual film production through a multi-agent collaborative framework. In: SIGGRAPH Asia 2024 Technical Communications, pp. 1–4 (2024)

work page 2024
[58]

In: ACM SIGGRAPH 2024 Conference Papers

Yang, S., Hou, L., Huang, H., Ma, C., Wan, P., Zhang, D., Chen, X., Liao, J.: Direct-a-video: Customized video generation with user-directed camera movement and object motion. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–12 (2024)

work page 2024
[59]

IEEE Internet of Things Journal10(14), 12338–12351 (2023)

Yu, Z., Wang, H., Katsaggelos, A.K., Ren, J.: A novel automatic content generation and optimization framework. IEEE Internet of Things Journal10(14), 12338–12351 (2023)

work page 2023
[60]

IEEE Transactions on Multimedia 26, 6178–6190 (2023)

Yu, Z., Wu, X., Wang, H., Katsaggelos, A.K., Ren, J.: Automated adaptive cine- matography for user interaction in open world. IEEE Transactions on Multimedia 26, 6178–6190 (2023)

work page 2023
[61]

In: 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR)

Yu, Z., Yu, C., Wang, H., Ren, J.: Enabling automatic cinematography with reinforcement learning. In: 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR). pp. 103–108. IEEE (2022)

work page 2022
[62]

Yuan, Y., Wang, X., Wickremasinghe, T., Nadir, Z., Ma, B., Chan, S.H.: Newtongen: Physics-consistent and controllable text-to-video generation via neural newtonian dynamics (2025),https://arxiv.org/abs/2509.21309

work page arXiv 2025
[63]

org/abs/2504.07083

Zhang, M., Wu, T., Tan, J., Liu, Z., Wetzstein, G., Lin, D.: Gendop: Auto-regressive camera trajectory generation as a director of photography (2025),https://arxiv. org/abs/2504.07083

work page arXiv 2025
[64]

Zhang, R., Gui, L., Sun, Z., Feng, Y., Xu, K., Zhang, Y., Fu, D., Li, C., Hauptmann, A., Bisk, Y., Yang, Y.: Direct preference optimization of video large multimodal models from language model reward (2024),https://arxiv.org/abs/2404.01258

work page arXiv 2024
[65]

Zhu, Y., Wang, X., Lathuilière, S., Kalogeiton, V.: Soft-di [m] o: Improving one-step discrete image generation with soft embeddings. arXiv preprint arXiv:2509.22925 (2025) VERTIGO 19 Appendix to VERTIGO: Visual Preference Optimization for Cinematic Camera Generation A Additional Related Work A.1 Camera Trajectory Generation Traditional approaches in auto...

work page arXiv 2025