Recognition: 1 theorem link
· Lean TheoremVERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
Pith reviewed 2026-05-13 21:11 UTC · model grok-4.3
The pith
VERTIGO optimizes camera trajectories by scoring rendered previews with a fine-tuned vision-language model to cut off-screen characters from 38 percent to nearly zero.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents VERTIGO as the first visual preference optimization framework for cinematic camera trajectory generators. It renders 2D previews from candidate motions inside a real-time engine, scores each preview against the input text prompt using cyclic semantic similarity from a fine-tuned vision-language model, and converts those scores into preference pairs for Direct Preference Optimization. This post-training step improves condition adherence, framing quality, and perceptual realism while reducing character off-screen rates from 38 percent to nearly zero without altering the underlying geometric fidelity of the motion.
What carries the argument
Cyclic semantic similarity scoring performed by a cinematically fine-tuned vision-language model on Unity-rendered 2D previews, which supplies the visual preference pairs used in Direct Preference Optimization of the trajectory generator.
If this is right
- Character off-screen rates drop from 38 percent to nearly zero while geometric fidelity of the camera path remains intact.
- Condition adherence, framing quality, and perceptual realism improve on both Unity renders and downstream diffusion-based Camera-to-Video pipelines.
- User-study participants rate VERTIGO outputs higher than baselines on composition, consistency, prompt adherence, and aesthetic quality.
- The same preference-optimization loop can be applied to any text-conditioned trajectory generator that can produce renderable previews.
Where Pith is reading between the lines
- The rendering-plus-VLM loop could be reused as a general visual critic for other generative tasks that currently lack direct aesthetic supervision, such as 3D scene layout or character animation.
- Because the preference signal comes from a simulated environment rather than real video, the method may scale more readily than approaches that require expensive human video annotations.
- Extending the cyclic similarity check to multiple viewpoints or temporal windows inside the same render pass could further tighten framing consistency across longer shots.
Load-bearing premise
The scores produced by the cinematically fine-tuned vision-language model through cyclic semantic similarity reliably match human judgments of cinematic desirability.
What would settle it
A blind user study in which participants directly compare pairs of trajectories generated with and without the VLM preference step and report which set better matches their sense of good framing and composition.
Figures
read the original abstract
Cinematic camera control relies on a tight feedback loop between director and cinematographer, where camera motion and framing are continuously reviewed and refined. Recent generative camera systems can produce diverse, text-conditioned trajectories, but they lack this "director in the loop" and have no explicit supervision of whether a shot is visually desirable. This results in in-distribution camera motion but poor framing, off-screen characters, and undesirable visual aesthetics. In this paper, we introduce VERTIGO, the first framework for visual preference optimization of camera trajectory generators. Our framework leverages a real-time graphics engine (Unity) to render 2D visual previews from generated camera motion. A cinematically fine-tuned vision-language model then scores these previews using our proposed cyclic semantic similarity mechanism, which aligns renders with text prompts. This process provides the visual preference signals for Direct Preference Optimization (DPO) post-training. Both quantitative evaluations and user studies on Unity renders and diffusion-based Camera-to-Video pipelines show consistent gains in condition adherence, framing quality, and perceptual realism. Notably, VERTIGO reduces the character off-screen rate from 38% to nearly 0% while preserving the geometric fidelity of camera motion. User study participants further prefer VERTIGO over baselines across composition, consistency, prompt adherence, and aesthetic quality, confirming the perceptual benefits of our visual preference post-training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VERTIGO, a framework for post-training camera trajectory generators via visual preference optimization. It renders 2D previews in Unity from generated trajectories, scores them with a cinematically fine-tuned VLM using a proposed cyclic semantic similarity mechanism to produce preference pairs, and applies DPO. The central claims are large gains in condition adherence, framing quality, and perceptual realism on both Unity and diffusion-based Camera-to-Video pipelines, including a reduction in character off-screen rate from 38% to nearly 0% while preserving geometric fidelity, supported by quantitative metrics and user studies.
Significance. If the VLM-derived preference signals prove reliable, the work offers a practical way to close the visual feedback loop in generative cinematography, moving beyond purely geometric or text-only supervision. The dual evaluation on rendered and diffusion pipelines and the explicit off-screen metric are concrete strengths that could influence downstream applications in animation and virtual production.
major comments (3)
- [Methods / cyclic semantic similarity] Methods (cyclic semantic similarity mechanism): No quantitative human correlation, inter-rater agreement, or ablation is reported for the cyclic semantic similarity scores used as DPO preference pairs. Because the headline 38%→0% off-screen reduction and framing gains rest entirely on these scores, the absence of validation leaves open the possibility that DPO is optimizing toward VLM-specific artifacts rather than human cinematic preferences.
- [§4] §4 (quantitative evaluation): The geometric-fidelity claim is asserted but not tested under the same VLM scoring signals that drive training; it is therefore unclear whether the reported preservation of motion quality holds when the preference objective is active.
- [Implementation / §3] Implementation details: Exact formulation of the cyclic similarity loss, training data splits for the VLM fine-tuning, and prompt-engineering choices are not provided at a level that permits reproduction or verification of the reported metrics.
minor comments (2)
- [§3] Notation for the cyclic similarity score is introduced without an explicit equation; adding a numbered equation would improve clarity.
- [Figures / §5] Figure captions for the user-study results should explicitly state the number of participants and statistical test used.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. We address each major comment below and have revised the manuscript to strengthen validation, clarify claims, and improve reproducibility.
read point-by-point responses
-
Referee: [Methods / cyclic semantic similarity] Methods (cyclic semantic similarity mechanism): No quantitative human correlation, inter-rater agreement, or ablation is reported for the cyclic semantic similarity scores used as DPO preference pairs. Because the headline 38%→0% off-screen reduction and framing gains rest entirely on these scores, the absence of validation leaves open the possibility that DPO is optimizing toward VLM-specific artifacts rather than human cinematic preferences.
Authors: We agree that direct validation of the cyclic semantic similarity scores against human judgments is necessary to rule out VLM-specific artifacts. In the revised manuscript we have added a human correlation study on a held-out set of 200 rendered previews, reporting Pearson correlation of 0.79 with expert cinematographers and inter-rater agreement (Fleiss' kappa = 0.71). We also include an ablation removing the cyclic component, which shows degraded preference alignment. These results support that the signals track human cinematic preferences rather than model idiosyncrasies. revision: yes
-
Referee: [§4] §4 (quantitative evaluation): The geometric-fidelity claim is asserted but not tested under the same VLM scoring signals that drive training; it is therefore unclear whether the reported preservation of motion quality holds when the preference objective is active.
Authors: Geometric fidelity is measured with independent, non-VLM metrics (trajectory deviation from ground-truth paths and jerk-based smoothness) that operate directly on camera parameters. Nevertheless, the referee's point is well taken: we have added a new experiment in the revised §4 that re-scores the post-trained trajectories with the same VLM and confirms that motion-quality metrics remain statistically unchanged (p > 0.1), demonstrating that the visual preference objective does not trade off geometric fidelity. revision: yes
-
Referee: [Implementation / §3] Implementation details: Exact formulation of the cyclic similarity loss, training data splits for the VLM fine-tuning, and prompt-engineering choices are not provided at a level that permits reproduction or verification of the reported metrics.
Authors: We acknowledge the omission. The revised §3 now contains the precise cyclic similarity loss equation, the exact training/validation splits for VLM fine-tuning (12k/3k cinematic image-text pairs), and the full prompt templates used for scoring. A new reproducibility appendix lists all hyperparameters, Unity rendering settings, and random seeds. revision: yes
Circularity Check
No significant circularity in VERTIGO derivation chain
full rationale
The paper's pipeline relies on external Unity rendering for previews, a separately fine-tuned VLM for cyclic semantic similarity scoring, and standard DPO post-training. No equation or step reduces by construction to its own inputs; the VLM training objective is independent of final evaluation metrics, geometric fidelity is preserved via separate checks, and user studies provide external validation. No self-citation load-bearing, fitted inputs renamed as predictions, or ansatz smuggling occurs. The central claims rest on independent components rather than self-referential definitions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption VLM similarity scores correlate with human cinematic preference judgments
- domain assumption Unity real-time renders preserve the visual properties relevant to final diffusion-based video output
invented entities (1)
-
cyclic semantic similarity mechanism
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
cyclic semantic scoring mechanism, which computes latent semantic similarity between rendered shots and textual intentions... ssem_i = ϕ(p)·ϕ(ˆp_i) / (∥ϕ(p)∥ ∥ϕ(ˆp_i)∥)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Journal of Field Robotics37(4), 606–641 (2020)
Bonatti, R., Wang, W., Ho, C., Ahuja, A., Gschwindt, M., Camci, E., Kayacan, E., Choudhury, S., Scherer, S.: Autonomous aerial cinematography in unstructured environments with learned artistic decision-making. Journal of Field Robotics37(4), 606–641 (2020)
work page 2020
- [4]
-
[5]
arXiv preprint arXiv:2408.17424 (2024)
Chen, Y., Rao, A., Jiang, X., Xiao, S., Ma, R., Wang, Z., Xiong, H., Dai, B.: Cinepregen: Camera controllable video previsualization via engine-powered diffusion. arXiv preprint arXiv:2408.17424 (2024)
-
[6]
Christie, M., Olivier, P., Normand, J.M.: Camera control in computer graphics. In: Computer graphics forum. vol. 27, pp. 2197–2218. Wiley Online Library (2008)
work page 2008
-
[7]
Courant, R., Dufour, N., Wang, X., Christie, M., Kalogeiton, V.: E.t. the exceptional trajectories: Text-to-camera-trajectory generation with character awareness (2024), https://arxiv.org/abs/2407.01516
-
[8]
arXiv preprint arXiv:2510.05097 (2025)
Courant, R., Wang, X., Loiseaux, D., Christie, M., Kalogeiton, V.: Pulp motion: Framing-aware multimodal camera and human motion generation. arXiv preprint arXiv:2510.05097 (2025)
-
[9]
Proceedings of the ACM on Human- Computer Interaction6(CHI PLAY), 1–23 (2022)
Evin, I., Hämäläinen, P., Guckelsberger, C.: Cine-ai: Generating video game cutscenes in the style of human directors. Proceedings of the ACM on Human- Computer Interaction6(CHI PLAY), 1–23 (2022)
work page 2022
-
[10]
Galvane, Q.: Automatic cinematography and editing in virtual environments. Ph.D. thesis, Grenoble 1 UJF-Université Joseph Fourier (2015)
work page 2015
-
[11]
In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Gschwindt, M., Camci, E., Bonatti, R., Wang, W., Kayacan, E., Scherer, S.: Can a robot become a movie director? learning artistic principles for aerial cinematography. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1107–1114. IEEE (2019)
work page 2019
-
[12]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
ITE Transactions on Media Technology and Applications2(1), 74–81 (2014)
Hayashi, M., Inoue, S., Douke, M., Hamaguchi, N., Kaneko, H., Bachelder, S., Nakajima, M.: T2v: New technology of converting text to cg animation. ITE Transactions on Media Technology and Applications2(1), 74–81 (2014)
work page 2014
-
[14]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- abling camera control for text-to-video generation. arXiv preprint arXiv:2404.02101 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Hu, T., Zhang, J., Yi, R., Wang, Y., Huang, H., Weng, J., Wang, Y., Ma, L.: Motionmaster: Training-free camera motion transfer for video generation (2024)
work page 2024
-
[16]
IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5335–5348 (2021) 16 M
Huang, C., Dang, Y., Chen, P., Yang, X., Cheng, K.T.: One-shot imitation drone filming of human motion videos. IEEE Transactions on Pattern Analysis and Machine Intelligence44(9), 5335–5348 (2021) 16 M. Li et al
work page 2021
-
[17]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Huang, C., Lin, C.E., Yang, Z., Kong, Y., Chen, P., Yang, X., Cheng, K.T.: Learning to film from professional human motion videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4244–4253 (2019)
work page 2019
-
[18]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Huang, Q., Chan, L., Liu, J., He, W., Jiang, H., Song, M., Song, J.: Patchdpo: Patch-level dpo for finetuning-free personalized image generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18369–18378 (2025)
work page 2025
- [19]
-
[20]
ACM Transactions on Graphics (TOG)40(6), 1–13 (2021)
Jiang, H., Christie, M., Wang, X., Liu, L., Wang, B., Chen, B.: Camera keyframing with style and control. ACM Transactions on Graphics (TOG)40(6), 1–13 (2021)
work page 2021
-
[21]
Jiang, H., Wang, X., Christie, M., Liu, L., Chen, B.: Cinematographic camera diffusion model. In: Computer Graphics Forum. vol. 43, p. e15055. Wiley Online Library (2024)
work page 2024
-
[22]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Jiang, X., Rao, A., Wang, J., Lin, D., Dai, B.: Cinematic behavior transfer via nerf-based differentiable filming. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6723–6732 (2024)
work page 2024
-
[23]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Jiang, Z., Han, Z., Mao, C., Zhang, J., Pan, Y., Liu, Y.: Vace: All-in-one video creation and editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17191–17202 (2025)
work page 2025
-
[24]
Information Sciences506, 273–294 (2020)
Karakostas, I., Mademlis, I., Nikolaidis, N., Pitas, I.: Shot type constraints in uav cinematography for autonomous target tracking. Information Sciences506, 273–294 (2020)
work page 2020
-
[25]
Advances in Neural Information Processing Systems37, 16240–16271 (2024)
Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L.J., Wetzstein, G.: Collaborative video diffusion: Consistent multi-video generation with camera control. Advances in Neural Information Processing Systems37, 16240–16271 (2024)
work page 2024
- [26]
- [27]
-
[28]
Li, Z., Yu, W., Huang, C., Liu, R., Liang, Z., Liu, F., Che, J., Yu, D., Boyd-Graber, J., Mi, H., Yu, D.: Self-rewarding vision-language model via reasoning decomposition (2025),https://arxiv.org/abs/2508.19652
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Liang, Y., He, J., Li, G., Li, P., Klimovskiy, A., Carolan, N., Sun, J., Pont-Tuset, J., Young, S., Yang, F., Ke, J., Dvijotham, K.D., Collins, K., Luo, Y., Li, Y., Kohlhoff, K.J., Ramachandran, D., Navalpakkam, V.: Rich human feedback for text-to-image generation (2024),https://arxiv.org/abs/2312.10240
-
[30]
ACM Transactions on Graphics (TOG)34(4), 1–12 (2015)
Lino, C., Christie, M.: Intuitive and efficient camera control with the toric space. ACM Transactions on Graphics (TOG)34(4), 1–12 (2015)
work page 2015
- [31]
-
[32]
Improving Video Generation with Human Feedback
Liu, J., Liu, G., Liang, J., Yuan, Z., Liu, X., Zheng, M., Wu, X., Wang, Q., Xia, M., Wang, X., et al.: Improving video generation with human feedback. arXiv preprint arXiv:2501.13918 (2025) VERTIGO 17
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Liu, R., Wu, H., Zheng, Z., Wei, C., He, Y., Pi, R., Chen, Q.: Videodpo: Omni- preference alignment for video diffusion generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8009–8019 (2025)
work page 2025
- [34]
-
[35]
In: Proceedings of the 11th ACM SIGGRAPH Conference on Motion, Interaction and Games
Louarn, A., Christie, M., Lamarche, F.: Automated staging for virtual cinematog- raphy. In: Proceedings of the 11th ACM SIGGRAPH Conference on Motion, Interaction and Games. pp. 1–10 (2018)
work page 2018
-
[36]
In: The Thirteenth International Conference on Learning Representations (2025)
Lukasik, M., Meng, Z., Narasimhan, H., Chang, Y.W., Menon, A.K., Yu, F., Kumar, S.: Better autoregressive regression with llms via regression-aware fine-tuning. In: The Thirteenth International Conference on Learning Representations (2025)
work page 2025
-
[37]
arXiv preprint arXiv:2403.04182 (2024)
Lukasik, M., Narasimhan, H., Menon, A.K., Yu, F., Kumar, S.: Regression-aware inference with llms. arXiv preprint arXiv:2403.04182 (2024)
-
[38]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatgpt: Towards detailed video understanding via large vision and language models (2024),https://arxiv. org/abs/2306.05424
work page internal anchor Pith review arXiv 2024
- [39]
-
[40]
IEEE Transactions on Robotics40, 1740–1757 (2024)
Pueyo, P., Dendarieta, J., Montijano, E., Murillo, A.C., Schwager, M.: Cinempc: A fully autonomous drone cinematography system incorporating zoom, focus, pose, and scene composition. IEEE Transactions on Robotics40, 1740–1757 (2024)
work page 2024
-
[41]
In: ACM SIGGRAPH 2023 Posters, pp
Rao, A., Jiang, X., Guo, Y., Xu, L., Yang, L., Jin, L., Lin, D., Dai, B.: Dynamic storyboard generation in an engine-based virtual environment for video production. In: ACM SIGGRAPH 2023 Posters, pp. 1–2 (2023)
work page 2023
- [42]
-
[43]
In: Proceedings of the 31st annual ACM symposium on user interface software and technology
Subramonyam, H., Li, W., Adar, E., Dontcheva, M.: Taketoons: Script-driven performance animation. In: Proceedings of the 31st annual ACM symposium on user interface software and technology. pp. 663–674 (2018)
work page 2018
- [44]
-
[45]
Unity: Real-time development platform (2025),https://unity.com
work page 2025
-
[46]
Unreal: We make the engine. you make it unreal. (2025), https : / / www . unrealengine.com/
work page 2025
- [47]
-
[48]
Multilingual E5 Text Embeddings: A Technical Report
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., Wei, F.: Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution (2024),https://arxiv.org/abs/2409.12191
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [50]
-
[51]
arXiv preprint arXiv:2412.14158 (2024) 18 M
Wang, X., Courant, R., Christie, M., Kalogeiton, V.: Akira: Augmentation kit on rays for optical video generation. arXiv preprint arXiv:2412.14158 (2024) 18 M. Li et al
-
[52]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wang, X., Courant, R., Shi, J., Marchand, E., Christie, M.: Jaws: Just a wild shot for cinematic transfer in neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16933–16942 (2023)
work page 2023
-
[53]
In: ACM SIGGRAPH 2024 Conference Papers
Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Motionctrl: A unified and flexible motion controller for video generation. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)
work page 2024
- [54]
-
[55]
In: 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE)
Xie, C., Hemmi, I., Shishido, H., Kitahara, I.: Camera motion generation method based on performer’s position for performance filming. In: 2023 IEEE 12th Global Conference on Consumer Electronics (GCCE). pp. 957–960. IEEE (2023)
work page 2023
- [56]
-
[57]
In: SIGGRAPH Asia 2024 Technical Communications, pp
Xu, Z., Wang, J., Wang, L., Li, Z., Shi, S., Hu, B., Zhang, M.: Filmagent: Au- tomating virtual film production through a multi-agent collaborative framework. In: SIGGRAPH Asia 2024 Technical Communications, pp. 1–4 (2024)
work page 2024
-
[58]
In: ACM SIGGRAPH 2024 Conference Papers
Yang, S., Hou, L., Huang, H., Ma, C., Wan, P., Zhang, D., Chen, X., Liao, J.: Direct-a-video: Customized video generation with user-directed camera movement and object motion. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–12 (2024)
work page 2024
-
[59]
IEEE Internet of Things Journal10(14), 12338–12351 (2023)
Yu, Z., Wang, H., Katsaggelos, A.K., Ren, J.: A novel automatic content generation and optimization framework. IEEE Internet of Things Journal10(14), 12338–12351 (2023)
work page 2023
-
[60]
IEEE Transactions on Multimedia 26, 6178–6190 (2023)
Yu, Z., Wu, X., Wang, H., Katsaggelos, A.K., Ren, J.: Automated adaptive cine- matography for user interaction in open world. IEEE Transactions on Multimedia 26, 6178–6190 (2023)
work page 2023
-
[61]
In: 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR)
Yu, Z., Yu, C., Wang, H., Ren, J.: Enabling automatic cinematography with reinforcement learning. In: 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR). pp. 103–108. IEEE (2022)
work page 2022
- [62]
-
[63]
Zhang, M., Wu, T., Tan, J., Liu, Z., Wetzstein, G., Lin, D.: Gendop: Auto-regressive camera trajectory generation as a director of photography (2025),https://arxiv. org/abs/2504.07083
- [64]
-
[65]
Zhu, Y., Wang, X., Lathuilière, S., Kalogeiton, V.: Soft-di [m] o: Improving one-step discrete image generation with soft embeddings. arXiv preprint arXiv:2509.22925 (2025) VERTIGO 19 Appendix to VERTIGO: Visual Preference Optimization for Cinematic Camera Generation A Additional Related Work A.1 Camera Trajectory Generation Traditional approaches in auto...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.