Rewriting Video: Text-Driven Reauthoring of Video Footage
Pith reviewed 2026-05-16 15:01 UTC · model grok-4.3
The pith
Video footage can be reauthored by turning it into editable text prompts that creators rewrite directly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our approach involves two technical contributions: a generative reconstruction algorithm that reverse-engineers video into an editable text prompt, and an interactive probe, Rewrite Kit, that allows creators to manipulate these prompts. A technical evaluation of the algorithm reveals a critical human-AI perceptual gap. A probe study with 12 creators surfaced novel use cases such as virtual reshooting, synthetic continuity, and aesthetic restyling. It also highlighted key tensions around coherence, control, and creative alignment in this new paradigm.
What carries the argument
The generative reconstruction algorithm that reverse-engineers video footage into an editable text prompt, paired with the Rewrite Kit interface for direct manipulation of that prompt.
If this is right
- Creators can virtually reshoot scenes by editing only the generated text description.
- Synthetic continuity becomes possible by adjusting prompts across separate video clips.
- Aesthetic restyling of footage no longer requires traditional video-editing expertise.
- Tensions in coherence and control point to specific requirements for future co-creative video systems.
Where Pith is reading between the lines
- If the perceptual gap shrinks, the same prompt-based workflow could scale to non-expert users in everyday video tools.
- The method may extend naturally to other sequential media such as audio narration or animation sequences.
- Real-time multi-user editing sessions could emerge where several people rewrite the same underlying prompt.
Load-bearing premise
Generative models can produce text prompts from video that remain faithful enough and editable enough to support real creative reauthoring even when human and AI perceptions of the content differ.
What would settle it
Creators attempt a concrete reauthoring goal such as changing the scene setting or mood through prompt edits alone and the resulting video either matches the stated intent without breaking original continuity or it does not.
Figures
read the original abstract
Video is a powerful medium for communication and storytelling, yet reauthoring existing footage remains challenging. Even simple edits often demand expertise, time, and careful planning, constraining how creators envision and shape their narratives. Recent advances in generative AI suggest a new paradigm: what if editing a video were as straightforward as rewriting text? To investigate this, we present a tech probe and a study on text-driven video reauthoring. Our approach involves two technical contributions: (1) a generative reconstruction algorithm that reverse-engineers video into an editable text prompt, and (2) an interactive probe, Rewrite Kit, that allows creators to manipulate these prompts. A technical evaluation of the algorithm reveals a critical human-AI perceptual gap. A probe study with 12 creators surfaced novel use cases such as virtual reshooting, synthetic continuity, and aesthetic restyling. It also highlighted key tensions around coherence, control, and creative alignment in this new paradigm. Our work contributes empirical insights into the opportunities and challenges of text-driven video reauthoring, offering design implications for future co-creative video tools.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a generative reconstruction algorithm that reverse-engineers video footage into editable text prompts, paired with an interactive probe called Rewrite Kit that lets creators manipulate those prompts for reauthoring. It reports a technical evaluation that identifies a human-AI perceptual gap and a probe study with 12 creators that surfaces use cases (virtual reshooting, synthetic continuity, aesthetic restyling) along with tensions around coherence, control, and creative alignment, yielding design implications for co-creative video tools.
Significance. If the reconstruction algorithm can be shown to produce sufficiently faithful prompts, the work would offer a novel paradigm for text-driven video editing that lowers barriers for non-experts and enables new narrative workflows. The empirical observations from the probe study provide concrete design implications that could inform future HCI systems combining generative models with video. The explicit acknowledgment of the perceptual gap is a strength, as it frames open challenges rather than overclaiming.
major comments (2)
- [Technical Evaluation] Technical Evaluation section: the evaluation is described only at a high level and supplies no quantitative metrics (e.g., CLIP similarity between source video and generated prompts, caption accuracy, or inter-rater agreement on fidelity). Without these, it is impossible to determine whether the acknowledged human-AI perceptual gap is small enough to preserve editability, which is load-bearing for the claim that prompt manipulation supports reliable reauthoring.
- [Probe Study] Probe Study section: the study reports findings from 12 creators but provides no details on recruitment criteria, task protocol, data collection instruments, or qualitative analysis method. This absence directly weakens the reliability of the reported use cases and tensions, which form the primary empirical contribution.
minor comments (1)
- [Abstract] The abstract states that the technical evaluation 'reveals a critical human-AI perceptual gap' but does not briefly characterize the gap or its measured impact, leaving readers without context before the body of the paper.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work's significance and for the constructive feedback. We will revise the manuscript to expand both the Technical Evaluation and Probe Study sections with additional details and metrics as requested.
read point-by-point responses
-
Referee: [Technical Evaluation] Technical Evaluation section: the evaluation is described only at a high level and supplies no quantitative metrics (e.g., CLIP similarity between source video and generated prompts, caption accuracy, or inter-rater agreement on fidelity). Without these, it is impossible to determine whether the acknowledged human-AI perceptual gap is small enough to preserve editability, which is load-bearing for the claim that prompt manipulation supports reliable reauthoring.
Authors: We agree that the Technical Evaluation section would be strengthened by quantitative metrics. In the revision we will add CLIP similarity scores between source video frames and frames generated from the reconstructed prompts, caption accuracy measures, and inter-rater agreement statistics from the human fidelity assessments. These additions will better quantify the perceptual gap and support the editability claims. revision: yes
-
Referee: [Probe Study] Probe Study section: the study reports findings from 12 creators but provides no details on recruitment criteria, task protocol, data collection instruments, or qualitative analysis method. This absence directly weakens the reliability of the reported use cases and tensions, which form the primary empirical contribution.
Authors: We acknowledge that the Probe Study section lacks sufficient methodological detail. The revised manuscript will include explicit information on recruitment criteria (professional video creators with at least two years of experience), the full task protocol (including session structure and participant instructions), data collection instruments (interview guides and observation logs), and the qualitative analysis method (thematic analysis with inter-coder reliability reporting). These additions will improve the reliability of the reported use cases and tensions. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents a generative reconstruction algorithm and Rewrite Kit probe as novel technical contributions, evaluated via technical assessment and a study with 12 creators. No equations, fitted parameters, self-citations, or ansatzes are described that reduce any claimed result to its own inputs by construction. The approach relies on new implementation details and empirical observations rather than any self-referential derivation chain, rendering the work self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Generative AI models can reverse-engineer video footage into semantically rich and editable text prompts with sufficient fidelity for reauthoring
Reference graph
Works this paper leans on
-
[1]
Kirsten Boehner, Janet Vertesi, Phoebe Sengers, and Paul Dourish. 2007. How HCI interprets the probes. InProceedings of the SIGCHI conference on Human 11 factors in computing systems. 1077–1086
work page 2007
-
[2]
Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore, and Tovi Gross- man. 2023. Promptify: Text-to-image generation through interactive prompt exploration with large language models. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–14
work page 2023
-
[3]
Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology3, 2 (2006), 77–101
work page 2006
-
[4]
Coalition for Content Provenance and Authenticity (C2PA)
Coalition for Content Provenance and Authenticity (C2PA) 2025.C2PA Specifi- cation, Version 2.2. Coalition for Content Provenance and Authenticity (C2PA). https://spec.c2pa.org/specifications/specifications/2.2/index.html Accessed: 7 October 2025
work page 2025
- [5]
-
[6]
Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shecht- man, Dan B Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. 2019. Text-based editing of talking-head video.ACM Transactions on Graphics (TOG)38, 4 (2019), 1–14
work page 2019
-
[7]
Google AI. 2025. Generate videos with Veo 3 in Gemini API: prompt guide. https: //ai.google.dev/gemini-api/docs/video?example=dialogue#prompt-guide. Last updated 2025-10-06 (UTC)
work page 2025
-
[8]
Yuwei Guo, Ceyuan Yang, Anyi Rao, Chenlin Meng, Omer Bar-Tal, Shuangrui Ding, Maneesh Agrawala, Dahua Lin, and Bo Dai. 2025. Keyframe-Guided Creative Video Inpainting. InProceedings of the Computer Vision and Pattern Recognition Conference. 13009–13020
work page 2025
-
[9]
Bernd Huber, Hijung Valentina Shin, Bryan Russell, Oliver Wang, and Gautham J Mysore. 2019. B-script: Transcript-based b-roll video editing with recommenda- tions. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11
work page 2019
-
[10]
Hilary Hutchinson, Wendy Mackay, Bo Westerlund, Benjamin B Bederson, Al- lison Druin, Catherine Plaisant, Michel Beaudouin-Lafon, Stéphane Conversy, Helen Evans, Heiko Hansen, et al. 2003. Technology probes: inspiring design for and with families. InProceedings of the SIGCHI conference on Human factors in computing systems. 17–24
work page 2003
-
[11]
Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Com- putational video editing for dialogue-driven scenes.ACM Trans. Graph.36, 4 (2017), 130–1
work page 2017
-
[12]
Mina Lee, Percy Liang, and Qian Yang. 2022. Coauthor: Designing a human- ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–19
work page 2022
-
[13]
Vivian Liu and Lydia B Chilton. 2022. Design guidelines for prompt engineering text-to-image generative models. InProceedings of the 2022 CHI conference on human factors in computing systems. 1–23
work page 2022
-
[14]
OpenAI. 2025. Sora: Creating Video from Text. https://openai.com/index/sora/. Accessed: 2025-10-07
work page 2025
-
[15]
Julien Porquet, Sitong Wang, and Lydia B Chilton. 2025. Copying style, Extracting value: Illustrators’ Perception of AI Style Transfer and its Impact on Creative Labor. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–16
work page 2025
-
[16]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763
work page 2021
-
[17]
Runway Research. 2025. Introducing Runway Gen-4. https://runwayml.com/ research/introducing-runway-gen-4. Accessed: 2025-10-07
work page 2025
-
[18]
Steve Rubin, Floraine Berthouzoz, Gautham J Mysore, Wilmot Li, and Maneesh Agrawala. 2013. Content-based tools for editing audio stories. InProceedings of the 26th annual ACM symposium on User interface software and technology. 113–122
work page 2013
- [19]
-
[20]
Ben Shneiderman. 1983. Direct manipulation: A step beyond programming languages.Computer16, 08 (1983), 57–69
work page 1983
-
[21]
The New York Times. 2025. A.I. Videos Have Never Been Better. Can You Tell What’s Real?The New York Times(June 29 2025). https://www.nytimes.com/ 2025/06/29/technology/ai-videos-real-or-fake.html Accessed: 2025-10-07
work page 2025
-
[22]
Sitong Wang, Zheng Ning, Anh Truong, Mira Dontcheva, Dingzeyu Li, and Lydia B Chilton. 2024. PodReels: Human-AI Co-Creation of Video Podcast Teasers. InProceedings of the 2024 ACM Designing Interactive Systems Conference. 958–974
work page 2024
-
[23]
Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi Zhang. 2024. Promptcharm: Text-to-image generation through multi-modal prompting and refinement. InProceedings of the 2024 CHI Conference on Human Factors in Com- puting Systems. 1–21
work page 2024
-
[24]
Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. 2025. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [25]
-
[26]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. InPro- ceedings of the IEEE international conference on computer vision. 2223–2232. 12
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.