arxiv: 2601.08565 · v2 · submitted 2026-01-13 · 💻 cs.HC · cs.AI

Rewriting Video: Text-Driven Reauthoring of Video Footage

Sitong Wang , Anh Truong , Lydia B. Chilton , Dingzeyu Li This is my paper

Pith reviewed 2026-05-16 15:01 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords video reauthoringtext-driven editinggenerative AIhuman-AI perceptual gapcreative toolsprompt manipulationvideo editing

0 comments

The pith

Video footage can be reauthored by turning it into editable text prompts that creators rewrite directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether video editing can work like text editing by building a system that converts existing footage into text prompts. It contributes a generative algorithm that reconstructs video as editable text and an interactive tool called Rewrite Kit that lets users change those prompts. Evaluation of the algorithm identifies a human-AI perceptual gap, while a study with twelve creators shows practical uses such as virtual reshooting, synthetic continuity, and aesthetic restyling. The work surfaces tensions around coherence, control, and creative alignment and draws design implications for future tools.

Core claim

Our approach involves two technical contributions: a generative reconstruction algorithm that reverse-engineers video into an editable text prompt, and an interactive probe, Rewrite Kit, that allows creators to manipulate these prompts. A technical evaluation of the algorithm reveals a critical human-AI perceptual gap. A probe study with 12 creators surfaced novel use cases such as virtual reshooting, synthetic continuity, and aesthetic restyling. It also highlighted key tensions around coherence, control, and creative alignment in this new paradigm.

What carries the argument

The generative reconstruction algorithm that reverse-engineers video footage into an editable text prompt, paired with the Rewrite Kit interface for direct manipulation of that prompt.

If this is right

Creators can virtually reshoot scenes by editing only the generated text description.
Synthetic continuity becomes possible by adjusting prompts across separate video clips.
Aesthetic restyling of footage no longer requires traditional video-editing expertise.
Tensions in coherence and control point to specific requirements for future co-creative video systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the perceptual gap shrinks, the same prompt-based workflow could scale to non-expert users in everyday video tools.
The method may extend naturally to other sequential media such as audio narration or animation sequences.
Real-time multi-user editing sessions could emerge where several people rewrite the same underlying prompt.

Load-bearing premise

Generative models can produce text prompts from video that remain faithful enough and editable enough to support real creative reauthoring even when human and AI perceptions of the content differ.

What would settle it

Creators attempt a concrete reauthoring goal such as changing the scene setting or mood through prompt edits alone and the resulting video either matches the stated intent without breaking original continuity or it does not.

Figures

Figures reproduced from arXiv: 2601.08565 by Anh Truong, Dingzeyu Li, Lydia B. Chilton, Sitong Wang.

**Figure 1.** Figure 1: Workflow of our text-as-interface approach for video reauthoring. A generative reconstruction algorithm extracts an [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The average similarity across iterations and the rationale for early stopping. The solid line shows the mean similarity [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The user interface of the Rewrite Kit technology probe. A creator’s three-part workflow: reverse-engineer, rewrite, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Use case 12: Camera angle change. Change the fairy shot to a low angle from the ground. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Use case 14: Generate a smooth transition between two clips. Note that the middle two rows of transition are all 100% [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Use case 9: Stylizing a vlog as pixel art. Guided by a single reference image, Rewrite Kit restyles the original [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Use case 10: Expanding a short food clip into a rich vlog. The input video, shot from a single static viewpoint, lacks [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Use case 11: Adding a yellow slug to an animation. For a participant wanting to insert a novel character from their [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Use case 4: Recreating a viral “historical selfie” with a new character. Inspired by a popular video of a historical figure [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

read the original abstract

Video is a powerful medium for communication and storytelling, yet reauthoring existing footage remains challenging. Even simple edits often demand expertise, time, and careful planning, constraining how creators envision and shape their narratives. Recent advances in generative AI suggest a new paradigm: what if editing a video were as straightforward as rewriting text? To investigate this, we present a tech probe and a study on text-driven video reauthoring. Our approach involves two technical contributions: (1) a generative reconstruction algorithm that reverse-engineers video into an editable text prompt, and (2) an interactive probe, Rewrite Kit, that allows creators to manipulate these prompts. A technical evaluation of the algorithm reveals a critical human-AI perceptual gap. A probe study with 12 creators surfaced novel use cases such as virtual reshooting, synthetic continuity, and aesthetic restyling. It also highlighted key tensions around coherence, control, and creative alignment in this new paradigm. Our work contributes empirical insights into the opportunities and challenges of text-driven video reauthoring, offering design implications for future co-creative video tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a text-to-video reauthoring loop via prompt reconstruction and a creator probe, but the evaluation gives no numbers to show the perceptual gap is small enough for reliable edits.

read the letter

The main takeaway is that this work tries to make video editing feel like text editing by first turning footage into editable prompts and then letting creators tweak those prompts in a tool called Rewrite Kit. The probe study with 12 creators surfaces concrete use cases such as virtual reshooting and synthetic continuity, along with the expected frictions around coherence and control. That user-facing part is the clearest addition—it moves beyond pure generation papers by showing how people might actually apply the idea in practice.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a generative reconstruction algorithm that reverse-engineers video footage into editable text prompts, paired with an interactive probe called Rewrite Kit that lets creators manipulate those prompts for reauthoring. It reports a technical evaluation that identifies a human-AI perceptual gap and a probe study with 12 creators that surfaces use cases (virtual reshooting, synthetic continuity, aesthetic restyling) along with tensions around coherence, control, and creative alignment, yielding design implications for co-creative video tools.

Significance. If the reconstruction algorithm can be shown to produce sufficiently faithful prompts, the work would offer a novel paradigm for text-driven video editing that lowers barriers for non-experts and enables new narrative workflows. The empirical observations from the probe study provide concrete design implications that could inform future HCI systems combining generative models with video. The explicit acknowledgment of the perceptual gap is a strength, as it frames open challenges rather than overclaiming.

major comments (2)

[Technical Evaluation] Technical Evaluation section: the evaluation is described only at a high level and supplies no quantitative metrics (e.g., CLIP similarity between source video and generated prompts, caption accuracy, or inter-rater agreement on fidelity). Without these, it is impossible to determine whether the acknowledged human-AI perceptual gap is small enough to preserve editability, which is load-bearing for the claim that prompt manipulation supports reliable reauthoring.
[Probe Study] Probe Study section: the study reports findings from 12 creators but provides no details on recruitment criteria, task protocol, data collection instruments, or qualitative analysis method. This absence directly weakens the reliability of the reported use cases and tensions, which form the primary empirical contribution.

minor comments (1)

[Abstract] The abstract states that the technical evaluation 'reveals a critical human-AI perceptual gap' but does not briefly characterize the gap or its measured impact, leaving readers without context before the body of the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive feedback. We will revise the manuscript to expand both the Technical Evaluation and Probe Study sections with additional details and metrics as requested.

read point-by-point responses

Referee: [Technical Evaluation] Technical Evaluation section: the evaluation is described only at a high level and supplies no quantitative metrics (e.g., CLIP similarity between source video and generated prompts, caption accuracy, or inter-rater agreement on fidelity). Without these, it is impossible to determine whether the acknowledged human-AI perceptual gap is small enough to preserve editability, which is load-bearing for the claim that prompt manipulation supports reliable reauthoring.

Authors: We agree that the Technical Evaluation section would be strengthened by quantitative metrics. In the revision we will add CLIP similarity scores between source video frames and frames generated from the reconstructed prompts, caption accuracy measures, and inter-rater agreement statistics from the human fidelity assessments. These additions will better quantify the perceptual gap and support the editability claims. revision: yes
Referee: [Probe Study] Probe Study section: the study reports findings from 12 creators but provides no details on recruitment criteria, task protocol, data collection instruments, or qualitative analysis method. This absence directly weakens the reliability of the reported use cases and tensions, which form the primary empirical contribution.

Authors: We acknowledge that the Probe Study section lacks sufficient methodological detail. The revised manuscript will include explicit information on recruitment criteria (professional video creators with at least two years of experience), the full task protocol (including session structure and participant instructions), data collection instruments (interview guides and observation logs), and the qualitative analysis method (thematic analysis with inter-coder reliability reporting). These additions will improve the reliability of the reported use cases and tensions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a generative reconstruction algorithm and Rewrite Kit probe as novel technical contributions, evaluated via technical assessment and a study with 12 creators. No equations, fitted parameters, self-citations, or ansatzes are described that reduce any claimed result to its own inputs by construction. The approach relies on new implementation details and empirical observations rather than any self-referential derivation chain, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that generative models can perform high-fidelity video-to-editable-text reconstruction suitable for creative control.

axioms (1)

domain assumption Generative AI models can reverse-engineer video footage into semantically rich and editable text prompts with sufficient fidelity for reauthoring
This assumption is required for the reconstruction algorithm to enable the claimed editing workflow and is not demonstrated in the provided abstract.

pith-pipeline@v0.9.0 · 5494 in / 1244 out tokens · 78875 ms · 2026-05-16T15:01:26.967468+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

[1]

Kirsten Boehner, Janet Vertesi, Phoebe Sengers, and Paul Dourish. 2007. How HCI interprets the probes. InProceedings of the SIGCHI conference on Human 11 factors in computing systems. 1077–1086

work page 2007
[2]

Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore, and Tovi Gross- man. 2023. Promptify: Text-to-image generation through interactive prompt exploration with large language models. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–14

work page 2023
[3]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology3, 2 (2006), 77–101

work page 2006
[4]

Coalition for Content Provenance and Authenticity (C2PA)

Coalition for Content Provenance and Authenticity (C2PA) 2025.C2PA Specifi- cation, Version 2.2. Coalition for Content Provenance and Authenticity (C2PA). https://spec.c2pa.org/specifications/specifications/2.2/index.html Accessed: 7 October 2025

work page 2025
[5]

Hai Dang, Lukas Mecke, Florian Lehmann, Sven Goller, and Daniel Buschek. 2022. How to prompt? Opportunities and challenges of zero-and few-shot learning for human-AI interaction in creative applications of generative models.arXiv preprint arXiv:2209.01390(2022)

work page arXiv 2022
[6]

Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shecht- man, Dan B Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. 2019. Text-based editing of talking-head video.ACM Transactions on Graphics (TOG)38, 4 (2019), 1–14

work page 2019
[7]

Google AI. 2025. Generate videos with Veo 3 in Gemini API: prompt guide. https: //ai.google.dev/gemini-api/docs/video?example=dialogue#prompt-guide. Last updated 2025-10-06 (UTC)

work page 2025
[8]

Yuwei Guo, Ceyuan Yang, Anyi Rao, Chenlin Meng, Omer Bar-Tal, Shuangrui Ding, Maneesh Agrawala, Dahua Lin, and Bo Dai. 2025. Keyframe-Guided Creative Video Inpainting. InProceedings of the Computer Vision and Pattern Recognition Conference. 13009–13020

work page 2025
[9]

Bernd Huber, Hijung Valentina Shin, Bryan Russell, Oliver Wang, and Gautham J Mysore. 2019. B-script: Transcript-based b-roll video editing with recommenda- tions. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11

work page 2019
[10]

Hilary Hutchinson, Wendy Mackay, Bo Westerlund, Benjamin B Bederson, Al- lison Druin, Catherine Plaisant, Michel Beaudouin-Lafon, Stéphane Conversy, Helen Evans, Heiko Hansen, et al. 2003. Technology probes: inspiring design for and with families. InProceedings of the SIGCHI conference on Human factors in computing systems. 17–24

work page 2003
[11]

Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Com- putational video editing for dialogue-driven scenes.ACM Trans. Graph.36, 4 (2017), 130–1

work page 2017
[12]

Mina Lee, Percy Liang, and Qian Yang. 2022. Coauthor: Designing a human- ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–19

work page 2022
[13]

Vivian Liu and Lydia B Chilton. 2022. Design guidelines for prompt engineering text-to-image generative models. InProceedings of the 2022 CHI conference on human factors in computing systems. 1–23

work page 2022
[14]

OpenAI. 2025. Sora: Creating Video from Text. https://openai.com/index/sora/. Accessed: 2025-10-07

work page 2025
[15]

Julien Porquet, Sitong Wang, and Lydia B Chilton. 2025. Copying style, Extracting value: Illustrators’ Perception of AI Style Transfer and its Impact on Creative Labor. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–16

work page 2025
[16]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021
[17]

Runway Research. 2025. Introducing Runway Gen-4. https://runwayml.com/ research/introducing-runway-gen-4. Accessed: 2025-10-07

work page 2025
[18]

Steve Rubin, Floraine Berthouzoz, Gautham J Mysore, Wilmot Li, and Maneesh Agrawala. 2013. Content-based tools for editing audio stories. InProceedings of the 26th annual ACM symposium on User interface software and technology. 113–122

work page 2013
[19]

Runway ML. n.d.. Creating with Aleph. https://help.runwayml.com/hc/en- us/articles/43176400374419-Creating-with-Aleph. Accessed: 2025-10-07

work page arXiv 2025
[20]

Ben Shneiderman. 1983. Direct manipulation: A step beyond programming languages.Computer16, 08 (1983), 57–69

work page 1983
[21]

The New York Times. 2025. A.I. Videos Have Never Been Better. Can You Tell What’s Real?The New York Times(June 29 2025). https://www.nytimes.com/ 2025/06/29/technology/ai-videos-real-or-fake.html Accessed: 2025-10-07

work page 2025
[22]

Sitong Wang, Zheng Ning, Anh Truong, Mira Dontcheva, Dingzeyu Li, and Lydia B Chilton. 2024. PodReels: Human-AI Co-Creation of Video Podcast Teasers. InProceedings of the 2024 ACM Designing Interactive Systems Conference. 958–974

work page 2024
[23]

Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi Zhang. 2024. Promptcharm: Text-to-image generation through multi-modal prompting and refinement. InProceedings of the 2024 CHI Conference on Human Factors in Com- puting Systems. 1–21

work page 2024
[24]

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. 2025. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. 2025. Reconstruction alignment improves unified multimodal models.arXiv preprint arXiv:2509.07295 (2025)

work page arXiv 2025
[26]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. InPro- ceedings of the IEEE international conference on computer vision. 2223–2232. 12

work page 2017