pith. sign in

arxiv: 2504.14776 · v2 · submitted 2025-04-21 · 💻 cs.HC

Script2Screen: Supporting Dialogue Scriptwriting with Interactive Audiovisual Generation

Pith reviewed 2026-05-22 19:32 UTC · model grok-4.3

classification 💻 cs.HC
keywords scriptwritingdialogue writingAI-assisted creationaudiovisual generationinteractive interfacesuser studyhuman-computer interactioniterative refinement
0
0 comments X

The pith

Script2Screen links text dialogue scripts to AI-generated audiovisual scenes so writers can see and adjust emotional speech, gestures, and camera angles during iteration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Script2Screen as a tool that unifies scriptwriting with real-time audiovisual generation focused on dialogues. It uses a text-to-audiovisual pipeline to create scenes with animated characters delivering emotional speeches, then lets users fine-tune gestures, emotions, and camera angles through the interface. A user study with novice and professional writers across domains found that this setup supports iterative refinement of scripts while complementing rather than replacing the writer's creative process. Traditional text-only scriptwriting gives only partial sense of the final audiovisual result, so connecting the two modalities can aid ideation especially for dialogue-heavy work.

Core claim

Script2Screen integrates scriptwriting with audiovisual scene creation through a novel text-to-audiovisual-scene pipeline that generates expressive scenes featuring emotional speeches and animated characters; the interface supplies fine-grained controls over character gestures, speech emotions, and camera angles, and a user study with novice and professional writers from various domains shows the approach enhances the scriptwriting process by enabling iterative refinement while complementing creative efforts.

What carries the argument

The text-to-audiovisual-scene pipeline that converts dialogue text into synchronized scenes with emotional speech and animated characters, paired with an interface for direct adjustment of gestures, emotions, and camera angles.

If this is right

  • Writers gain the ability to iterate on dialogue by immediately viewing and tweaking the corresponding audiovisual elements rather than relying on imagination alone.
  • The interactive controls allow fine adjustments to emotional tone and visual framing without leaving the scriptwriting environment.
  • Both novice and experienced writers can use the system to explore scene ideas while retaining full creative control over the final text.
  • The workflow treats audiovisual generation as a feedback aid that supports refinement cycles instead of an automated replacement for human authorship.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the same synchronized pipeline to non-dialogue elements like action descriptions or scene transitions could broaden the tool's reach beyond its current dialogue focus.
  • Embedding the controls into existing professional screenwriting software might reduce the learning curve and increase adoption among working writers.
  • Collecting logged failure cases from actual use sessions would let developers target specific improvements in generation quality that the current formative study left unspecified.

Load-bearing premise

The generated audiovisual outputs for speech emotion, gestures, and camera angles are accurate and controllable enough to give useful feedback instead of misleading or distracting the writer.

What would settle it

A controlled study in which writers using the tool report that inaccurate or uncontrollable generations led to more wasted time or poorer final scripts than text-only writing would show the central claim is false.

Figures

Figures reproduced from arXiv: 2504.14776 by Bryan Wang, Eitan Grinspun, Jiaju Ma, Tovi Grossman, Zhecheng Wang.

Figure 1
Figure 1. Figure 1: The Script2Screen interface, captured from a writing session. Script2Screen allows screenwriters to seamlessly visualize and hear their scenes. Writers can select a workspace from the Scenes navigation bar (A), where the corresponding script loads into the Script Editor (B). After the script is processed, the Screen Preview (C) offers a live 3D animation with synced audio, complete with customizable camera… view at source ↗
Figure 2
Figure 2. Figure 2: Design space of audiovisual storytelling systems positioned by gran [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: In each dialogue card, when the user hovers over the text area, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the Script2Screen text-to-audiovisual generation pipeline. The pipeline begins with the user’s written script (1), which is processed by the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The text annotation module parses the script using Large Language Models (LLMs) to extract dialogue, speaker names, and other narrative elements [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The system generates rhythmic character gestures synchronized with spoken dialogue, enhancing the expressiveness of the scene. The highlighted [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of average ratings between baseline and Script2Screen across the Creative Support Index. Error bars represent the standard deviation. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: This bar plot illustrates the feedback from participants on various fea [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
read the original abstract

Scriptwriting has traditionally been text-centric, a modality that only partially conveys the produced audiovisual experience. A formative study with professional writers informed us that connecting textual and audiovisual modalities can aid ideation and iteration, especially for writing dialogues. In this work, we present Script2Screen, an AI-assisted tool that integrates scriptwriting with audiovisual scene creation in a unified, synchronized workflow. Focusing on dialogues in scripts, Script2Screen generates expressive scenes with emotional speeches and animated characters through a novel text-to-audiovisual-scene pipeline. The user interface provides fine-grained controls, allowing writers to fine-tune audiovisual elements such as character gestures, speech emotions, and camera angles. A user study with both novice and professional writers from various domains demonstrated that Script2Screen's interactive audiovisual generation enhances the scriptwriting process, facilitating iterative refinement while complementing, rather than replacing, their creative efforts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents Script2Screen, an AI-assisted tool that integrates textual scriptwriting with a text-to-audiovisual-scene pipeline to generate expressive dialogue scenes featuring emotional speech, animated characters, gestures, and camera angles. Fine-grained UI controls allow writers to adjust these elements. A formative study with professional writers informed the design, and a subsequent user study with both novice and professional writers from various domains is claimed to demonstrate that the interactive audiovisual generation enhances the scriptwriting process, supports iterative refinement, and complements rather than replaces creative efforts.

Significance. If the empirical claims hold under closer scrutiny, the work could contribute to HCI by showing how synchronized multimodal AI generation can support creative workflows in scriptwriting, a domain traditionally limited to text. The emphasis on user-controllable elements like gestures and emotions represents a practical strength that could generalize to other narrative tools.

major comments (2)
  1. [User Study] User Study section: The abstract and available description state that the user study demonstrated enhancement of the scriptwriting process, but provide no sample size, quantitative metrics, study design details (e.g., within- or between-subjects control condition such as text-only baseline), objective measures (e.g., expert-rated dialogue quality, revision counts, or completion time), or statistical tests. This leaves the central claim dependent on unblinded self-report and vulnerable to novelty or demand effects.
  2. [Abstract and System Description] Abstract and System Description: The claim that AI-generated audiovisual outputs (emotional speech, gestures, camera angles) provide useful feedback rests on an assumption of sufficient accuracy and controllability, informed by a formative study. However, no details on generation quality, failure cases, or validation of the pipeline's reliability are supplied, which is load-bearing for whether the interactive loop aids rather than misleads writers.
minor comments (2)
  1. [Related Work] Consider expanding the related work section to include more citations on prior multimodal creative tools or dialogue generation systems to better contextualize the novelty of the synchronized workflow.
  2. [Figures] Ensure that any figures depicting the UI or pipeline include detailed captions that label all interactive controls and generation stages for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and indicate planned revisions to improve the clarity, rigor, and transparency of the work.

read point-by-point responses
  1. Referee: [User Study] User Study section: The abstract and available description state that the user study demonstrated enhancement of the scriptwriting process, but provide no sample size, quantitative metrics, study design details (e.g., within- or between-subjects control condition such as text-only baseline), objective measures (e.g., expert-rated dialogue quality, revision counts, or completion time), or statistical tests. This leaves the central claim dependent on unblinded self-report and vulnerable to novelty or demand effects.

    Authors: We agree that the user study presentation would benefit from greater detail and explicit discussion of its limitations. The study was exploratory and qualitative in focus, involving both novice and professional writers who completed dialogue writing tasks with the tool to observe iterative refinement. Data were gathered primarily through interviews, think-aloud protocols, and session observations rather than through controlled quantitative metrics or statistical testing. We will revise the User Study section to specify the participant count and backgrounds, clarify the within-subjects structure with opportunities for comparison to text-only writing, report any observable patterns (such as iteration frequency), and add an explicit discussion of reliance on self-reports along with potential biases including novelty and demand effects. This will better contextualize the claims without overstating the evidence. revision: partial

  2. Referee: [Abstract and System Description] Abstract and System Description: The claim that AI-generated audiovisual outputs (emotional speech, gestures, camera angles) provide useful feedback rests on an assumption of sufficient accuracy and controllability, informed by a formative study. However, no details on generation quality, failure cases, or validation of the pipeline's reliability are supplied, which is load-bearing for whether the interactive loop aids rather than misleads writers.

    Authors: The formative study with professional writers directly shaped the choice of controllable parameters in the text-to-audiovisual pipeline. Subsequent user study participants reported that the generated outputs supported their creative iteration despite imperfections. We acknowledge that the manuscript currently lacks explicit discussion of generation quality, failure modes, and pipeline validation. We will add a dedicated subsection (likely within System Description or Limitations) that describes the underlying models, reports observed quality characteristics from the studies, and enumerates common failure cases such as emotion-speech mismatches or gesture desynchronization, along with how the fine-grained UI controls allow users to correct or work around them. This will make the assumptions more transparent and demonstrate how the interactive loop remains useful even when outputs are imperfect. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

This is an applied HCI system paper describing an interactive tool (Script2Screen) and its evaluation via a user study with novice and professional writers. The abstract and available text contain no equations, mathematical derivations, fitted parameters, predictions, or first-principles results. Central claims rest on empirical user feedback about process enhancement and iterative refinement rather than any derivational chain. No self-citation load-bearing steps, ansatz smuggling, or reductions of outputs to inputs by construction are present. The paper is self-contained against external benchmarks (user study results) and receives a normal non-finding for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that current generative models can produce usable expressive audiovisual scenes from dialogue text, and that user studies can validate creative support without detailed quantitative validation in the provided summary.

axioms (1)
  • domain assumption Connecting textual and audiovisual modalities aids ideation and iteration in dialogue scriptwriting
    Stated as informed by the formative study with professional writers.

pith-pipeline@v0.9.0 · 5687 in / 1281 out tokens · 48525 ms · 2026-05-22T19:32:10.908902+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 2 internal anchors

  1. [1]

    Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz

    Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Tee- van, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk)(CHI ’19). Assoc...

  2. [2]

    Tenglong Ao, Qingzhe Gao, Yuke Lou, Baoquan Chen, and Libin Liu. 2022. Rhyth- mic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings.ACM Trans. Graph.41, 6, Article 209 (nov 2022), 19 pages. https://doi.org/10.1145/3550454.3555435

  3. [3]

    Bernstein, Greg Little, Robert C

    Michael S. Bernstein, Greg Little, Robert C. Miller, Björn Hartmann, Mark S. Ackerman, David R. Karger, David Crowell, and Katrina Panovich. 2015. Soylent: a word processor with a crowd inside.Commun. ACM58, 8 (July 2015), 85–94. https://doi.org/10.1145/2791285

  4. [4]

    Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore, and Tovi Grossman

  5. [5]

    InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology(San Francisco, CA, USA) (UIST ’23)

    Promptify: Text-to-Image Generation through Interactive Prompt Explo- ration with Large Language Models. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology(San Francisco, CA, USA) (UIST ’23). Association for Computing Machinery, New York, NY, USA, Article 96, 14 pages. https://doi.org/10.1145/3586183.3606725

  6. [6]

    2010.Sketching user experiences: getting the design right and the right design

    Bill Buxton. 2010.Sketching user experiences: getting the design right and the right design. Morgan kaufmann

  7. [7]

    Michelle Cavaleri and Saib Dianati. 2016. You want me to check your grammar again?: The usefulness of an online grammar checker as perceived by students. Journal of Academic Language and Learning10, 1 (2016), A223–A236

  8. [8]

    Kang Chen, Zhipeng Tan, Jin Lei, Song-Hai Zhang, Yuan-Chen Guo, Weidong Zhang, and Shi-Min Hu. 2021. ChoreoMaster: choreography-oriented music- driven dance synthesis.ACM Trans. Graph.40, 4, Article 145 (July 2021), 13 pages. https://doi.org/10.1145/3450626.3459932

  9. [9]

    Erin Cherry and Celine Latulipe. 2014. Quantifying the Creativity Support of Digital Tools through the Creativity Support Index.ACM Trans. Comput.-Hum. Interact.21, 4, Article 21 (June 2014), 25 pages. https://doi.org/10.1145/2617588

  10. [10]

    Jean-Peïc Chou, Alexa Fay Siu, Nedim Lipka, Ryan Rossi, Franck Dernoncourt, and Maneesh Agrawala. 2023. TaleStream: Supporting Story Ideation with Trope Knowledge. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. 1–12

  11. [11]

    John Joon Young Chung, Wooseok Kim, Kang Min Yoo, Hwaran Lee, Eytan Adar, and Minsuk Chang. 2022. TaleBrush: Sketching Stories with Generative Pretrained Language Models. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems(New Orleans, LA, USA)(CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 209, 19 pag...

  12. [12]

    John Joon Young Chung and Max Kreminski. 2024. Patchview: LLM-Powered Worldbuilding with Generative Dust and Magnet Visualization.arXiv preprint arXiv:2408.04112(2024)

  13. [13]

    Marc Davis. 2003. Active capture: integrating human-computer interaction and computer vision/audition to automate media capture. In2003 International Conference on Multimedia and Expo. ICME’03. Proceedings (Cat. No. 03TH8698), Vol. 2. IEEE, II–185

  14. [14]

    Rudresh Dwivedi, Devam Dave, Het Naik, Smiti Singhal, Rana Omer, Pankesh Patel, Bin Qian, Zhenyu Wen, Tejal Shah, Graham Morgan, et al. 2023. Explainable AI (XAI): Core ideas, techniques, and solutions.Comput. Surveys55, 9 (2023), 1–33

  15. [15]

    ElevenLabs. 2024. ElevenLabs. https://elevenlabs.io TTS generation web service

  16. [16]

    Michael L Epstein, Amber D Lazarus, Tammy B Calvano, Kelly A Matthews, Rachel A Hendel, Beth B Epstein, and Gary M Brosvic. 2002. Immediate feedback assessment technique promotes learning and corrects inaccurate first responses. The Psychological Record52 (2002), 187–201

  17. [17]

    2005.Screenplay: The foundations of screenwriting

    Syd Field. 2005.Screenplay: The foundations of screenwriting. Delta

  18. [18]

    Troje, and Marc-André Carbonneau

    Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F. Troje, and Marc-André Carbonneau. 2023. ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech.Computer Graphics Forum42, 1 (2023), 206–216. https://doi.org/10. 1111/cgf.14734 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.14734

  19. [19]

    Chieh-Yang Huang, Shih-Hong Huang, and Ting-Hao Kenneth Huang. 2020. Heteroglossia: In-situ story ideation with the crowd. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12

  20. [20]

    Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. 2020. MovieNet: A Holistic Dataset for Movie Understanding. InComputer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV(Glasgow, United Kingdom). Springer-Verlag, Berlin, Heidelberg, 709–727. https://doi.org/10.1007/978-3-030-58548-8_41

  21. [21]

    Paul Huffman and James Hutson. 2024. Enhancing History Education with Google NotebookLM: Case Study of Mary Easton Sibley’s Diary for Multimedia Content and Podcast Creation.ISRG Journal of Arts, Humanities and Social Sciences2, 5 (2024)

  22. [22]

    Maurice Jakesch, Advait Bhat, Daniel Buschek, Lior Zalmanson, and Mor Naaman

  23. [23]

    In Proceedings of the 2023 CHI conference on human factors in computing systems

    Co-writing with opinionated language models affects users’ views. In Proceedings of the 2023 CHI conference on human factors in computing systems. 1–15

  24. [24]

    Zeyu Jin, Gautham J Mysore, Stephen Diverdi, Jingwan Lu, and Adam Finkelstein

  25. [25]

    Voco: Text-based insertion and replacement in audio narration.ACM Transactions on Graphics (TOG)36, 4 (2017), 1–13

  26. [26]

    Hye-Young Jo, Ryo Suzuki, and Yoonji Kim. 2024. CollageVis: Rapid Previsualiza- tion Tool for Indie Filmmaking using Video Collages. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA) (CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 164, 16 pages. https://doi.org/10.1145/3613904.3642575

  27. [27]

    Jun Kato, Kenta Hara, and Nao Hirasawa. 2024. Griffith: A Storyboarding Tool Designed with Japanese Animation Professionals. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 233, 14 pages. https://doi.org/10.1145/3613904.3642121

  28. [28]

    Script2Screen: Supporting Dialogue Scriptwriting with Interactive Audiovisual Generation•15

    Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. Script2Screen: Supporting Dialogue Scriptwriting with Interactive Audiovisual Generation•15

  29. [29]

    Speak, read and prompt: High-fidelity text-to-speech with minimal super- vision.Transactions of the Association for Computational Linguistics11 (2023), 1703–1718

  30. [30]

    Han-Jong Kim, Chang Min Kim, and Tek-Jin Nam. 2018. SketchStudio: Experience Prototyping with 2.5-Dimensional Animated Design Scenarios. InProceedings of the 2018 Designing Interactive Systems Conference(Hong Kong, China)(DIS ’18). Association for Computing Machinery, New York, NY, USA, 831–843. https: //doi.org/10.1145/3196709.3196736

  31. [31]

    Philippe Laban, Elicia Ye, Srujay Korlakunta, John Canny, and Marti Hearst. 2022. Newspod: Automatic and interactive news podcasts. InProceedings of the 27th International Conference on Intelligent User Interfaces. 691–706

  32. [32]

    Stefan Larsson and Fredrik Heintz. 2020. Transparency in artificial intelligence. Internet policy review9, 2 (2020)

  33. [33]

    Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Com- putational video editing for dialogue-driven scenes.ACM Trans. Graph.36, 4, Article 130 (July 2017), 14 pages. https://doi.org/10.1145/3072959.3073653

  34. [34]

    Kim, and Maneesh Agrawala

    Mackenzie Leake, Hijung Valentina Shin, Joy O. Kim, and Maneesh Agrawala. 2020. Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA)(CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/331...

  35. [35]

    Mina Lee, Percy Liang, and Qian Yang. 2022. CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA)(CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 388, 19 pages. https://doi.org/1...

  36. [36]

    Susan Lin, Jeremy Warner, JD Zamfirescu-Pereira, Matthew G Lee, Sauhard Jain, Shanqing Cai, Piyawat Lertvittayakumjorn, Michael Xuelin Huang, Shumin Zhai, Björn Hartmann, et al . 2024. Rambler: Supporting Writing With Speech via LLM-Assisted Gist Manipulation. InProceedings of the CHI Conference on Human Factors in Computing Systems. 1–19

  37. [37]

    Vivian Liu, Tao Long, Nathan Raw, and Lydia Chilton. 2023. Generative disco: Text-to-video generation for music visualization.arXiv preprint arXiv:2304.08551 (2023)

  38. [38]

    Jiaju Ma, Li-Yi Wei, and Rubaiat Habib Kazi. 2022. A Layered Authoring Tool for Stylized 3D animations. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems(New Orleans, LA, USA)(CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 383, 14 pages. https: //doi.org/10.1145/3491102.3501894

  39. [39]

    Marcel Marti, Jodok Vieli, Wojciech Witoń, Rushit Sanghrajka, Daniel Inversini, Diana Wotruba, Isabel Simo, Sasha Schriber, Mubbasir Kapadia, and Markus Gross

  40. [40]

    InProceedings of the 23rd International Conference on Intelligent User Interfaces(Tokyo, Japan) (IUI ’18)

    CARDINAL: Computer Assisted Authoring of Movie Scripts. InProceedings of the 23rd International Conference on Intelligent User Interfaces(Tokyo, Japan) (IUI ’18). Association for Computing Machinery, New York, NY, USA, 509–519. https://doi.org/10.1145/3172944.3172972

  41. [41]

    Justin Matejka, Michael Glueck, Erin Bradner, Ali Hashemi, Tovi Grossman, and George Fitzmaurice. 2018. Dream Lens: Exploration and Visualization of Large-Scale Generative Design Datasets. InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems(Montreal QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1...

  42. [42]

    Robert McKee. 1997. Substance, structure, style, and the principles of screenwrit- ing.Alba Editorial(1997)

  43. [43]

    There’s so much responsibility on users right now:

    Piotr Mirowski, Kory W. Mathewson, Jaylen Pittman, and Richard Evans. 2023. Co-Writing Screenplays and Theatre Scripts with Language Models: Evaluation by Industry Professionals. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 355, 34...

  44. [44]

    Bryan, Juan-Pablo Caceres, and Bryan Pardo

    Max Morrison, Lucas Rencker, Zeyu Jin, Nicholas J. Bryan, Juan-Pablo Caceres, and Bryan Pardo. 2021. Context-Aware Prosody Correction for Text-Based Speech Editing. arXiv:2102.08328 [eess.AS] https://arxiv.org/abs/2102.08328

  45. [45]

    Daniel Naber et al. 2003. A rule-based style and grammar checker. (2003)

  46. [46]

    OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]

  47. [47]

    Amy Pavel, Gabriel Reyes, and Jeffrey P Bigham. 2020. Rescribe: Authoring and automatically editing audio descriptions. InProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 747–759

  48. [48]

    Anyi Rao, Jean-Peïc Chou, and Maneesh Agrawala. 2024. ScriptViz: A Vi- sualization Tool to Aid Scriptwriting based on a Large Movie Database. arXiv:2410.03224 [cs.HC] https://arxiv.org/abs/2410.03224

  49. [49]

    Anyi Rao, Jean-Peïc Chou, and Maneesh Agrawala. 2024. ScriptViz: A Visualiza- tion Tool to Aid Scriptwriting based on a Large Movie Database. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology

  50. [50]

    Anyi Rao, Xuekun Jiang, Yuwei Guo, Linning Xu, Lei Yang, Libiao Jin, Dahua Lin, and Bo Dai. 2023. Dynamic Storyboard Generation in an Engine-based Virtual Environment for Video Production. InACM SIGGRAPH 2023 Posters(Los Angeles, CA, USA)(SIGGRAPH ’23). Association for Computing Machinery, New York, NY, USA, Article 40, 2 pages. https://doi.org/10.1145/35...

  51. [51]

    Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast, robust and controllable text to speech.Advances in neural information processing systems32 (2019)

  52. [52]

    Mohi Reza, Nathan M Laundry, Ilya Musabirov, Peter Dushniku, Zhi Yuan “Michael” Yu, Kashish Mittal, Tovi Grossman, Michael Liut, Anastasia Kuzminykh, and Joseph Jay Williams. 2024. ABScribe: Rapid Exploration & Organization of Multiple Writing Variations in Human-AI Co-Writing Tasks using Large Language Models. InProceedings of the 2024 CHI Conference on ...

  53. [53]

    Steve Rubin and Maneesh Agrawala. 2014. Generating emotionally relevant musical scores for audio stories. InProceedings of the 27th annual ACM symposium on User interface software and technology. 439–448

  54. [54]

    Steve Rubin, Floraine Berthouzoz, Gautham J Mysore, and Maneesh Agrawala

  55. [55]

    InProceedings of the 28th Annual ACM Symposium on User Interface Software & Technology

    Capture-time feedback for recording scripted narration. InProceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. 191–199

  56. [56]

    Steve Rubin, Floraine Berthouzoz, Gautham J Mysore, Wilmot Li, and Maneesh Agrawala. 2013. Content-based tools for editing audio stories. InProceedings of the 26th annual ACM symposium on User interface software and technology. 113–122

  57. [57]

    Prem Seetharaman, Gautham Mysore, Bryan Pardo, Paris Smaragdis, and Celso Gomes. 2019. Voiceassist: Guiding users to high-quality voice recordings. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–6

  58. [58]

    Hijung Valentina Shin, Wilmot Li, and Frédo Durand. 2016. Dynamic authoring of audio with linked scripts. InProceedings of the 29th Annual Symposium on User Interface Software and Technology. 509–516

  59. [59]

    Daxin Tan, Liqun Deng, Yu Ting Yeung, Xin Jiang, Xiao Chen, and Tan Lee

  60. [60]

    In2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

    Editspeech: A text based speech editing system using partial inference and bidirectional fusion. In2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 626–633

  61. [61]

    Shrikant Venkataramani, Paris Smaragdis, and Gautham Mysore. 2017. AutoDub: Automatic Redubbing for Voiceover Editing. InProceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. 533–538

  62. [62]

    Bryan Wang, Zeyu Jin, and Gautham Mysore. 2022. Record Once, Post Every- where: Automatic Shortening of Audio Stories for Social Media. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–11

  63. [63]

    Bryan Wang, Yuliang Li, Zhaoyang Lv, Haijun Xia, Yan Xu, and Raj Sodhi. 2024. LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing. InProceedings of the 29th International Conference on Intelligent User Interfaces(Greenville, SC, USA)(IUI ’24). Association for Computing Machinery, New York, NY, USA, 699–714. https://doi.org/10.11...

  64. [64]

    Bryan Wang, Meng Yu Yang, and Tovi Grossman. 2021. Soloist: Generating mixed-initiative tutorials from existing guitar instructional videos through au- dio processing. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14

  65. [65]

    Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al . 2017. Tacotron: Towards end-to-end speech synthesis.arXiv preprint arXiv:1703.10135 (2017)

  66. [66]

    Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi Zhang. 2024. PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement. InProceedings of the 2024 CHI Conference on Human Factors in Com- puting Systems

  67. [67]

    Haijun Xia. 2020. Crosspower: Bridging Graphics and Linguistics. InProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 722–734

  68. [68]

    Haijun Xia, Jennifer Jacobs, and Maneesh Agrawala. 2020. Crosscast: Adding Visuals to Audio Travel Podcasts. InProceedings of the 33rd Annual ACM Sympo- sium on User Interface Software and Technology(Virtual Event, USA)(UIST ’20). Association for Computing Machinery, New York, NY, USA, 735–746. https: //doi.org/10.1145/3379337.3415882

  69. [69]

    Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. 2022. Wordcraft: story writing with large language models. InProceedings of the 27th International Conference on Intelligent User Interfaces. 841–852

  70. [70]

    id " :

    Yeshuang Zhu, Shichao Yue, Chun Yu, and Yuanchun Shi. 2017. CEPT: Collabora- tive editing tool for non-native authors. InProceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. 273–285. 16•Wang, et al. A PROMPT DESIGN AND EXAMPLES In this appendix, we present the detailed prompts and examples used for suggesting...