pith. machine review for the scientific record. sign in

arxiv: 2601.11848 · v2 · submitted 2026-01-17 · 💻 cs.HC

Recognition: no theorem link

Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:06 UTC · model grok-4.3

classification 💻 cs.HC
keywords user mental modelshuman-AI collaborationlong-horizon tasksdelegation strategiespromptingAI interfaces
0
0 comments X

The pith

Users communicate long-horizon tasks to AI with rigid exhaustive instructions unlike the flexible intent given to humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how 16 professionals draft specifications for long-horizon work when delegating to a human colleague versus an AI system. Participants offered high-level goals to humans to support flexible exploration but supplied detailed step-by-step instructions to AI to reduce ambiguity and deviation. This split reflects a view that current AI cannot reliably infer intent, set priorities, or exercise independent judgment. The study also surfaces a desired future state in which AI combines its efficiency and context capacity with human-like critical thinking and agency. These patterns point toward concrete ways to redesign AI interfaces for extended tasks.

Core claim

Participants treated human delegation as a compass, offering high-level intent to encourage flexible exploration. In contrast, communication with AI resembled painstakingly laying down railway tracks: rigid, exhaustive instructions to minimize ambiguity and deviation. This reflected a perception that current AI struggles to infer intent, prioritize, and make judgments on its own. When envisioning an ideal AI collaborator, users desired a hybrid blending AI efficiency and large context window with the critical thinking and agency of a human colleague.

What carries the argument

The compass versus railway tracks mental models that describe how users adapt their communication style based on whether the recipient is a human or an AI.

Load-bearing premise

The mental models and communication patterns seen in this sample of 16 professionals will hold for other users and will continue to match AI's actual capabilities rather than just current perceptions of its limits.

What would settle it

A follow-up study in which the same participants interact with a more capable AI on long-horizon tasks and shift toward providing high-level intent instead of exhaustive instructions would challenge the observed divergence.

Figures

Figures reproduced from arXiv: 2601.11848 by Alexander J. Fiannaca, Carrie J. Cai, Michael Terry, Michael Xieyang Liu, Savvas Petridis.

Figure 1
Figure 1. Figure 1: Illustrative excerpts from two example specifications from our study, written by participant P8 for a human (left) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

As AI systems grow increasingly capable of operating for hours or days at a time, users' prompts are transforming into elaborate specifications for the AI to autonomously work on. While prompting for bounded, single-turn tasks has been extensively studied, less is known about how people communicate specifications for long-horizon tasks. We conducted a qualitative study in which 16 professionals drafted specifications for both a human colleague and an AI, revealing a core divergence: participants treated human delegation as a "compass", offering high-level intent to encourage flexible exploration. In contrast, communication with AI resembled painstakingly laying down "railway tracks": rigid, exhaustive instructions to minimize ambiguity and deviation. This reflected a perception that current AI struggles to infer intent, prioritize, and make judgments on its own. When envisioning an ideal AI collaborator, users desired a hybrid: a collaborator blending AI's efficiency and large context window with the critical thinking and agency of a human colleague. We discuss design implications for future AI systems, proposing that they align on outcomes through generated rough drafts, verify feasibility via end-to-end "test runs," and monitor execution through intelligent check-ins -- ultimately transforming AI from a passive instruction-follower into a reliable collaborator for ambiguous, long-horizon tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper reports a qualitative study with 16 professionals who drafted specifications for a long-horizon task once for a human colleague and once for an AI. It identifies a core divergence in mental models: human delegation framed as a 'compass' providing high-level intent to support flexible exploration, versus AI communication as 'railway tracks' with rigid, exhaustive instructions to minimize ambiguity due to perceived AI limitations in inferring intent, prioritizing, and exercising judgment. The authors envision an ideal AI as a hybrid blending efficiency with human-like critical thinking and propose design implications including generated rough drafts, end-to-end test runs, and intelligent check-ins.

Significance. If the observed patterns hold, the work contributes to HCI and AI interaction research by surfacing distinct user strategies for long-horizon delegation, an increasingly relevant area as AI handles extended autonomous tasks. The compass/railway metaphor and hybrid collaborator vision provide concrete, actionable insights for interface and system design. The qualitative approach directly grounds claims in participant statements rather than modeled assumptions.

major comments (3)
  1. [Methods] Methods section: The manuscript provides no details on the interview protocol, how the hypothetical drafting task was presented, the qualitative analysis and coding process, or inter-rater reliability. This absence makes it impossible to assess the robustness of the central compass/railway divergence claim.
  2. [Findings] Findings and Discussion: The reported mental models rest entirely on self-reported strategies from a single hypothetical scenario without validation against actual task execution, real AI interactions, or behavioral outcomes, weakening the evidential link to underlying user cognition.
  3. [Limitations] Limitations: The small sample (n=16) of professionals and restriction to one hypothetical task domain limit claims about generalizability to broader user populations or real-world long-horizon AI usage, yet the discussion does not sufficiently qualify the scope of the proposed design implications.
minor comments (1)
  1. [Abstract] Abstract: Could briefly note the sample size and qualitative nature to better set reader expectations for the strength of evidence.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We appreciate the opportunity to strengthen the manuscript. Below we respond to each major comment and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Methods] Methods section: The manuscript provides no details on the interview protocol, how the hypothetical drafting task was presented, the qualitative analysis and coding process, or inter-rater reliability. This absence makes it impossible to assess the robustness of the central compass/railway divergence claim.

    Authors: We agree that the methods section lacks sufficient detail. In the revised manuscript, we will expand the Methods section to include a full description of the semi-structured interview protocol, the exact wording and presentation of the hypothetical task to participants, the thematic analysis process including how codes were developed and applied, and any measures taken for reliability such as multiple coders reviewing transcripts. This will allow readers to better evaluate the findings. revision: yes

  2. Referee: [Findings] Findings and Discussion: The reported mental models rest entirely on self-reported strategies from a single hypothetical scenario without validation against actual task execution, real AI interactions, or behavioral outcomes, weakening the evidential link to underlying user cognition.

    Authors: This is a valid concern regarding the scope of our claims. Our study was designed as an exploratory qualitative investigation into reported mental models through hypothetical scenarios, which is a common approach in HCI for surfacing initial insights. However, we recognize that self-reports may not fully capture actual behavior. In the revision, we will strengthen the Discussion to explicitly note this limitation and frame the findings as hypotheses for future empirical validation with real tasks and AI systems. We will also add suggestions for follow-up studies involving behavioral measures. revision: partial

  3. Referee: [Limitations] Limitations: The small sample (n=16) of professionals and restriction to one hypothetical task domain limit claims about generalizability to broader user populations or real-world long-horizon AI usage, yet the discussion does not sufficiently qualify the scope of the proposed design implications.

    Authors: We concur that the limitations section should more explicitly address generalizability. We will revise the Limitations and Discussion sections to better qualify the scope, emphasizing that the findings are based on a small sample of professionals in specific domains and one task type, and that the design implications are speculative and intended as starting points for future work rather than broadly generalizable recommendations. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct qualitative data

full rationale

The paper reports a qualitative study with 16 professionals who drafted specifications for human vs. AI collaborators in a hypothetical long-horizon task. Core claims (compass vs. railway tracks mental models) are presented as patterns observed in participant statements and behaviors, without equations, fitted parameters, predictive models, or self-citations that reduce any result to its own inputs by construction. No uniqueness theorems, ansatzes, or renamings of known results appear. The derivation chain is self-contained against the collected interview and drafting data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a qualitative empirical study with no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5538 in / 938 out tokens · 33105 ms · 2026-05-16T14:06:23.156837+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Omakase: proactive assistance with actionable suggestions for evolving scientific research projects

    cs.HC 2026-04 unverdicted novelty 4.0

    Omakase monitors project documents to infer timely queries and distills research reports into actionable suggestions that users rated significantly more useful than raw reports.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · cited by 1 Pith paper · 19 internal anchors

  1. [1]

    Anthropic. 2025. Introducing Claude Sonnet 4.5. https://www.anthropic.com/ news/claude-sonnet-4-5

  2. [2]

    Anthropic. 2025. Rakuten accelerates development with Claude Code. https: //www.anthropic.com/customers/rakuten

  3. [3]

    Anthropic. 2026. Claude Code overview. https://code.claude.com/docs/en/ overview

  4. [4]

    Anthropic. 2026. Introducing Cowork | Claude. https://claude.com/blog/cowork- research-preview

  5. [5]

    Bernstein, Joel Brandt, Robert C

    Michael S. Bernstein, Joel Brandt, Robert C. Miller, and David R. Karger. 2011. Crowds in two seconds: Enabling realtime crowd-powered interfaces.MIT web domain(Oct. 2011). https://dspace.mit.edu/handle/1721.1/72377 Accepted: 2012-08-28T18:02:44Z ISBN: 9781450307161

  6. [6]

    Bernstein, Greg Little, Robert C

    Michael S. Bernstein, Greg Little, Robert C. Miller, Björn Hartmann, Mark S. Ackerman, David R. Karger, David Crowell, and Katrina Panovich. 2015. Soylent: a word processor with a crowd inside.Commun. ACM58, 8 (July 2015), 85–94. doi:10.1145/2791285

  7. [7]

    Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore, and Tovi Gross- man. 2023. Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models. doi:10.48550/arXiv.2304.09337 arXiv:2304.09337 [cs]

  8. [8]

    Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology3, 2 (2006), 77–101

  9. [9]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott ...

  10. [10]

    Raluca Budiu, Feifei Liu, Amy Zhang, and Emma Cionca. 2023. Prompt Structure in Conversations with Generative AI. https://www.nngroup.com/articles/ai- prompt-structure/

  11. [11]

    Haichao Chen, Zehua Jiang, Xinyu Liu, Can Can Xue, Samantha Min Er Yew, Bin Sheng, Ying-Feng Zheng, Xiaofei Wang, You Wu, Sobha Sivaprasad, Tien Yin Wong, Varun Chaudhary, and Yih Chung Tham. 2025. Can large language models fully automate or partially assist paper selection in systematic reviews?The British Journal of Ophthalmology109, 8 (Jan. 2025), e326...

  12. [12]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas- tian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bra...

  13. [13]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Web- son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping...

  14. [14]

    Cursor. 2026. Cursor. https://cursor.com

  15. [15]

    Hai Dang, Sven Goller, Florian Lehmann, and Daniel Buschek. 2023. Choice Over Control: How Users Write with Large Language Models using Diegetic and Non-Diegetic Prompting. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–17. doi:10.1145/3544548.3580969

  16. [16]

    Google DeepMind. 2025. Gemini Deep Research — your personal research assistant. https://gemini.google/overview/deep-research/

  17. [17]

    Anhai Doan, Raghu Ramakrishnan, and Alon Y. Halevy. 2011. Crowdsourcing systems on the World-Wide Web.Commun. ACM54, 4 (April 2011), 86–96. doi:10.1145/1924421.1924442

  18. [18]

    Steven Dow, Anand Kulkarni, Brie Bunge, Truc Nguyen, Scott Klemmer, and Björn Hartmann. 2011. Shepherding the crowd: managing and providing feedback to crowd workers. InCHI ’11 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’11). Association for Computing Machinery, New York, NY, USA, 1669–1674. doi:10.1145/1979742.1979826

  19. [19]

    K. J. Kevin Feng, Kevin Pu, Matt Latzke, Tal August, Pao Siangliulue, Jonathan Bragg, Daniel S. Weld, Amy X. Zhang, and Joseph Chee Chang. 2025. Cocoa: Co-Planning and Co-Execution with AI Agents. doi:10.48550/arXiv.2412.10999 arXiv:2412.10999 [cs]

  20. [20]

    Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Sali- nas, Erkang, Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. 2024. Magentic-One: A Generalist Multi-Agent System for Solving Complex...

  21. [21]

    Google. 2025. A2UI. https://a2ui.org/

  22. [22]

    Google. 2025. Agent Development Kit. https://google.github.io/adk-docs

  23. [23]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. doi:10.48550/arXiv.2308. 10620 arXiv:2308.10620 [cs]

  24. [24]

    Jiajie Huang, Honghao Lai, Weilong Zhao, Danni Xia, Chunyang Bai, Mingyao Sun, Jianing Liu, Jiayi Liu, Bei Pan, Jinhui Tian, and Long Ge. 2025. Large Lan- guage Model–Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Evaluation Study.Journal of Medical Inter- net Research27, 1 (June 2025), e70450. doi:10...

  25. [25]

    Minsuk Kahng, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarackal, Minsuk Chang, Michael Terry, and Lucas Dixon. 2024. LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models.IEEE Transactions on Visualization and Computer Graphics(2024), 1–11. doi:10.1109/TVCG.2024.3456354 Conf...

  26. [26]

    Minsuk Kahng, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarackal, Minsuk Chang, Michael Terry, and Lucas Dixon. 2024. LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models. InExtended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA ’24). Associ...

  27. [27]

    Kiro. 2025. Specs - Docs - Kiro: The AI IDE for prototype to production. https: //kiro.dev/docs/specs/

  28. [28]

    Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, and John Horton

    Aniket Kittur, Jeffrey V. Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, and John Horton. 2013. The future of crowd work. InProceedings of the 2013 conference on Computer supported cooperative work (CSCW ’13). Association for Computing Machinery, New York, NY, USA, 1301–1318. doi:10.1145/2441776.2441923

  29. [29]

    Aniket Kittur, Boris Smus, Susheel Khamkar, and Robert E. Kraut. 2011. Crowd- Forge: crowdsourcing complex work. InProceedings of the 24th annual ACM symposium on User interface software and technology (UIST ’11). Association for Computing Machinery, New York, NY, USA, 43–52. doi:10.1145/2047196.2047202

  30. [30]

    Kulkarni, Matthew Can, and Bjoern Hartmann

    Anand P. Kulkarni, Matthew Can, and Bjoern Hartmann. 2011. Turkomatic: automatic recursive task and workflow design for mechanical turk. InCHI ’11 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’11). Associ- ation for Computing Machinery, New York, NY, USA, 2053–2058. doi:10.1145/ 1979742.1979865

  31. [31]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. doi:10.48550/arXiv.2005.11401 arXiv:2005.11401 [cs]

  32. [32]

    Bennett, and Shaun K

    Franklin Mingzhe Li, Michael Xieyang Liu, Cynthia L. Bennett, and Shaun K. Kane

  33. [33]

    doi:10.1145/3772318.3791158 arXiv:2602.07266 [cs]

    ADCanvas: Accessible and Conversational Audio Description Authoring for Blind and Low Vision Creators. doi:10.1145/3772318.3791158 arXiv:2602.07266 [cs]

  34. [34]

    Liang, Melissa Lin, Nikitha Rao, and Brad A

    Jenny T. Liang, Melissa Lin, Nikitha Rao, and Brad A. Myers. 2025. Prompts Are Programs Too! Understanding How Developers Build Software Containing Prompts.Proc. ACM Softw. Eng.2, FSE (June 2025), FSE072:1591–FSE072:1614. doi:10.1145/3729342

  35. [35]

    Chilton, Max Goldman, and Robert C

    Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. 2010. Explor- ing iterative and parallel human computation processes. InProceedings of the ACM SIGKDD Workshop on Human Computation (HCOMP ’10). Association for 13 Petridis and Liu et al. Computing Machinery, New York, NY, USA, 68–76. doi:10.1145/1837885.1837907

  36. [36]

    Chilton, Max Goldman, and Robert C

    Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. 2010. TurKit: human computation algorithms on mechanical turk. InProceedings of the 23nd annual ACM symposium on User interface software and technology (UIST ’10). Association for Computing Machinery, New York, NY, USA, 57–66. doi:10.1145/ 1866029.1866040

  37. [37]

    Michael Xieyang Liu, Aniket Kittur, and Brad A. Myers. 2021. To Reuse or Not To Reuse? A Framework and System for Evaluating Summarized Knowledge. Proceedings of the ACM on Human-Computer Interaction5, CSCW1 (April 2021), 166:1–166:35. doi:10.1145/3449240

  38. [38]

    Michael Xieyang Liu, Aniket Kittur, and Brad A. Myers. 2022. Crystalline: Lowering the Cost for Developers to Collect and Organize Information for Decision Making. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3491102.3501968 event-place: New Or...

  39. [39]

    We Need Structured Output

    Michael Xieyang Liu, Frederick Liu, Alexander J. Fiannaca, Terry Koo, Lucas Dixon, Michael Terry, and Carrie J. Cai. 2024. "We Need Structured Output": Towards User-centered Constraints on Large Language Model Output. doi:10. 1145/3613905.3650756 arXiv:2404.07362 [cs]

  40. [40]

    Fiannaca, Alex Olwal, Michael Terry, and Carrie J

    Michael Xieyang Liu, Savvas Petridis, Vivian Tsai, Alexander J. Fiannaca, Alex Olwal, Michael Terry, and Carrie J. Cai. 2025. Gensors: Authoring Personalized Visual Sensors with Multimodal Foundation Models and Reasoning. doi:10.1145/ 3708359.3712085 arXiv:2501.15727 [cs]

  41. [41]

    What It Wants Me To Say

    Michael Xieyang Liu, Advait Sarkar, Carina Negreanu, Benjamin Zorn, Jack Williams, Neil Toronto, and Andrew D. Gordon. 2023. “What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code- Generating Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association f...

  42. [42]

    Michael Xieyang Liu, Michael Terry, Alex Olwal, Savvas Petridis, Vivian Tsai, Alexander Fiannaca, and Carrie J. Cai. 2025. Meta-Sensors: Self-Discovery, Adaptation and Reasoning.Defensive Publications Series(March 2025). https: //www.tdcommons.org/dpubs_series/7863

  43. [43]

    Michael Xieyang Liu, Tongshuang Wu, Tianying Chen, Franklin Mingzhe Li, Aniket Kittur, and Brad A. Myers. 2024. Selenite: Scaffolding Online Sensemak- ing with Comprehensive Overviews Elicited from Large Language Models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Sys- tems (CHI ’24). Association for Computing Machinery, New Yo...

  44. [44]

    like having a really bad pa

    Ewa Luger and Abigail Sellen. 2016. "Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents. InProceed- ings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). Association for Computing Machinery, New York, NY, USA, 5286–5297. doi:10.1145/2858036.2858288

  45. [45]

    Qianou Ma, Weirui Peng, Hua Shen, Kenneth Koedinger, and Tongshuang Wu

  46. [46]

    doi:10.48550/arXiv.2409.08775 arXiv:2409.08775 [cs]

    What You Say = What You Want? Teaching Humans to Articulate Require- ments for LLMs. doi:10.48550/arXiv.2409.08775 arXiv:2409.08775 [cs]

  47. [47]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. doi:10.48550/arXiv.2303.17651 arXiv:2303....

  48. [48]

    Atefeh Mahdavi Goloujeh, Anne Sullivan, and Brian Magerko. 2024. Is It AI or Is It Me? Understanding Users’ Prompt Journey with Text-to-Image Generative AI Tools. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–13. doi:10.1145/3613904.3642861

  49. [49]

    Atefeh Mahdavi Goloujeh, Anne Sullivan, and Brian Magerko. 2024. The Social Construction of Generative AI Prompts. InExtended Abstracts of the CHI Con- ference on Human Factors in Computing Systems (CHI EA ’24). Association for Computing Machinery, New York, NY, USA, 1–7. doi:10.1145/3613905.3650947

  50. [50]

    Brian McInnis, Dan Cosley, Chaebong Nam, and Gilly Leshed. 2016. Taking a HIT: Designing around Rejection, Mistrust, Risk, and Workers’ Experiences in Amazon Mechanical Turk. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). Association for Computing Machinery, New York, NY, USA, 2271–2282. doi:10.1145/2858036.2858539

  51. [51]

    Ali Modarressi, Abdullatif Köksal, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze. 2025. MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory. doi:10.48550/arXiv.2404.11672 arXiv:2404.11672 [cs]

  52. [52]

    Kate Moran. 2024. CARE: Structure for Crafting AI Prompts. https://www. nngroup.com/articles/careful-prompts/

  53. [53]

    Tim Neusesser and Kate Moran. 2025. Designing Use-Case Prompt Suggestions. https://www.nngroup.com/articles/designing-use-case-prompt-suggestions/

  54. [54]

    Alex Olwal, Michael Xieyang Liu, Savvas Petridis, Vivian Tsai, Alexander Fi- annaca, Michael Terry, and Carrie Cai. 2025. Semantic Sensors: Multimodal Language Model Powered Sensors Capable of Reasoning.Defensive Publications Series(June 2025). https://www.tdcommons.org/dpubs_series/8210

  55. [55]

    OpenAI. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/

  56. [56]

    OpenAI. 2025. Introducing deep research. https://openai.com/index/introducing- deep-research/

  57. [57]

    Christiano, Jan Leike, and Ryan Lowe

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe

  58. [58]

    2022), 27730–27744

    Training language models to follow instructions with human feed- back.Advances in Neural Information Processing Systems35 (Dec. 2022), 27730–27744. https://proceedings.neurips.cc/paper_files/paper/2022/hash/ b1efde53be364a73914f58805a001731-Abstract-Conference.html

  59. [59]

    Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, and Austin Z. Henley. 2023. Building Your Own Product Copilot: Challenges, Oppor- tunities, and Needs. doi:10.48550/arXiv.2312.14231 arXiv:2312.14231 [cs]

  60. [60]

    Perplexity. 2025. Introducing Perplexity Deep Research. https://www.perplexity. ai/hub/blog/introducing-perplexity-deep-research

  61. [61]

    Fiannaca, Vivian Tsai, Michael Terry, and Carrie J

    Savvas Petridis, Michael Xieyang Liu, Alexander J. Fiannaca, Vivian Tsai, Michael Terry, and Carrie J. Cai. 2024. In Situ AI Prototyping: Infusing Multimodal Prompts into Mobile Settings with MobileMaker. In2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 121–133. doi:10. 1109/VL/HCC60511.2024.00023

  62. [62]

    Savvas Petridis, Michael Terry, and Carrie Jun Cai. 2023. PromptInfuser: Bring- ing User Interface Mock-ups to Life with Large Language Models. InExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (CHI EA ’23). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/3544549.3585628

  63. [63]

    Savvas Petridis, Michael Terry, and Carrie J. Cai. 2023. PromptInfuser: How Tightly Coupling AI and UI Design Impacts Designers’ Workflows. doi:10.48550/ arXiv.2310.15435 arXiv:2310.15435 [cs]

  64. [64]

    Savvas Petridis, Benjamin D Wedin, James Wexler, Mahima Pushkarna, Aaron Donsbach, Nitesh Goyal, Carrie J Cai, and Michael Terry. 2024. Constitution- Maker: Interactively Critiquing Large Language Models by Converting Feedback into Principles. InProceedings of the 29th International Conference on Intelligent User Interfaces (IUI ’24). Association for Comp...

  65. [65]

    Cai, Michael Terry, and Minsuk Kahng

    Crystal Qian, Michael Xieyang Liu, Emily Reif, Grady Simon, Nada Hussein, Nathan Clement, James Wexler, Carrie J. Cai, Michael Terry, and Minsuk Kahng

  66. [66]

    doi:10.48550/arXiv.2412.16089 arXiv:2412.16089 [cs]

    The Evolution of LLM Adoption in Industry Data Curation Practices. doi:10.48550/arXiv.2412.16089 arXiv:2412.16089 [cs]

  67. [67]

    Crystal Qian, Michael Xieyang Liu, Emily Reif, Grady Simon, Nada Hussein, Nathan Clement, James Wexler, Carrie J Cai, Michael Terry, and Minsuk Kahng

  68. [68]

    InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’25)

    LLM Adoption in Data Curation Workflows: Industry Practices and Insights. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’25). Association for Computing Machinery, New York, NY, USA, 1–10. doi:10.1145/3706599.3719677

  69. [69]

    Hua Shen, Tiffany Knearem, Reshmi Ghosh, Michael Xieyang Liu, Andrés Monroy-Hernández, Tongshuang Wu, Diyi Yang, Yun Huang, Tanushree Mitra, Yang Li, and Marti Hearst. 2025. Bidirectional Human-AI Alignment: Emerging Challenges and Opportunities. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’2...

  70. [70]

    Sangho Suh, Meng Chen, Bryan Min, Toby Jia-Jun Li, and Haijun Xia. 2024. Luminate: Structured Generation and Exploration of Design Space with Large Language Models for Human-AI Co-Creation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–26. doi:10.1145/361...

  71. [71]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony H...

  72. [72]

    Mojtaba Vaismoradi, Hannele Turunen, and Terese Bondas. 2013. Content analy- sis and thematic analysis: Implications for conducting a qualitative descriptive study.Nursing & Health Sciences15, 3 (2013), 398–405. doi:10.1111/nhs.12048 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/nhs.12048. 14 Compass vs Railway Tracks: Unpacking User Mental Mod...

  73. [73]

    Eric von Hippel. 1986. Lead Users: A Source of Novel Product Concepts.Manage. Sci.32, 7 (July 1986), 791–805

  74. [74]

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An Open-Ended Embodied Agent with Large Language Models. doi:10.48550/arXiv.2305.16291 arXiv:2305.16291 [cs]

  75. [75]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

  76. [76]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. doi:10.48550/arXiv.2203.11171 arXiv:2203.11171 [cs]

  77. [77]

    Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi Zhang. 2024. PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement. InProceedings of the 2024 CHI Conference on Human Factors in Com- puting Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–21. doi:10.1145/3613904.3642803

  78. [78]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned Language Models Are Zero-Shot Learners. doi:10.48550/arXiv.2109.01652 arXiv:2109.01652 [cs]

  79. [79]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. doi:10.48550/arXiv.2201.11903 arXiv:2201.11903 [cs]

  80. [80]

    Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C. Schmidt. 2023. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. doi:10. 48550/arXiv.2302.11382 arXiv:2302.11382 [cs]

Showing first 80 references.