arxiv: 2601.11848 · v2 · submitted 2026-01-17 · 💻 cs.HC

Recognition: no theorem link

Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI

Savvas Petridis , Michael Xieyang Liu , Alexander J. Fiannaca , Carrie J. Cai , Michael Terry

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:06 UTC · model grok-4.3

classification 💻 cs.HC

keywords user mental modelshuman-AI collaborationlong-horizon tasksdelegation strategiespromptingAI interfaces

0 comments

The pith

Users communicate long-horizon tasks to AI with rigid exhaustive instructions unlike the flexible intent given to humans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how 16 professionals draft specifications for long-horizon work when delegating to a human colleague versus an AI system. Participants offered high-level goals to humans to support flexible exploration but supplied detailed step-by-step instructions to AI to reduce ambiguity and deviation. This split reflects a view that current AI cannot reliably infer intent, set priorities, or exercise independent judgment. The study also surfaces a desired future state in which AI combines its efficiency and context capacity with human-like critical thinking and agency. These patterns point toward concrete ways to redesign AI interfaces for extended tasks.

Core claim

Participants treated human delegation as a compass, offering high-level intent to encourage flexible exploration. In contrast, communication with AI resembled painstakingly laying down railway tracks: rigid, exhaustive instructions to minimize ambiguity and deviation. This reflected a perception that current AI struggles to infer intent, prioritize, and make judgments on its own. When envisioning an ideal AI collaborator, users desired a hybrid blending AI efficiency and large context window with the critical thinking and agency of a human colleague.

What carries the argument

The compass versus railway tracks mental models that describe how users adapt their communication style based on whether the recipient is a human or an AI.

Load-bearing premise

The mental models and communication patterns seen in this sample of 16 professionals will hold for other users and will continue to match AI's actual capabilities rather than just current perceptions of its limits.

What would settle it

A follow-up study in which the same participants interact with a more capable AI on long-horizon tasks and shift toward providing high-level intent instead of exhaustive instructions would challenge the observed divergence.

Figures

Figures reproduced from arXiv: 2601.11848 by Alexander J. Fiannaca, Carrie J. Cai, Michael Terry, Michael Xieyang Liu, Savvas Petridis.

**Figure 1.** Figure 1: Illustrative excerpts from two example specifications from our study, written by participant P8 for a human (left) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

read the original abstract

As AI systems grow increasingly capable of operating for hours or days at a time, users' prompts are transforming into elaborate specifications for the AI to autonomously work on. While prompting for bounded, single-turn tasks has been extensively studied, less is known about how people communicate specifications for long-horizon tasks. We conducted a qualitative study in which 16 professionals drafted specifications for both a human colleague and an AI, revealing a core divergence: participants treated human delegation as a "compass", offering high-level intent to encourage flexible exploration. In contrast, communication with AI resembled painstakingly laying down "railway tracks": rigid, exhaustive instructions to minimize ambiguity and deviation. This reflected a perception that current AI struggles to infer intent, prioritize, and make judgments on its own. When envisioning an ideal AI collaborator, users desired a hybrid: a collaborator blending AI's efficiency and large context window with the critical thinking and agency of a human colleague. We discuss design implications for future AI systems, proposing that they align on outcomes through generated rough drafts, verify feasibility via end-to-end "test runs," and monitor execution through intelligent check-ins -- ultimately transforming AI from a passive instruction-follower into a reliable collaborator for ambiguous, long-horizon tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Users draft high-level intent for human delegates but exhaustive step-by-step instructions for AI on long-horizon tasks, based on a 16-person qualitative study.

read the letter

The paper's central observation is that professionals treat delegation to a human as setting a rough direction and letting the person figure out the details, while they treat delegation to AI as spelling out every constraint to avoid any wrong turn. They call the first a compass and the second railway tracks. This comes directly from having 16 participants write specifications for the same long-running task aimed at either a colleague or an AI system. The difference tracks with how people currently view AI's limits on inferring intent and handling ambiguity. The design suggestions that follow, such as generating rough drafts for alignment, running end-to-end test simulations, and using smart check-ins, are straightforward extensions of that finding and could actually shape interface work on longer-running agents. The study is original in its direct human-versus-AI comparison for multi-day tasks rather than single-turn prompts. The patterns line up with everyday experience of current tools, and the authors avoid overclaiming by framing the work as design implications rather than universal rules. The sample size is small and the task was hypothetical, so the mental models might look different once people have real ongoing interactions or work in other domains. The abstract gives little on how responses were coded or checked for consistency, which leaves the strength of the patterns harder to judge until the full methods are reviewed. This is useful reading for anyone building or studying AI interfaces that need to support extended professional work. It is not a finished theory but a clear empirical starting point. I would send it to peer review because the basic distinction is worth testing further and the design ideas are concrete enough to evaluate.

Referee Report

3 major / 1 minor

Summary. The paper reports a qualitative study with 16 professionals who drafted specifications for a long-horizon task once for a human colleague and once for an AI. It identifies a core divergence in mental models: human delegation framed as a 'compass' providing high-level intent to support flexible exploration, versus AI communication as 'railway tracks' with rigid, exhaustive instructions to minimize ambiguity due to perceived AI limitations in inferring intent, prioritizing, and exercising judgment. The authors envision an ideal AI as a hybrid blending efficiency with human-like critical thinking and propose design implications including generated rough drafts, end-to-end test runs, and intelligent check-ins.

Significance. If the observed patterns hold, the work contributes to HCI and AI interaction research by surfacing distinct user strategies for long-horizon delegation, an increasingly relevant area as AI handles extended autonomous tasks. The compass/railway metaphor and hybrid collaborator vision provide concrete, actionable insights for interface and system design. The qualitative approach directly grounds claims in participant statements rather than modeled assumptions.

major comments (3)

[Methods] Methods section: The manuscript provides no details on the interview protocol, how the hypothetical drafting task was presented, the qualitative analysis and coding process, or inter-rater reliability. This absence makes it impossible to assess the robustness of the central compass/railway divergence claim.
[Findings] Findings and Discussion: The reported mental models rest entirely on self-reported strategies from a single hypothetical scenario without validation against actual task execution, real AI interactions, or behavioral outcomes, weakening the evidential link to underlying user cognition.
[Limitations] Limitations: The small sample (n=16) of professionals and restriction to one hypothetical task domain limit claims about generalizability to broader user populations or real-world long-horizon AI usage, yet the discussion does not sufficiently qualify the scope of the proposed design implications.

minor comments (1)

[Abstract] Abstract: Could briefly note the sample size and qualitative nature to better set reader expectations for the strength of evidence.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We appreciate the opportunity to strengthen the manuscript. Below we respond to each major comment and indicate the revisions we will make.

read point-by-point responses

Referee: [Methods] Methods section: The manuscript provides no details on the interview protocol, how the hypothetical drafting task was presented, the qualitative analysis and coding process, or inter-rater reliability. This absence makes it impossible to assess the robustness of the central compass/railway divergence claim.

Authors: We agree that the methods section lacks sufficient detail. In the revised manuscript, we will expand the Methods section to include a full description of the semi-structured interview protocol, the exact wording and presentation of the hypothetical task to participants, the thematic analysis process including how codes were developed and applied, and any measures taken for reliability such as multiple coders reviewing transcripts. This will allow readers to better evaluate the findings. revision: yes
Referee: [Findings] Findings and Discussion: The reported mental models rest entirely on self-reported strategies from a single hypothetical scenario without validation against actual task execution, real AI interactions, or behavioral outcomes, weakening the evidential link to underlying user cognition.

Authors: This is a valid concern regarding the scope of our claims. Our study was designed as an exploratory qualitative investigation into reported mental models through hypothetical scenarios, which is a common approach in HCI for surfacing initial insights. However, we recognize that self-reports may not fully capture actual behavior. In the revision, we will strengthen the Discussion to explicitly note this limitation and frame the findings as hypotheses for future empirical validation with real tasks and AI systems. We will also add suggestions for follow-up studies involving behavioral measures. revision: partial
Referee: [Limitations] Limitations: The small sample (n=16) of professionals and restriction to one hypothetical task domain limit claims about generalizability to broader user populations or real-world long-horizon AI usage, yet the discussion does not sufficiently qualify the scope of the proposed design implications.

Authors: We concur that the limitations section should more explicitly address generalizability. We will revise the Limitations and Discussion sections to better qualify the scope, emphasizing that the findings are based on a small sample of professionals in specific domains and one task type, and that the design implications are speculative and intended as starting points for future work rather than broadly generalizable recommendations. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct qualitative data

full rationale

The paper reports a qualitative study with 16 professionals who drafted specifications for human vs. AI collaborators in a hypothetical long-horizon task. Core claims (compass vs. railway tracks mental models) are presented as patterns observed in participant statements and behaviors, without equations, fitted parameters, predictive models, or self-citations that reduce any result to its own inputs by construction. No uniqueness theorems, ansatzes, or renamings of known results appear. The derivation chain is self-contained against the collected interview and drafting data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a qualitative empirical study with no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5538 in / 938 out tokens · 33105 ms · 2026-05-16T14:06:23.156837+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omakase: proactive assistance with actionable suggestions for evolving scientific research projects
cs.HC 2026-04 unverdicted novelty 4.0

Omakase monitors project documents to infer timely queries and distills research reports into actionable suggestions that users rated significantly more useful than raw reports.

Reference graph

Works this paper leans on

96 extracted references · 96 canonical work pages · cited by 1 Pith paper · 19 internal anchors

[1]

Anthropic. 2025. Introducing Claude Sonnet 4.5. https://www.anthropic.com/ news/claude-sonnet-4-5

work page 2025
[2]

Anthropic. 2025. Rakuten accelerates development with Claude Code. https: //www.anthropic.com/customers/rakuten

work page 2025
[3]

Anthropic. 2026. Claude Code overview. https://code.claude.com/docs/en/ overview

work page 2026
[4]

Anthropic. 2026. Introducing Cowork | Claude. https://claude.com/blog/cowork- research-preview

work page 2026
[5]

Bernstein, Joel Brandt, Robert C

Michael S. Bernstein, Joel Brandt, Robert C. Miller, and David R. Karger. 2011. Crowds in two seconds: Enabling realtime crowd-powered interfaces.MIT web domain(Oct. 2011). https://dspace.mit.edu/handle/1721.1/72377 Accepted: 2012-08-28T18:02:44Z ISBN: 9781450307161

work page 2011
[6]

Bernstein, Greg Little, Robert C

Michael S. Bernstein, Greg Little, Robert C. Miller, Björn Hartmann, Mark S. Ackerman, David R. Karger, David Crowell, and Katrina Panovich. 2015. Soylent: a word processor with a crowd inside.Commun. ACM58, 8 (July 2015), 85–94. doi:10.1145/2791285

work page doi:10.1145/2791285 2015
[7]

Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore, and Tovi Gross- man. 2023. Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models. doi:10.48550/arXiv.2304.09337 arXiv:2304.09337 [cs]

work page doi:10.48550/arxiv.2304.09337 2023
[8]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology3, 2 (2006), 77–101

work page 2006
[9]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott ...

work page 2020
[10]

Raluca Budiu, Feifei Liu, Amy Zhang, and Emma Cionca. 2023. Prompt Structure in Conversations with Generative AI. https://www.nngroup.com/articles/ai- prompt-structure/

work page 2023
[11]

Haichao Chen, Zehua Jiang, Xinyu Liu, Can Can Xue, Samantha Min Er Yew, Bin Sheng, Ying-Feng Zheng, Xiaofei Wang, You Wu, Sobha Sivaprasad, Tien Yin Wong, Varun Chaudhary, and Yih Chung Tham. 2025. Can large language models fully automate or partially assist paper selection in systematic reviews?The British Journal of Ophthalmology109, 8 (Jan. 2025), e326...

work page doi:10.1136/bjo- 2025
[12]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas- tian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bra...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.02311 2022
[13]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Web- son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.11416 2022
[14]

Cursor. 2026. Cursor. https://cursor.com

work page 2026
[15]

Hai Dang, Sven Goller, Florian Lehmann, and Daniel Buschek. 2023. Choice Over Control: How Users Write with Large Language Models using Diegetic and Non-Diegetic Prompting. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–17. doi:10.1145/3544548.3580969

work page doi:10.1145/3544548.3580969 2023
[16]

Google DeepMind. 2025. Gemini Deep Research — your personal research assistant. https://gemini.google/overview/deep-research/

work page 2025
[17]

Anhai Doan, Raghu Ramakrishnan, and Alon Y. Halevy. 2011. Crowdsourcing systems on the World-Wide Web.Commun. ACM54, 4 (April 2011), 86–96. doi:10.1145/1924421.1924442

work page doi:10.1145/1924421.1924442 2011
[18]

Steven Dow, Anand Kulkarni, Brie Bunge, Truc Nguyen, Scott Klemmer, and Björn Hartmann. 2011. Shepherding the crowd: managing and providing feedback to crowd workers. InCHI ’11 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’11). Association for Computing Machinery, New York, NY, USA, 1669–1674. doi:10.1145/1979742.1979826

work page doi:10.1145/1979742.1979826 2011
[19]

K. J. Kevin Feng, Kevin Pu, Matt Latzke, Tal August, Pao Siangliulue, Jonathan Bragg, Daniel S. Weld, Amy X. Zhang, and Joseph Chee Chang. 2025. Cocoa: Co-Planning and Co-Execution with AI Agents. doi:10.48550/arXiv.2412.10999 arXiv:2412.10999 [cs]

work page doi:10.48550/arxiv.2412.10999 2025
[20]

Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Sali- nas, Erkang, Zhu, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang, Ricky Loynd, Robert West, Victor Dibia, Ahmed Awadallah, Ece Kamar, Rafah Hosn, and Saleema Amershi. 2024. Magentic-One: A Generalist Multi-Agent System for Solving Complex...

work page internal anchor Pith review doi:10.48550/arxiv.2411.04468 2024
[21]

Google. 2025. A2UI. https://a2ui.org/

work page 2025
[22]

Google. 2025. Agent Development Kit. https://google.github.io/adk-docs

work page 2025
[23]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. doi:10.48550/arXiv.2308. 10620 arXiv:2308.10620 [cs]

work page doi:10.48550/arxiv.2308 2024
[24]

Jiajie Huang, Honghao Lai, Weilong Zhao, Danni Xia, Chunyang Bai, Mingyao Sun, Jianing Liu, Jiayi Liu, Bei Pan, Jinhui Tian, and Long Ge. 2025. Large Lan- guage Model–Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Evaluation Study.Journal of Medical Inter- net Research27, 1 (June 2025), e70450. doi:10...

work page doi:10.2196/70450 2025
[25]

Minsuk Kahng, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarackal, Minsuk Chang, Michael Terry, and Lucas Dixon. 2024. LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models.IEEE Transactions on Visualization and Computer Graphics(2024), 1–11. doi:10.1109/TVCG.2024.3456354 Conf...

work page doi:10.1109/tvcg.2024.3456354 2024
[26]

Minsuk Kahng, Ian Tenney, Mahima Pushkarna, Michael Xieyang Liu, James Wexler, Emily Reif, Krystal Kallarackal, Minsuk Chang, Michael Terry, and Lucas Dixon. 2024. LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models. InExtended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems (CHI EA ’24). Associ...

work page doi:10.1145/3613905.3650755 2024
[27]

Kiro. 2025. Specs - Docs - Kiro: The AI IDE for prototype to production. https: //kiro.dev/docs/specs/

work page 2025
[28]

Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, and John Horton

Aniket Kittur, Jeffrey V. Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, and John Horton. 2013. The future of crowd work. InProceedings of the 2013 conference on Computer supported cooperative work (CSCW ’13). Association for Computing Machinery, New York, NY, USA, 1301–1318. doi:10.1145/2441776.2441923

work page doi:10.1145/2441776.2441923 2013
[29]

Aniket Kittur, Boris Smus, Susheel Khamkar, and Robert E. Kraut. 2011. Crowd- Forge: crowdsourcing complex work. InProceedings of the 24th annual ACM symposium on User interface software and technology (UIST ’11). Association for Computing Machinery, New York, NY, USA, 43–52. doi:10.1145/2047196.2047202

work page doi:10.1145/2047196.2047202 2011
[30]

Kulkarni, Matthew Can, and Bjoern Hartmann

Anand P. Kulkarni, Matthew Can, and Bjoern Hartmann. 2011. Turkomatic: automatic recursive task and workflow design for mechanical turk. InCHI ’11 Extended Abstracts on Human Factors in Computing Systems (CHI EA ’11). Associ- ation for Computing Machinery, New York, NY, USA, 2053–2058. doi:10.1145/ 1979742.1979865

work page arXiv 2011
[31]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. doi:10.48550/arXiv.2005.11401 arXiv:2005.11401 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2005.11401 2021
[32]

Bennett, and Shaun K

Franklin Mingzhe Li, Michael Xieyang Liu, Cynthia L. Bennett, and Shaun K. Kane

work page
[33]

doi:10.1145/3772318.3791158 arXiv:2602.07266 [cs]

ADCanvas: Accessible and Conversational Audio Description Authoring for Blind and Low Vision Creators. doi:10.1145/3772318.3791158 arXiv:2602.07266 [cs]

work page doi:10.1145/3772318.3791158
[34]

Liang, Melissa Lin, Nikitha Rao, and Brad A

Jenny T. Liang, Melissa Lin, Nikitha Rao, and Brad A. Myers. 2025. Prompts Are Programs Too! Understanding How Developers Build Software Containing Prompts.Proc. ACM Softw. Eng.2, FSE (June 2025), FSE072:1591–FSE072:1614. doi:10.1145/3729342

work page doi:10.1145/3729342 2025
[35]

Chilton, Max Goldman, and Robert C

Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. 2010. Explor- ing iterative and parallel human computation processes. InProceedings of the ACM SIGKDD Workshop on Human Computation (HCOMP ’10). Association for 13 Petridis and Liu et al. Computing Machinery, New York, NY, USA, 68–76. doi:10.1145/1837885.1837907

work page doi:10.1145/1837885.1837907 2010
[36]

Chilton, Max Goldman, and Robert C

Greg Little, Lydia B. Chilton, Max Goldman, and Robert C. Miller. 2010. TurKit: human computation algorithms on mechanical turk. InProceedings of the 23nd annual ACM symposium on User interface software and technology (UIST ’10). Association for Computing Machinery, New York, NY, USA, 57–66. doi:10.1145/ 1866029.1866040

work page arXiv 2010
[37]

Michael Xieyang Liu, Aniket Kittur, and Brad A. Myers. 2021. To Reuse or Not To Reuse? A Framework and System for Evaluating Summarized Knowledge. Proceedings of the ACM on Human-Computer Interaction5, CSCW1 (April 2021), 166:1–166:35. doi:10.1145/3449240

work page doi:10.1145/3449240 2021
[38]

Michael Xieyang Liu, Aniket Kittur, and Brad A. Myers. 2022. Crystalline: Lowering the Cost for Developers to Collect and Organize Information for Decision Making. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3491102.3501968 event-place: New Or...

work page doi:10.1145/3491102.3501968 2022
[39]

We Need Structured Output

Michael Xieyang Liu, Frederick Liu, Alexander J. Fiannaca, Terry Koo, Lucas Dixon, Michael Terry, and Carrie J. Cai. 2024. "We Need Structured Output": Towards User-centered Constraints on Large Language Model Output. doi:10. 1145/3613905.3650756 arXiv:2404.07362 [cs]

work page arXiv 2024
[40]

Fiannaca, Alex Olwal, Michael Terry, and Carrie J

Michael Xieyang Liu, Savvas Petridis, Vivian Tsai, Alexander J. Fiannaca, Alex Olwal, Michael Terry, and Carrie J. Cai. 2025. Gensors: Authoring Personalized Visual Sensors with Multimodal Foundation Models and Reasoning. doi:10.1145/ 3708359.3712085 arXiv:2501.15727 [cs]

work page arXiv 2025
[41]

What It Wants Me To Say

Michael Xieyang Liu, Advait Sarkar, Carina Negreanu, Benjamin Zorn, Jack Williams, Neil Toronto, and Andrew D. Gordon. 2023. “What It Wants Me To Say”: Bridging the Abstraction Gap Between End-User Programmers and Code- Generating Large Language Models. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association f...

work page doi:10.1145/3544548.3580817 2023
[42]

Michael Xieyang Liu, Michael Terry, Alex Olwal, Savvas Petridis, Vivian Tsai, Alexander Fiannaca, and Carrie J. Cai. 2025. Meta-Sensors: Self-Discovery, Adaptation and Reasoning.Defensive Publications Series(March 2025). https: //www.tdcommons.org/dpubs_series/7863

work page 2025
[43]

Michael Xieyang Liu, Tongshuang Wu, Tianying Chen, Franklin Mingzhe Li, Aniket Kittur, and Brad A. Myers. 2024. Selenite: Scaffolding Online Sensemak- ing with Comprehensive Overviews Elicited from Large Language Models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Sys- tems (CHI ’24). Association for Computing Machinery, New Yo...

work page doi:10.1145/3613904.3642149 2024
[44]

like having a really bad pa

Ewa Luger and Abigail Sellen. 2016. "Like Having a Really Bad PA": The Gulf between User Expectation and Experience of Conversational Agents. InProceed- ings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). Association for Computing Machinery, New York, NY, USA, 5286–5297. doi:10.1145/2858036.2858288

work page doi:10.1145/2858036.2858288 2016
[45]

Qianou Ma, Weirui Peng, Hua Shen, Kenneth Koedinger, and Tongshuang Wu

work page
[46]

doi:10.48550/arXiv.2409.08775 arXiv:2409.08775 [cs]

What You Say = What You Want? Teaching Humans to Articulate Require- ments for LLMs. doi:10.48550/arXiv.2409.08775 arXiv:2409.08775 [cs]

work page doi:10.48550/arxiv.2409.08775
[47]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. doi:10.48550/arXiv.2303.17651 arXiv:2303....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.17651 2023
[48]

Atefeh Mahdavi Goloujeh, Anne Sullivan, and Brian Magerko. 2024. Is It AI or Is It Me? Understanding Users’ Prompt Journey with Text-to-Image Generative AI Tools. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–13. doi:10.1145/3613904.3642861

work page doi:10.1145/3613904.3642861 2024
[49]

Atefeh Mahdavi Goloujeh, Anne Sullivan, and Brian Magerko. 2024. The Social Construction of Generative AI Prompts. InExtended Abstracts of the CHI Con- ference on Human Factors in Computing Systems (CHI EA ’24). Association for Computing Machinery, New York, NY, USA, 1–7. doi:10.1145/3613905.3650947

work page doi:10.1145/3613905.3650947 2024
[50]

Brian McInnis, Dan Cosley, Chaebong Nam, and Gilly Leshed. 2016. Taking a HIT: Designing around Rejection, Mistrust, Risk, and Workers’ Experiences in Amazon Mechanical Turk. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). Association for Computing Machinery, New York, NY, USA, 2271–2282. doi:10.1145/2858036.2858539

work page doi:10.1145/2858036.2858539 2016
[51]

Ali Modarressi, Abdullatif Köksal, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze. 2025. MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory. doi:10.48550/arXiv.2404.11672 arXiv:2404.11672 [cs]

work page doi:10.48550/arxiv.2404.11672 2025
[52]

Kate Moran. 2024. CARE: Structure for Crafting AI Prompts. https://www. nngroup.com/articles/careful-prompts/

work page 2024
[53]

Tim Neusesser and Kate Moran. 2025. Designing Use-Case Prompt Suggestions. https://www.nngroup.com/articles/designing-use-case-prompt-suggestions/

work page 2025
[54]

Alex Olwal, Michael Xieyang Liu, Savvas Petridis, Vivian Tsai, Alexander Fi- annaca, Michael Terry, and Carrie Cai. 2025. Semantic Sensors: Multimodal Language Model Powered Sensors Capable of Reasoning.Defensive Publications Series(June 2025). https://www.tdcommons.org/dpubs_series/8210

work page 2025
[55]

OpenAI. 2024. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/

work page 2024
[56]

OpenAI. 2025. Introducing deep research. https://openai.com/index/introducing- deep-research/

work page 2025
[57]

Christiano, Jan Leike, and Ryan Lowe

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe

work page
[58]

2022), 27730–27744

Training language models to follow instructions with human feed- back.Advances in Neural Information Processing Systems35 (Dec. 2022), 27730–27744. https://proceedings.neurips.cc/paper_files/paper/2022/hash/ b1efde53be364a73914f58805a001731-Abstract-Conference.html

work page 2022
[59]

Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, and Austin Z. Henley. 2023. Building Your Own Product Copilot: Challenges, Oppor- tunities, and Needs. doi:10.48550/arXiv.2312.14231 arXiv:2312.14231 [cs]

work page doi:10.48550/arxiv.2312.14231 2023
[60]

Perplexity. 2025. Introducing Perplexity Deep Research. https://www.perplexity. ai/hub/blog/introducing-perplexity-deep-research

work page 2025
[61]

Fiannaca, Vivian Tsai, Michael Terry, and Carrie J

Savvas Petridis, Michael Xieyang Liu, Alexander J. Fiannaca, Vivian Tsai, Michael Terry, and Carrie J. Cai. 2024. In Situ AI Prototyping: Infusing Multimodal Prompts into Mobile Settings with MobileMaker. In2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 121–133. doi:10. 1109/VL/HCC60511.2024.00023

work page arXiv 2024
[62]

Savvas Petridis, Michael Terry, and Carrie Jun Cai. 2023. PromptInfuser: Bring- ing User Interface Mock-ups to Life with Large Language Models. InExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (CHI EA ’23). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/3544549.3585628

work page doi:10.1145/3544549.3585628 2023
[63]

Savvas Petridis, Michael Terry, and Carrie J. Cai. 2023. PromptInfuser: How Tightly Coupling AI and UI Design Impacts Designers’ Workflows. doi:10.48550/ arXiv.2310.15435 arXiv:2310.15435 [cs]

work page arXiv 2023
[64]

Savvas Petridis, Benjamin D Wedin, James Wexler, Mahima Pushkarna, Aaron Donsbach, Nitesh Goyal, Carrie J Cai, and Michael Terry. 2024. Constitution- Maker: Interactively Critiquing Large Language Models by Converting Feedback into Principles. InProceedings of the 29th International Conference on Intelligent User Interfaces (IUI ’24). Association for Comp...

work page doi:10.1145/3640543.3645144 2024
[65]

Cai, Michael Terry, and Minsuk Kahng

Crystal Qian, Michael Xieyang Liu, Emily Reif, Grady Simon, Nada Hussein, Nathan Clement, James Wexler, Carrie J. Cai, Michael Terry, and Minsuk Kahng

work page
[66]

doi:10.48550/arXiv.2412.16089 arXiv:2412.16089 [cs]

The Evolution of LLM Adoption in Industry Data Curation Practices. doi:10.48550/arXiv.2412.16089 arXiv:2412.16089 [cs]

work page doi:10.48550/arxiv.2412.16089
[67]

Crystal Qian, Michael Xieyang Liu, Emily Reif, Grady Simon, Nada Hussein, Nathan Clement, James Wexler, Carrie J Cai, Michael Terry, and Minsuk Kahng

work page
[68]

InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’25)

LLM Adoption in Data Curation Workflows: Industry Practices and Insights. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’25). Association for Computing Machinery, New York, NY, USA, 1–10. doi:10.1145/3706599.3719677

work page doi:10.1145/3706599.3719677
[69]

Hua Shen, Tiffany Knearem, Reshmi Ghosh, Michael Xieyang Liu, Andrés Monroy-Hernández, Tongshuang Wu, Diyi Yang, Yun Huang, Tanushree Mitra, Yang Li, and Marti Hearst. 2025. Bidirectional Human-AI Alignment: Emerging Challenges and Opportunities. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’2...

work page doi:10.1145/3706599.3716291 2025
[70]

Sangho Suh, Meng Chen, Bryan Min, Toby Jia-Jun Li, and Haijun Xia. 2024. Luminate: Structured Generation and Exploration of Design Space with Large Language Models for Human-AI Co-Creation. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–26. doi:10.1145/361...

work page doi:10.1145/3613904.3642400 2024
[71]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony H...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
[72]

Mojtaba Vaismoradi, Hannele Turunen, and Terese Bondas. 2013. Content analy- sis and thematic analysis: Implications for conducting a qualitative descriptive study.Nursing & Health Sciences15, 3 (2013), 398–405. doi:10.1111/nhs.12048 _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/nhs.12048. 14 Compass vs Railway Tracks: Unpacking User Mental Mod...

work page doi:10.1111/nhs.12048 2013
[73]

Eric von Hippel. 1986. Lead Users: A Source of Novel Product Concepts.Manage. Sci.32, 7 (July 1986), 791–805

work page 1986
[74]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An Open-Ended Embodied Agent with Large Language Models. doi:10.48550/arXiv.2305.16291 arXiv:2305.16291 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.16291 2023
[75]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.16741 2025
[76]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. doi:10.48550/arXiv.2203.11171 arXiv:2203.11171 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.11171 2023
[77]

Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, and Tianyi Zhang. 2024. PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement. InProceedings of the 2024 CHI Conference on Human Factors in Com- puting Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–21. doi:10.1145/3613904.3642803

work page doi:10.1145/3613904.3642803 2024
[78]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022. Finetuned Language Models Are Zero-Shot Learners. doi:10.48550/arXiv.2109.01652 arXiv:2109.01652 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2109.01652 2022
[79]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. doi:10.48550/arXiv.2201.11903 arXiv:2201.11903 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.11903 2023
[80]

Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C. Schmidt. 2023. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. doi:10. 48550/arXiv.2302.11382 arXiv:2302.11382 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

Showing first 80 references.