The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better Workflows

Andrew Head; Chris Parnin; Emerson Murphy-Hill; Ken Milne; Litao Yan; Sumit Gulwani; Vu Le

arxiv: 2509.26557 · v1 · submitted 2025-09-30 · 💻 cs.HC

The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better Workflows

Litao Yan , Andrew Head , Ken Milne , Vu Le , Sumit Gulwani , Chris Parnin , Emerson Murphy-Hill This is my paper

Pith reviewed 2026-05-18 11:49 UTC · model grok-4.3

classification 💻 cs.HC

keywords screen recordingsworkflow recommendationsvision-language modelsAI assistantsuser action inferencesoftware efficiencyhuman-computer interaction

0 comments

The pith

Screen recordings can be turned into precise workflow suggestions using a vision-language model pipeline

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InvisibleMentor, a system that processes screen recordings of users working in tools like Excel to automatically spot inefficient patterns such as repetitive edits and then recommend better alternatives. It does this through a two-stage process without requiring the user to describe their goals or problems. A vision-language model first extracts actions and context from the raw video, after which a language model produces structured suggestions. This matters because many users miss more efficient methods in complex software, and current AI assistants depend on imprecise or effortful prompts from the user. In testing, the system correctly identified workflow issues and participants rated its output as more actionable, tailored, and useful for learning than suggestions from a prompt-based spreadsheet assistant.

Core claim

InvisibleMentor turns screen recordings of task completion into vision-grounded reflections on tasks. It detects issues such as repetitive edits and recommends more efficient alternatives based on observed behavior. Unlike prior systems that rely on logs, APIs, or user prompts, InvisibleMentor operates directly on screen recordings. It uses a two-stage pipeline: a vision-language model reconstructs actions and context, and a language model generates structured, high-fidelity suggestions.

What carries the argument

Two-stage pipeline in which a vision-language model reconstructs user actions and task context directly from screen recordings, followed by a language model that produces structured suggestions for more efficient workflows.

If this is right

Users receive tailored efficiency suggestions without having to articulate their goals or problems.
Repetitive or inefficient actions visible in behavior can be automatically detected and addressed.
Suggestions are judged more actionable and helpful for learning than those produced by prompt-based spreadsheet assistants.
The approach works directly from video input and does not require access to application logs or APIs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same video-based reconstruction could be tested in other feature-rich applications such as presentation or data-visualization tools.
Running the pipeline on continuous screen capture might support ongoing rather than post-task guidance.
Errors in action reconstruction could be reduced by allowing users to correct the inferred steps before suggestions are generated.

Load-bearing premise

A vision-language model can reliably reconstruct precise user actions and task context from raw screen recordings without substantial errors or loss of detail.

What would settle it

A study that supplies screen recordings of tasks with known optimal versus suboptimal workflows, then measures how often the reconstructed actions match independent manual annotations and whether the generated suggestions correctly address the identified inefficiencies.

Figures

Figures reproduced from arXiv: 2509.26557 by Andrew Head, Chris Parnin, Emerson Murphy-Hill, Ken Milne, Litao Yan, Sumit Gulwani, Vu Le.

**Figure 1.** Figure 1: InvisibleMentor transforms ordinary screen recordings into more than just sequences of user actions. It interprets the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: InvisibleMentor’s pipeline for generating suggestions from a screen recording. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: User interface of a spreadsheet assistant that provides structured workflow guidance. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Participant ratings of InvisibleMentor’s suggestions. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Participants’ comparative preferences between In [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Relationship between video duration and VLM pro [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Many users struggle to notice when a more efficient workflow exists in feature-rich tools like Excel. Existing AI assistants offer help only after users describe their goals or problems, which can be effortful and imprecise. We present InvisibleMentor, a system that turns screen recordings of task completion into vision-grounded reflections on tasks. It detects issues such as repetitive edits and recommends more efficient alternatives based on observed behavior. Unlike prior systems that rely on logs, APIs, or user prompts, InvisibleMentor operates directly on screen recordings. It uses a two-stage pipeline: a vision-language model reconstructs actions and context, and a language model generates structured, high-fidelity suggestions. In evaluation, InvisibleMentor accurately identified inefficient workflows, and participants found its suggestions more actionable, tailored, and more helpful for learning and improvement compared to a prompt-based spreadsheet assistant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InvisibleMentor sketches a VLM-then-LM pipeline on screen recordings for proactive workflow fixes in tools like Excel, but the abstract gives no numbers or study details to check if it works.

read the letter

The main takeaway is a system that watches screen recordings of users working in feature-rich software, reconstructs their actions with a vision-language model, and then uses a language model to flag inefficiencies like repetitive edits and suggest better alternatives. It positions itself as more automatic than assistants that wait for explicit prompts or need logs and APIs. That direct visual approach is the clearest new element here, and it targets a genuine everyday problem where users miss simpler ways to do things in spreadsheets or similar apps. The framing of turning passive recordings into structured reflections is straightforward and practical on paper. The evaluation claim is that it spotted issues accurately and that participants rated the suggestions higher for actionability and learning value than a prompt-based baseline. Without any participant counts, success rates, task details, or controls in the abstract, though, those results stay hard to assess. The central assumption that the vision model can pull precise actions and context from raw video without major loss or error also sits untested in what we have. This would interest HCI researchers building AI productivity tools or anyone exploring vision-based interfaces for software assistance. A reader already working on workflow mining or proactive agents could pick up the two-stage pipeline idea and adapt it. If the full paper supplies the missing methods, quantitative metrics, and study protocol, it deserves a serious referee to evaluate the reconstruction fidelity and user study design. Otherwise the claims rest mostly on description. I would send the complete version out for peer review because the core concept is worth proper testing even if the current evidence is thin.

Referee Report

2 major / 1 minor

Summary. The paper introduces InvisibleMentor, a two-stage pipeline that processes screen recordings of user tasks in feature-rich applications such as Excel. A vision-language model first reconstructs actions and task context from raw video, after which a language model produces structured suggestions for more efficient workflows. The abstract claims that the system accurately detects issues like repetitive edits and that, in an evaluation, participants rated its suggestions as more actionable, tailored, and helpful for learning than those from a prompt-based spreadsheet assistant.

Significance. If the evaluation claims are substantiated with rigorous controls and metrics, the work could offer a meaningful advance in passive, observation-driven AI assistance for workflow discovery, reducing reliance on explicit user prompts or application logs. The approach aligns with HCI goals of lowering the effort required to identify inefficiencies in complex tools. However, the absence of any quantitative results, participant counts, or protocol details in the provided manuscript prevents a firm assessment of its potential impact or generalizability.

major comments (2)

[Abstract] Abstract: The central claim that 'InvisibleMentor accurately identified inefficient workflows' is presented without any supporting metrics, error rates, participant numbers, or description of the evaluation protocol. This omission is load-bearing because the paper's contribution rests on demonstrating superior performance over the prompt-based baseline.
[Abstract] Abstract: The weakest assumption—that a vision-language model can reliably reconstruct precise user actions and task context from raw screen recordings without substantial errors—is stated but not accompanied by any fidelity measures, failure cases, or validation against ground-truth action logs. This directly affects the credibility of the downstream suggestions.

minor comments (1)

[Abstract] Abstract: The comparison baseline is described only as 'a prompt-based spreadsheet assistant'; clarifying its exact capabilities and prompting strategy would help readers understand the strength of the reported advantage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that additional details are needed to substantiate the claims and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'InvisibleMentor accurately identified inefficient workflows' is presented without any supporting metrics, error rates, participant numbers, or description of the evaluation protocol. This omission is load-bearing because the paper's contribution rests on demonstrating superior performance over the prompt-based baseline.

Authors: We agree that the abstract would benefit from more specific details to support the central claim. The current version summarizes the evaluation outcomes at a high level for brevity. In the revised manuscript we will update the abstract to include participant counts, key comparative metrics against the prompt-based baseline, and a concise description of the evaluation protocol. revision: yes
Referee: [Abstract] Abstract: The weakest assumption—that a vision-language model can reliably reconstruct precise user actions and task context from raw screen recordings without substantial errors—is stated but not accompanied by any fidelity measures, failure cases, or validation against ground-truth action logs. This directly affects the credibility of the downstream suggestions.

Authors: We acknowledge that explicit validation of the VLM reconstruction step strengthens the paper. The abstract currently focuses on the end-to-end system rather than intermediate fidelity metrics. We will revise the abstract to reference the VLM validation approach and will ensure the full manuscript includes quantitative fidelity measures, selected failure cases, and comparison to ground-truth logs. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract describes a two-stage VLM+LM pipeline for inferring actions from screen recordings and generating workflow suggestions, along with qualitative evaluation results from a user study. No equations, parameters, derivations, or self-citations appear in the text. The claims rest on system behavior and external participant feedback rather than any reduction of outputs to fitted inputs or self-referential premises by construction. This is a standard non-circular system paper whose central assertions are evaluated independently of internal fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unverified performance of current vision-language models for action reconstruction and on the assumption that observed behavior in recordings is representative of real workflow problems.

axioms (1)

domain assumption Vision-language models can accurately reconstruct user actions and context from screen recordings
The first stage of the pipeline explicitly relies on this capability of VLMs.

invented entities (1)

InvisibleMentor two-stage pipeline no independent evidence
purpose: To convert screen recordings into structured workflow suggestions
The system itself is the primary new artifact introduced by the paper.

pith-pipeline@v0.9.0 · 5662 in / 1140 out tokens · 32280 ms · 2026-05-18T11:49:41.992075+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VLM achieved over 90% accuracy on recovering 14 common spreadsheet actions across 25 real sessions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages

[1]

David Akers, Matthew Simpson, Robin Jeffries, and Terry Winograd. 2009. Undo and erase events as indicators of usability problems. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 659–668

work page 2009
[2]

Mohammad Alahmadi, Abdulkarim Malkadi, and Sonia Haiduc. 2020. UI Screens Identification and Extraction from Mobile Programming Screencasts. InPro- ceedings of the 28th International Conference on Program Comprehension. ACM, Litao Yan, Andrew Head, Ken Milne, Vu Le, Sumit Gulwani, Chris Parnin, and Emerson Murphy-Hill 319–330

work page 2020
[3]

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, and Abhanshu Sharma

work page
[4]

ScreenAI: A Vision-Language Model for UI and Infographics Understand- ing

work page
[5]

Carlos Bernal-Cárdenas, Nathan Cooper, Madeleine Havranek, Kevin Moran, Oscar Chaparro, Denys Poshyvanyk, and Andrian Marcus. 2023. Translating Video Recordings of Complex Mobile App UI Gestures into Replayable Scenarios. IEEE Transactions on Software Engineering49 (2023), 1782–1803

work page 2023
[6]

2016.Qualitative HCI research: Going behind the scenes

Ann Blandford, Dominic Furniss, and Stephann Makri. 2016.Qualitative HCI research: Going behind the scenes. Morgan & Claypool Publishers

work page 2016
[7]

Bradbard, Charles Alvis, and Richard Morris

David A. Bradbard, Charles Alvis, and Richard Morris. 2014. Spreadsheet usage by management accountants: An exploratory study.Journal of Accounting Education (2014), 24–30

work page 2014
[8]

Tyson Bulmer, Lloyd Montgomery, and Daniela Damian. 2018. Predicting develop- ers’ IDE commands with machine learning. InProceedings of the 15th International Conference on Mining Software Repositories. ACM, 82–85

work page 2018
[9]

It’s Freedom to Put Things Where My Mind Wants

George Chalhoub and Advait Sarkar. 2022. “It’s Freedom to Put Things Where My Mind Wants”: Understanding and Improving the User Experience of Structuring Data in Spreadsheets. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. ACM, Article 585, 24 pages

work page 2022
[10]

Sibei Chen, Yeye He, Weiwei Cui, Ju Fan, Song Ge, Haidong Zhang, Dongmei Zhang, and Surajit Chaudhuri. 2024. Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations.Proceedings of the ACM on Management of Data, Article 122 (2024), 27 pages

work page 2024
[11]

Yanting Chen, Yi Ren, Xiaoting Qin, Jue Zhang, Kehong Yuan, Lu Han, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, and Qi Zhang. 2024. Sharingan: Extract User Action Sequence from Desktop Recordings

work page 2024
[12]

Chilana, Amy J

Parmit K. Chilana, Amy J. Ko, and Jacob O. Wobbrock. 2012. LemonAid: selection- based crowdsourced contextual help for web applications. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1549–1558

work page 2012
[13]

1993.Eager: programming repetitive tasks by demonstration

Allen Cypher. 1993.Eager: programming repetitive tasks by demonstration. MIT Press, Cambridge, MA, USA, 205–217

work page 1993
[14]

Robert DeLine, Amir Khella, Mary Czerwinski, and George Robertson. 2005. Towards understanding programs through wear-based filtering. InProceedings of the 2005 ACM Symposium on Software Visualization. ACM, 183–192

work page 2005
[15]

Travis Faas, Lynn Dombrowski, Alyson Young, and Andrew D. Miller. 2018. Watch Me Code: Programming Mentorship Communities on Twitch.tv.Proceed- ings of the ACM on Human-Computer Interaction, Article 50 (2018), 18 pages

work page 2018
[16]

Leah Findlater and Joanna McGrenere. 2004. A comparison of static, adaptive, and adaptable menus. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 89–96

work page 2004
[17]

Adam Fourney, Richard Mann, and Michael Terry. 2011. Query-feature graphs: bridging user vocabulary and system functionality. InProceedings of the 24th An- nual ACM Symposium on User Interface Software and Technology. ACM, 207–216

work page 2011
[18]

Ailie Fraser, Mira Dontcheva, Holger Winnemöller, Sheryl Ehrlich, and Scott Klemmer

C. Ailie Fraser, Mira Dontcheva, Holger Winnemöller, Sheryl Ehrlich, and Scott Klemmer. 2016. DiscoverySpace: Suggesting Actions in Complex Software. In Proceedings of the 2016 ACM Conference on Designing Interactive Systems. ACM, 1221–1232

work page 2016
[19]

Tovi Grossman and George Fitzmaurice. 2010. ToolClips: an investigation of contextual video assistance for functionality understanding. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1515–1524

work page 2010
[20]

Harris, and Rishabh Singh

Sumit Gulwani, William R. Harris, and Rishabh Singh. 2012. Spreadsheet data manipulation using examples.Commun. ACM(2012), 97–105

work page 2012
[21]

Björn Hartmann, Daniel MacDougall, Joel Brandt, and Scott R. Klemmer. 2010. What would other programmers do: suggesting solutions to error messages. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1019–1028

work page 2010
[22]

Sture Holm. 1979. A simple sequentially rejective multiple test procedure.Scan- dinavian journal of statistics(1979), 65–70

work page 1979
[23]

Forrest Huang, Gang Li, Tao Li, and Yang Li. 2024. Automatic Macro Mining from Interaction Traces at Scale. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, Article 1038, 16 pages

work page 2024
[24]

Yue Jiang, Eldon Schoop, Amanda Swearngin, and Jeffrey Nichols. 2025. ILu- vUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Con- versations. InProceedings of the 30th International Conference on Intelligent User Interfaces. ACM, 861–877

work page 2025
[25]

Yiqiao Jin, Stefano Petrangeli, Yu Shen, and Gang Wu. 2025. ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction. InCompanion Proceedings of the ACM on Web Conference 2025. ACM, 2008–2013

work page 2025
[26]

Anjali Khurana, Xiaotian Su, April Yi Wang, and Parmit K Chilana. 2025. Do It For Me vs. Do It With Me: Investigating User Perceptions of Different Paradigms of Automation in Copilots for Feature-Rich Software. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. ACM, Article 880, 18 pages

work page 2025
[27]

Anjali Khurana, Hariharan Subramonyam, and Parmit K Chilana. 2024. Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking. InProceedings of the 29th International Conference on Intelligent User Interfaces. ACM, 288–303

work page 2024
[28]

Andrea Kohlhase, Michael Kohlhase, and Ana Guseva. 2015. Context in Spread- sheet Comprehension. InSEMS@ ICSE. 21–27

work page 2015
[29]

Benjamin Lafreniere, Andrea Bunt, Matthew Lount, Filip Krynicki, and Michael A. Terry. 2011. AdaptableGIMP: designing a socially-adaptable interface. InPro- ceedings of the 24th Annual ACM Symposium Adjunct on User Interface Software and Technology. ACM, 89–90

work page 2011
[30]

Chilana, Adam Fourney, and Michael A

Benjamin Lafreniere, Parmit K. Chilana, Adam Fourney, and Michael A. Terry

work page
[31]

InProceedings of the 28th Annual ACM Symposium on User Interface Software & Technology

These Aren’t the Commands You’re Looking For: Addressing False Feedfor- ward in Feature-Rich Software. InProceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. ACM, 619–628

work page
[32]

Liang, Aayush Kumar, Yasharth Bajpai, Sumit Gulwani, Vu Le, Chris Parnin, Arjun Radhakrishna, Ashish Tiwari, Emerson Murphy-Hill, and Gustavo Soares

Jenny T. Liang, Aayush Kumar, Yasharth Bajpai, Sumit Gulwani, Vu Le, Chris Parnin, Arjun Radhakrishna, Ashish Tiwari, Emerson Murphy-Hill, and Gustavo Soares. 2025. TableTalk: Scaffolding Spreadsheet Development with a Language Agent.ACM Transactions on Computer-Human Interaction(2025)

work page 2025
[33]

Wendy E. Mackay. 1990. Patterns of sharing customizable software. InProceedings of the 1990 ACM Conference on Computer-Supported Cooperative Work. ACM

work page 1990
[34]

Abdulkarim Malkadi, Ahmad Tayeb, and Sonia Haiduc. 2023. Improving Code Ex- traction from Coding Screencasts Using a Code-Aware Encoder-Decoder Model. In2023 38th IEEE/ACM International Conference on Automated Software Engineer- ing (ASE). 1492–1504

work page 2023
[35]

Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2011. Ambient help. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2751–2760

work page 2011
[36]

Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2013. Patina: dynamic heatmaps for visualizing application usage. InProceedings of the SIGCHI Confer- ence on Human Factors in Computing Systems. ACM, 3227–3236

work page 2013
[37]

Justin Matejka, Wei Li, Tovi Grossman, and George Fitzmaurice. 2009. Com- munityCommands: command recommendations for software applications. In Proceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology. ACM, 193–202

work page 2009
[38]

Emerson Murphy-Hill, Rahul Jiresal, and Gail C. Murphy. 2012. Improving software developers’ fluency by recommending development environment com- mands. InProceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. ACM, Article 42, 11 pages

work page 2012
[39]

Emerson Murphy-Hill, Da Young Lee, Gail C Murphy, and Joanna McGrenere

work page
[40]

How do users discover new tools in software development and beyond? Computer Supported Cooperative Work (CSCW)24, 5 (2015), 389–422

work page 2015
[41]

Nambhi, Bhanu Prakash Reddy, Aarsh Prakash Agarwal, Gaurav Verma, Harvineet Singh, and Iftikhar Ahamath Burhanuddin

Aadhavan M. Nambhi, Bhanu Prakash Reddy, Aarsh Prakash Agarwal, Gaurav Verma, Harvineet Singh, and Iftikhar Ahamath Burhanuddin. 2019. Stuck? No worries! Task-aware Command Recommendation and Proactive Help for Analysts. InProceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization. ACM, 271–275

work page 2019
[42]

1994.Usability Engineering

Jakob Nielsen. 1994.Usability Engineering. Morgan Kaufmann Publishers Inc

work page 1994
[43]

Chris Parnin and Robert DeLine. 2010. Evaluating cues for resuming interrupted programming tasks. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 93–102

work page 2010
[44]

How do you even know that stuff?

Qing, Xia, Advait Sarkar, Duncan Brumby, and Anna Cox. 2025. "How do you even know that stuff?": Barriers to expertise sharing among spreadsheet users

work page 2025
[45]

Vidya Ramesh, Charlie Hsu, Maneesh Agrawala, and Björn Hartmann. 2011. ShowMeHow: translating user interface instructions between applications. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. ACM, 127–134

work page 2011
[46]

Satterthwaite

Franklin E. Satterthwaite. 1946. An approximate distribution of estimates of variance components.Biometrics bulletin2, 6 (1946), 110–114

work page 1946
[47]

Rishabh Singh and Sumit Gulwani. 2016. Transforming spreadsheet data types using examples.POPL ’16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languag(2016), 343–356

work page 2016
[48]

Ananya Singha, Bhavya Chopra, Anirudh Khatry, Sumit Gulwani, Austin Henley, Vu Le, Chris Parnin, Mukul Singh, and Gust Verbruggen. 2024. Semantically Aligned Question and Code Generation for Automated Insight Generation. In Proceedings of the 1st International Workshop on Large Language Models for Code. ACM, 127–134

work page 2024
[49]

Sruti Srinivasa Ragavan, Advait Sarkar, and Andrew D Gordon. 2021. Spread- sheet Comprehension: Guesswork, Giving Up and Going Back to the Author. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Article 181, 1-21 pages

work page 2021
[50]

Michael B. Twidale. 2005. Over the shoulder learning: supporting brief informal learning.Computer Supported Cooperative Work (CSCW)14, 6 (2005), 505–547

work page 2005
[51]

Xu Wang, Benjamin Lafreniere, and Tovi Grossman. 2018. Leveraging Community-Generated Videos and Command Logs to Classify and Recommend Software Workflows. InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 1–13

work page 2018
[52]

Frank Wilcoxon. 1992. Individual comparisons by ranking methods. InBreak- throughs in Statistics: Methodology and Distribution. Springer, 196–202

work page 1992
[53]

Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, and Shuo Shang. 2024. MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding. The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better Workflows

work page 2024
[54]

Qing Xia, Advait Sarkar, Duncan Brumby, and Anna Cox. 2025. How do you even know that stuff?: Barriers to expertise sharing among spreadsheet users

work page 2025
[55]

Zamfirescu-Pereira, Richmond Y

J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang

work page
[56]

InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. ACM, Article 437, 21 pages

work page 2023
[57]

Dehai Zhao, Zhenchang Xing, Chunyang Chen, Xin Xia, and Guoqiang Li. 2019. ActionNet: Vision-Based Workflow Action Recognition From Programming Screencasts. In2019 IEEE/ACM 41st International Conference on Software En- gineering (ICSE). 350–361

work page 2019
[58]

Dehai Zhao, Zhenchang Xing, Xin Xia, Deheng Ye, Xiwei Xu, and Liming Zhu

work page
[59]

cell content

SeeHow: Workflow Extraction from Programming Screencasts through Action-Aware Video Analytics. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1946–1957. A Prompt Templates This appendix provides the prompt templates used in each phase of InvisibleMentor’s architecture. The prompts were constructed to guide the vision-languag...

work page 1946
[60]

Group related actions into workflows (steps accomplishing a specific task)

work page
[61]

For each workflow, set "Optimal" to true/false based on efficiency

work page
[62]

Optimal": false): -

For suboptimal workflows ("Optimal": false): - "ActionList": List actions starting with "It looks like you..." - "Reason": Main inefficiency (be specific) starting with "You ..." - "Suggestion": Provide ONE actionable solution using Excel features: - Give step-by-step instructions with exact Ribbon paths/shortcuts - Include detailed examples with realisti...

work page
[63]

Focus on efficiency and maintainability, not just task completion

work page
[64]

Only include 3 most impactful suboptimal workflows and rank them by importance

work page
[65]

Use proper formatting: backticks (`) around Excel functions, formulas, keyboard shortcuts, and feature names, and triple backticks (```) for multi-line formulas or step- by-step code examples

work page
[66]

Workflows

Create plausible placeholders for unclear data references Output JSON format: { "Workflows": [ { "ActionList": ["Action 1", "Action 2"], "Optimal": true/false, Litao Yan, Andrew Head, Ken Milne, Vu Le, Sumit Gulwani, Chris Parnin, and Emerson Murphy-Hill "Reason": "Brief explanation", "Suggestion": "Step-by-step actionable solution" } ] } B User Scenario ...

work page 2021

[1] [1]

David Akers, Matthew Simpson, Robin Jeffries, and Terry Winograd. 2009. Undo and erase events as indicators of usability problems. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 659–668

work page 2009

[2] [2]

Mohammad Alahmadi, Abdulkarim Malkadi, and Sonia Haiduc. 2020. UI Screens Identification and Extraction from Mobile Programming Screencasts. InPro- ceedings of the 28th International Conference on Program Comprehension. ACM, Litao Yan, Andrew Head, Ken Milne, Vu Le, Sumit Gulwani, Chris Parnin, and Emerson Murphy-Hill 319–330

work page 2020

[3] [3]

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, and Abhanshu Sharma

work page

[4] [4]

ScreenAI: A Vision-Language Model for UI and Infographics Understand- ing

work page

[5] [5]

Carlos Bernal-Cárdenas, Nathan Cooper, Madeleine Havranek, Kevin Moran, Oscar Chaparro, Denys Poshyvanyk, and Andrian Marcus. 2023. Translating Video Recordings of Complex Mobile App UI Gestures into Replayable Scenarios. IEEE Transactions on Software Engineering49 (2023), 1782–1803

work page 2023

[6] [6]

2016.Qualitative HCI research: Going behind the scenes

Ann Blandford, Dominic Furniss, and Stephann Makri. 2016.Qualitative HCI research: Going behind the scenes. Morgan & Claypool Publishers

work page 2016

[7] [7]

Bradbard, Charles Alvis, and Richard Morris

David A. Bradbard, Charles Alvis, and Richard Morris. 2014. Spreadsheet usage by management accountants: An exploratory study.Journal of Accounting Education (2014), 24–30

work page 2014

[8] [8]

Tyson Bulmer, Lloyd Montgomery, and Daniela Damian. 2018. Predicting develop- ers’ IDE commands with machine learning. InProceedings of the 15th International Conference on Mining Software Repositories. ACM, 82–85

work page 2018

[9] [9]

It’s Freedom to Put Things Where My Mind Wants

George Chalhoub and Advait Sarkar. 2022. “It’s Freedom to Put Things Where My Mind Wants”: Understanding and Improving the User Experience of Structuring Data in Spreadsheets. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. ACM, Article 585, 24 pages

work page 2022

[10] [10]

Sibei Chen, Yeye He, Weiwei Cui, Ju Fan, Song Ge, Haidong Zhang, Dongmei Zhang, and Surajit Chaudhuri. 2024. Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations.Proceedings of the ACM on Management of Data, Article 122 (2024), 27 pages

work page 2024

[11] [11]

Yanting Chen, Yi Ren, Xiaoting Qin, Jue Zhang, Kehong Yuan, Lu Han, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, and Qi Zhang. 2024. Sharingan: Extract User Action Sequence from Desktop Recordings

work page 2024

[12] [12]

Chilana, Amy J

Parmit K. Chilana, Amy J. Ko, and Jacob O. Wobbrock. 2012. LemonAid: selection- based crowdsourced contextual help for web applications. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1549–1558

work page 2012

[13] [13]

1993.Eager: programming repetitive tasks by demonstration

Allen Cypher. 1993.Eager: programming repetitive tasks by demonstration. MIT Press, Cambridge, MA, USA, 205–217

work page 1993

[14] [14]

Robert DeLine, Amir Khella, Mary Czerwinski, and George Robertson. 2005. Towards understanding programs through wear-based filtering. InProceedings of the 2005 ACM Symposium on Software Visualization. ACM, 183–192

work page 2005

[15] [15]

Travis Faas, Lynn Dombrowski, Alyson Young, and Andrew D. Miller. 2018. Watch Me Code: Programming Mentorship Communities on Twitch.tv.Proceed- ings of the ACM on Human-Computer Interaction, Article 50 (2018), 18 pages

work page 2018

[16] [16]

Leah Findlater and Joanna McGrenere. 2004. A comparison of static, adaptive, and adaptable menus. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 89–96

work page 2004

[17] [17]

Adam Fourney, Richard Mann, and Michael Terry. 2011. Query-feature graphs: bridging user vocabulary and system functionality. InProceedings of the 24th An- nual ACM Symposium on User Interface Software and Technology. ACM, 207–216

work page 2011

[18] [18]

Ailie Fraser, Mira Dontcheva, Holger Winnemöller, Sheryl Ehrlich, and Scott Klemmer

C. Ailie Fraser, Mira Dontcheva, Holger Winnemöller, Sheryl Ehrlich, and Scott Klemmer. 2016. DiscoverySpace: Suggesting Actions in Complex Software. In Proceedings of the 2016 ACM Conference on Designing Interactive Systems. ACM, 1221–1232

work page 2016

[19] [19]

Tovi Grossman and George Fitzmaurice. 2010. ToolClips: an investigation of contextual video assistance for functionality understanding. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1515–1524

work page 2010

[20] [20]

Harris, and Rishabh Singh

Sumit Gulwani, William R. Harris, and Rishabh Singh. 2012. Spreadsheet data manipulation using examples.Commun. ACM(2012), 97–105

work page 2012

[21] [21]

Björn Hartmann, Daniel MacDougall, Joel Brandt, and Scott R. Klemmer. 2010. What would other programmers do: suggesting solutions to error messages. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1019–1028

work page 2010

[22] [22]

Sture Holm. 1979. A simple sequentially rejective multiple test procedure.Scan- dinavian journal of statistics(1979), 65–70

work page 1979

[23] [23]

Forrest Huang, Gang Li, Tao Li, and Yang Li. 2024. Automatic Macro Mining from Interaction Traces at Scale. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, Article 1038, 16 pages

work page 2024

[24] [24]

Yue Jiang, Eldon Schoop, Amanda Swearngin, and Jeffrey Nichols. 2025. ILu- vUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Con- versations. InProceedings of the 30th International Conference on Intelligent User Interfaces. ACM, 861–877

work page 2025

[25] [25]

Yiqiao Jin, Stefano Petrangeli, Yu Shen, and Gang Wu. 2025. ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction. InCompanion Proceedings of the ACM on Web Conference 2025. ACM, 2008–2013

work page 2025

[26] [26]

Anjali Khurana, Xiaotian Su, April Yi Wang, and Parmit K Chilana. 2025. Do It For Me vs. Do It With Me: Investigating User Perceptions of Different Paradigms of Automation in Copilots for Feature-Rich Software. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. ACM, Article 880, 18 pages

work page 2025

[27] [27]

Anjali Khurana, Hariharan Subramonyam, and Parmit K Chilana. 2024. Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking. InProceedings of the 29th International Conference on Intelligent User Interfaces. ACM, 288–303

work page 2024

[28] [28]

Andrea Kohlhase, Michael Kohlhase, and Ana Guseva. 2015. Context in Spread- sheet Comprehension. InSEMS@ ICSE. 21–27

work page 2015

[29] [29]

Benjamin Lafreniere, Andrea Bunt, Matthew Lount, Filip Krynicki, and Michael A. Terry. 2011. AdaptableGIMP: designing a socially-adaptable interface. InPro- ceedings of the 24th Annual ACM Symposium Adjunct on User Interface Software and Technology. ACM, 89–90

work page 2011

[30] [30]

Chilana, Adam Fourney, and Michael A

Benjamin Lafreniere, Parmit K. Chilana, Adam Fourney, and Michael A. Terry

work page

[31] [31]

InProceedings of the 28th Annual ACM Symposium on User Interface Software & Technology

These Aren’t the Commands You’re Looking For: Addressing False Feedfor- ward in Feature-Rich Software. InProceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. ACM, 619–628

work page

[32] [32]

Liang, Aayush Kumar, Yasharth Bajpai, Sumit Gulwani, Vu Le, Chris Parnin, Arjun Radhakrishna, Ashish Tiwari, Emerson Murphy-Hill, and Gustavo Soares

Jenny T. Liang, Aayush Kumar, Yasharth Bajpai, Sumit Gulwani, Vu Le, Chris Parnin, Arjun Radhakrishna, Ashish Tiwari, Emerson Murphy-Hill, and Gustavo Soares. 2025. TableTalk: Scaffolding Spreadsheet Development with a Language Agent.ACM Transactions on Computer-Human Interaction(2025)

work page 2025

[33] [33]

Wendy E. Mackay. 1990. Patterns of sharing customizable software. InProceedings of the 1990 ACM Conference on Computer-Supported Cooperative Work. ACM

work page 1990

[34] [34]

Abdulkarim Malkadi, Ahmad Tayeb, and Sonia Haiduc. 2023. Improving Code Ex- traction from Coding Screencasts Using a Code-Aware Encoder-Decoder Model. In2023 38th IEEE/ACM International Conference on Automated Software Engineer- ing (ASE). 1492–1504

work page 2023

[35] [35]

Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2011. Ambient help. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2751–2760

work page 2011

[36] [36]

Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2013. Patina: dynamic heatmaps for visualizing application usage. InProceedings of the SIGCHI Confer- ence on Human Factors in Computing Systems. ACM, 3227–3236

work page 2013

[37] [37]

Justin Matejka, Wei Li, Tovi Grossman, and George Fitzmaurice. 2009. Com- munityCommands: command recommendations for software applications. In Proceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology. ACM, 193–202

work page 2009

[38] [38]

Emerson Murphy-Hill, Rahul Jiresal, and Gail C. Murphy. 2012. Improving software developers’ fluency by recommending development environment com- mands. InProceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. ACM, Article 42, 11 pages

work page 2012

[39] [39]

Emerson Murphy-Hill, Da Young Lee, Gail C Murphy, and Joanna McGrenere

work page

[40] [40]

How do users discover new tools in software development and beyond? Computer Supported Cooperative Work (CSCW)24, 5 (2015), 389–422

work page 2015

[41] [41]

Nambhi, Bhanu Prakash Reddy, Aarsh Prakash Agarwal, Gaurav Verma, Harvineet Singh, and Iftikhar Ahamath Burhanuddin

Aadhavan M. Nambhi, Bhanu Prakash Reddy, Aarsh Prakash Agarwal, Gaurav Verma, Harvineet Singh, and Iftikhar Ahamath Burhanuddin. 2019. Stuck? No worries! Task-aware Command Recommendation and Proactive Help for Analysts. InProceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization. ACM, 271–275

work page 2019

[42] [42]

1994.Usability Engineering

Jakob Nielsen. 1994.Usability Engineering. Morgan Kaufmann Publishers Inc

work page 1994

[43] [43]

Chris Parnin and Robert DeLine. 2010. Evaluating cues for resuming interrupted programming tasks. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 93–102

work page 2010

[44] [44]

How do you even know that stuff?

Qing, Xia, Advait Sarkar, Duncan Brumby, and Anna Cox. 2025. "How do you even know that stuff?": Barriers to expertise sharing among spreadsheet users

work page 2025

[45] [45]

Vidya Ramesh, Charlie Hsu, Maneesh Agrawala, and Björn Hartmann. 2011. ShowMeHow: translating user interface instructions between applications. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. ACM, 127–134

work page 2011

[46] [46]

Satterthwaite

Franklin E. Satterthwaite. 1946. An approximate distribution of estimates of variance components.Biometrics bulletin2, 6 (1946), 110–114

work page 1946

[47] [47]

Rishabh Singh and Sumit Gulwani. 2016. Transforming spreadsheet data types using examples.POPL ’16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languag(2016), 343–356

work page 2016

[48] [48]

Ananya Singha, Bhavya Chopra, Anirudh Khatry, Sumit Gulwani, Austin Henley, Vu Le, Chris Parnin, Mukul Singh, and Gust Verbruggen. 2024. Semantically Aligned Question and Code Generation for Automated Insight Generation. In Proceedings of the 1st International Workshop on Large Language Models for Code. ACM, 127–134

work page 2024

[49] [49]

Sruti Srinivasa Ragavan, Advait Sarkar, and Andrew D Gordon. 2021. Spread- sheet Comprehension: Guesswork, Giving Up and Going Back to the Author. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Article 181, 1-21 pages

work page 2021

[50] [50]

Michael B. Twidale. 2005. Over the shoulder learning: supporting brief informal learning.Computer Supported Cooperative Work (CSCW)14, 6 (2005), 505–547

work page 2005

[51] [51]

Xu Wang, Benjamin Lafreniere, and Tovi Grossman. 2018. Leveraging Community-Generated Videos and Command Logs to Classify and Recommend Software Workflows. InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 1–13

work page 2018

[52] [52]

Frank Wilcoxon. 1992. Individual comparisons by ranking methods. InBreak- throughs in Statistics: Methodology and Distribution. Springer, 196–202

work page 1992

[53] [53]

Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, and Shuo Shang. 2024. MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding. The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better Workflows

work page 2024

[54] [54]

Qing Xia, Advait Sarkar, Duncan Brumby, and Anna Cox. 2025. How do you even know that stuff?: Barriers to expertise sharing among spreadsheet users

work page 2025

[55] [55]

Zamfirescu-Pereira, Richmond Y

J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang

work page

[56] [56]

InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. ACM, Article 437, 21 pages

work page 2023

[57] [57]

Dehai Zhao, Zhenchang Xing, Chunyang Chen, Xin Xia, and Guoqiang Li. 2019. ActionNet: Vision-Based Workflow Action Recognition From Programming Screencasts. In2019 IEEE/ACM 41st International Conference on Software En- gineering (ICSE). 350–361

work page 2019

[58] [58]

Dehai Zhao, Zhenchang Xing, Xin Xia, Deheng Ye, Xiwei Xu, and Liming Zhu

work page

[59] [59]

cell content

SeeHow: Workflow Extraction from Programming Screencasts through Action-Aware Video Analytics. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1946–1957. A Prompt Templates This appendix provides the prompt templates used in each phase of InvisibleMentor’s architecture. The prompts were constructed to guide the vision-languag...

work page 1946

[60] [60]

Group related actions into workflows (steps accomplishing a specific task)

work page

[61] [61]

For each workflow, set "Optimal" to true/false based on efficiency

work page

[62] [62]

Optimal": false): -

For suboptimal workflows ("Optimal": false): - "ActionList": List actions starting with "It looks like you..." - "Reason": Main inefficiency (be specific) starting with "You ..." - "Suggestion": Provide ONE actionable solution using Excel features: - Give step-by-step instructions with exact Ribbon paths/shortcuts - Include detailed examples with realisti...

work page

[63] [63]

Focus on efficiency and maintainability, not just task completion

work page

[64] [64]

Only include 3 most impactful suboptimal workflows and rank them by importance

work page

[65] [65]

Use proper formatting: backticks (`) around Excel functions, formulas, keyboard shortcuts, and feature names, and triple backticks (```) for multi-line formulas or step- by-step code examples

work page

[66] [66]

Workflows

Create plausible placeholders for unclear data references Output JSON format: { "Workflows": [ { "ActionList": ["Action 1", "Action 2"], "Optimal": true/false, Litao Yan, Andrew Head, Ken Milne, Vu Le, Sumit Gulwani, Chris Parnin, and Emerson Murphy-Hill "Reason": "Brief explanation", "Suggestion": "Step-by-step actionable solution" } ] } B User Scenario ...

work page 2021