The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better Workflows
Pith reviewed 2026-05-18 11:49 UTC · model grok-4.3
The pith
Screen recordings can be turned into precise workflow suggestions using a vision-language model pipeline
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InvisibleMentor turns screen recordings of task completion into vision-grounded reflections on tasks. It detects issues such as repetitive edits and recommends more efficient alternatives based on observed behavior. Unlike prior systems that rely on logs, APIs, or user prompts, InvisibleMentor operates directly on screen recordings. It uses a two-stage pipeline: a vision-language model reconstructs actions and context, and a language model generates structured, high-fidelity suggestions.
What carries the argument
Two-stage pipeline in which a vision-language model reconstructs user actions and task context directly from screen recordings, followed by a language model that produces structured suggestions for more efficient workflows.
If this is right
- Users receive tailored efficiency suggestions without having to articulate their goals or problems.
- Repetitive or inefficient actions visible in behavior can be automatically detected and addressed.
- Suggestions are judged more actionable and helpful for learning than those produced by prompt-based spreadsheet assistants.
- The approach works directly from video input and does not require access to application logs or APIs.
Where Pith is reading between the lines
- The same video-based reconstruction could be tested in other feature-rich applications such as presentation or data-visualization tools.
- Running the pipeline on continuous screen capture might support ongoing rather than post-task guidance.
- Errors in action reconstruction could be reduced by allowing users to correct the inferred steps before suggestions are generated.
Load-bearing premise
A vision-language model can reliably reconstruct precise user actions and task context from raw screen recordings without substantial errors or loss of detail.
What would settle it
A study that supplies screen recordings of tasks with known optimal versus suboptimal workflows, then measures how often the reconstructed actions match independent manual annotations and whether the generated suggestions correctly address the identified inefficiencies.
Figures
read the original abstract
Many users struggle to notice when a more efficient workflow exists in feature-rich tools like Excel. Existing AI assistants offer help only after users describe their goals or problems, which can be effortful and imprecise. We present InvisibleMentor, a system that turns screen recordings of task completion into vision-grounded reflections on tasks. It detects issues such as repetitive edits and recommends more efficient alternatives based on observed behavior. Unlike prior systems that rely on logs, APIs, or user prompts, InvisibleMentor operates directly on screen recordings. It uses a two-stage pipeline: a vision-language model reconstructs actions and context, and a language model generates structured, high-fidelity suggestions. In evaluation, InvisibleMentor accurately identified inefficient workflows, and participants found its suggestions more actionable, tailored, and more helpful for learning and improvement compared to a prompt-based spreadsheet assistant.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InvisibleMentor, a two-stage pipeline that processes screen recordings of user tasks in feature-rich applications such as Excel. A vision-language model first reconstructs actions and task context from raw video, after which a language model produces structured suggestions for more efficient workflows. The abstract claims that the system accurately detects issues like repetitive edits and that, in an evaluation, participants rated its suggestions as more actionable, tailored, and helpful for learning than those from a prompt-based spreadsheet assistant.
Significance. If the evaluation claims are substantiated with rigorous controls and metrics, the work could offer a meaningful advance in passive, observation-driven AI assistance for workflow discovery, reducing reliance on explicit user prompts or application logs. The approach aligns with HCI goals of lowering the effort required to identify inefficiencies in complex tools. However, the absence of any quantitative results, participant counts, or protocol details in the provided manuscript prevents a firm assessment of its potential impact or generalizability.
major comments (2)
- [Abstract] Abstract: The central claim that 'InvisibleMentor accurately identified inefficient workflows' is presented without any supporting metrics, error rates, participant numbers, or description of the evaluation protocol. This omission is load-bearing because the paper's contribution rests on demonstrating superior performance over the prompt-based baseline.
- [Abstract] Abstract: The weakest assumption—that a vision-language model can reliably reconstruct precise user actions and task context from raw screen recordings without substantial errors—is stated but not accompanied by any fidelity measures, failure cases, or validation against ground-truth action logs. This directly affects the credibility of the downstream suggestions.
minor comments (1)
- [Abstract] Abstract: The comparison baseline is described only as 'a prompt-based spreadsheet assistant'; clarifying its exact capabilities and prompting strategy would help readers understand the strength of the reported advantage.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We agree that additional details are needed to substantiate the claims and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'InvisibleMentor accurately identified inefficient workflows' is presented without any supporting metrics, error rates, participant numbers, or description of the evaluation protocol. This omission is load-bearing because the paper's contribution rests on demonstrating superior performance over the prompt-based baseline.
Authors: We agree that the abstract would benefit from more specific details to support the central claim. The current version summarizes the evaluation outcomes at a high level for brevity. In the revised manuscript we will update the abstract to include participant counts, key comparative metrics against the prompt-based baseline, and a concise description of the evaluation protocol. revision: yes
-
Referee: [Abstract] Abstract: The weakest assumption—that a vision-language model can reliably reconstruct precise user actions and task context from raw screen recordings without substantial errors—is stated but not accompanied by any fidelity measures, failure cases, or validation against ground-truth action logs. This directly affects the credibility of the downstream suggestions.
Authors: We acknowledge that explicit validation of the VLM reconstruction step strengthens the paper. The abstract currently focuses on the end-to-end system rather than intermediate fidelity metrics. We will revise the abstract to reference the VLM validation approach and will ensure the full manuscript includes quantitative fidelity measures, selected failure cases, and comparison to ground-truth logs. revision: yes
Circularity Check
No significant circularity detected
full rationale
The provided abstract describes a two-stage VLM+LM pipeline for inferring actions from screen recordings and generating workflow suggestions, along with qualitative evaluation results from a user study. No equations, parameters, derivations, or self-citations appear in the text. The claims rest on system behavior and external participant feedback rather than any reduction of outputs to fitted inputs or self-referential premises by construction. This is a standard non-circular system paper whose central assertions are evaluated independently of internal fitting.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Vision-language models can accurately reconstruct user actions and context from screen recordings
invented entities (1)
-
InvisibleMentor two-stage pipeline
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VLM achieved over 90% accuracy on recovering 14 common spreadsheet actions across 25 real sessions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
David Akers, Matthew Simpson, Robin Jeffries, and Terry Winograd. 2009. Undo and erase events as indicators of usability problems. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 659–668
work page 2009
-
[2]
Mohammad Alahmadi, Abdulkarim Malkadi, and Sonia Haiduc. 2020. UI Screens Identification and Extraction from Mobile Programming Screencasts. InPro- ceedings of the 28th International Conference on Program Comprehension. ACM, Litao Yan, Andrew Head, Ken Milne, Vu Le, Sumit Gulwani, Chris Parnin, and Emerson Murphy-Hill 319–330
work page 2020
-
[3]
Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, and Abhanshu Sharma
-
[4]
ScreenAI: A Vision-Language Model for UI and Infographics Understand- ing
-
[5]
Carlos Bernal-Cárdenas, Nathan Cooper, Madeleine Havranek, Kevin Moran, Oscar Chaparro, Denys Poshyvanyk, and Andrian Marcus. 2023. Translating Video Recordings of Complex Mobile App UI Gestures into Replayable Scenarios. IEEE Transactions on Software Engineering49 (2023), 1782–1803
work page 2023
-
[6]
2016.Qualitative HCI research: Going behind the scenes
Ann Blandford, Dominic Furniss, and Stephann Makri. 2016.Qualitative HCI research: Going behind the scenes. Morgan & Claypool Publishers
work page 2016
-
[7]
Bradbard, Charles Alvis, and Richard Morris
David A. Bradbard, Charles Alvis, and Richard Morris. 2014. Spreadsheet usage by management accountants: An exploratory study.Journal of Accounting Education (2014), 24–30
work page 2014
-
[8]
Tyson Bulmer, Lloyd Montgomery, and Daniela Damian. 2018. Predicting develop- ers’ IDE commands with machine learning. InProceedings of the 15th International Conference on Mining Software Repositories. ACM, 82–85
work page 2018
-
[9]
It’s Freedom to Put Things Where My Mind Wants
George Chalhoub and Advait Sarkar. 2022. “It’s Freedom to Put Things Where My Mind Wants”: Understanding and Improving the User Experience of Structuring Data in Spreadsheets. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. ACM, Article 585, 24 pages
work page 2022
-
[10]
Sibei Chen, Yeye He, Weiwei Cui, Ju Fan, Song Ge, Haidong Zhang, Dongmei Zhang, and Surajit Chaudhuri. 2024. Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations.Proceedings of the ACM on Management of Data, Article 122 (2024), 27 pages
work page 2024
-
[11]
Yanting Chen, Yi Ren, Xiaoting Qin, Jue Zhang, Kehong Yuan, Lu Han, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, and Qi Zhang. 2024. Sharingan: Extract User Action Sequence from Desktop Recordings
work page 2024
-
[12]
Parmit K. Chilana, Amy J. Ko, and Jacob O. Wobbrock. 2012. LemonAid: selection- based crowdsourced contextual help for web applications. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1549–1558
work page 2012
-
[13]
1993.Eager: programming repetitive tasks by demonstration
Allen Cypher. 1993.Eager: programming repetitive tasks by demonstration. MIT Press, Cambridge, MA, USA, 205–217
work page 1993
-
[14]
Robert DeLine, Amir Khella, Mary Czerwinski, and George Robertson. 2005. Towards understanding programs through wear-based filtering. InProceedings of the 2005 ACM Symposium on Software Visualization. ACM, 183–192
work page 2005
-
[15]
Travis Faas, Lynn Dombrowski, Alyson Young, and Andrew D. Miller. 2018. Watch Me Code: Programming Mentorship Communities on Twitch.tv.Proceed- ings of the ACM on Human-Computer Interaction, Article 50 (2018), 18 pages
work page 2018
-
[16]
Leah Findlater and Joanna McGrenere. 2004. A comparison of static, adaptive, and adaptable menus. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 89–96
work page 2004
-
[17]
Adam Fourney, Richard Mann, and Michael Terry. 2011. Query-feature graphs: bridging user vocabulary and system functionality. InProceedings of the 24th An- nual ACM Symposium on User Interface Software and Technology. ACM, 207–216
work page 2011
-
[18]
Ailie Fraser, Mira Dontcheva, Holger Winnemöller, Sheryl Ehrlich, and Scott Klemmer
C. Ailie Fraser, Mira Dontcheva, Holger Winnemöller, Sheryl Ehrlich, and Scott Klemmer. 2016. DiscoverySpace: Suggesting Actions in Complex Software. In Proceedings of the 2016 ACM Conference on Designing Interactive Systems. ACM, 1221–1232
work page 2016
-
[19]
Tovi Grossman and George Fitzmaurice. 2010. ToolClips: an investigation of contextual video assistance for functionality understanding. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1515–1524
work page 2010
-
[20]
Sumit Gulwani, William R. Harris, and Rishabh Singh. 2012. Spreadsheet data manipulation using examples.Commun. ACM(2012), 97–105
work page 2012
-
[21]
Björn Hartmann, Daniel MacDougall, Joel Brandt, and Scott R. Klemmer. 2010. What would other programmers do: suggesting solutions to error messages. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1019–1028
work page 2010
-
[22]
Sture Holm. 1979. A simple sequentially rejective multiple test procedure.Scan- dinavian journal of statistics(1979), 65–70
work page 1979
-
[23]
Forrest Huang, Gang Li, Tao Li, and Yang Li. 2024. Automatic Macro Mining from Interaction Traces at Scale. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, Article 1038, 16 pages
work page 2024
-
[24]
Yue Jiang, Eldon Schoop, Amanda Swearngin, and Jeffrey Nichols. 2025. ILu- vUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Con- versations. InProceedings of the 30th International Conference on Intelligent User Interfaces. ACM, 861–877
work page 2025
-
[25]
Yiqiao Jin, Stefano Petrangeli, Yu Shen, and Gang Wu. 2025. ScreenLLM: Stateful Screen Schema for Efficient Action Understanding and Prediction. InCompanion Proceedings of the ACM on Web Conference 2025. ACM, 2008–2013
work page 2025
-
[26]
Anjali Khurana, Xiaotian Su, April Yi Wang, and Parmit K Chilana. 2025. Do It For Me vs. Do It With Me: Investigating User Perceptions of Different Paradigms of Automation in Copilots for Feature-Rich Software. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. ACM, Article 880, 18 pages
work page 2025
-
[27]
Anjali Khurana, Hariharan Subramonyam, and Parmit K Chilana. 2024. Why and When LLM-Based Assistants Can Go Wrong: Investigating the Effectiveness of Prompt-Based Interactions for Software Help-Seeking. InProceedings of the 29th International Conference on Intelligent User Interfaces. ACM, 288–303
work page 2024
-
[28]
Andrea Kohlhase, Michael Kohlhase, and Ana Guseva. 2015. Context in Spread- sheet Comprehension. InSEMS@ ICSE. 21–27
work page 2015
-
[29]
Benjamin Lafreniere, Andrea Bunt, Matthew Lount, Filip Krynicki, and Michael A. Terry. 2011. AdaptableGIMP: designing a socially-adaptable interface. InPro- ceedings of the 24th Annual ACM Symposium Adjunct on User Interface Software and Technology. ACM, 89–90
work page 2011
-
[30]
Chilana, Adam Fourney, and Michael A
Benjamin Lafreniere, Parmit K. Chilana, Adam Fourney, and Michael A. Terry
-
[31]
InProceedings of the 28th Annual ACM Symposium on User Interface Software & Technology
These Aren’t the Commands You’re Looking For: Addressing False Feedfor- ward in Feature-Rich Software. InProceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. ACM, 619–628
-
[32]
Jenny T. Liang, Aayush Kumar, Yasharth Bajpai, Sumit Gulwani, Vu Le, Chris Parnin, Arjun Radhakrishna, Ashish Tiwari, Emerson Murphy-Hill, and Gustavo Soares. 2025. TableTalk: Scaffolding Spreadsheet Development with a Language Agent.ACM Transactions on Computer-Human Interaction(2025)
work page 2025
-
[33]
Wendy E. Mackay. 1990. Patterns of sharing customizable software. InProceedings of the 1990 ACM Conference on Computer-Supported Cooperative Work. ACM
work page 1990
-
[34]
Abdulkarim Malkadi, Ahmad Tayeb, and Sonia Haiduc. 2023. Improving Code Ex- traction from Coding Screencasts Using a Code-Aware Encoder-Decoder Model. In2023 38th IEEE/ACM International Conference on Automated Software Engineer- ing (ASE). 1492–1504
work page 2023
-
[35]
Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2011. Ambient help. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2751–2760
work page 2011
-
[36]
Justin Matejka, Tovi Grossman, and George Fitzmaurice. 2013. Patina: dynamic heatmaps for visualizing application usage. InProceedings of the SIGCHI Confer- ence on Human Factors in Computing Systems. ACM, 3227–3236
work page 2013
-
[37]
Justin Matejka, Wei Li, Tovi Grossman, and George Fitzmaurice. 2009. Com- munityCommands: command recommendations for software applications. In Proceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology. ACM, 193–202
work page 2009
-
[38]
Emerson Murphy-Hill, Rahul Jiresal, and Gail C. Murphy. 2012. Improving software developers’ fluency by recommending development environment com- mands. InProceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. ACM, Article 42, 11 pages
work page 2012
-
[39]
Emerson Murphy-Hill, Da Young Lee, Gail C Murphy, and Joanna McGrenere
-
[40]
How do users discover new tools in software development and beyond? Computer Supported Cooperative Work (CSCW)24, 5 (2015), 389–422
work page 2015
-
[41]
Aadhavan M. Nambhi, Bhanu Prakash Reddy, Aarsh Prakash Agarwal, Gaurav Verma, Harvineet Singh, and Iftikhar Ahamath Burhanuddin. 2019. Stuck? No worries! Task-aware Command Recommendation and Proactive Help for Analysts. InProceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization. ACM, 271–275
work page 2019
-
[42]
Jakob Nielsen. 1994.Usability Engineering. Morgan Kaufmann Publishers Inc
work page 1994
-
[43]
Chris Parnin and Robert DeLine. 2010. Evaluating cues for resuming interrupted programming tasks. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 93–102
work page 2010
-
[44]
How do you even know that stuff?
Qing, Xia, Advait Sarkar, Duncan Brumby, and Anna Cox. 2025. "How do you even know that stuff?": Barriers to expertise sharing among spreadsheet users
work page 2025
-
[45]
Vidya Ramesh, Charlie Hsu, Maneesh Agrawala, and Björn Hartmann. 2011. ShowMeHow: translating user interface instructions between applications. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. ACM, 127–134
work page 2011
-
[46]
Franklin E. Satterthwaite. 1946. An approximate distribution of estimates of variance components.Biometrics bulletin2, 6 (1946), 110–114
work page 1946
-
[47]
Rishabh Singh and Sumit Gulwani. 2016. Transforming spreadsheet data types using examples.POPL ’16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languag(2016), 343–356
work page 2016
-
[48]
Ananya Singha, Bhavya Chopra, Anirudh Khatry, Sumit Gulwani, Austin Henley, Vu Le, Chris Parnin, Mukul Singh, and Gust Verbruggen. 2024. Semantically Aligned Question and Code Generation for Automated Insight Generation. In Proceedings of the 1st International Workshop on Large Language Models for Code. ACM, 127–134
work page 2024
-
[49]
Sruti Srinivasa Ragavan, Advait Sarkar, and Andrew D Gordon. 2021. Spread- sheet Comprehension: Guesswork, Giving Up and Going Back to the Author. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, Article 181, 1-21 pages
work page 2021
-
[50]
Michael B. Twidale. 2005. Over the shoulder learning: supporting brief informal learning.Computer Supported Cooperative Work (CSCW)14, 6 (2005), 505–547
work page 2005
-
[51]
Xu Wang, Benjamin Lafreniere, and Tovi Grossman. 2018. Leveraging Community-Generated Videos and Command Logs to Classify and Recommend Software Workflows. InProceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 1–13
work page 2018
-
[52]
Frank Wilcoxon. 1992. Individual comparisons by ranking methods. InBreak- throughs in Statistics: Methodology and Distribution. Springer, 196–202
work page 1992
-
[53]
Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, and Shuo Shang. 2024. MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding. The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better Workflows
work page 2024
-
[54]
Qing Xia, Advait Sarkar, Duncan Brumby, and Anna Cox. 2025. How do you even know that stuff?: Barriers to expertise sharing among spreadsheet users
work page 2025
-
[55]
Zamfirescu-Pereira, Richmond Y
J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang
-
[56]
InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems
Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. ACM, Article 437, 21 pages
work page 2023
-
[57]
Dehai Zhao, Zhenchang Xing, Chunyang Chen, Xin Xia, and Guoqiang Li. 2019. ActionNet: Vision-Based Workflow Action Recognition From Programming Screencasts. In2019 IEEE/ACM 41st International Conference on Software En- gineering (ICSE). 350–361
work page 2019
-
[58]
Dehai Zhao, Zhenchang Xing, Xin Xia, Deheng Ye, Xiwei Xu, and Liming Zhu
-
[59]
SeeHow: Workflow Extraction from Programming Screencasts through Action-Aware Video Analytics. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1946–1957. A Prompt Templates This appendix provides the prompt templates used in each phase of InvisibleMentor’s architecture. The prompts were constructed to guide the vision-languag...
work page 1946
-
[60]
Group related actions into workflows (steps accomplishing a specific task)
-
[61]
For each workflow, set "Optimal" to true/false based on efficiency
-
[62]
For suboptimal workflows ("Optimal": false): - "ActionList": List actions starting with "It looks like you..." - "Reason": Main inefficiency (be specific) starting with "You ..." - "Suggestion": Provide ONE actionable solution using Excel features: - Give step-by-step instructions with exact Ribbon paths/shortcuts - Include detailed examples with realisti...
-
[63]
Focus on efficiency and maintainability, not just task completion
-
[64]
Only include 3 most impactful suboptimal workflows and rank them by importance
-
[65]
Use proper formatting: backticks (`) around Excel functions, formulas, keyboard shortcuts, and feature names, and triple backticks (```) for multi-line formulas or step- by-step code examples
-
[66]
Create plausible placeholders for unclear data references Output JSON format: { "Workflows": [ { "ActionList": ["Action 1", "Action 2"], "Optimal": true/false, Litao Yan, Andrew Head, Ken Milne, Vu Le, Sumit Gulwani, Chris Parnin, and Emerson Murphy-Hill "Reason": "Brief explanation", "Suggestion": "Step-by-step actionable solution" } ] } B User Scenario ...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.