DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching

Alane Suhr; Amy Pavel; Anya Ji; David M. Chan; Meng Chen; Tobias Maringgele; Tsung-Han Wu

arxiv: 2606.31980 · v1 · pith:QDLQYJEZnew · submitted 2026-06-30 · 💻 cs.CL · cs.AI· cs.HC

DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching

Meng Chen , Anya Ji , Tsung-Han Wu , Tobias Maringgele , David M. Chan , Alane Suhr , Amy Pavel This is my paper

Pith reviewed 2026-07-01 05:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HC

keywords DigitalCoachcomputer use coachingvisual groundingmultimodal datasethuman-AI interactiondialogue evaluationsoftware tutoringagent coaching

0 comments

The pith

Models coach computer users with more direct instructions but fewer explanations and with advice poorly grounded in screen visuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DigitalCoach, a dataset of 72 recorded human expert-novice coaching sessions across five software applications, to measure whether current models can teach people how to use computers. Automated analysis of the sessions shows models favor straightforward commands over the explanations, error diagnoses, and knowledge-check questions that human coaches use more often. When the coaching style is held constant, model utterances still fail to reference the actual visual elements on screen. Interactive tests with learners confirm that model coaches produce passive following of steps rather than active engagement with the material. The dataset therefore supplies a concrete benchmark for building agents that coach in shared visual software environments.

Core claim

DigitalCoach is a multimodal dataset of 72 human expert-novice computer use coaching sessions consisting of 22,752 dialogue turns grounded in 28.1 hours of screen and input event recordings across five software applications. Automated evaluation shows that models differ from humans in how they coach: models provide more direct instructions, but fewer explanations, error diagnoses, and knowledge-check questions. When we fix the coaching method, models produce utterances similar to human references yet poorly grounded in visual context. Interactive evaluation confirms that model coaches cause learners to passively follow instructions without deeper engagement and fall short in visual grounding

What carries the argument

DigitalCoach dataset of human coaching sessions with screen recordings, used to run both automated metric comparisons and interactive learner studies against model-generated coaching.

If this is right

Coaching agents will require stronger visual grounding to produce instructions that reference actual screen content.
Agents will need to increase production of explanations, error diagnoses, and knowledge-check questions to align with observed human coaching patterns.
Learner engagement metrics will improve only when models shift away from direct-instruction dominance toward more interactive dialogue.
The dataset supplies training data for agents that support collaborative rather than purely directive computer-use assistance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The grounding failures may stem from training regimes that do not explicitly link generated language to dynamic screen states during coaching.
Similar evaluation methods could be applied to other agent tasks that require referencing a shared visual workspace, such as remote technical support.
Expanding the dataset to additional applications and novice populations would test whether the communication differences hold more broadly.

Load-bearing premise

The patterns observed in the 72 sessions across five applications represent typical differences between human and model coaching behavior.

What would settle it

A controlled study using a new collection of coaching sessions in which state-of-the-art models match human rates of explanations, error diagnoses, knowledge-check questions, and visual grounding accuracy would falsify the reported gaps.

Figures

Figures reproduced from arXiv: 2606.31980 by Alane Suhr, Amy Pavel, Anya Ji, David M. Chan, Meng Chen, Tobias Maringgele, Tsung-Han Wu.

**Figure 1.** Figure 1: An example from DIGITALCOACH. Human coach provides gives guidance grounded in user’s screen and with explanations while model coach gives instructions without explaining how or why. then drag an object to position it, rather than using natural language to describe what object to move and where to move it. For novices who want to develop such software expertise, the automation afforded by agents bypasses l… view at source ↗

**Figure 2.** Figure 2: Collection setup, illustrated in the CAD soft [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Mean token length across dialogue progress of [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Dialogue act distributions of sampled 18 ses [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Coaching method distributions of sampled [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 7.** Figure 7: Mean normalized ranking (↑) and ratings (↑) for human (H), Gemini-3.1-Pro Vanilla prompt (G(V)), and Oracle prompt (G(O)) coaching utterances (* = p < 0.05, ** = p < 0.01, Friedman test followed by pairwise Wilcoxon signed-rank tests, Bonferroni correction). whether the utterance sounded natural. Results [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 9.** Figure 9: Representative (A) communication and (B) [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 8.** Figure 8: (A) Token length per utterance and (B) dis [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Agents are increasingly capable of automating software tasks, but can they teach humans how to use software themselves? We introduce DigitalCoach, a multimodal dataset of 72 human expert-novice computer use coaching sessions consisting of 22,752 dialogue turns grounded in 28.1 hours of screen and input event recordings across five software applications. We use DigitalCoach to evaluate whether state-of-the-art models can teach humans how to use computers. Automated evaluation shows that models differ from humans in how they coach: models provide more direct instructions, but fewer explanations, error diagnoses, and knowledge-check questions. When we fix the coaching method, models produce utterances similar to human references yet poorly grounded in visual context. Interactive evaluation confirms that model coaches cause learners to passively follow instructions without deeper engagement and fall short in visual grounding. DigitalCoach lays a foundation for collaborative and proactive computer use coaching agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper brings a new dataset of coaching sessions that shows concrete model-human differences, but the 72-session sample is too narrow to support broad claims.

read the letter

The main takeaway is a multimodal dataset of 72 expert-novice coaching sessions across five apps, with screen recordings and input logs, plus some automated and interactive comparisons showing models coach more directly and ground their advice worse than humans.

The dataset is the clearest addition. Recording actual screen and event data lets them measure visual grounding failures in a way most prior work skips. The split between direct instructions versus explanations, error diagnosis, and knowledge checks comes from that data, and the interactive part where learners turn passive with model coaches adds a practical angle.

The soft spot is scale and representativeness. Seventy-two sessions is a reasonable start for an initial study, but treating them as the basis for general statements about model versus human coaching behavior assumes the five applications and the error types that appeared are diverse enough. Without cross-application tests or checks on the automated utterance classifiers, the observed deltas could be tied to this particular set of tasks rather than a robust pattern.

This is for researchers building agents that teach software use or study human-AI tutoring interactions. It deserves a serious referee because it supplies real interaction traces on a practical gap instead of staying in simulation, even if the conclusions will need more data to hold up.

Referee Report

2 major / 1 minor

Summary. The paper introduces DigitalCoach, a multimodal dataset of 72 expert-novice computer-use coaching sessions (22,752 turns, 28.1 hours) across five applications. It uses automated and interactive evaluations to claim that state-of-the-art models differ from humans by providing more direct instructions but fewer explanations, error diagnoses, and knowledge-check questions; when coaching method is fixed, model utterances match human references in style but are poorly grounded in visual context, resulting in passive learner engagement.

Significance. If the empirical distinctions hold after methodological clarification, the work would usefully document communication and visual-grounding gaps in current agents for coaching tasks and supply a grounded dataset that could benchmark future proactive coaching systems. The scale of the collected sessions and the dual automated-plus-interactive evaluation protocol are concrete strengths.

major comments (2)

[Evaluation sections] The central quantitative claims rest on automated labeling of utterance types (direct instructions, explanations, error diagnoses, knowledge-check questions) and visual-grounding metrics, yet the manuscript provides no definitions of these categories, no inter-annotator agreement figures, and no accuracy/validation results for the automated classifiers. These omissions are load-bearing for the reported model-human differences.
[Dataset description] The generalizability claim—that models systematically differ from humans in coaching style and grounding—rests on 72 sessions across only five applications. No cross-application statistical tests, diversity analysis of novice error types, or external validation of the sample are reported, leaving open the possibility that observed deltas are artifacts of the particular UI patterns or task structures chosen.

minor comments (1)

[Abstract] The abstract would benefit from a single sentence stating the number and identity of the models evaluated and the main baselines used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important areas for methodological clarification and we will revise the manuscript to address them directly.

read point-by-point responses

Referee: [Evaluation sections] The central quantitative claims rest on automated labeling of utterance types (direct instructions, explanations, error diagnoses, knowledge-check questions) and visual-grounding metrics, yet the manuscript provides no definitions of these categories, no inter-annotator agreement figures, and no accuracy/validation results for the automated classifiers. These omissions are load-bearing for the reported model-human differences.

Authors: We agree that the manuscript currently lacks explicit definitions of the utterance categories and validation details for the automated classifiers. In the revision we will add a new subsection that (1) provides formal definitions for each category, (2) reports inter-annotator agreement from a human validation study performed on a random subset of turns, and (3) presents accuracy and confusion-matrix results comparing the automated labels to human judgments. These additions will make the quantitative claims fully reproducible and address the load-bearing concern. revision: yes
Referee: [Dataset description] The generalizability claim—that models systematically differ from humans in coaching style and grounding—rests on 72 sessions across only five applications. No cross-application statistical tests, diversity analysis of novice error types, or external validation of the sample are reported, leaving open the possibility that observed deltas are artifacts of the particular UI patterns or task structures chosen.

Authors: The five applications were deliberately selected to span distinct UI paradigms, yet we acknowledge that the absence of cross-application analyses leaves generalizability open to question. We will add (a) per-application breakdowns and statistical tests for consistency of the model-human differences and (b) a summary of novice error-type diversity across the five domains. While expanding the dataset to additional applications is outside the scope of the present work, these new analyses will directly test whether the observed patterns are robust or UI-specific. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset collection and direct comparison

full rationale

The paper collects a new dataset of 72 coaching sessions and reports automated metrics plus interactive evaluations comparing model vs. human behavior. No equations, fitted parameters, ansatzes, or derivations appear in the provided text. Central claims rest on direct observation of the collected data rather than any reduction to prior self-citations or inputs by construction. Self-citations (if any) are not invoked as uniqueness theorems or load-bearing justifications for the evaluation methodology. This is standard empirical work whose validity hinges on data representativeness, not on definitional or self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5699 in / 1040 out tokens · 31312 ms · 2026-07-01T05:32:54.744825+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 4 canonical work pages

[1]

InProceedings of the 2021 Conference on Empirical Methods in Nat- ural Language Processing, pages 1112–1125, Online and Punta Cana, Dominican Republic

MindCraft: Theory of mind modeling for situ- ated dialogue in collaborative tasks. InProceedings of the 2021 Conference on Empirical Methods in Nat- ural Language Processing, pages 1112–1125, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Su...

2021
[2]

Mark G Core and James Allen

Cambridge University Press, Cambridge. Mark G Core and James Allen. 1997. Coding dialogs with the damsl annotation scheme. InAAAI fall sym- posium on communicative action in humans and ma- chines, volume 56, pages 28–35. Boston, MA. Joseph L. Fleiss and Jacob Cohen. 1973. The Equiva- lence of Weighted Kappa and the Intraclass Correla- tion Coefficient as ...

work page arXiv 1997
[3]

Grounding gaps in language model genera- tions. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 6279–6296. Mayank Sharma, Roy Pea, and Hari Subramonyam

2024
[4]

Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chen- glei Si, Wayne Chi, Andi Peng, Jocelyn J

ConvoLearn: A Dataset of Constructivist Tutor-Student Dialogue. Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chen- glei Si, Wayne Chi, Andi Peng, Jocelyn J. Shen, Ameet Talwalkar, Tongshuang Wu, and David Son- tag. 2025. Completion $\neq$ Collaboration: Scal- ing Collaborative Effort with Agents.arXiv preprint...

work page arXiv 2025
[5]

Genartist: Multimodal LLM as an agent for unified image generation and editing. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Sys- tems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, and Diyi Yang. 2025. ...

work page arXiv 2024
[6]

Alright, now we want to rename this to. . .Card

OpenReview.net. 12 Appendix Our appendix is organized as follows: • Appendix A provides detailed statistics and linguistic examples of the dataset. • Appendix B outlines the experimental plat- form setup and facilitation scripts. • Appendix C catalogs the specific software tasks across the five tested domains. • Appendix D details the human-in-the-loop tr...

2010
[7]

Split the utterance if there is a pause of 2 sec- onds or longer
[8]

(a) Descending pitch can be a hint that an utterance is ending

Use changes in intonation and what can be inferred from the semantics to decide whether to split for shorter pauses. (a) Descending pitch can be a hint that an utterance is ending. Figure D.1: Transcription correction interface. Editor can see the dialogue transcript alongside screen record- ing. They can use the interface to edit, split, merge each utter...
[9]

Context from the other speaker can also help (e.g., if one half of the utterance is respond- ing directly to the other speaker, it may make sense to split)
[10]

more practice would be helpful to re-enforce the things I learnt,

Use the multimodal context to decide whether to split. (a) If a short pause coincides with a visible action in the screen state, this can be evi- dence for a new utterance. (b) If the screen state remains unchanged and there is only a pause shorter than 2 seconds, the utterances of the same speaker should generally be merged. D.3 Transcript Correction Int...
[11]

Listen to the audio carefully
[12]

Fix mistranscriptions only

For each objectin array order, produce the wording that best matches what is spoken in the corresponding time range. Fix mistranscriptions only
[13]

The number of strings you output must equal {n} (one corrected line per input object)

DoNOTinvent new utterances, merge, or split segments. The number of strings you output must equal {n} (one corrected line per input object)
[14]

DoNOTchange punctuation style unnecessarily; natural English is fine
[15]

ONLY correct the "text" field; keep the same timing, speaker, and id
[16]

Use , to indicate short speech pauses or hesitation
[17]

to indicate a longer pause or hesitation

Use ... to indicate a longer pause or hesitation
[18]

Use – to indicate a trail off or being interrupted by another speaker
[19]

text" field should be capitalized. Output requirements: - Return ONLY valid JSON: a single array of {n} strings. - String i is the corrected

The first letter of the first word in the "text" field should be capitalized. Output requirements: - Return ONLY valid JSON: a single array of {n} strings. - String i is the corrected "text" for input object i (same order as the reference array). - No markdown, no code fences, no keys other than the array itself. Good Examples: - "bit, there’s pill." -> "...

work page arXiv

[1] [1]

InProceedings of the 2021 Conference on Empirical Methods in Nat- ural Language Processing, pages 1112–1125, Online and Punta Cana, Dominican Republic

MindCraft: Theory of mind modeling for situ- ated dialogue in collaborative tasks. InProceedings of the 2021 Conference on Empirical Methods in Nat- ural Language Processing, pages 1112–1125, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Su...

2021

[2] [2]

Mark G Core and James Allen

Cambridge University Press, Cambridge. Mark G Core and James Allen. 1997. Coding dialogs with the damsl annotation scheme. InAAAI fall sym- posium on communicative action in humans and ma- chines, volume 56, pages 28–35. Boston, MA. Joseph L. Fleiss and Jacob Cohen. 1973. The Equiva- lence of Weighted Kappa and the Intraclass Correla- tion Coefficient as ...

work page arXiv 1997

[3] [3]

Grounding gaps in language model genera- tions. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 6279–6296. Mayank Sharma, Roy Pea, and Hari Subramonyam

2024

[4] [4]

Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chen- glei Si, Wayne Chi, Andi Peng, Jocelyn J

ConvoLearn: A Dataset of Constructivist Tutor-Student Dialogue. Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chen- glei Si, Wayne Chi, Andi Peng, Jocelyn J. Shen, Ameet Talwalkar, Tongshuang Wu, and David Son- tag. 2025. Completion $\neq$ Collaboration: Scal- ing Collaborative Effort with Agents.arXiv preprint...

work page arXiv 2025

[5] [5]

Genartist: Multimodal LLM as an agent for unified image generation and editing. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Sys- tems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, and Diyi Yang. 2025. ...

work page arXiv 2024

[6] [6]

Alright, now we want to rename this to. . .Card

OpenReview.net. 12 Appendix Our appendix is organized as follows: • Appendix A provides detailed statistics and linguistic examples of the dataset. • Appendix B outlines the experimental plat- form setup and facilitation scripts. • Appendix C catalogs the specific software tasks across the five tested domains. • Appendix D details the human-in-the-loop tr...

2010

[7] [7]

Split the utterance if there is a pause of 2 sec- onds or longer

[8] [8]

(a) Descending pitch can be a hint that an utterance is ending

Use changes in intonation and what can be inferred from the semantics to decide whether to split for shorter pauses. (a) Descending pitch can be a hint that an utterance is ending. Figure D.1: Transcription correction interface. Editor can see the dialogue transcript alongside screen record- ing. They can use the interface to edit, split, merge each utter...

[9] [9]

Context from the other speaker can also help (e.g., if one half of the utterance is respond- ing directly to the other speaker, it may make sense to split)

[10] [10]

more practice would be helpful to re-enforce the things I learnt,

Use the multimodal context to decide whether to split. (a) If a short pause coincides with a visible action in the screen state, this can be evi- dence for a new utterance. (b) If the screen state remains unchanged and there is only a pause shorter than 2 seconds, the utterances of the same speaker should generally be merged. D.3 Transcript Correction Int...

[11] [11]

Listen to the audio carefully

[12] [12]

Fix mistranscriptions only

For each objectin array order, produce the wording that best matches what is spoken in the corresponding time range. Fix mistranscriptions only

[13] [13]

The number of strings you output must equal {n} (one corrected line per input object)

DoNOTinvent new utterances, merge, or split segments. The number of strings you output must equal {n} (one corrected line per input object)

[14] [14]

DoNOTchange punctuation style unnecessarily; natural English is fine

[15] [15]

ONLY correct the "text" field; keep the same timing, speaker, and id

[16] [16]

Use , to indicate short speech pauses or hesitation

[17] [17]

to indicate a longer pause or hesitation

Use ... to indicate a longer pause or hesitation

[18] [18]

Use – to indicate a trail off or being interrupted by another speaker

[19] [19]

text" field should be capitalized. Output requirements: - Return ONLY valid JSON: a single array of {n} strings. - String i is the corrected

The first letter of the first word in the "text" field should be capitalized. Output requirements: - Return ONLY valid JSON: a single array of {n} strings. - String i is the corrected "text" for input object i (same order as the reference array). - No markdown, no code fences, no keys other than the array itself. Good Examples: - "bit, there’s pill." -> "...

work page arXiv