arxiv: 2605.11700 · v1 · submitted 2026-05-12 · 💻 cs.HC

Recognition: no theorem link

MindMirror: A Local-First Multimodal State-Aware Support System for Digital Workers

Wenqi Luo , Changbo Wang , Yan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:48 UTC · model grok-4.3

classification 💻 cs.HC

keywords local-first designmultimodal interactionfacial expression recognitiondigital workersstate reflectionlocal LLMproductivity supportstructured workflow

0 comments

The pith

MindMirror combines local facial expression detection with a fine-tuned model at 94.49 percent accuracy and structured reflection to support digital workers in managing fatigue and task blockage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MindMirror addresses fatigue, anxiety, and task blockage in prolonged computer work by providing a local-first multimodal system that detects states via camera-based facial cues, text, and optional speech. It creates a closed workflow where users check their state, manually correct detections, articulate blockages in a structured way, receive local LLM suggestions, and review daily or weekly reports. This design avoids the need for clear prompts required by general chatbots and keeps data on the user's device. Evaluation on an independent benchmark of 6,767 images shows the fine-tuned model raises accuracy from 59.66 percent to 94.49 percent. A small formative study with six workers indicates they value the local approach, correction mechanism, and reflection structure as a lightweight tool for self-support.

Core claim

The central claim is that a local-first multimodal system integrating camera-based facial expression recognition, user inputs, structured reflection, and a local LLM can form an effective closed workflow for state-aware support, demonstrated by a 34.83 percentage point accuracy gain in the emotion recognition module on a seven-class benchmark and positive user feedback on the prototype's design features.

What carries the argument

The closed workflow of state checking via multimodal inputs, manual correction, structured articulation of blockage, local LLM-based suggestion generation, and daily/weekly review reports, supported by a fine-tuned Hugging Face emotion model and Ollama-hosted Qwen LLM.

If this is right

The local-first architecture keeps all processing on-device, reducing privacy risks compared to cloud-based alternatives.
Manual correction gives users direct control to override automated detections before suggestions are generated.
The structured reflection step helps users articulate vague feelings of blockage into concrete inputs for the LLM.
Daily and weekly reports enable users to track patterns in their states over time without external data sharing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adding passive signals such as typing speed or application switches could strengthen state detection without extra user effort.
The workflow might transfer to domains like student study sessions or remote team meetings where similar fatigue patterns occur.
Over repeated use the reflection prompts could serve as a lightweight training mechanism for better self-awareness.

Load-bearing premise

That detected facial expressions and user-reported states accurately reflect internal experience in real work settings and that the local LLM suggestions will be perceived as helpful rather than intrusive or generic.

What would settle it

A longitudinal study measuring whether users show sustained changes in self-reported fatigue or task completion rates when using the full MindMirror workflow versus a control condition without the structured reflection and suggestions.

Figures

Figures reproduced from arXiv: 2605.11700 by Changbo Wang, Wenqi Luo, Yan Wang.

**Figure 1.** Figure 1: Overall architecture of MindMirror. The system connects user-side interaction, a Web frontend, Flask backend/API routing, local AI engines, and local session storage. the frontend captures a video frame with Canvas and converts it into a base64 image, which is sent to the backend emotionanalysis endpoint. The backend performs emotion recognition and returns an emotion label. At the interaction level, Mind… view at source ↗

**Figure 2.** Figure 2: End-to-end workflow of MindMirror. The system supports local-first multimodal state checking, user confirmation or correction of emotion cues, structured reflection, local LLM-based suggestion generation, and local review/history management [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Representative user interfaces of MindMirror: (a) the homepage dashboard and quick-entry panel, and (b) the state-check page with camera input, assistant status, recognition result area, privacy notice, and manual state options [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Representative interaction screens after state checking: (a) the structured blockage-reflection page with three guiding questions and chat support, and (b) the review-report page with state distribution, key statistics, blockage summary, and recovery suggestion areas. inflated performance [17] [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Digital workers often experience fatigue, anxiety, reduced attention, and task blockage during prolonged computer-based work. Existing productivity tools mainly focus on task completion, while general-purpose AI chatbots require users to formulate clear prompts before receiving useful help. This paper presents MindMirror, a local-first multimodal state-aware support system for digital workers. MindMirror integrates camera-based facial expression cues, text input, optional speech interaction, structured blockage reflection, local large language model (LLM)-based response generation, and daily/weekly review reports. The system forms a closed workflow of state checking, manual correction, structured articulation, suggestion generation, and state review. The current prototype follows a local-first design, while optional speech services may rely on third-party APIs when enabled. It is implemented with a Web frontend, Flask backend, an emotion recognition model, an Ollama-hosted Qwen model, Chart.js visualization, and local JSON/LocalStorage records. We evaluate the emotion recognition module on an independent seven-class image-level facial expression benchmark containing 6,767 images. The fine-tuned Hugging Face model improves accuracy from 59.66% to 94.49% over a non-fine-tuned checkpoint baseline, an absolute gain of 34.83 percentage points. We further validate the prototype through endpoint-level reliability tests, voice-interaction latency tests, and a small formative user feedback study with six digital workers. Results suggest that users value the local-first design, manual correction mechanism, and structured reflection workflow. MindMirror is not intended for psychological diagnosis; instead, it serves as a lightweight, user-controllable tool for state reflection and supportive interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MindMirror is a working local-first prototype with a clean closed workflow for worker state reflection, but the n=6 formative study gives no quantitative evidence it reduces fatigue or boosts performance.

read the letter

The main thing to know is that this paper ships a concrete system that ties camera emotion cues, manual correction, structured blockage reflection, local LLM suggestions, and review reports into one loop, all with a local-first emphasis. That specific combination for digital workers is new even if the pieces are familiar from prior HCI work. They also show a clear technical win: fine-tuning the Hugging Face model lifts accuracy from 59.66% to 94.49% on a 6,767-image seven-class benchmark, and they report endpoint reliability and voice latency numbers. The implementation details with Flask, Ollama, and local storage make the prototype easy to understand and potentially reproduce. Credit for keeping everything privacy-respecting where possible and for not claiming medical diagnosis. The soft spot is the evaluation. The user study with six participants only records that people valued the local design and reflection prompts; there are no pre/post measures of fatigue, anxiety, attention, or task output, and no comparison against a baseline tool. The benchmark accuracy is on posed images, not on real webcam feeds during actual work, so the link from detected expressions to helpful suggestions in practice stays untested. The central claim that the system supports digital workers therefore rests on the assumption that the cues and LLM outputs will feel useful rather than generic or intrusive. This paper is for HCI researchers or developers building personal productivity tools who want a local multimodal example. It has enough implemented substance and a quantified technical result to deserve peer review, though the human evaluation will need expansion to make the practical benefits stick. I would send it to referees rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript presents MindMirror, a local-first multimodal system designed to support digital workers by integrating camera-based facial expression recognition, text and optional speech inputs, structured reflection on task blockages, and local LLM-generated suggestions, culminating in daily/weekly review reports. It claims a substantial improvement in seven-class facial expression recognition accuracy from 59.66% to 94.49% on an independent benchmark of 6,767 images and reports positive qualitative feedback from a formative study involving six digital workers who valued the local-first design, manual correction, and reflection workflow.

Significance. If the system's effectiveness in reducing fatigue and improving productivity is confirmed, MindMirror could contribute meaningfully to the field of human-computer interaction by providing a privacy-preserving, user-controllable alternative to general-purpose AI tools for managing work-related mental states. The technical demonstration of fine-tuning for high-accuracy local emotion recognition and the closed workflow design are notable strengths.

major comments (2)

[Evaluation section (user study)] The formative user study with only six participants reports that users 'valued' the local-first design, manual correction mechanism, and structured reflection workflow, but provides no quantitative pre/post measures of internal states (e.g., fatigue, anxiety), task performance metrics, or comparisons to baseline tools. This limits the ability to substantiate the central claim that MindMirror serves as an effective state-aware support system.
[Emotion recognition evaluation] The accuracy of 94.49% is reported on a benchmark dataset of posed images; the manuscript does not include an evaluation on in-situ webcam feeds captured during actual prolonged work sessions, which may differ in expression naturalness, lighting, and head pose, potentially affecting real-world performance.

minor comments (2)

[Abstract] The abstract mentions 'endpoint-level reliability tests' and 'voice-interaction latency tests' but provides no specific results or metrics for these tests, which would help assess the prototype's robustness.
[Implementation] The description of the system architecture (Web frontend, Flask backend, Ollama-hosted Qwen model) is high-level; including a diagram or more detailed component interactions would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation of MindMirror. The comments highlight important distinctions between formative exploration and efficacy validation, which we will address by clarifying scope and adding explicit limitations in the revised manuscript.

read point-by-point responses

Referee: [Evaluation section (user study)] The formative user study with only six participants reports that users 'valued' the local-first design, manual correction mechanism, and structured reflection workflow, but provides no quantitative pre/post measures of internal states (e.g., fatigue, anxiety), task performance metrics, or comparisons to baseline tools. This limits the ability to substantiate the central claim that MindMirror serves as an effective state-aware support system.

Authors: We agree that the study is formative and qualitative, with N=6 providing initial user feedback rather than quantitative evidence of effectiveness. The manuscript already describes the study as 'formative user feedback' and makes no claims of fatigue reduction or productivity gains; it only reports that participants valued the local-first design, manual correction, and reflection workflow. In revision we will expand the Evaluation section with additional protocol details (e.g., session structure, thematic analysis method) and insert a dedicated Limitations subsection that explicitly notes the absence of pre/post internal-state measures, task metrics, and baseline comparisons, while outlining plans for future controlled experiments. This directly incorporates the referee's concern by better bounding our claims. revision: yes
Referee: [Emotion recognition evaluation] The accuracy of 94.49% is reported on a benchmark dataset of posed images; the manuscript does not include an evaluation on in-situ webcam feeds captured during actual prolonged work sessions, which may differ in expression naturalness, lighting, and head pose, potentially affecting real-world performance.

Authors: The 94.49% figure is obtained on a standard independent seven-class benchmark of 6,767 images, chosen to enable direct comparison with published baselines and to demonstrate the benefit of fine-tuning. We concur that posed benchmark images differ from naturalistic webcam footage in lighting, pose, and expression subtlety. In the revised manuscript we will add a paragraph in the Evaluation and/or Discussion sections acknowledging this domain gap and stating that real-world performance remains to be validated. Because the current work focuses on prototype implementation and benchmark validation, we cannot retroactively supply in-situ results; such data collection would require a separate study. revision: partial

Circularity Check

0 steps flagged

No circularity; claims rest on external benchmark and direct user observations

full rationale

The paper evaluates its emotion recognition module on an independent external benchmark of 6,767 images, reporting a standard accuracy lift from fine-tuning (59.66% to 94.49%). User feedback comes from direct observation in a small formative study rather than any derived or fitted quantity. No equations, self-referential definitions, load-bearing self-citations, or renamed empirical patterns appear in the derivation chain. The workflow and results are self-contained against external data sources.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions about facial expression validity and LLM helpfulness rather than new free parameters or invented entities.

axioms (2)

domain assumption Facial expressions provide a usable proxy for internal emotional and attentional states during computer work.
Invoked in the state-checking and emotion recognition module description.
domain assumption Local LLMs can generate contextually appropriate supportive suggestions from structured user input.
Assumed in the response generation step of the workflow.

pith-pipeline@v0.9.0 · 5598 in / 1394 out tokens · 55346 ms · 2026-05-13T05:48:09.917637+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

R. W. Picard.Affective Computing. MIT Press, 1997

work page 1997
[2]

R. A. Calvo and S. D’Mello. Affect detection: An interdisciplinary review of models, methods, and their applications.IEEE Transactions on Affective Computing, 1(1):18–37, 2010

work page 2010
[3]

Mental health at work

World Health Organization. Mental health at work. WHO Fact Sheet, 2024. Available at: https://www.who.int/news-room/ fact-sheets/detail/mental-health-at-work

work page 2024
[4]

I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Ham- ner, W. Cukierski, Y . Tang, D. Thaler, D.-H. Lee, and others. Challenges in representation learning: A report on three machine learning contests. arXiv preprint arXiv:1307.0414, 2013

work page Pith review arXiv 2013
[5]

Lucey, J

P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. In2010 IEEE Computer Society 8 Conference on Computer Vision and Pattern Recognition Workshops, pp. 94–101, 2010

work page 2010
[6]

Li and W

S. Li and W. Deng. Deep facial expression recognition: A survey.IEEE Transactions on Affective Computing, 13(3):1195–1215, 2022

work page 2022
[7]

F. Xue, Q. Wang, and G. Guo. TransFER: Learning relation-aware facial expression representations with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3601– 3610, 2021

work page 2021
[8]

H. Li, M. Sui, F. Zhao, Z. Zha, and F. Wu. MVT: Mask vision trans- former for facial expression recognition in the wild.arXiv preprint arXiv:2106.04520, 2021

work page arXiv 2021
[9]

H. Lian, C. Lu, S. Li, Y . Zhao, C. Tang, and Y . Zong. A survey of deep learning-based multimodal emotion recognition: Speech, text, and face. Entropy, 25(10):1440, 2023

work page 2023
[10]

Zadeh, P

A. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency. Mul- timodal language analysis in the wild: CMU-MOSEI dataset and in- terpretable dynamic fusion graph. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2236–2246, 2018

work page 2018
[11]

Z. Guo, A. Lai, J. H. Thygesen, J. Farrington, T. Keen, and K. Li. Large language models for mental health applications: Systematic review.JMIR Mental Health, 11:e57400, 2024

work page 2024
[12]

K. K. Fitzpatrick, A. Darcy, and M. Vierhile. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent: A randomized controlled trial.JMIR Mental Health, 4(2):e19, 2017

work page 2017
[13]

I. Li, A. K. Dey, and J. Forlizzi. A stage-based model of personal informatics systems. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 557–566, 2010

work page 2010
[14]

E. P. S. Baumer. Reflective informatics: Conceptual dimensions for designing technologies of reflection. InProceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 585–594, 2015

work page 2015
[15]

Amershi, D

S. Amershi, D. Weld, M. V orvoreanu, A. Fourney, B. Nushi, P. Collisson, J. Suh, S. Iqbal, P. N. Bennett, K. Inkpen, and others. Guidelines for human-AI interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–13, 2019

work page 2019
[16]

Kleppmann, A

M. Kleppmann, A. Wiggins, P. van Hardenberg, and M. McGranaghan. Local-first software: You own your data, in spite of the cloud. InPro- ceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 154–178, 2019

work page 2019
[17]

Kaufman, S

S. Kaufman, S. Rosset, C. Perlich, and O. Stitelman. Leakage in data mining: Formulation, detection, and avoidance.ACM Transactions on Knowledge Discovery from Data, 6(4):1–21, 2012

work page 2012
[18]

vit-Facial-Expression-Recognition

mo-thecreator. vit-Facial-Expression-Recognition. Hugging Face model card. Available at: https://huggingface.co/mo-thecreator/ vit-Facial-Expression-Recognition

work page
[19]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Ollama documentation

Ollama. Ollama documentation. Available at: https://ollama. com. 9 A Detailed Per-Class Results Table 12 reports the detailed per-class accuracy comparison on the 6,767-image independent image-level benchmark. This appendix preserves class-level evidence without interrupting the main evaluation narrative. Table 12:Detailed per-class accuracy comparison on...

work page
[21]

Base the response on the user’s confirmed state and reflection content

work page
[22]

Provide specific and actionable suggestions

work page
[23]

Use warm and supportive language

work page
[24]

Do not diagnose mental disorders

work page
[25]

Do not provide medical or therapeutic treatment

work page
[26]

If the user expresses severe or persistent distress, recommend professional help. Output format: Step 1: Immediate action - Action: <one concrete action> - Explanation: <short explanation> Step 2: Short-term strategy - Action: <one short-term work strategy> - Explanation: <short explanation> Step 3: Longer-term reminder - Action: <one reflection or planni...

work page
[27]

The state-check workflow was easy to understand

work page
[28]

Manual correction made the system feel more controllable

work page
[29]

The three-question reflection helped me articulate my blockage

work page
[30]

The generated suggestions were specific enough to be actionable

work page
[31]

The local-first/no-account design increased my trust in the system. 10

work page