pith. sign in

arxiv: 2604.19541 · v1 · submitted 2026-04-21 · 💻 cs.MA · cs.HC

FOCAL: Filtered On-device Continuous Activity Logging for Efficient Personal Desktop Summarization

Pith reviewed 2026-05-10 00:50 UTC · model grok-4.3

classification 💻 cs.MA cs.HC
keywords desktop activity loggingon-device summarizationmulti-agent pipelinevision-language modelstask interruption handlingpersonal informaticsnoise filteringprivacy-preserving AI
0
0 comments X

The pith

FOCAL's cascaded agents filter desktop screenshots on-device to cut token use by 60% and maintain high recall even during task switches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a privacy-first system that processes continuous desktop screenshots into organized personal logs without sending data off-device. It uses a chain of specialized agents to skip irrelevant images, attribute actions to tasks, and summarize only what matters. Experiments on hundreds of complex sessions show major drops in compute cost alongside gains in accuracy, especially when users switch between tasks. A sympathetic reader would care because this approach makes long-term personal activity tracking feasible on ordinary laptops instead of requiring cloud resources or constant user attention.

Core claim

FOCAL implements a unified filter-plan-log pipeline: a lightweight Filter Agent suppresses noise in the screenshot stream, a text-only Brain Agent assigns screenshots to tasks, a Record Agent performs selective visual reasoning only when needed, and a task-isolated Memory Agent produces coherent multi-perspective summaries. On DesktopBench with 2,572 screenshots across 420 sessions, this yields 60.4% lower token consumption and 72.3% fewer VLM calls than a baseline while raising Key Information Recall from 0.38 to 0.61; under A to B to A interruptions, FOCAL sustains Task Accuracy 0.81 and KIR 0.80 where the baseline falls to 0.03.

What carries the argument

The filter-plan-log architecture that cascades a lightweight Filter Agent for noise suppression, a text-only Brain Agent for task attribution, a selective Record Agent for visual reasoning, and a task-isolated Memory Agent for context-coherent summarization.

If this is right

  • Total VLM token consumption drops 60.4% and call count drops 72.3% compared with exhaustive processing.
  • Task accuracy and key-information recall remain above 0.8 even when users interrupt one task with another and return.
  • All processing stays on-device, eliminating the need to transmit raw screenshots or logs to external servers.
  • Multi-perspective personal logs become practical for long-running desktop use without manual curation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same filter-first design could be applied to mobile screen recordings or browser history streams with only minor changes to the input format.
  • If the Filter Agent generalizes across different operating systems, the approach might replace cloud-dependent activity trackers in productivity tools.
  • Deploying the pipeline on lower-power devices would test whether the latency remains acceptable when VLM calls are further quantized.

Load-bearing premise

The lightweight Filter Agent can discard screenshots without losing any that contain task-critical information, and the full agent pipeline stays fast enough on ordinary consumer hardware.

What would settle it

Measure Key Information Recall and per-session latency on a fresh collection of 500 desktop sessions containing frequent A to B to A switches; if recall drops below 0.5 or average latency exceeds typical VLM inference time on the same hardware, the central efficiency claim does not hold.

Figures

Figures reproduced from arXiv: 2604.19541 by Bo Yuan, Haoran Yin, Jiannong Cao, Ruosong Yang, Zhiyuan Wen.

Figure 1
Figure 1. Figure 1: Overview of FOCAL. Agent then uses GUI metadata to perform task attribution and adap￾tive planning; a VLM-based Record Agent writes intent-centric log entries within task scope; a Memory Agent enforces task-isolated context management; and a Summary Agent composes task-level memories into coherent multi-perspective summaries. By front￾loading selection and routing, FOCAL reduces unnecessary local visual pr… view at source ↗
Figure 2
Figure 2. Figure 2: DesktopBench construction pipeline with Multi-task and Interruption splits. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: DesktopBench statistics. (a) Task prefix counts (Multi-task). (b) Session pattern distribution. (c) Task-type transitions. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Token efficiency on the Multi-task and Interruption [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Summary quality on the Multi-task and Interrup [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation of the Brain Agent and memory strategy [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case study of Session 258. FOCAL recovers three task spans, while the naive baseline over-segments the workflow. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Desktop interaction streams provide a continuous, privacy-sensitive record of interleaved user tasks. Transforming these streams into task-organized personal logs on-device faces two main challenges: exhaustive Vision-Language Model (VLM) processing strains local resources, and global stream processing causes cross-task context pollution. We present FOCAL (Filtered On-device Continuous Activity Logging), a privacy-first multi-agent system utilizing a unified filter-plan-log architecture. It cascades a lightweight Filter Agent for noise suppression, a text-only Brain Agent for task attribution, a Record Agent for selective visual reasoning, and a task-isolated Memory Agent for context-coherent summarization. Experiments on DesktopBench (comprising 2,572 screenshots across 420 complex sessions) show FOCAL reduces total token consumption by 60.4% and VLM call count by 72.3% versus a baseline, while boosting Key Information Recall (KIR) from 0.38 to 0.61. Crucially, under $A{\to}B{\to}A$ task interruptions, FOCAL maintains Task Acc 0.81 and KIR 0.80, whereas the baseline collapses to Task Acc 0.03. FOCAL pioneers the efficient, on-device summarization of instruction-free desktop streams into multi-perspective personal logs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces FOCAL, a privacy-first multi-agent cascade for on-device summarization of continuous desktop screenshot streams. It employs a lightweight Filter Agent for noise suppression, a text-only Brain Agent for task attribution, a Record Agent for selective visual reasoning, and a task-isolated Memory Agent for coherent logging. On the DesktopBench benchmark (2,572 screenshots across 420 sessions), FOCAL is reported to cut token consumption by 60.4% and VLM calls by 72.3% relative to a baseline while raising Key Information Recall (KIR) from 0.38 to 0.61; under A→B→A interruptions it maintains Task Acc 0.81 and KIR 0.80 versus baseline collapse to 0.03.

Significance. If the empirical claims hold, the work offers a practical route to efficient, on-device personal activity logging that respects privacy and consumer hardware limits. The emphasis on interruption robustness and token reduction addresses real deployment constraints in continuous desktop monitoring.

major comments (3)
  1. [Experiments] Experiments section: aggregate metrics on DesktopBench are presented without per-session filter precision/recall, error analysis of discarded frames, or ablation restoring filtered screenshots. This directly bears on the central robustness claim under A→B→A interruptions, because any false-negative Filter Agent decision removes visual evidence before the Record or Memory Agents can process it.
  2. [Experiments] Evaluation: the reported 60.4% token reduction, 72.3% VLM-call reduction, and KIR/Task-Acc figures lack error bars, statistical tests, or variance across sessions, and the baseline implementation details (e.g., whether it applies any filtering) are insufficient to interpret the magnitude of the gains.
  3. [Architecture] System description: the Filter Agent's decision criteria, model size, and any training procedure are described at a high level only, leaving open whether its noise-suppression behavior reliably preserves task-critical frames (especially return-to-task-A frames) rather than introducing selective loss.
minor comments (2)
  1. [Abstract] The abstract and introduction use both 'filter-plan-log' and 'unified filter-plan-log architecture'; a single consistent phrasing would improve clarity.
  2. [Experiments] No mention of whether DesktopBench or the FOCAL implementation will be released, which would strengthen reproducibility claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and committing to revisions that enhance the experimental analysis and architectural details.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: aggregate metrics on DesktopBench are presented without per-session filter precision/recall, error analysis of discarded frames, or ablation restoring filtered screenshots. This directly bears on the central robustness claim under A→B→A interruptions, because any false-negative Filter Agent decision removes visual evidence before the Record or Memory Agents can process it.

    Authors: We agree that a more granular analysis of the Filter Agent would strengthen the robustness claims, particularly regarding potential loss of critical frames during task switches. In the revised version, we will include per-session filter precision and recall metrics, a detailed error analysis of discarded frames (categorized by session type and interruption points), and an ablation experiment that restores filtered screenshots to quantify their contribution to KIR and Task Accuracy under A→B→A conditions. These additions will demonstrate that the Filter Agent preserves task-critical information without selective loss. revision: yes

  2. Referee: [Experiments] Evaluation: the reported 60.4% token reduction, 72.3% VLM-call reduction, and KIR/Task-Acc figures lack error bars, statistical tests, or variance across sessions, and the baseline implementation details (e.g., whether it applies any filtering) are insufficient to interpret the magnitude of the gains.

    Authors: The baseline is implemented as an exhaustive VLM processing pipeline without any filtering or task attribution steps, directly applying visual reasoning to every screenshot. We will revise the evaluation section to report variance and standard deviations across the 420 sessions, include error bars on all reported metrics, and conduct appropriate statistical significance tests (such as paired t-tests) to validate the improvements. This will provide a clearer interpretation of the efficiency gains and performance boosts. revision: yes

  3. Referee: [Architecture] System description: the Filter Agent's decision criteria, model size, and any training procedure are described at a high level only, leaving open whether its noise-suppression behavior reliably preserves task-critical frames (especially return-to-task-A frames) rather than introducing selective loss.

    Authors: We agree that more detail on the Filter Agent is warranted. In the revised manuscript, we will describe its decision criteria in full (a prompt-based classifier operating on textual activity descriptions to identify noise vs. relevant frames), specify the underlying model size, and clarify that it relies on the base model's capabilities without custom training. This elaboration will address how the agent is tuned to retain task-critical frames, particularly those signaling returns to previous tasks like A in A→B→A scenarios. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmark

full rationale

The paper describes a multi-agent cascade architecture (Filter, Brain, Record, Memory Agents) for on-device desktop stream summarization and reports performance via direct experiments on the independently constructed DesktopBench (2,572 screenshots, 420 sessions). All headline metrics (60.4% token reduction, 72.3% fewer VLM calls, KIR 0.38→0.61, Task Acc 0.81 vs. 0.03 under A→B→A interruptions) are presented as measured outcomes on this external test set rather than outputs of any equation, fitted parameter, or self-referential definition. No mathematical derivations, uniqueness theorems, or ansatzes appear in the provided text; the claims rest on benchmark evaluation, not on any reduction to the system's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or fitted constants are described; the system rests on standard assumptions that lightweight models can filter visual streams and that task attribution can be done from text alone.

pith-pipeline@v0.9.0 · 5547 in / 1187 out tokens · 82185 ms · 2026-05-10T00:50:12.842859+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    Md Sadek Ferdous, Soumyadeb Chowdhury, and Joemon M. Jose. 2017. Analysing privacy in visual lifelogging.Pervasive and Mobile Computing40 (2017), 430–449. doi:10.1016/j.pmcj.2017.03.003

  2. [2]

    Jim Gemmell, Gordon Bell, Roger Lueder, Steven Drucker, and Curtis Wong

  3. [3]

    InProceedings of the tenth ACM international conference on Multimedia

    MyLifeBits: fulfilling the Memex vision. InProceedings of the tenth ACM international conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 235–238. doi:10.1145/641007.641053

  4. [4]

    Matt Gottsacker, Yahya Hmaiti, Mykola Maslych, Gerd Bruder, Joseph J LaViola, and Gregory F Welch. 2025. XR-First Design for Productivity: A Conceptual Framework for Enabling Efficient Task Switching in XR. In2025 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). IEEE, 524– 529

  5. [5]

    Cathal Gurrin, Alan F Smeaton, and Aiden R Doherty. 2014. LifeLogging: Personal Big Data. Foundations and TrendsÂő in Information Retrieval 8, 1 (2014), 1–125

  6. [6]

    Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. 2024. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, Seattle, WA, USA, 13504–13514

  7. [7]

    Lixing He, Bufang Yang, Di Duan, Zhenyu Yan, and Guoliang Xing. 2025. EgoLog: Ego-Centric Fine-Grained Daily Log with Ubiquitous Wearables.CoRR abs/2504.02624 (2025). doi:10.48550/arXiv.2504.02624

  8. [8]

    Steve Hodges, Lyndsay Williams, Emma Berry, Shahram Izadi, James Srinivasan, Alex Butler, Gavin Smyth, Narinder Kapur, and Ken Wood. 2006. SenseCam: A retrospective memory aid. InInternational conference on ubiquitous computing (Lecture Notes in Computer Science, Vol. 4206). Springer, Springer, Orange County, CA, USA, 177–193. doi:10.1007/11853565_11

  9. [9]

    Xiaofei Hui, Qian Wu, Haoxuan Qu, Majid Mirmehdi, Hossein Rahmani, and Jun Liu. 2026. When Visual Privacy Protection Meets Multimodal Large Language Models.International Journal of Computer Vision134 (2026), 167. doi:10.1007/ s11263-026-02761-y

  10. [10]

    Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. 2024. Goat-bench: A benchmark for multi-modal lifelong navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Seattle, WA, USA, 16373–16383

  11. [11]

    Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, and Seong Tae Kim

  12. [12]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Do you remember? dense video captioning with cross-modal memory retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Seattle, WA, USA, 13894–13904

  13. [13]

    Ian Li, Anind Dey, and Jodi Forlizzi. 2010. A stage-based model of personal informatics systems. InProceedings of the SIGCHI conference on human factors in computing systems. Association for Computing Machinery, New York, NY, USA, 557–566. doi:10.1145/1753326.1753409

  14. [14]

    Yanwei Li, Chengyao Wang, and Jiaya Jia. 2024. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision. Springer, Springer, Milan, Italy, 323–340

  15. [15]

    Yanda Li, Chi Zhang, Wenjia Jiang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, and Yunchao Wei. 2024. Appagent v2: Advanced agent for flexible mobile interactions.CoRRabs/2408.11824 (2024). doi:10.48550/arXiv.2408.11824

  16. [16]

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, and Mike Zheng Shou. 2024. VideoGUI: A Benchmark for GUI Automation from Instructional Videos. InAdvances in Neural Information Processing Systems, Vol. 37. https://proceedings.neurips.cc/ paper_files/paper/2024/hash/0fa4e4715c2d876d5ba7bb04f6f7f75f-Abstract- Da...

  17. [17]

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173

  18. [18]

    OpenAI. 2026. GPT-5. OpenAI Platform Documentation. https://platform.openai. com/docs/models/gpt-5 Accessed: 2026-03-30

  19. [19]

    Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. 2024. Streaming long video understanding with large language models.Advances in Neural Information Processing Systems37 (2024), 119336– 119360

  20. [20]

    Jielin Qiu, Jiacheng Zhu, William Han, Aditesh Kumar, Karthik Mittal, Claire Jin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Ding Zhao, et al. 2024. Mmsum: A dataset for multimodal summarization and thumbnail generation of videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Seattle, WA, USA, 21909–21921

  21. [21]

    Abigail J Sellen and Steve Whittaker. 2010. Beyond total capture: a constructive critique of lifelogging.Commun. ACM53, 5 (2010), 70–77

  22. [22]

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. InInternational Conference on Machine Learning, Vol. 202. PMLR, PMLR, Honolulu, Hawaii, USA, 31210–31227. https: //proceedings.mlr.press/v202/shi23a.html

  23. [23]

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. 2024. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Seattle, WA, USA, 18221–18232

  24. [24]

    Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. Autodroid: Llm-powered task automation in android. InProceedings of the 30th annual international confer- ence on Mobile computing and networking. Association for Computing Machinery, Melbourne, VIC, Australia, 543–557. doi...

  25. [25]

    Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, and Mike Zheng Shou. 2025. GUI-Narrator: Detecting and Captioning Computer GUI Actions. In Proceedings of the 33rd ACM International Conference on Multimedia

  26. [26]

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al . 2024. Os- world: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems37 (2024), 52040–52094

  27. [27]

    Huatao Xu, Zilin Zeng, Panrong Tong, Mo Li, and Mani B Srivastava. 2025. Autolife: Automatic life journaling with smartphones and llms.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies9, 4 (2025), 1–29

  28. [28]

    Jiwen Zhang, Ya-Qi Yu, Minghui Liao, Wentao Li, Jihao Wu, and Zhongyu Wei

  29. [29]

    Zhang, Joshua B

    UI-Hawk: Unleashing the Screen Stream Understanding for Mobile GUI Agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 18217–18236. doi:10.18653/v1/2025.emnlp-main.1008

  30. [30]

    Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, and Youjian Zhao. 2024. Language- aware visual semantic distillation for video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Seattle, WA, USA, 27113–27123. 9