FOCAL: Filtered On-device Continuous Activity Logging for Efficient Personal Desktop Summarization
Pith reviewed 2026-05-10 00:50 UTC · model grok-4.3
The pith
FOCAL's cascaded agents filter desktop screenshots on-device to cut token use by 60% and maintain high recall even during task switches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FOCAL implements a unified filter-plan-log pipeline: a lightweight Filter Agent suppresses noise in the screenshot stream, a text-only Brain Agent assigns screenshots to tasks, a Record Agent performs selective visual reasoning only when needed, and a task-isolated Memory Agent produces coherent multi-perspective summaries. On DesktopBench with 2,572 screenshots across 420 sessions, this yields 60.4% lower token consumption and 72.3% fewer VLM calls than a baseline while raising Key Information Recall from 0.38 to 0.61; under A to B to A interruptions, FOCAL sustains Task Accuracy 0.81 and KIR 0.80 where the baseline falls to 0.03.
What carries the argument
The filter-plan-log architecture that cascades a lightweight Filter Agent for noise suppression, a text-only Brain Agent for task attribution, a selective Record Agent for visual reasoning, and a task-isolated Memory Agent for context-coherent summarization.
If this is right
- Total VLM token consumption drops 60.4% and call count drops 72.3% compared with exhaustive processing.
- Task accuracy and key-information recall remain above 0.8 even when users interrupt one task with another and return.
- All processing stays on-device, eliminating the need to transmit raw screenshots or logs to external servers.
- Multi-perspective personal logs become practical for long-running desktop use without manual curation.
Where Pith is reading between the lines
- The same filter-first design could be applied to mobile screen recordings or browser history streams with only minor changes to the input format.
- If the Filter Agent generalizes across different operating systems, the approach might replace cloud-dependent activity trackers in productivity tools.
- Deploying the pipeline on lower-power devices would test whether the latency remains acceptable when VLM calls are further quantized.
Load-bearing premise
The lightweight Filter Agent can discard screenshots without losing any that contain task-critical information, and the full agent pipeline stays fast enough on ordinary consumer hardware.
What would settle it
Measure Key Information Recall and per-session latency on a fresh collection of 500 desktop sessions containing frequent A to B to A switches; if recall drops below 0.5 or average latency exceeds typical VLM inference time on the same hardware, the central efficiency claim does not hold.
Figures
read the original abstract
Desktop interaction streams provide a continuous, privacy-sensitive record of interleaved user tasks. Transforming these streams into task-organized personal logs on-device faces two main challenges: exhaustive Vision-Language Model (VLM) processing strains local resources, and global stream processing causes cross-task context pollution. We present FOCAL (Filtered On-device Continuous Activity Logging), a privacy-first multi-agent system utilizing a unified filter-plan-log architecture. It cascades a lightweight Filter Agent for noise suppression, a text-only Brain Agent for task attribution, a Record Agent for selective visual reasoning, and a task-isolated Memory Agent for context-coherent summarization. Experiments on DesktopBench (comprising 2,572 screenshots across 420 complex sessions) show FOCAL reduces total token consumption by 60.4% and VLM call count by 72.3% versus a baseline, while boosting Key Information Recall (KIR) from 0.38 to 0.61. Crucially, under $A{\to}B{\to}A$ task interruptions, FOCAL maintains Task Acc 0.81 and KIR 0.80, whereas the baseline collapses to Task Acc 0.03. FOCAL pioneers the efficient, on-device summarization of instruction-free desktop streams into multi-perspective personal logs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FOCAL, a privacy-first multi-agent cascade for on-device summarization of continuous desktop screenshot streams. It employs a lightweight Filter Agent for noise suppression, a text-only Brain Agent for task attribution, a Record Agent for selective visual reasoning, and a task-isolated Memory Agent for coherent logging. On the DesktopBench benchmark (2,572 screenshots across 420 sessions), FOCAL is reported to cut token consumption by 60.4% and VLM calls by 72.3% relative to a baseline while raising Key Information Recall (KIR) from 0.38 to 0.61; under A→B→A interruptions it maintains Task Acc 0.81 and KIR 0.80 versus baseline collapse to 0.03.
Significance. If the empirical claims hold, the work offers a practical route to efficient, on-device personal activity logging that respects privacy and consumer hardware limits. The emphasis on interruption robustness and token reduction addresses real deployment constraints in continuous desktop monitoring.
major comments (3)
- [Experiments] Experiments section: aggregate metrics on DesktopBench are presented without per-session filter precision/recall, error analysis of discarded frames, or ablation restoring filtered screenshots. This directly bears on the central robustness claim under A→B→A interruptions, because any false-negative Filter Agent decision removes visual evidence before the Record or Memory Agents can process it.
- [Experiments] Evaluation: the reported 60.4% token reduction, 72.3% VLM-call reduction, and KIR/Task-Acc figures lack error bars, statistical tests, or variance across sessions, and the baseline implementation details (e.g., whether it applies any filtering) are insufficient to interpret the magnitude of the gains.
- [Architecture] System description: the Filter Agent's decision criteria, model size, and any training procedure are described at a high level only, leaving open whether its noise-suppression behavior reliably preserves task-critical frames (especially return-to-task-A frames) rather than introducing selective loss.
minor comments (2)
- [Abstract] The abstract and introduction use both 'filter-plan-log' and 'unified filter-plan-log architecture'; a single consistent phrasing would improve clarity.
- [Experiments] No mention of whether DesktopBench or the FOCAL implementation will be released, which would strengthen reproducibility claims.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback on our manuscript. We address each of the major comments below, providing clarifications and committing to revisions that enhance the experimental analysis and architectural details.
read point-by-point responses
-
Referee: [Experiments] Experiments section: aggregate metrics on DesktopBench are presented without per-session filter precision/recall, error analysis of discarded frames, or ablation restoring filtered screenshots. This directly bears on the central robustness claim under A→B→A interruptions, because any false-negative Filter Agent decision removes visual evidence before the Record or Memory Agents can process it.
Authors: We agree that a more granular analysis of the Filter Agent would strengthen the robustness claims, particularly regarding potential loss of critical frames during task switches. In the revised version, we will include per-session filter precision and recall metrics, a detailed error analysis of discarded frames (categorized by session type and interruption points), and an ablation experiment that restores filtered screenshots to quantify their contribution to KIR and Task Accuracy under A→B→A conditions. These additions will demonstrate that the Filter Agent preserves task-critical information without selective loss. revision: yes
-
Referee: [Experiments] Evaluation: the reported 60.4% token reduction, 72.3% VLM-call reduction, and KIR/Task-Acc figures lack error bars, statistical tests, or variance across sessions, and the baseline implementation details (e.g., whether it applies any filtering) are insufficient to interpret the magnitude of the gains.
Authors: The baseline is implemented as an exhaustive VLM processing pipeline without any filtering or task attribution steps, directly applying visual reasoning to every screenshot. We will revise the evaluation section to report variance and standard deviations across the 420 sessions, include error bars on all reported metrics, and conduct appropriate statistical significance tests (such as paired t-tests) to validate the improvements. This will provide a clearer interpretation of the efficiency gains and performance boosts. revision: yes
-
Referee: [Architecture] System description: the Filter Agent's decision criteria, model size, and any training procedure are described at a high level only, leaving open whether its noise-suppression behavior reliably preserves task-critical frames (especially return-to-task-A frames) rather than introducing selective loss.
Authors: We agree that more detail on the Filter Agent is warranted. In the revised manuscript, we will describe its decision criteria in full (a prompt-based classifier operating on textual activity descriptions to identify noise vs. relevant frames), specify the underlying model size, and clarify that it relies on the base model's capabilities without custom training. This elaboration will address how the agent is tuned to retain task-critical frames, particularly those signaling returns to previous tasks like A in A→B→A scenarios. revision: yes
Circularity Check
No circularity: empirical results on external benchmark
full rationale
The paper describes a multi-agent cascade architecture (Filter, Brain, Record, Memory Agents) for on-device desktop stream summarization and reports performance via direct experiments on the independently constructed DesktopBench (2,572 screenshots, 420 sessions). All headline metrics (60.4% token reduction, 72.3% fewer VLM calls, KIR 0.38→0.61, Task Acc 0.81 vs. 0.03 under A→B→A interruptions) are presented as measured outcomes on this external test set rather than outputs of any equation, fitted parameter, or self-referential definition. No mathematical derivations, uniqueness theorems, or ansatzes appear in the provided text; the claims rest on benchmark evaluation, not on any reduction to the system's own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Md Sadek Ferdous, Soumyadeb Chowdhury, and Joemon M. Jose. 2017. Analysing privacy in visual lifelogging.Pervasive and Mobile Computing40 (2017), 430–449. doi:10.1016/j.pmcj.2017.03.003
-
[2]
Jim Gemmell, Gordon Bell, Roger Lueder, Steven Drucker, and Curtis Wong
-
[3]
InProceedings of the tenth ACM international conference on Multimedia
MyLifeBits: fulfilling the Memex vision. InProceedings of the tenth ACM international conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 235–238. doi:10.1145/641007.641053
-
[4]
Matt Gottsacker, Yahya Hmaiti, Mykola Maslych, Gerd Bruder, Joseph J LaViola, and Gregory F Welch. 2025. XR-First Design for Productivity: A Conceptual Framework for Enabling Efficient Task Switching in XR. In2025 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct). IEEE, 524– 529
work page 2025
-
[5]
Cathal Gurrin, Alan F Smeaton, and Aiden R Doherty. 2014. LifeLogging: Personal Big Data. Foundations and TrendsÂő in Information Retrieval 8, 1 (2014), 1–125
work page 2014
-
[6]
Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. 2024. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE, Seattle, WA, USA, 13504–13514
work page 2024
-
[7]
Lixing He, Bufang Yang, Di Duan, Zhenyu Yan, and Guoliang Xing. 2025. EgoLog: Ego-Centric Fine-Grained Daily Log with Ubiquitous Wearables.CoRR abs/2504.02624 (2025). doi:10.48550/arXiv.2504.02624
-
[8]
Steve Hodges, Lyndsay Williams, Emma Berry, Shahram Izadi, James Srinivasan, Alex Butler, Gavin Smyth, Narinder Kapur, and Ken Wood. 2006. SenseCam: A retrospective memory aid. InInternational conference on ubiquitous computing (Lecture Notes in Computer Science, Vol. 4206). Springer, Springer, Orange County, CA, USA, 177–193. doi:10.1007/11853565_11
-
[9]
Xiaofei Hui, Qian Wu, Haoxuan Qu, Majid Mirmehdi, Hossein Rahmani, and Jun Liu. 2026. When Visual Privacy Protection Meets Multimodal Large Language Models.International Journal of Computer Vision134 (2026), 167. doi:10.1007/ s11263-026-02761-y
work page 2026
-
[10]
Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi. 2024. Goat-bench: A benchmark for multi-modal lifelong navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Seattle, WA, USA, 16373–16383
work page 2024
-
[11]
Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, and Seong Tae Kim
-
[12]
InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Do you remember? dense video captioning with cross-modal memory retrieval. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Seattle, WA, USA, 13894–13904
-
[13]
Ian Li, Anind Dey, and Jodi Forlizzi. 2010. A stage-based model of personal informatics systems. InProceedings of the SIGCHI conference on human factors in computing systems. Association for Computing Machinery, New York, NY, USA, 557–566. doi:10.1145/1753326.1753409
-
[14]
Yanwei Li, Chengyao Wang, and Jiaya Jia. 2024. Llama-vid: An image is worth 2 tokens in large language models. InEuropean Conference on Computer Vision. Springer, Springer, Milan, Italy, 323–340
work page 2024
-
[15]
Yanda Li, Chi Zhang, Wenjia Jiang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, and Yunchao Wei. 2024. Appagent v2: Advanced agent for flexible mobile interactions.CoRRabs/2408.11824 (2024). doi:10.48550/arXiv.2408.11824
-
[16]
Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen Wu, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, and Mike Zheng Shou. 2024. VideoGUI: A Benchmark for GUI Automation from Instructional Videos. InAdvances in Neural Information Processing Systems, Vol. 37. https://proceedings.neurips.cc/ paper_files/paper/2024/hash/0fa4e4715c2d876d5ba7bb04f6f7f75f-Abstract- Da...
work page 2024
-
[17]
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics12 (2024), 157–173
work page 2024
-
[18]
OpenAI. 2026. GPT-5. OpenAI Platform Documentation. https://platform.openai. com/docs/models/gpt-5 Accessed: 2026-03-30
work page 2026
-
[19]
Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. 2024. Streaming long video understanding with large language models.Advances in Neural Information Processing Systems37 (2024), 119336– 119360
work page 2024
-
[20]
Jielin Qiu, Jiacheng Zhu, William Han, Aditesh Kumar, Karthik Mittal, Claire Jin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Ding Zhao, et al. 2024. Mmsum: A dataset for multimodal summarization and thumbnail generation of videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Seattle, WA, USA, 21909–21921
work page 2024
-
[21]
Abigail J Sellen and Steve Whittaker. 2010. Beyond total capture: a constructive critique of lifelogging.Commun. ACM53, 5 (2010), 70–77
work page 2010
-
[22]
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. InInternational Conference on Machine Learning, Vol. 202. PMLR, PMLR, Honolulu, Hawaii, USA, 31210–31227. https: //proceedings.mlr.press/v202/shi23a.html
work page 2023
-
[23]
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. 2024. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Seattle, WA, USA, 18221–18232
work page 2024
-
[24]
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2024. Autodroid: Llm-powered task automation in android. InProceedings of the 30th annual international confer- ence on Mobile computing and networking. Association for Computing Machinery, Melbourne, VIC, Australia, 543–557. doi...
-
[25]
Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, and Mike Zheng Shou. 2025. GUI-Narrator: Detecting and Captioning Computer GUI Actions. In Proceedings of the 33rd ACM International Conference on Multimedia
work page 2025
-
[26]
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al . 2024. Os- world: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems37 (2024), 52040–52094
work page 2024
-
[27]
Huatao Xu, Zilin Zeng, Panrong Tong, Mo Li, and Mani B Srivastava. 2025. Autolife: Automatic life journaling with smartphones and llms.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies9, 4 (2025), 1–29
work page 2025
-
[28]
Jiwen Zhang, Ya-Qi Yu, Minghui Liao, Wentao Li, Jihao Wu, and Zhongyu Wei
-
[29]
UI-Hawk: Unleashing the Screen Stream Understanding for Mobile GUI Agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Suzhou, China, 18217–18236. doi:10.18653/v1/2025.emnlp-main.1008
-
[30]
Bo Zou, Chao Yang, Yu Qiao, Chengbin Quan, and Youjian Zhao. 2024. Language- aware visual semantic distillation for video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Seattle, WA, USA, 27113–27123. 9
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.