Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

2); (2) Zhongnan University of Economics; 3); (3) Jilin University; 4); (4) The Chinese University of Hong Kong; Dezhi Kong (1); Hao Wang (1); Jiaming Xu (1) ((1) HyperAI Team; Law

arxiv: 2606.03236 · v1 · pith:OARR4PAGnew · submitted 2026-06-02 · 💻 cs.AI

Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

Zhijie Ding (1 , 2) , Weinan Hong (1 , 3) , Zicheng Zhu (1 , 4) , Lei Li (1) , Dezhi Kong (1)

show 10 more authors

Hao Wang (1) Peng Zhou (1) Xuchu Jiang (1) Jiaming Xu (1) ((1) HyperAI Team Xiaomi Corporation (2) Zhongnan University of Economics Law (3) Jilin University (4) The Chinese University of Hong Kong Shenzhen)

This is my paper

Pith reviewed 2026-06-28 10:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords proactive mobile agentsmultimodal LLMsintervention gatingcontext compressionpre-reasoning frameworkfalse trigger ratesuccess rate

0 comments

The pith

A pre-reasoning perceptor decides when a mobile agent should intervene, activating full reasoning only when needed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Pre-Reasoning Perception Framework (PRPF) to separate the decision of when to assist from how to assist in proactive mobile agents. A lightweight Multimodal Proactive Perceptor (MPP) handles intervention gating and context compression. The Proactive Agent Reasoner (PAR) is activated only when intervention is warranted. This addresses goal misalignment and redundant inference in unified MLLM pipelines. On the ProactiveMobile benchmark, it reduces false trigger rates while improving success rates and efficiency.

Core claim

By perceiving before reasoning, PRPF uses a lightweight perceptor to gate interventions and compress context, activating the full reasoner only when appropriate, which reduces false trigger rates, improves success rates, and increases inference efficiency over the ProactiveMobile baseline.

What carries the argument

The Multimodal Proactive Perceptor (MPP), a lightweight model for intervention gating and context compression that precedes the Proactive Agent Reasoner (PAR).

If this is right

False trigger rates decrease because the perceptor can be tuned for conservative intervention.
Success rates increase as the reasoner focuses on cases where assistance is truly needed.
Inference efficiency improves by avoiding full model calls when no intervention is required.
The framework decouples conservative filtering from comprehensive assistance generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar separation could benefit other agent systems that must decide on action versus inaction.
Training the perceptor on compressed context might allow deployment on edge devices for always-on monitoring.
If the perceptor errors are low, overall system reliability could exceed unified models even with occasional misses.

Load-bearing premise

The lightweight perceptor can accurately determine when intervention is needed using compressed context alone, without missing key cases or creating offsetting errors.

What would settle it

Measure the perceptor's accuracy on intervention decisions against full model or human labels; if missed interventions exceed the reported gains in false triggers, the net benefit disappears.

Figures

Figures reproduced from arXiv: 2606.03236 by 2), (2) Zhongnan University of Economics, 3), (3) Jilin University, 4), (4) The Chinese University of Hong Kong, Dezhi Kong (1), Hao Wang (1), Jiaming Xu (1) ((1) HyperAI Team, Law, Lei Li (1), Peng Zhou (1), Shenzhen), Weinan Hong (1, Xiaomi Corporation, Xuchu Jiang (1), Zhijie Ding (1, Zicheng Zhu (1.

**Figure 2.** Figure 2: Overall framework of PRPF. ProactiveMobile as the evaluation setting, PRPF focuses on the architectural separation between lightweight pre-reasoning intervention perception and heavy VLM-based assistance reasoning. 2.2 GUI Perception and Efficient Reasoning Mobile and GUI agents provide the perception and execution substrate for proactive assistance, but most existing systems remain reactive. Prior work ha… view at source ↗

**Figure 4.** Figure 4: Per-modality breakdown of PRPF outcomes on the ProactiveMobile under SR scoring. 4.6 Case Study To localize PRPF’s failures, we partition every test sample into one of five mutually exclusive outcomes under SR scoring and report the permodality breakdown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity analysis of trigger-gate performance under different thresholds [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) have substantially advanced mobile agents, yet proactive mobile assistance remains challenging because agents must decide \emph{when} to intervene before determining \emph{how} to assist. Existing systems often implement these two decisions within a unified MLLM-based pipeline, leading to goal misalignment between conservative intervention filtering and comprehensive assistance generation, as well as redundant inference when the agent should remain silent. To address these limitations, we propose the \textbf{Pre-Reasoning Perception Framework (PRPF)}, a two-stage framework built on perceiving before reasoning. PRPF introduces a lightweight Multimodal Proactive Perceptor (MPP) for intervention gating and context compression, and activates the Proactive Agent Reasoner (PAR) only when intervention is warranted. Experiments on the ProactiveMobile benchmark show that PRPF substantially reduces false trigger rates (FTR) while improving success rates (SR) and inference efficiency over the ProactiveMobile baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRPF splits perception from reasoning with a lightweight gate to cut unnecessary triggers in mobile agents, but the abstract gives no numbers or ablations so the gains stay unproven.

read the letter

The core idea is simple and practical: run a small Multimodal Proactive Perceptor first to decide whether to wake the full Proactive Agent Reasoner, instead of feeding everything through one big MLLM every time. That explicit two-stage split with context compression is the main new framing, and it directly targets the goal misalignment and wasted inference the authors flag in unified pipelines.

They do a clean job laying out why conservative filtering and full assistance generation pull in different directions, and the architecture follows logically from that diagnosis. If the perceptor really keeps recall high while dropping false triggers, the efficiency win would be real for on-device agents.

The soft spots are the missing evidence. The abstract claims lower FTR, higher SR, and better efficiency on ProactiveMobile, yet supplies no tables, no ablations on the gate, no error bars, and no check on whether the compressed context causes missed interventions. The stress-test concern lands: false negatives from the lightweight gate would hurt real success rates even if the reported FTR looks good, and nothing in the description shows they measured that tradeoff. Without those results the central claim cannot be assessed.

No circularity or invented math appears. Citations track the usual MLLM-agent line.

This is for teams shipping proactive mobile agents who already run MLLM baselines and want a modular efficiency tweak. It is not reshaping the broader field. A serious referee could look at it once the experiments and recall analysis are added; right now the manuscript is too thin for that step.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Pre-Reasoning Perception Framework (PRPF) for proactive mobile agents. It decouples intervention gating and context compression into a lightweight Multimodal Proactive Perceptor (MPP) from downstream assistance generation in the Proactive Agent Reasoner (PAR), activating the latter only when the MPP signals intervention. Experiments on the ProactiveMobile benchmark are reported to show that PRPF substantially reduces false trigger rates (FTR) while improving success rates (SR) and inference efficiency relative to the ProactiveMobile baseline.

Significance. If the empirical results hold under detailed scrutiny, the two-stage separation directly targets the goal misalignment between conservative filtering and comprehensive generation that arises in unified MLLM pipelines, while also cutting redundant inference. The architectural choice of a dedicated lightweight perceptor for early gating is a concrete, falsifiable response to the 'when to intervene' problem and could generalize to other proactive agent settings.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the headline claim of simultaneous FTR reduction and SR improvement is load-bearing, yet the manuscript supplies no numerical results, error bars, dataset splits, ablation tables, or statistical tests. Without these, the magnitude and robustness of the reported gains cannot be assessed.
[§3 and §4.2] §3 (Framework) and §4.2 (Evaluation metrics): the SR improvement presupposes that MPP intervention decisions made from compressed context achieve high recall relative to what the full PAR would decide. Reporting only FTR (false positives) leaves unaddressed the possibility that false-negative gating misses interventions that would have succeeded, directly undermining the net SR claim.

minor comments (1)

[Abstract] Notation for MPP and PAR is introduced in the abstract before full expansion; a parenthetical expansion on first use would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our Pre-Reasoning Perception Framework (PRPF). The comments highlight important aspects of empirical reporting and metric coverage that we address point-by-point below. We believe the two-stage design remains a substantive contribution and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the headline claim of simultaneous FTR reduction and SR improvement is load-bearing, yet the manuscript supplies no numerical results, error bars, dataset splits, ablation tables, or statistical tests. Without these, the magnitude and robustness of the reported gains cannot be assessed.

Authors: We agree that the abstract presents only qualitative claims and that §4 would benefit from additional statistical detail. The full paper does contain tables in §4 reporting concrete FTR, SR, and efficiency numbers against the ProactiveMobile baseline and ablations in §4.3; however, error bars, explicit train/test splits, and significance tests are indeed absent. We will revise the abstract to include key numerical deltas and expand §4 with error bars (from 3 independent runs), dataset split descriptions, and statistical tests where appropriate. revision: yes
Referee: [§3 and §4.2] §3 (Framework) and §4.2 (Evaluation metrics): the SR improvement presupposes that MPP intervention decisions made from compressed context achieve high recall relative to what the full PAR would decide. Reporting only FTR (false positives) leaves unaddressed the possibility that false-negative gating misses interventions that would have succeeded, directly undermining the net SR claim.

Authors: The referee correctly identifies that FTR alone does not fully characterize gating quality. Because SR is measured on end-to-end task success (which requires both triggering when needed and providing correct assistance), an improvement in SR over the always-on baseline already provides indirect evidence that false-negative rate is not catastrophic. That said, we acknowledge the value of direct recall analysis. In revision we will add a targeted evaluation on a held-out subset comparing MPP gating decisions against an oracle PAR, reporting recall and F1 for the intervention signal. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal with empirical evaluation

full rationale

The paper presents PRPF as a two-stage architectural framework (MPP for gating/compression + PAR activation) evaluated on the ProactiveMobile benchmark. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. Claims of reduced FTR, improved SR, and efficiency are presented as direct experimental outcomes against the baseline, with no reduction to self-referential definitions or inputs. This is a standard non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Based on abstract only; the central claim rests on the untested premise that perception and reasoning decisions can be cleanly decoupled and that the lightweight perceptor preserves necessary context.

axioms (1)

domain assumption Intervention decision and assistance generation can be separated into independent stages without performance loss
The two-stage design depends on this separation being effective.

invented entities (2)

Multimodal Proactive Perceptor (MPP) no independent evidence
purpose: Lightweight intervention gating and context compression
New module introduced to handle the when decision separately.
Proactive Agent Reasoner (PAR) no independent evidence
purpose: Full reasoning activated conditionally
New module introduced for the how decision.

pith-pipeline@v0.9.1-grok · 5787 in / 1229 out tokens · 31096 ms · 2026-06-28T10:01:21.301449+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 17 canonical work pages · 8 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

International Conference on Learning Representations , volume=

Proactive agent: Shifting llm agents from reactive responses to active assistance , author=. International Conference on Learning Representations , volume=
[9]

2026 , eprint=

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices , author=. 2026 , eprint=

2026
[10]

2026 , eprint=

Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants , author=. 2026 , eprint=

2026
[11]

Companion Proceedings of the 34th ACM Symposium on the Foundations of Software Engineering , year =

Fengrui Liu and others , title =. Companion Proceedings of the 34th ACM Symposium on the Foundations of Software Engineering , year =
[12]

Advances in Neural Information Processing Systems , volume=

Contextagent: Context-aware proactive llm agents with open-world sensory perceptions , author=. Advances in Neural Information Processing Systems , volume=
[13]

Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

Appagent: Multimodal agents as smartphone users , author=. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

2025
[14]

International Conference on Learning Representations , volume=

Androidworld: A dynamic benchmarking environment for autonomous agents , author=. International Conference on Learning Representations , volume=
[15]

2026 , eprint=

ProAgentBench: Evaluating LLM Agents for Proactive Assistance with Real-World Data , author=. 2026 , eprint=

2026
[16]

arXiv preprint arXiv:2602.01532 , year=

PRISM: Festina Lente Proactivity--Risk-Sensitive, Uncertainty-Aware Deliberation for Proactive Agents , author=. arXiv preprint arXiv:2602.01532 , year=

work page arXiv
[17]

arXiv preprint arXiv:2601.05755 , year=

VIGIL: Defending LLM Agents Against Tool Stream Injection via Verify-Before-Commit , author=. arXiv preprint arXiv:2601.05755 , year=

work page arXiv
[18]

PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory , author=. arXiv preprint arXiv:2604.08000 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

arXiv preprint arXiv:2603.08013 , year=

PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents , author=. arXiv preprint arXiv:2603.08013 , year=

work page arXiv
[20]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[21]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

2025
[24]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Mobile-agent: Autonomous multi-modal mobile device agent with visual perception , author=. arXiv preprint arXiv:2401.16158 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mobile-bench: An evaluation benchmark for llm-based mobile agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[26]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Seeclick: Harnessing gui grounding for advanced visual gui agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[27]

arXiv preprint arXiv:2503.00401 , year=

Smoothing grounding and reasoning for mllm-powered gui agents with query-oriented pivot tasks , author=. arXiv preprint arXiv:2503.00401 , year=

work page arXiv
[28]

arXiv preprint arXiv:2512.22009 , year=

iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception , author=. arXiv preprint arXiv:2512.22009 , year=

work page arXiv
[29]

arXiv preprint arXiv:2503.06470 , year=

Think twice, click once: Enhancing gui grounding via fast and slow systems , author=. arXiv preprint arXiv:2503.06470 , year=

work page arXiv
[30]

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding , author=. arXiv preprint arXiv:2605.15542 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

RouteLLM: Learning to Route LLMs with Preference Data

Routellm: Learning to route llms with preference data , author=. arXiv preprint arXiv:2406.18665 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Learning to inference adaptively for multimodal large language models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[33]

Bai, Shuai and Cai, Yuxuan and Chen, Ruizhe and Chen, Keqin and Chen, Xionghui and Cheng, Zesen and Deng, Lianghao and Ding, Wei and Gao, Chang and Ge, Chunjiang and others , journal=
[34]

Hurst, Aaron and Lerer, Adam and Goucher, Adam P and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, AJ and Welihinda, Akila and Hayes, Alan and Radford, Alec and others , journal=
[35]

arXiv preprint arXiv:2507.21071 , year=

Fingertip 20k: A benchmark for proactive and personalized mobile llm agents , author=. arXiv preprint arXiv:2507.21071 , year=

work page arXiv
[36]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[37]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019
[38]

2026 , eprint=

Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

2026
[39]

2025 , eprint=

UI-TARS: Pioneering Automated GUI Interaction with Native Agents , author=. 2025 , eprint=

2025
[40]

Hong, Wenyi and Yu, Wenmeng and Gu, Xiaotao and Wang, Guo and Gan, Guobing and Tang, Haomiao and Cheng, Jiale and Qi, Ji and Ji, Junhui and Pan, Lihang and others , journal=
[41]

arXiv e-prints , pages=

Tongui: Building generalized gui agents by learning from multimodal web tutorials , author=. arXiv e-prints , pages=
[42]

2026 , url=

GPT-5.5 System Card , author=. 2026 , url=

2026
[43]

2025 , url=

OpenAI o3 and o4-mini System Card , author=. 2025 , url=

2025
[44]

2026 , url=

Gemini 3.1 Pro Model Card , author=. 2026 , url=

2026
[45]

arXiv preprint arXiv:2402.11573 , year=

Bge landmark embedding: A chunking-free embedding method for retrieval augmented long-context large language models , author=. arXiv preprint arXiv:2402.11573 , year=

work page arXiv
[46]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

International Conference on Learning Representations , volume=

Flashattention-2: Faster attention with better parallelism and work partitioning , author=. International Conference on Learning Representations , volume=
[48]

MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference

MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference , author=. arXiv preprint arXiv:2604.02945 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[49]

2026 , month =

Claude Sonnet 4.6 , author =. 2026 , month =

2026
[50]

Perceiver

Andrew Jaegle and Sebastian Borgeaud and Jean-Baptiste Alayrac and Carl Doersch and Catalin Ionescu and David Ding and Skanda Koppula and Daniel Zoran and Andrew Brock and Evan Shelhamer and Olivier J Henaff and Matthew Botvinick and Andrew Zisserman and Oriol Vinyals and Joao Carreira , booktitle=. Perceiver
[51]

ArXiv , year=

Progressive Multimodal Reasoning via Active Retrieval , author=. ArXiv , year=
[52]

ArXiv , year=

LCIRC: A Recurrent Compression Approach for Efficient Long-form Context and Query Dependent Modeling in LLMs , author=. ArXiv , year=
[53]

USENIX Annual Technical Conference , year=

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention , author=. USENIX Annual Technical Conference , year=
[54]

2026 , eprint=

ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild , author=. 2026 , eprint=

2026
[55]

Advances in Neural Information Processing Systems , volume=

Openvlthinker: Complex vision-language reasoning via iterative sft-rl cycles , author=. Advances in Neural Information Processing Systems , volume=
[56]

Pattern recognition letters , volume=

An introduction to ROC analysis , author=. Pattern recognition letters , volume=. 2006 , publisher=

2006
[57]

Neural Information Processing Systems , year=

Attention is all you need , author=. Neural Information Processing Systems , year=
[58]

Focal Loss for Dense Object Detection , year=

Lin, Tsung-Yi and Goyal, Priya and Girshick, Ross and He, Kaiming and Dollár, Piotr , booktitle=. Focal Loss for Dense Object Detection , year=

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[2] [2]

Publications Manual , year = "1983", publisher =

1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[5] [5]

Dan Gusfield , title =. 1997

1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[8] [8]

International Conference on Learning Representations , volume=

Proactive agent: Shifting llm agents from reactive responses to active assistance , author=. International Conference on Learning Representations , volume=

[9] [9]

2026 , eprint=

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices , author=. 2026 , eprint=

2026

[10] [10]

2026 , eprint=

Proactive Agent Research Environment: Simulating Active Users to Evaluate Proactive Assistants , author=. 2026 , eprint=

2026

[11] [11]

Companion Proceedings of the 34th ACM Symposium on the Foundations of Software Engineering , year =

Fengrui Liu and others , title =. Companion Proceedings of the 34th ACM Symposium on the Foundations of Software Engineering , year =

[12] [12]

Advances in Neural Information Processing Systems , volume=

Contextagent: Context-aware proactive llm agents with open-world sensory perceptions , author=. Advances in Neural Information Processing Systems , volume=

[13] [13]

Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

Appagent: Multimodal agents as smartphone users , author=. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

2025

[14] [14]

International Conference on Learning Representations , volume=

Androidworld: A dynamic benchmarking environment for autonomous agents , author=. International Conference on Learning Representations , volume=

[15] [15]

2026 , eprint=

ProAgentBench: Evaluating LLM Agents for Proactive Assistance with Real-World Data , author=. 2026 , eprint=

2026

[16] [16]

arXiv preprint arXiv:2602.01532 , year=

PRISM: Festina Lente Proactivity--Risk-Sensitive, Uncertainty-Aware Deliberation for Proactive Agents , author=. arXiv preprint arXiv:2602.01532 , year=

work page arXiv

[17] [17]

arXiv preprint arXiv:2601.05755 , year=

VIGIL: Defending LLM Agents Against Tool Stream Injection via Verify-Before-Commit , author=. arXiv preprint arXiv:2601.05755 , year=

work page arXiv

[18] [18]

PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory , author=. arXiv preprint arXiv:2604.08000 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

arXiv preprint arXiv:2603.08013 , year=

PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents , author=. arXiv preprint arXiv:2603.08013 , year=

work page arXiv

[20] [20]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[21] [21]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

2025

[24] [24]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Mobile-agent: Autonomous multi-modal mobile device agent with visual perception , author=. arXiv preprint arXiv:2401.16158 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Mobile-bench: An evaluation benchmark for llm-based mobile agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[26] [26]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Seeclick: Harnessing gui grounding for advanced visual gui agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[27] [27]

arXiv preprint arXiv:2503.00401 , year=

Smoothing grounding and reasoning for mllm-powered gui agents with query-oriented pivot tasks , author=. arXiv preprint arXiv:2503.00401 , year=

work page arXiv

[28] [28]

arXiv preprint arXiv:2512.22009 , year=

iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception , author=. arXiv preprint arXiv:2512.22009 , year=

work page arXiv

[29] [29]

arXiv preprint arXiv:2503.06470 , year=

Think twice, click once: Enhancing gui grounding via fast and slow systems , author=. arXiv preprint arXiv:2503.06470 , year=

work page arXiv

[30] [30]

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding , author=. arXiv preprint arXiv:2605.15542 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

RouteLLM: Learning to Route LLMs with Preference Data

Routellm: Learning to route llms with preference data , author=. arXiv preprint arXiv:2406.18665 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Learning to inference adaptively for multimodal large language models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[33] [33]

Bai, Shuai and Cai, Yuxuan and Chen, Ruizhe and Chen, Keqin and Chen, Xionghui and Cheng, Zesen and Deng, Lianghao and Ding, Wei and Gao, Chang and Ge, Chunjiang and others , journal=

[34] [34]

Hurst, Aaron and Lerer, Adam and Goucher, Adam P and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, AJ and Welihinda, Akila and Hayes, Alan and Radford, Alec and others , journal=

[35] [35]

arXiv preprint arXiv:2507.21071 , year=

Fingertip 20k: A benchmark for proactive and personalized mobile llm agents , author=. arXiv preprint arXiv:2507.21071 , year=

work page arXiv

[36] [36]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[37] [37]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019

[38] [38]

2026 , eprint=

Kimi K2.5: Visual Agentic Intelligence , author=. 2026 , eprint=

2026

[39] [39]

2025 , eprint=

UI-TARS: Pioneering Automated GUI Interaction with Native Agents , author=. 2025 , eprint=

2025

[40] [40]

Hong, Wenyi and Yu, Wenmeng and Gu, Xiaotao and Wang, Guo and Gan, Guobing and Tang, Haomiao and Cheng, Jiale and Qi, Ji and Ji, Junhui and Pan, Lihang and others , journal=

[41] [41]

arXiv e-prints , pages=

Tongui: Building generalized gui agents by learning from multimodal web tutorials , author=. arXiv e-prints , pages=

[42] [42]

2026 , url=

GPT-5.5 System Card , author=. 2026 , url=

2026

[43] [43]

2025 , url=

OpenAI o3 and o4-mini System Card , author=. 2025 , url=

2025

[44] [44]

2026 , url=

Gemini 3.1 Pro Model Card , author=. 2026 , url=

2026

[45] [45]

arXiv preprint arXiv:2402.11573 , year=

Bge landmark embedding: A chunking-free embedding method for retrieval augmented long-context large language models , author=. arXiv preprint arXiv:2402.11573 , year=

work page arXiv

[46] [46]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

International Conference on Learning Representations , volume=

Flashattention-2: Faster attention with better parallelism and work partitioning , author=. International Conference on Learning Representations , volume=

[48] [48]

MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference

MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference , author=. arXiv preprint arXiv:2604.02945 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[49] [49]

2026 , month =

Claude Sonnet 4.6 , author =. 2026 , month =

2026

[50] [50]

Perceiver

Andrew Jaegle and Sebastian Borgeaud and Jean-Baptiste Alayrac and Carl Doersch and Catalin Ionescu and David Ding and Skanda Koppula and Daniel Zoran and Andrew Brock and Evan Shelhamer and Olivier J Henaff and Matthew Botvinick and Andrew Zisserman and Oriol Vinyals and Joao Carreira , booktitle=. Perceiver

[51] [51]

ArXiv , year=

Progressive Multimodal Reasoning via Active Retrieval , author=. ArXiv , year=

[52] [52]

ArXiv , year=

LCIRC: A Recurrent Compression Approach for Efficient Long-form Context and Query Dependent Modeling in LLMs , author=. ArXiv , year=

[53] [53]

USENIX Annual Technical Conference , year=

Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention , author=. USENIX Annual Technical Conference , year=

[54] [54]

2026 , eprint=

ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild , author=. 2026 , eprint=

2026

[55] [55]

Advances in Neural Information Processing Systems , volume=

Openvlthinker: Complex vision-language reasoning via iterative sft-rl cycles , author=. Advances in Neural Information Processing Systems , volume=

[56] [56]

Pattern recognition letters , volume=

An introduction to ROC analysis , author=. Pattern recognition letters , volume=. 2006 , publisher=

2006

[57] [57]

Neural Information Processing Systems , year=

Attention is all you need , author=. Neural Information Processing Systems , year=

[58] [58]

Focal Loss for Dense Object Detection , year=

Lin, Tsung-Yi and Goyal, Priya and Girshick, Ross and He, Kaiming and Dollár, Piotr , booktitle=. Focal Loss for Dense Object Detection , year=