ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution

Huarong Deng; Jiamu Zhou; Jihong Wang; Jun Wang; Teng Wang; Weiming Zhang; Weinan Zhang; Weiwen Liu; Xingyu Lou; Zhuosheng Zhang

arxiv: 2601.07262 · v3 · submitted 2026-01-12 · 💻 cs.HC

ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution

Jihong Wang , Jiamu Zhou , Weiming Zhang , Teng Wang , Weiwen Liu , Zhuosheng Zhang , Xingyu Lou , Weinan Zhang

show 2 more authors

Huarong Deng Jun Wang

This is my paper

Pith reviewed 2026-05-16 15:43 UTC · model grok-4.3

classification 💻 cs.HC

keywords agentcolorbrowseragentknowledgelong-horizonautomationbrowserchallengescomplex

0 comments

The pith

ColorBrowserAgent reaches 71.2% success on WebArena by using human-in-the-loop knowledge adaptation and progressive summarization to handle long-horizon browser tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work focuses on web agents that must click, type, and navigate across many pages without losing track. Two ideas are central: when a person gives quick feedback on a mistake, the system stores it as general knowledge for similar future sites instead of forgetting it. Second, instead of keeping every past action, it creates running summaries that keep only what matters for the next steps. This combination is tested on standard benchmarks where it beats other agents and also in live company software where users report higher satisfaction.

Core claim

ColorBrowserAgent consistently outperforms strong baselines. It achieves a state-of-the-art success rate of 71.2% on WebArena and maintains 47.4% performance under zero-shot transfer setting on WebChoreArena. In commercial deployment, it improves user satisfaction by 19.3% relatively.

Load-bearing premise

That sparse human feedback can be reliably transformed into reusable domain knowledge across heterogeneous sites and that progressive summarization prevents decision drift without discarding critical task information.

read the original abstract

With the advancement of vision-language models, web automation has made significant progress. However, deploying autonomous agents in real-world settings remains challenging, primarily due to site heterogeneity, where generalist models lack domain-specific priors for diverse interfaces, and long-horizon instability, characterized by the accumulation of decision drift over extended interactions. To address these challenges, we introduce ColorBrowserAgent (Complex Long-Horizon Browser Agent), a knowledge-evolving agent for robust web automation. Our approach addresses these challenges through two synergistic mechanisms: human-in-the-loop knowledge adaptation that transforms sparse human feedback into reusable domain knowledge, and knowledge-aligned progressive summarization that stabilizes long interactions through memory compression. Extensive experiments on WebArena, WebChoreArena and industrial deployment show that ColorBrowserAgent consistently outperforms strong baselines. It achieves a state-of-the-art success rate of 71.2% on WebArena and maintains 47.4% performance under zero-shot transfer setting on WebChoreArena. In commercial deployment, it improves user satisfaction by 19.3% relatively, verifying its robustness in real-world scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ColorBrowserAgent shows a workable mix of human feedback loops and progressive summarization to stabilize long browser sessions, with ablations and deployment data that hold up.

read the letter

The main thing to know is that this paper gives a concrete recipe for reducing decision drift in long-horizon web agents. It turns occasional human feedback into reusable site-specific knowledge and pairs that with knowledge-aligned summarization to keep memory manageable without dropping key details. The reported results are 71.2% success on WebArena, 47.4% zero-shot transfer on WebChoreArena, and a 19.3% relative lift in user satisfaction from a real commercial deployment with an explicit user-study protocol.

Referee Report

0 major / 3 minor

Summary. The paper introduces ColorBrowserAgent, a browser agent for complex long-horizon web tasks that combines human-in-the-loop knowledge adaptation (converting sparse feedback into reusable domain knowledge) with knowledge-aligned progressive summarization to address site heterogeneity and decision drift. It reports state-of-the-art results of 71.2% success rate on WebArena, 47.4% under zero-shot transfer on WebChoreArena, and a 19.3% relative gain in user satisfaction from commercial deployment, supported by ablation studies and released code.

Significance. If the reported gains hold under the described protocols, the work provides a concrete, reproducible mechanism for injecting domain priors into generalist vision-language agents while stabilizing long interactions. The ablation tables isolating each component, the explicit user-study protocol for the deployment metric, and the release of code and prompts are notable strengths that support verification and extension.

minor comments (3)

[Methods] §4 (Methods): the pseudocode for human-feedback-to-knowledge conversion is helpful but leaves the exact mapping from sparse signals to site-specific priors underspecified; a short example of a feedback item and the resulting knowledge entry would clarify reusability across heterogeneous interfaces.
[Experiments] Table 2 (Ablations): the zero-shot WebChoreArena row reports 47.4% but does not state the number of trials or confidence intervals; adding these would strengthen the transfer claim.
[Experiments] §5.3 (Deployment): the 19.3% relative satisfaction improvement is presented with a user-study protocol, yet the exact questionnaire items and response scale are not reproduced; including them would aid interpretability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation for minor revision. The referee's summary accurately captures the core contributions of ColorBrowserAgent, including the human-in-the-loop knowledge adaptation and knowledge-aligned progressive summarization, along with the reported results on WebArena, WebChoreArena, and the commercial deployment. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript contains no equations, derivations, or parameter-fitting procedures that could reduce predictions to inputs by construction. All performance claims (71.2% WebArena success, 47.4% zero-shot WebChoreArena, 19.3% user-satisfaction lift) are presented as direct empirical outcomes from benchmark runs, ablation tables, and a documented user study. The two core mechanisms—human-in-the-loop knowledge adaptation and progressive summarization—are described via pseudocode and isolated in ablations rather than being defined in terms of the metrics they are claimed to improve. No self-citation chain is invoked to establish uniqueness or forbid alternatives, and the work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level agent name and two mechanisms.

pith-pipeline@v0.9.0 · 5518 in / 1011 out tokens · 20834 ms · 2026-05-16T15:43:32.576494+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
cs.AI 2026-04 unverdicted novelty 7.0

WebForge is an automated multi-agent framework that creates realistic and reproducible browser agent benchmarks at scale, demonstrated via a 934-task benchmark that reveals distinct model capability profiles through m...