ColorBrowserAgent: Complex Long-Horizon Browser Agent with Adaptive Knowledge Evolution
Pith reviewed 2026-05-16 15:43 UTC · model grok-4.3
The pith
ColorBrowserAgent reaches 71.2% success on WebArena by using human-in-the-loop knowledge adaptation and progressive summarization to handle long-horizon browser tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ColorBrowserAgent consistently outperforms strong baselines. It achieves a state-of-the-art success rate of 71.2% on WebArena and maintains 47.4% performance under zero-shot transfer setting on WebChoreArena. In commercial deployment, it improves user satisfaction by 19.3% relatively.
Load-bearing premise
That sparse human feedback can be reliably transformed into reusable domain knowledge across heterogeneous sites and that progressive summarization prevents decision drift without discarding critical task information.
read the original abstract
With the advancement of vision-language models, web automation has made significant progress. However, deploying autonomous agents in real-world settings remains challenging, primarily due to site heterogeneity, where generalist models lack domain-specific priors for diverse interfaces, and long-horizon instability, characterized by the accumulation of decision drift over extended interactions. To address these challenges, we introduce ColorBrowserAgent (Complex Long-Horizon Browser Agent), a knowledge-evolving agent for robust web automation. Our approach addresses these challenges through two synergistic mechanisms: human-in-the-loop knowledge adaptation that transforms sparse human feedback into reusable domain knowledge, and knowledge-aligned progressive summarization that stabilizes long interactions through memory compression. Extensive experiments on WebArena, WebChoreArena and industrial deployment show that ColorBrowserAgent consistently outperforms strong baselines. It achieves a state-of-the-art success rate of 71.2% on WebArena and maintains 47.4% performance under zero-shot transfer setting on WebChoreArena. In commercial deployment, it improves user satisfaction by 19.3% relatively, verifying its robustness in real-world scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ColorBrowserAgent, a browser agent for complex long-horizon web tasks that combines human-in-the-loop knowledge adaptation (converting sparse feedback into reusable domain knowledge) with knowledge-aligned progressive summarization to address site heterogeneity and decision drift. It reports state-of-the-art results of 71.2% success rate on WebArena, 47.4% under zero-shot transfer on WebChoreArena, and a 19.3% relative gain in user satisfaction from commercial deployment, supported by ablation studies and released code.
Significance. If the reported gains hold under the described protocols, the work provides a concrete, reproducible mechanism for injecting domain priors into generalist vision-language agents while stabilizing long interactions. The ablation tables isolating each component, the explicit user-study protocol for the deployment metric, and the release of code and prompts are notable strengths that support verification and extension.
minor comments (3)
- [Methods] §4 (Methods): the pseudocode for human-feedback-to-knowledge conversion is helpful but leaves the exact mapping from sparse signals to site-specific priors underspecified; a short example of a feedback item and the resulting knowledge entry would clarify reusability across heterogeneous interfaces.
- [Experiments] Table 2 (Ablations): the zero-shot WebChoreArena row reports 47.4% but does not state the number of trials or confidence intervals; adding these would strengthen the transfer claim.
- [Experiments] §5.3 (Deployment): the 19.3% relative satisfaction improvement is presented with a user-study protocol, yet the exact questionnaire items and response scale are not reproduced; including them would aid interpretability.
Simulated Author's Rebuttal
We thank the referee for the positive review and recommendation for minor revision. The referee's summary accurately captures the core contributions of ColorBrowserAgent, including the human-in-the-loop knowledge adaptation and knowledge-aligned progressive summarization, along with the reported results on WebArena, WebChoreArena, and the commercial deployment. No specific major comments were provided in the report.
Circularity Check
No significant circularity
full rationale
The manuscript contains no equations, derivations, or parameter-fitting procedures that could reduce predictions to inputs by construction. All performance claims (71.2% WebArena success, 47.4% zero-shot WebChoreArena, 19.3% user-satisfaction lift) are presented as direct empirical outcomes from benchmark runs, ablation tables, and a documented user study. The two core mechanisms—human-in-the-loop knowledge adaptation and progressive summarization—are described via pseudocode and isolated in ablations rather than being defined in terms of the metrics they are claimed to improve. No self-citation chain is invoked to establish uniqueness or forbid alternatives, and the work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark
WebForge is an automated multi-agent framework that creates realistic and reproducible browser agent benchmarks at scale, demonstrated via a 934-task benchmark that reveals distinct model capability profiles through m...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.