ExpSeek: Self-Triggered Experience Seeking for Web Agents
Pith reviewed 2026-05-16 14:46 UTC · model grok-4.3
The pith
Web agents can seek step-level experience proactively by monitoring their own entropy signals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ExpSeek shifts experience intervention to step-level proactive seeking by estimating entropy thresholds from the model's intrinsic signals to determine intervention timing and by designing step-level tailored experience content, achieving absolute improvements of 9.3 percent on the 8B model and 7.5 percent on the 32B model across four challenging web agent benchmarks.
What carries the argument
Step-level entropy thresholds computed from the agent's own output signals, used both to time the intervention and to select or generate matching experience content.
If this is right
- Experience can be supplied dynamically during interaction instead of only as global context before the task begins.
- A 4B-scale experience model is sufficient to improve substantially larger agent models.
- Entropy serves as a self-contained signal that removes dependence on post-hoc tuning or external supervision.
- The same trigger mechanism applies across multiple web-agent benchmarks without benchmark-specific adjustments.
Where Pith is reading between the lines
- The entropy trigger may extend to non-web agent settings such as code generation or multi-step planning where uncertainty also accumulates step by step.
- Focusing experience storage only on high-entropy steps could reduce memory requirements for long-horizon agents.
- Combining ExpSeek with continued pre-training on high-entropy traces might further shrink the gap between small and large agent models.
- If entropy thresholds prove stable across model families, the method could be applied zero-shot to new agent architectures without retraining the trigger.
Load-bearing premise
Entropy values produced by the model at each step reliably mark the moments when experience should be sought and what content is useful, without external labels or extra tuning.
What would settle it
Replace the entropy-threshold trigger with fixed or random timing on the same four benchmarks and check whether the reported gains of 9.3 percent and 7.5 percent disappear.
Figures
read the original abstract
Experience intervention in web agents emerges as a promising technical paradigm, enhancing agent interaction capabilities by providing valuable insights from accumulated experiences. However, existing methods predominantly inject experience passively as global context before task execution, struggling to adapt to dynamically changing contextual observations during agent-environment interaction. We propose ExpSeek, which shifts experience toward step-level proactive seeking: (1) estimating step-level entropy thresholds to determine intervention timing using the model's intrinsic signals; (2) designing step-level tailored experience content. Experiments on Qwen3-8B and 32B models across four challenging web agent benchmarks demonstrate that ExpSeek achieves absolute improvements of 9.3% and 7.5%, respectively. Our experiments validate the feasibility and advantages of entropy as a self-triggering signal, reveal that even a small-scale 4B experience model can significantly boost the performance of larger agent models. The code is released at https://github.com/WYRipple/ExpSeek.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ExpSeek, a method for web agents that shifts experience intervention from passive global context to step-level proactive seeking. It estimates step-level entropy thresholds using the agent's intrinsic model signals to decide intervention timing and designs tailored experience content for each step. Experiments on Qwen3-8B and 32B models across four web agent benchmarks report absolute gains of 9.3% and 7.5%, respectively, while also showing that a small 4B experience model can boost larger agents. The code is released publicly.
Significance. If the central performance claims hold after verification that thresholds are computed without benchmark-specific fitting, the work would advance web agent autonomy by demonstrating a viable intrinsic self-triggering mechanism based on entropy. It provides concrete evidence that step-level experience seeking outperforms passive approaches and that small auxiliary models can be effective, with potential implications for scalable agent training. The public code release strengthens reproducibility.
major comments (2)
- [§3.2] §3.2: The threshold estimation procedure is described as using running entropy statistics, but the manuscript does not explicitly state whether the quantile or cutoff value is fixed globally across all tasks or selected via any form of validation or percentile fitting on the four evaluation benchmarks. Because this choice directly determines both timing and content selection, any benchmark-specific calibration would undermine the claim of purely intrinsic, self-triggered intervention.
- [§4] Experimental results (abstract and §4): Absolute improvements of 9.3% and 7.5% are reported without accompanying details on baseline definitions, statistical significance tests, exact entropy threshold computation (including any hyperparameters), or data exclusion rules. These omissions leave the central performance claim only partially supported and prevent independent assessment of robustness.
minor comments (1)
- The manuscript would benefit from a clear pseudocode or algorithmic box for the full ExpSeek loop, including how entropy is computed at each step and how the experience model is queried.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the requested clarifications and details.
read point-by-point responses
-
Referee: [§3.2] §3.2: The threshold estimation procedure is described as using running entropy statistics, but the manuscript does not explicitly state whether the quantile or cutoff value is fixed globally across all tasks or selected via any form of validation or percentile fitting on the four evaluation benchmarks. Because this choice directly determines both timing and content selection, any benchmark-specific calibration would undermine the claim of purely intrinsic, self-triggered intervention.
Authors: We thank the referee for highlighting this point. The threshold in §3.2 is computed solely from running entropy statistics based on the agent's intrinsic model signals during interaction. We maintain a running mean and standard deviation of observed entropy values and set the cutoff as a fixed global multiplier (1.5 standard deviations above the mean) determined from preliminary runs on a small set of non-benchmark web tasks. No quantile, percentile, or any form of validation/fitting was performed on the four evaluation benchmarks. We will revise §3.2 to include the exact formula, the multiplier value, initialization details, and explicit confirmation that no benchmark data influenced the threshold to reinforce the intrinsic nature of the mechanism. revision: yes
-
Referee: [§4] Experimental results (abstract and §4): Absolute improvements of 9.3% and 7.5% are reported without accompanying details on baseline definitions, statistical significance tests, exact entropy threshold computation (including any hyperparameters), or data exclusion rules. These omissions leave the central performance claim only partially supported and prevent independent assessment of robustness.
Authors: We agree that additional experimental details are needed for full transparency and reproducibility. In the revised §4, we will add: precise definitions and adaptations of all baselines; results of statistical significance tests (e.g., paired bootstrap or McNemar's test with p-values); the complete mathematical specification and hyperparameters for entropy threshold computation (including running statistics initialization and the multiplier); and any data exclusion or filtering rules applied. We will also cross-reference the publicly released code for implementation details. These changes will allow independent assessment of the reported gains. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper claims that step-level entropy thresholds are computed from the agent model's intrinsic signals via running statistics (Section 3.2) to decide intervention timing, with experience content then tailored accordingly. This construction uses native model outputs rather than fitting any derived quantity back to the reported benchmark gains (9.3 % / 7.5 %). No equation reduces the threshold selection to a post-hoc fit on the same evaluation distributions, no self-citation supplies a uniqueness theorem that forbids alternatives, and the central result is presented as an empirical outcome on held-out web-agent tasks. The method is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Model output entropy can be used to estimate step-level intervention thresholds using intrinsic signals
Forward citations
Cited by 3 Pith papers
-
Context Training with Active Information Seeking
Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...
-
GAM: Hierarchical Graph-based Agentic Memory for LLM Agents
GAM decouples event-level memory encoding from topic-level consolidation in LLM agents using hierarchical graphs to reduce interference and improve long-term coherence and retrieval.
-
Context Training with Active Information Seeking
Active information seeking via search tools, when combined with multi-candidate context pruning during training, produces consistent gains on translation, health, and reasoning tasks over naive tool addition or no-too...
Reference graph
Works this paper leans on
-
[1]
Agentic Reinforced Policy Optimization
Self-guided function calling in large language models via stepwise experience recall. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10842–10854, Suzhou, China. Association for Computational Linguistics. Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Ji- azhen Du, Huiyang Wang, Fu...
work page internal anchor Pith review arXiv 2025
-
[2]
Memory in the Age of AI Agents
Memory in the age of ai agents.Preprint, arXiv:2512.13564. 9 Minsoo Kim, Victor Bursztyn, Eunyee Koh, Shu- nan Guo, and Seung-won Hwang. 2024. RaDA: Retrieval-augmented web agent planning with LLMs. InFindings of the Association for Computational Lin- guistics: ACL 2024, pages 13511–13525, Bangkok, Thailand. Association for Computational Linguistics. Shas...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Sealqa: Raising the bar for reasoning in search-augmented language models.Preprint, arXiv:2506.01062. Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-rong Wen. 2025. Tool learning with large language mod- els: a survey.Frontiers of Computer Science, 19(8). Harsh Raj, Vipul Gupta, Domenic Rosati, and Sub- habrata...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
In ICML 2025 Workshop on Collaborative and Feder- ated Agentic Workflows
AGENT KB: A hierarchical memory frame- work for cross-domain agentic problem solving. In ICML 2025 Workshop on Collaborative and Feder- ated Agentic Workflows. Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, and 1 others
work page 2025
-
[5]
Kimi K2: Open Agentic Intelligence
Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534. Huaijie Wang, Shibo Hao, Hanze Dong, Shenao Zhang, Yilin Bao, Ziran Yang, and Yi Wu. 2025a. Offline reinforcement learning for LLM multi-step reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8881–8893, Vienna, Austria. Association for Computational L...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
The key information comes from the Reddit post and the The Numbers website. According to the Reddit post, "The Super Mario Bros. Movie" has grossed $1.36B, and "Jurassic World Dominion" has grossed $1.001B. From the The Numbers website, there is a list of the top Universal movies by worldwide box office, and the most recent ones include "The Super Mario B...
work page 2020
-
[9]
[8 Movies That Broke $1 Billion at the Box Office Since Covid]Avatar: The Way of Water’s box office comes in at over $2.3 billion
-
[10]
[6 Movies That Have Made $1 Billion At The Box Office Since Covid]6 Jurassic World: Dominion (2022) · 5 The Super Mario Bros. Movie (2023) · 4 Spider-Man: No Way Home (2021) · 3 Avatar: The Way of Water (2022) · 2 ... ...... Agent Action: <think> From the search results, I can see that "The Super Mario Bros. Movie" (2023) and "Jurassic World: Dominion" (2...
work page 2022
-
[11]
Movie - $1.36B · Jurassic World Dominion - $1.001B · Oppenheimer- $975.6M
[Universal - The First Half of the 2020s In Review : r/boxoffice] Universal - The First Half of the 2020s In Review · The Super Mario Bros. Movie - $1.36B · Jurassic World Dominion - $1.001B · Oppenheimer- $975.6M
-
[12]
Movie (2023) · 4 Spider-Man: No Way Home (2021) · 3 Avatar: The Way of Water (2022) · 2
[6 Movies That Have Made $1 Billion At The Box Office Since Covid] 6 Jurassic World: Dominion (2022) · 5 The Super Mario Bros. Movie (2023) · 4 Spider-Man: No Way Home (2021) · 3 Avatar: The Way of Water (2022) · 2 ... ...... Agent Action: <think> Based on the search results, I need to determine which Universal Studios films have grossed over $1 billion w...
work page 2022
-
[13]
The key information comes from the Reddit post and the The Numbers website. According to the Reddit post, "The Super Mario Bros. Movie" has grossed $1.36B, and "Jurassic World Dominion" has grossed $1.001B. From the The Numbers website, there is a list of the top Universal movies by worldwide box office, and the most recent ones include "The Super Mario B...
work page 2020
-
[14]
[8 Movies That Broke $1 Billion at the Box Office Since Covid] Avatar: The Way of Water’s box office comes in at over $2.3 billion
-
[15]
[6 Movies That Have Made $1 Billion At The Box Office Since Covid] 6 Jurassic World: Dominion (2022) · 5 The Super Mario Bros. Movie (2023) · 4 Spider-Man: No Way Home (2021) · 3 Avatar: The Way of Water (2022) · 2 ... ...... <guidance> Review the core of the question: you need to count films released by Universal Studios with worldwide box office over $1...
work page 2022
-
[16]
search: Input a keyword you think needs to be searched, returns multiple website links and corresponding summaries
-
[17]
**Detailed tool call format** {search_format} {visit_format} **Model workflow description**
visit: Input a website link and the problem you hope to solve by visiting that site, returns the problem’s answer and a snippet of the original content from the site corresponding to the answer. **Detailed tool call format** {search_format} {visit_format} **Model workflow description**
-
[18]
The history may have multiple rounds, all designed to solve a given problem
Model input consists of the historical loop of interactions, including the model’s generated thoughts and tool calls, and the user’s returned tool results. The history may have multiple rounds, all designed to solve a given problem
-
[19]
The model must output two parts: thinking process, and tool call or answer, each enclosed by specific angle bracket position markers: - If thinking, follow the format: <thought> here is the thinking process </thought> - If calling a tool, follow the format: <tool_call> tool call here </tool_call> - If determining the final answer, follow the format: <answ...
-
[20]
The user will respond with tool call results or occasionally provide guidance. If there is guidance, you should carefully consider whether the user’s ideas are reasonable and try to follow them: - If responding to a tool call, you will see the format: <tool_response> here is the tool’s return value </tool_response> - If the user provides careful guidance,...
-
[21]
Every problem must have an answer; during multi-round resolution, do not forget your past planning and process results, and do not forget the details embedded in the problem
-
[22]
To obtain accurate information, you must call the visit tool to visit a site
The search tool’s return value is only website links and snippet summaries; they are hardly reliable references and can only serve as search direction. To obtain accurate information, you must call the visit tool to visit a site
-
[23]
Do not blindly ignore key constraints in the plan to avoid potential cascading errors
If the interaction includes a resolution plan, follow the plan. Do not blindly ignore key constraints in the plan to avoid potential cascading errors
-
[24]
An incorrect format will cause tool calls to fail
In tool_call generation, the format must follow the above definitions and be valid JSON. An incorrect format will cause tool calls to fail. *8*. **IMPORTANT**: If the user provides guidance after the answer, prioritize regenerating <tool_call></tool_call> to continue searching for missing clues, or provide only when you are absolutely certain of the answe...
-
[25]
**You must generate the position markers** in accordance with the requirements stated above (<thought></thought>; <tool_call></tool_call> or <answer></answer>). 2. In particular, do not forget to generate the closing tags: </thought>, </tool_call>, OR </answer>
-
[26]
You must not generate extra angle bracket position markers. Table 11: ReAct system prompt. 20 Prompt for generating experience triplets. # Questions for Students to Solve {question} # Standard Answer for the Question {answer} # This is a complete trajectory that ultimately got the correct answer as your reference: ˋˋˋ{true_traj}ˋˋˋ # This is a complete tr...
-
[27]
Define a STEP as R_i+O_i, but the last STEP only has R_N
-
[28]
Each R is a student’s response, attempting to call tools to further solve the problem, but the second trajectory with wrong answer always has some issues
-
[29]
Your core task is to answer this question for each STEP:ˋˋˋIn order to avoid the final error, if guidance is provided after this STEP ends, what should be done to make the agent perform better?ˋˋˋ
-
[30]
Of course, a complete guidance is a triplet <student’s current state, reason why this STEP leads to the final error, what to say before the next STEP to improve the current state> ˋˋˋExplanation of the triplet: - Student’s current state: A relatively general description, introducing what the student saw and what they did. The description does not involve ...
-
[31]
The guidance opinion in the triplet generated for STEP_i will be concatenated after O_i, which means the student can see it before generating R_i+1
-
[32]
Not every STEP necessarily needs guidance, you can skip after analysis, but since the trajectory is wrong, **there must be at least one STEP that has issues and can be summarized into a triplet**
-
[33]
Finally, briefly summarize what three good pieces of advice could be given before working on this problem
-
[34]
**!!Must Note!!** The total number of rounds you analyze in the trajectory is **{step_num}**, you must generate the corresponding number of STEPs before you can continue to generate TOTAL! # Output Format (strictly follow the markdown format I give you) ˋˋˋ # STEP 1: ## Analysis - Write analysis content here ## Triplet (If there is no error, directly writ...
-
[35]
Reuse: Do not change any current labels, and select an existing label for the new behavior (recognizing the existing classification)
-
[36]
Create: Do not change existing labels, create a new label for the new behavior (existing classification is incomplete)
-
[37]
Modify: Modify certain current labels, and assign that label to the new behavior (existing classification is inaccurate) # Detailed Requirements
-
[38]
It should be at least a dozen or dozens of words (e.g., in the pattern of xxx: xxx xxx xx)
Each label must be concise and clear, but needs to have certain semantic information that allows people to understand the characteristics of the current behavior + mistake without explanation. It should be at least a dozen or dozens of words (e.g., in the pattern of xxx: xxx xxx xx)
-
[39]
There cannot be too many labels; each label should have distinguishability in scenario content
-
[40]
One label can correspond to multiple behaviors, so you must ensure their textual content is consistent
-
[41]
When outputting, you need to output the ids and labels of all existing behaviors and new behaviors
Use the given id as the unique identifier for behaviors. When outputting, you need to output the ids and labels of all existing behaviors and new behaviors
-
[42]
Try to keep the number of different labels balanced. # List of Behaviors Already Given Labels {exp_list} # List of New Behaviors {new_exp_list} # Output Format: ˋˋˋ {output_format} ˋˋˋ Table 13: Prompt for iteratively generating topics. 22 Prompt for experience model topic selection stage. # Overall Instructions You are a teacher who is very good at guidi...
-
[43]
After you carefully state the reasons for selection, just output the idx of the selected topics
You need to combine the student’s current state and select **3** topics from the several potential error topics I give you. After you carefully state the reasons for selection, just output the idx of the selected topics
-
[44]
The student may not have actually made a mistake, but your subsequent guidance can prevent problems before they occur. # Output Format Strictly follow the markdown format below for output ˋˋˋ # Analysis of the Current Step Write your analysis here # Selected Topic idx (separated by spaces) idx1 idx2 idx3 ˋˋˋ Output: Table 14: Prompt for experience model t...
-
[48]
The guidance you provide will be given to the student together with the tool call results after this step ends. - If the student generates a tool call in this step, the guidance will be given to the student together with the tool return value - If the student generates an answer in this step, the guidance will be given directly to the student, and the stu...
-
[51]
If necessary, you can encourage students to continue calling tools or switch tools
Guidance should be clear and easy to understand. If necessary, you can encourage students to continue calling tools or switch tools
-
[52]
The guidance you provide should try to imitate the previous guidance patterns, don’t improvise freely. # Output Format Strictly follow the markdown format below for output ˋˋˋ # Analysis combining student’s current behavior and previous experience to provide appropriate guidance for the present moment Write your detailed analysis here # Guidance Content W...
-
[53]
The question must have an answer. If the student thinks there is insufficient evidence, it must be because they haven’t found the evidence. After careful analysis, provide your guidance for the student’s current step, with the goal of helping the student actually answer the question correctly
-
[54]
Since this step may not necessarily be wrong, please carefully choose your wording to prevent your guidance from introducing bias
-
[55]
Your analysis must include a brief review of the **problem** that the student needs to solve, emphasizing the content of the problem to the student to prevent answering off-topic
-
[56]
The guidance you provide will be given to the student together with the tool call results after this step ends - If the student generates a tool call in this step, the guidance will be given to the student together with the tool return value - If the student generates an answer in this step, the guidance will be given directly to the student, and the stud...
-
[57]
**You are a teacher, not someone helping students cheat**
It is forbidden to find answers on behalf of the student, and it is forbidden to hint at what the answer is under any circumstances. **You are a teacher, not someone helping students cheat**
-
[58]
Your purpose is only to **guide**
It is forbidden to provide **direct clues** to students. Your purpose is only to **guide**
-
[59]
Guidance should be clear and easy to understand. If necessary, you can encourage students to continue calling tools or switch tools. # Output Format Strictly follow the markdown format below for output ˋˋˋ # Analysis combining student’s current behavior to provide appropriate guidance for the present moment Write your detailed analysis here # Guidance Con...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.