ExpSeek: Self-Triggered Experience Seeking for Web Agents

Bingli Wu; Haiyang Yu; Juwei Yue; Shuaiyi Nie; Tingwen Liu; Wenyuan Zhang; Xinghua Zhang; Yongbin Li

arxiv: 2601.08605 · v2 · submitted 2026-01-13 · 💻 cs.CL · cs.AI

ExpSeek: Self-Triggered Experience Seeking for Web Agents

Wenyuan Zhang , Xinghua Zhang , Haiyang Yu , Shuaiyi Nie , Bingli Wu , Juwei Yue , Tingwen Liu , Yongbin Li This is my paper

Pith reviewed 2026-05-16 14:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords web agentsexperience seekingentropy thresholdsself-triggered interventionagent benchmarksQwen modelsproactive experience

0 comments

The pith

Web agents can seek step-level experience proactively by monitoring their own entropy signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ExpSeek moves experience use in web agents from passive pre-task injection to active, step-by-step seeking. The method estimates entropy thresholds directly from the agent's intrinsic output signals to decide intervention timing and then crafts tailored experience content for that step. Tests on Qwen3-8B and 32B models across four web agent benchmarks report absolute gains of 9.3 percent and 7.5 percent. A small 4B experience model is shown to lift performance of the larger agents. The approach treats entropy as a self-contained trigger that removes the need for external labels or post-hoc tuning.

Core claim

ExpSeek shifts experience intervention to step-level proactive seeking by estimating entropy thresholds from the model's intrinsic signals to determine intervention timing and by designing step-level tailored experience content, achieving absolute improvements of 9.3 percent on the 8B model and 7.5 percent on the 32B model across four challenging web agent benchmarks.

What carries the argument

Step-level entropy thresholds computed from the agent's own output signals, used both to time the intervention and to select or generate matching experience content.

If this is right

Experience can be supplied dynamically during interaction instead of only as global context before the task begins.
A 4B-scale experience model is sufficient to improve substantially larger agent models.
Entropy serves as a self-contained signal that removes dependence on post-hoc tuning or external supervision.
The same trigger mechanism applies across multiple web-agent benchmarks without benchmark-specific adjustments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The entropy trigger may extend to non-web agent settings such as code generation or multi-step planning where uncertainty also accumulates step by step.
Focusing experience storage only on high-entropy steps could reduce memory requirements for long-horizon agents.
Combining ExpSeek with continued pre-training on high-entropy traces might further shrink the gap between small and large agent models.
If entropy thresholds prove stable across model families, the method could be applied zero-shot to new agent architectures without retraining the trigger.

Load-bearing premise

Entropy values produced by the model at each step reliably mark the moments when experience should be sought and what content is useful, without external labels or extra tuning.

What would settle it

Replace the entropy-threshold trigger with fixed or random timing on the same four benchmarks and check whether the reported gains of 9.3 percent and 7.5 percent disappear.

Figures

Figures reproduced from arXiv: 2601.08605 by Bingli Wu, Haiyang Yu, Juwei Yue, Shuaiyi Nie, Tingwen Liu, Wenyuan Zhang, Xinghua Zhang, Yongbin Li.

**Figure 2.** Figure 2: The overall architecture of ExpSeek, including experience base construction and actively seeking [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Entropy distributions of process and answer [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Entropy distributions of process and answer steps for Qwen3-8B before and after applying ExpSeek across [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Scaling Law of experience model Me. Method GAIA xbench Qwen3-8B ← E-8B / 32B 36.89 / 35.60 37.20 / 36.00 Qwen3-32B ← E-32B / 8B 43.88 / 40.33 42.00 / 37.20 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Cross-comparison results of performance and efficiency after adjusting intervention intensity. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Correlation between repository size and per [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Entropy distributions of process and answer [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Entropy distributions of process and answer steps for Qwen3-32B before and after applying ExpSeek [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Experience intervention in web agents emerges as a promising technical paradigm, enhancing agent interaction capabilities by providing valuable insights from accumulated experiences. However, existing methods predominantly inject experience passively as global context before task execution, struggling to adapt to dynamically changing contextual observations during agent-environment interaction. We propose ExpSeek, which shifts experience toward step-level proactive seeking: (1) estimating step-level entropy thresholds to determine intervention timing using the model's intrinsic signals; (2) designing step-level tailored experience content. Experiments on Qwen3-8B and 32B models across four challenging web agent benchmarks demonstrate that ExpSeek achieves absolute improvements of 9.3% and 7.5%, respectively. Our experiments validate the feasibility and advantages of entropy as a self-triggering signal, reveal that even a small-scale 4B experience model can significantly boost the performance of larger agent models. The code is released at https://github.com/WYRipple/ExpSeek.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ExpSeek shifts web agents to step-level proactive experience seeking via intrinsic entropy triggers and reports solid gains, though the thresholds may need closer checks for true self-triggering.

read the letter

The main thing to know is that this paper moves experience use in web agents from a passive global prefix to a proactive step-by-step fetch, with the agent's own entropy deciding when to pull in tailored content. That is a clear framing change from the methods it cites. The experiments back it up with absolute gains of 9.3% on the 8B model and 7.5% on the 32B model across four benchmarks, and the note that a 4B experience model can still help larger agents is a useful practical detail. Releasing the code also makes it easier to inspect the implementation. The soft spot sits in the threshold procedure. If the entropy cutoff or quantile is chosen or validated on the same task distributions used for final numbers, the self-triggered claim weakens and the gains could partly reflect hidden calibration rather than purely intrinsic signals. The abstract does not spell this out, so the full text needs to show the statistics are fixed in advance without benchmark peeking. Minor issues like missing statistical significance details or baseline definitions are common at this stage but should be tightened. This work is for people actively building or tuning web agents who want a concrete way to make experience injection more dynamic. It stays inside the current agent paradigm and does not claim new theory, but the method is described enough to test and the results are on standard tasks. I would send it to peer review because the core mechanism is replicable and the reported improvements are large enough to matter for follow-up work.

Referee Report

2 major / 1 minor

Summary. The paper proposes ExpSeek, a method for web agents that shifts experience intervention from passive global context to step-level proactive seeking. It estimates step-level entropy thresholds using the agent's intrinsic model signals to decide intervention timing and designs tailored experience content for each step. Experiments on Qwen3-8B and 32B models across four web agent benchmarks report absolute gains of 9.3% and 7.5%, respectively, while also showing that a small 4B experience model can boost larger agents. The code is released publicly.

Significance. If the central performance claims hold after verification that thresholds are computed without benchmark-specific fitting, the work would advance web agent autonomy by demonstrating a viable intrinsic self-triggering mechanism based on entropy. It provides concrete evidence that step-level experience seeking outperforms passive approaches and that small auxiliary models can be effective, with potential implications for scalable agent training. The public code release strengthens reproducibility.

major comments (2)

[§3.2] §3.2: The threshold estimation procedure is described as using running entropy statistics, but the manuscript does not explicitly state whether the quantile or cutoff value is fixed globally across all tasks or selected via any form of validation or percentile fitting on the four evaluation benchmarks. Because this choice directly determines both timing and content selection, any benchmark-specific calibration would undermine the claim of purely intrinsic, self-triggered intervention.
[§4] Experimental results (abstract and §4): Absolute improvements of 9.3% and 7.5% are reported without accompanying details on baseline definitions, statistical significance tests, exact entropy threshold computation (including any hyperparameters), or data exclusion rules. These omissions leave the central performance claim only partially supported and prevent independent assessment of robustness.

minor comments (1)

The manuscript would benefit from a clear pseudocode or algorithmic box for the full ExpSeek loop, including how entropy is computed at each step and how the experience model is queried.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the requested clarifications and details.

read point-by-point responses

Referee: [§3.2] §3.2: The threshold estimation procedure is described as using running entropy statistics, but the manuscript does not explicitly state whether the quantile or cutoff value is fixed globally across all tasks or selected via any form of validation or percentile fitting on the four evaluation benchmarks. Because this choice directly determines both timing and content selection, any benchmark-specific calibration would undermine the claim of purely intrinsic, self-triggered intervention.

Authors: We thank the referee for highlighting this point. The threshold in §3.2 is computed solely from running entropy statistics based on the agent's intrinsic model signals during interaction. We maintain a running mean and standard deviation of observed entropy values and set the cutoff as a fixed global multiplier (1.5 standard deviations above the mean) determined from preliminary runs on a small set of non-benchmark web tasks. No quantile, percentile, or any form of validation/fitting was performed on the four evaluation benchmarks. We will revise §3.2 to include the exact formula, the multiplier value, initialization details, and explicit confirmation that no benchmark data influenced the threshold to reinforce the intrinsic nature of the mechanism. revision: yes
Referee: [§4] Experimental results (abstract and §4): Absolute improvements of 9.3% and 7.5% are reported without accompanying details on baseline definitions, statistical significance tests, exact entropy threshold computation (including any hyperparameters), or data exclusion rules. These omissions leave the central performance claim only partially supported and prevent independent assessment of robustness.

Authors: We agree that additional experimental details are needed for full transparency and reproducibility. In the revised §4, we will add: precise definitions and adaptations of all baselines; results of statistical significance tests (e.g., paired bootstrap or McNemar's test with p-values); the complete mathematical specification and hyperparameters for entropy threshold computation (including running statistics initialization and the multiplier); and any data exclusion or filtering rules applied. We will also cross-reference the publicly released code for implementation details. These changes will allow independent assessment of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper claims that step-level entropy thresholds are computed from the agent model's intrinsic signals via running statistics (Section 3.2) to decide intervention timing, with experience content then tailored accordingly. This construction uses native model outputs rather than fitting any derived quantity back to the reported benchmark gains (9.3 % / 7.5 %). No equation reduces the threshold selection to a post-hoc fit on the same evaluation distributions, no self-citation supplies a uniqueness theorem that forbids alternatives, and the central result is presented as an empirical outcome on held-out web-agent tasks. The method is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full methods, equations, and experimental details unavailable. The approach rests on the domain assumption that entropy from model outputs serves as a reliable uncertainty signal for intervention timing. No explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Model output entropy can be used to estimate step-level intervention thresholds using intrinsic signals
Central to determining when experience seeking should occur.

pith-pipeline@v0.9.0 · 5480 in / 1267 out tokens · 45441 ms · 2026-05-16T14:46:05.413471+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Context Training with Active Information Seeking
cs.CL 2026-05 unverdicted novelty 6.0

Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...
GAM: Hierarchical Graph-based Agentic Memory for LLM Agents
cs.AI 2026-04 unverdicted novelty 6.0

GAM decouples event-level memory encoding from topic-level consolidation in LLM agents using hierarchical graphs to reduce interference and improve long-term coherence and retrieval.
Context Training with Active Information Seeking
cs.CL 2026-05 unverdicted novelty 5.0

Active information seeking via search tools, when combined with multi-candidate context pruning during training, produces consistent gains on translation, health, and reasoning tasks over naive tool addition or no-too...

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

Agentic Reinforced Policy Optimization

Self-guided function calling in large language models via stepwise experience recall. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10842–10854, Suzhou, China. Association for Computational Linguistics. Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Ji- azhen Du, Huiyang Wang, Fu...

work page internal anchor Pith review arXiv 2025
[2]

Memory in the Age of AI Agents

Memory in the age of ai agents.Preprint, arXiv:2512.13564. 9 Minsoo Kim, Victor Bursztyn, Eunyee Koh, Shu- nan Guo, and Seung-won Hwang. 2024. RaDA: Retrieval-augmented web agent planning with LLMs. InFindings of the Association for Computational Lin- guistics: ACL 2024, pages 13511–13525, Bangkok, Thailand. Association for Computational Linguistics. Shas...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Sealqa: Raising the bar for reasoning in search-augmented language models.Preprint, arXiv:2506.01062. Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-rong Wen. 2025. Tool learning with large language mod- els: a survey.Frontiers of Computer Science, 19(8). Harsh Raj, Vipul Gupta, Domenic Rosati, and Sub- habrata...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

In ICML 2025 Workshop on Collaborative and Feder- ated Agentic Workflows

AGENT KB: A hierarchical memory frame- work for cross-domain agentic problem solving. In ICML 2025 Workshop on Collaborative and Feder- ated Agentic Workflows. Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, and 1 others

work page 2025
[5]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534. Huaijie Wang, Shibo Hao, Hanze Dong, Shenao Zhang, Yilin Bao, Ziran Yang, and Yi Wu. 2025a. Offline reinforcement learning for LLM multi-step reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8881–8893, Vienna, Austria. Association for Computational L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

The Super Mario Bros. Movie

The key information comes from the Reddit post and the The Numbers website. According to the Reddit post, "The Super Mario Bros. Movie" has grossed $1.36B, and "Jurassic World Dominion" has grossed $1.001B. From the The Numbers website, there is a list of the top Universal movies by worldwide box office, and the most recent ones include "The Super Mario B...

work page 2020
[9]

[8 Movies That Broke $1 Billion at the Box Office Since Covid]Avatar: The Way of Water’s box office comes in at over $2.3 billion

work page
[10]

The Super Mario Bros. Movie

[6 Movies That Have Made $1 Billion At The Box Office Since Covid]6 Jurassic World: Dominion (2022) · 5 The Super Mario Bros. Movie (2023) · 4 Spider-Man: No Way Home (2021) · 3 Avatar: The Way of Water (2022) · 2 ... ...... Agent Action: <think> From the search results, I can see that "The Super Mario Bros. Movie" (2023) and "Jurassic World: Dominion" (2...

work page 2022
[11]

Movie - $1.36B · Jurassic World Dominion - $1.001B · Oppenheimer- $975.6M

[Universal - The First Half of the 2020s In Review : r/boxoffice] Universal - The First Half of the 2020s In Review · The Super Mario Bros. Movie - $1.36B · Jurassic World Dominion - $1.001B · Oppenheimer- $975.6M

work page
[12]

Movie (2023) · 4 Spider-Man: No Way Home (2021) · 3 Avatar: The Way of Water (2022) · 2

[6 Movies That Have Made $1 Billion At The Box Office Since Covid] 6 Jurassic World: Dominion (2022) · 5 The Super Mario Bros. Movie (2023) · 4 Spider-Man: No Way Home (2021) · 3 Avatar: The Way of Water (2022) · 2 ... ...... Agent Action: <think> Based on the search results, I need to determine which Universal Studios films have grossed over $1 billion w...

work page 2022
[13]

The Super Mario Bros. Movie

The key information comes from the Reddit post and the The Numbers website. According to the Reddit post, "The Super Mario Bros. Movie" has grossed $1.36B, and "Jurassic World Dominion" has grossed $1.001B. From the The Numbers website, there is a list of the top Universal movies by worldwide box office, and the most recent ones include "The Super Mario B...

work page 2020
[14]

[8 Movies That Broke $1 Billion at the Box Office Since Covid] Avatar: The Way of Water’s box office comes in at over $2.3 billion

work page
[15]

billion-dollar movie

[6 Movies That Have Made $1 Billion At The Box Office Since Covid] 6 Jurassic World: Dominion (2022) · 5 The Super Mario Bros. Movie (2023) · 4 Spider-Man: No Way Home (2021) · 3 Avatar: The Way of Water (2022) · 2 ... ...... <guidance> Review the core of the question: you need to count films released by Universal Studios with worldwide box office over $1...

work page 2022
[16]

search: Input a keyword you think needs to be searched, returns multiple website links and corresponding summaries

work page
[17]

**Detailed tool call format** {search_format} {visit_format} **Model workflow description**

visit: Input a website link and the problem you hope to solve by visiting that site, returns the problem’s answer and a snippet of the original content from the site corresponding to the answer. **Detailed tool call format** {search_format} {visit_format} **Model workflow description**

work page
[18]

The history may have multiple rounds, all designed to solve a given problem

Model input consists of the historical loop of interactions, including the model’s generated thoughts and tool calls, and the user’s returned tool results. The history may have multiple rounds, all designed to solve a given problem

work page
[19]

The model must output two parts: thinking process, and tool call or answer, each enclosed by specific angle bracket position markers: - If thinking, follow the format: <thought> here is the thinking process </thought> - If calling a tool, follow the format: <tool_call> tool call here </tool_call> - If determining the final answer, follow the format: <answ...

work page
[20]

The user will respond with tool call results or occasionally provide guidance. If there is guidance, you should carefully consider whether the user’s ideas are reasonable and try to follow them: - If responding to a tool call, you will see the format: <tool_response> here is the tool’s return value </tool_response> - If the user provides careful guidance,...

work page
[21]

Every problem must have an answer; during multi-round resolution, do not forget your past planning and process results, and do not forget the details embedded in the problem

work page
[22]

To obtain accurate information, you must call the visit tool to visit a site

The search tool’s return value is only website links and snippet summaries; they are hardly reliable references and can only serve as search direction. To obtain accurate information, you must call the visit tool to visit a site

work page
[23]

Do not blindly ignore key constraints in the plan to avoid potential cascading errors

If the interaction includes a resolution plan, follow the plan. Do not blindly ignore key constraints in the plan to avoid potential cascading errors

work page
[24]

An incorrect format will cause tool calls to fail

In tool_call generation, the format must follow the above definitions and be valid JSON. An incorrect format will cause tool calls to fail. *8*. **IMPORTANT**: If the user provides guidance after the answer, prioritize regenerating <tool_call></tool_call> to continue searching for missing clues, or provide only when you are absolutely certain of the answe...

work page
[25]

**You must generate the position markers** in accordance with the requirements stated above (<thought></thought>; <tool_call></tool_call> or <answer></answer>). 2. In particular, do not forget to generate the closing tags: </thought>, </tool_call>, OR </answer>

work page
[26]

Table 11: ReAct system prompt

You must not generate extra angle bracket position markers. Table 11: ReAct system prompt. 20 Prompt for generating experience triplets. # Questions for Students to Solve {question} # Standard Answer for the Question {answer} # This is a complete trajectory that ultimately got the correct answer as your reference: ˋˋˋ{true_traj}ˋˋˋ # This is a complete tr...

work page
[27]

Define a STEP as R_i+O_i, but the last STEP only has R_N

work page
[28]

Each R is a student’s response, attempting to call tools to further solve the problem, but the second trajectory with wrong answer always has some issues

work page
[29]

Your core task is to answer this question for each STEP:ˋˋˋIn order to avoid the final error, if guidance is provided after this STEP ends, what should be done to make the agent perform better?ˋˋˋ

work page
[30]

Of course, a complete guidance is a triplet <student’s current state, reason why this STEP leads to the final error, what to say before the next STEP to improve the current state> ˋˋˋExplanation of the triplet: - Student’s current state: A relatively general description, introducing what the student saw and what they did. The description does not involve ...

work page
[31]

The guidance opinion in the triplet generated for STEP_i will be concatenated after O_i, which means the student can see it before generating R_i+1

work page
[32]

Not every STEP necessarily needs guidance, you can skip after analysis, but since the trajectory is wrong, **there must be at least one STEP that has issues and can be summarized into a triplet**

work page
[33]

Finally, briefly summarize what three good pieces of advice could be given before working on this problem

work page
[34]

behavior + mistake

**!!Must Note!!** The total number of rounds you analyze in the trajectory is **{step_num}**, you must generate the corresponding number of STEPs before you can continue to generate TOTAL! # Output Format (strictly follow the markdown format I give you) ˋˋˋ # STEP 1: ## Analysis - Write analysis content here ## Triplet (If there is no error, directly writ...

work page
[35]

Reuse: Do not change any current labels, and select an existing label for the new behavior (recognizing the existing classification)

work page
[36]

Create: Do not change existing labels, create a new label for the new behavior (existing classification is incomplete)

work page
[37]

Modify: Modify certain current labels, and assign that label to the new behavior (existing classification is inaccurate) # Detailed Requirements

work page
[38]

It should be at least a dozen or dozens of words (e.g., in the pattern of xxx: xxx xxx xx)

Each label must be concise and clear, but needs to have certain semantic information that allows people to understand the characteristics of the current behavior + mistake without explanation. It should be at least a dozen or dozens of words (e.g., in the pattern of xxx: xxx xxx xx)

work page
[39]

There cannot be too many labels; each label should have distinguishability in scenario content

work page
[40]

One label can correspond to multiple behaviors, so you must ensure their textual content is consistent

work page
[41]

When outputting, you need to output the ids and labels of all existing behaviors and new behaviors

Use the given id as the unique identifier for behaviors. When outputting, you need to output the ids and labels of all existing behaviors and new behaviors

work page
[42]

Try to keep the number of different labels balanced. # List of Behaviors Already Given Labels {exp_list} # List of New Behaviors {new_exp_list} # Output Format: ˋˋˋ {output_format} ˋˋˋ Table 13: Prompt for iteratively generating topics. 22 Prompt for experience model topic selection stage. # Overall Instructions You are a teacher who is very good at guidi...

work page
[43]

After you carefully state the reasons for selection, just output the idx of the selected topics

You need to combine the student’s current state and select **3** topics from the several potential error topics I give you. After you carefully state the reasons for selection, just output the idx of the selected topics

work page
[44]

The student may not have actually made a mistake, but your subsequent guidance can prevent problems before they occur. # Output Format Strictly follow the markdown format below for output ˋˋˋ # Analysis of the Current Step Write your analysis here # Selected Topic idx (separated by spaces) idx1 idx2 idx3 ˋˋˋ Output: Table 14: Prompt for experience model t...

work page
[48]

The guidance you provide will be given to the student together with the tool call results after this step ends. - If the student generates a tool call in this step, the guidance will be given to the student together with the tool return value - If the student generates an answer in this step, the guidance will be given directly to the student, and the stu...

work page
[51]

If necessary, you can encourage students to continue calling tools or switch tools

Guidance should be clear and easy to understand. If necessary, you can encourage students to continue calling tools or switch tools

work page
[52]

The guidance you provide should try to imitate the previous guidance patterns, don’t improvise freely. # Output Format Strictly follow the markdown format below for output ˋˋˋ # Analysis combining student’s current behavior and previous experience to provide appropriate guidance for the present moment Write your detailed analysis here # Guidance Content W...

work page
[53]

If the student thinks there is insufficient evidence, it must be because they haven’t found the evidence

The question must have an answer. If the student thinks there is insufficient evidence, it must be because they haven’t found the evidence. After careful analysis, provide your guidance for the student’s current step, with the goal of helping the student actually answer the question correctly

work page
[54]

Since this step may not necessarily be wrong, please carefully choose your wording to prevent your guidance from introducing bias

work page
[55]

Your analysis must include a brief review of the **problem** that the student needs to solve, emphasizing the content of the problem to the student to prevent answering off-topic

work page
[56]

The guidance you provide will be given to the student together with the tool call results after this step ends - If the student generates a tool call in this step, the guidance will be given to the student together with the tool return value - If the student generates an answer in this step, the guidance will be given directly to the student, and the stud...

work page
[57]

**You are a teacher, not someone helping students cheat**

It is forbidden to find answers on behalf of the student, and it is forbidden to hint at what the answer is under any circumstances. **You are a teacher, not someone helping students cheat**

work page
[58]

Your purpose is only to **guide**

It is forbidden to provide **direct clues** to students. Your purpose is only to **guide**

work page
[59]

assistant

Guidance should be clear and easy to understand. If necessary, you can encourage students to continue calling tools or switch tools. # Output Format Strictly follow the markdown format below for output ˋˋˋ # Analysis combining student’s current behavior to provide appropriate guidance for the present moment Write your detailed analysis here # Guidance Con...

work page

[1] [1]

Agentic Reinforced Policy Optimization

Self-guided function calling in large language models via stepwise experience recall. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10842–10854, Suzhou, China. Association for Computational Linguistics. Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Ji- azhen Du, Huiyang Wang, Fu...

work page internal anchor Pith review arXiv 2025

[2] [2]

Memory in the Age of AI Agents

Memory in the age of ai agents.Preprint, arXiv:2512.13564. 9 Minsoo Kim, Victor Bursztyn, Eunyee Koh, Shu- nan Guo, and Seung-won Hwang. 2024. RaDA: Retrieval-augmented web agent planning with LLMs. InFindings of the Association for Computational Lin- guistics: ACL 2024, pages 13511–13525, Bangkok, Thailand. Association for Computational Linguistics. Shas...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Sealqa: Raising the bar for reasoning in search-augmented language models.Preprint, arXiv:2506.01062. Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-rong Wen. 2025. Tool learning with large language mod- els: a survey.Frontiers of Computer Science, 19(8). Harsh Raj, Vipul Gupta, Domenic Rosati, and Sub- habrata...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

In ICML 2025 Workshop on Collaborative and Feder- ated Agentic Workflows

AGENT KB: A hierarchical memory frame- work for cross-domain agentic problem solving. In ICML 2025 Workshop on Collaborative and Feder- ated Agentic Workflows. Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, and 1 others

work page 2025

[5] [5]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534. Huaijie Wang, Shibo Hao, Hanze Dong, Shenao Zhang, Yilin Bao, Ziran Yang, and Yi Wu. 2025a. Offline reinforcement learning for LLM multi-step reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 8881–8893, Vienna, Austria. Association for Computational L...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [8]

The Super Mario Bros. Movie

The key information comes from the Reddit post and the The Numbers website. According to the Reddit post, "The Super Mario Bros. Movie" has grossed $1.36B, and "Jurassic World Dominion" has grossed $1.001B. From the The Numbers website, there is a list of the top Universal movies by worldwide box office, and the most recent ones include "The Super Mario B...

work page 2020

[7] [9]

[8 Movies That Broke $1 Billion at the Box Office Since Covid]Avatar: The Way of Water’s box office comes in at over $2.3 billion

work page

[8] [10]

The Super Mario Bros. Movie

[6 Movies That Have Made $1 Billion At The Box Office Since Covid]6 Jurassic World: Dominion (2022) · 5 The Super Mario Bros. Movie (2023) · 4 Spider-Man: No Way Home (2021) · 3 Avatar: The Way of Water (2022) · 2 ... ...... Agent Action: <think> From the search results, I can see that "The Super Mario Bros. Movie" (2023) and "Jurassic World: Dominion" (2...

work page 2022

[9] [11]

Movie - $1.36B · Jurassic World Dominion - $1.001B · Oppenheimer- $975.6M

[Universal - The First Half of the 2020s In Review : r/boxoffice] Universal - The First Half of the 2020s In Review · The Super Mario Bros. Movie - $1.36B · Jurassic World Dominion - $1.001B · Oppenheimer- $975.6M

work page

[10] [12]

Movie (2023) · 4 Spider-Man: No Way Home (2021) · 3 Avatar: The Way of Water (2022) · 2

[6 Movies That Have Made $1 Billion At The Box Office Since Covid] 6 Jurassic World: Dominion (2022) · 5 The Super Mario Bros. Movie (2023) · 4 Spider-Man: No Way Home (2021) · 3 Avatar: The Way of Water (2022) · 2 ... ...... Agent Action: <think> Based on the search results, I need to determine which Universal Studios films have grossed over $1 billion w...

work page 2022

[11] [13]

The Super Mario Bros. Movie

The key information comes from the Reddit post and the The Numbers website. According to the Reddit post, "The Super Mario Bros. Movie" has grossed $1.36B, and "Jurassic World Dominion" has grossed $1.001B. From the The Numbers website, there is a list of the top Universal movies by worldwide box office, and the most recent ones include "The Super Mario B...

work page 2020

[12] [14]

[8 Movies That Broke $1 Billion at the Box Office Since Covid] Avatar: The Way of Water’s box office comes in at over $2.3 billion

work page

[13] [15]

billion-dollar movie

[6 Movies That Have Made $1 Billion At The Box Office Since Covid] 6 Jurassic World: Dominion (2022) · 5 The Super Mario Bros. Movie (2023) · 4 Spider-Man: No Way Home (2021) · 3 Avatar: The Way of Water (2022) · 2 ... ...... <guidance> Review the core of the question: you need to count films released by Universal Studios with worldwide box office over $1...

work page 2022

[14] [16]

search: Input a keyword you think needs to be searched, returns multiple website links and corresponding summaries

work page

[15] [17]

**Detailed tool call format** {search_format} {visit_format} **Model workflow description**

visit: Input a website link and the problem you hope to solve by visiting that site, returns the problem’s answer and a snippet of the original content from the site corresponding to the answer. **Detailed tool call format** {search_format} {visit_format} **Model workflow description**

work page

[16] [18]

The history may have multiple rounds, all designed to solve a given problem

Model input consists of the historical loop of interactions, including the model’s generated thoughts and tool calls, and the user’s returned tool results. The history may have multiple rounds, all designed to solve a given problem

work page

[17] [19]

The model must output two parts: thinking process, and tool call or answer, each enclosed by specific angle bracket position markers: - If thinking, follow the format: <thought> here is the thinking process </thought> - If calling a tool, follow the format: <tool_call> tool call here </tool_call> - If determining the final answer, follow the format: <answ...

work page

[18] [20]

The user will respond with tool call results or occasionally provide guidance. If there is guidance, you should carefully consider whether the user’s ideas are reasonable and try to follow them: - If responding to a tool call, you will see the format: <tool_response> here is the tool’s return value </tool_response> - If the user provides careful guidance,...

work page

[19] [21]

Every problem must have an answer; during multi-round resolution, do not forget your past planning and process results, and do not forget the details embedded in the problem

work page

[20] [22]

To obtain accurate information, you must call the visit tool to visit a site

The search tool’s return value is only website links and snippet summaries; they are hardly reliable references and can only serve as search direction. To obtain accurate information, you must call the visit tool to visit a site

work page

[21] [23]

Do not blindly ignore key constraints in the plan to avoid potential cascading errors

If the interaction includes a resolution plan, follow the plan. Do not blindly ignore key constraints in the plan to avoid potential cascading errors

work page

[22] [24]

An incorrect format will cause tool calls to fail

In tool_call generation, the format must follow the above definitions and be valid JSON. An incorrect format will cause tool calls to fail. *8*. **IMPORTANT**: If the user provides guidance after the answer, prioritize regenerating <tool_call></tool_call> to continue searching for missing clues, or provide only when you are absolutely certain of the answe...

work page

[23] [25]

**You must generate the position markers** in accordance with the requirements stated above (<thought></thought>; <tool_call></tool_call> or <answer></answer>). 2. In particular, do not forget to generate the closing tags: </thought>, </tool_call>, OR </answer>

work page

[24] [26]

Table 11: ReAct system prompt

You must not generate extra angle bracket position markers. Table 11: ReAct system prompt. 20 Prompt for generating experience triplets. # Questions for Students to Solve {question} # Standard Answer for the Question {answer} # This is a complete trajectory that ultimately got the correct answer as your reference: ˋˋˋ{true_traj}ˋˋˋ # This is a complete tr...

work page

[25] [27]

Define a STEP as R_i+O_i, but the last STEP only has R_N

work page

[26] [28]

Each R is a student’s response, attempting to call tools to further solve the problem, but the second trajectory with wrong answer always has some issues

work page

[27] [29]

Your core task is to answer this question for each STEP:ˋˋˋIn order to avoid the final error, if guidance is provided after this STEP ends, what should be done to make the agent perform better?ˋˋˋ

work page

[28] [30]

Of course, a complete guidance is a triplet <student’s current state, reason why this STEP leads to the final error, what to say before the next STEP to improve the current state> ˋˋˋExplanation of the triplet: - Student’s current state: A relatively general description, introducing what the student saw and what they did. The description does not involve ...

work page

[29] [31]

The guidance opinion in the triplet generated for STEP_i will be concatenated after O_i, which means the student can see it before generating R_i+1

work page

[30] [32]

Not every STEP necessarily needs guidance, you can skip after analysis, but since the trajectory is wrong, **there must be at least one STEP that has issues and can be summarized into a triplet**

work page

[31] [33]

Finally, briefly summarize what three good pieces of advice could be given before working on this problem

work page

[32] [34]

behavior + mistake

**!!Must Note!!** The total number of rounds you analyze in the trajectory is **{step_num}**, you must generate the corresponding number of STEPs before you can continue to generate TOTAL! # Output Format (strictly follow the markdown format I give you) ˋˋˋ # STEP 1: ## Analysis - Write analysis content here ## Triplet (If there is no error, directly writ...

work page

[33] [35]

Reuse: Do not change any current labels, and select an existing label for the new behavior (recognizing the existing classification)

work page

[34] [36]

Create: Do not change existing labels, create a new label for the new behavior (existing classification is incomplete)

work page

[35] [37]

Modify: Modify certain current labels, and assign that label to the new behavior (existing classification is inaccurate) # Detailed Requirements

work page

[36] [38]

It should be at least a dozen or dozens of words (e.g., in the pattern of xxx: xxx xxx xx)

Each label must be concise and clear, but needs to have certain semantic information that allows people to understand the characteristics of the current behavior + mistake without explanation. It should be at least a dozen or dozens of words (e.g., in the pattern of xxx: xxx xxx xx)

work page

[37] [39]

There cannot be too many labels; each label should have distinguishability in scenario content

work page

[38] [40]

One label can correspond to multiple behaviors, so you must ensure their textual content is consistent

work page

[39] [41]

When outputting, you need to output the ids and labels of all existing behaviors and new behaviors

Use the given id as the unique identifier for behaviors. When outputting, you need to output the ids and labels of all existing behaviors and new behaviors

work page

[40] [42]

Try to keep the number of different labels balanced. # List of Behaviors Already Given Labels {exp_list} # List of New Behaviors {new_exp_list} # Output Format: ˋˋˋ {output_format} ˋˋˋ Table 13: Prompt for iteratively generating topics. 22 Prompt for experience model topic selection stage. # Overall Instructions You are a teacher who is very good at guidi...

work page

[41] [43]

After you carefully state the reasons for selection, just output the idx of the selected topics

You need to combine the student’s current state and select **3** topics from the several potential error topics I give you. After you carefully state the reasons for selection, just output the idx of the selected topics

work page

[42] [44]

The student may not have actually made a mistake, but your subsequent guidance can prevent problems before they occur. # Output Format Strictly follow the markdown format below for output ˋˋˋ # Analysis of the Current Step Write your analysis here # Selected Topic idx (separated by spaces) idx1 idx2 idx3 ˋˋˋ Output: Table 14: Prompt for experience model t...

work page

[43] [48]

The guidance you provide will be given to the student together with the tool call results after this step ends. - If the student generates a tool call in this step, the guidance will be given to the student together with the tool return value - If the student generates an answer in this step, the guidance will be given directly to the student, and the stu...

work page

[44] [51]

If necessary, you can encourage students to continue calling tools or switch tools

Guidance should be clear and easy to understand. If necessary, you can encourage students to continue calling tools or switch tools

work page

[45] [52]

The guidance you provide should try to imitate the previous guidance patterns, don’t improvise freely. # Output Format Strictly follow the markdown format below for output ˋˋˋ # Analysis combining student’s current behavior and previous experience to provide appropriate guidance for the present moment Write your detailed analysis here # Guidance Content W...

work page

[46] [53]

If the student thinks there is insufficient evidence, it must be because they haven’t found the evidence

The question must have an answer. If the student thinks there is insufficient evidence, it must be because they haven’t found the evidence. After careful analysis, provide your guidance for the student’s current step, with the goal of helping the student actually answer the question correctly

work page

[47] [54]

Since this step may not necessarily be wrong, please carefully choose your wording to prevent your guidance from introducing bias

work page

[48] [55]

Your analysis must include a brief review of the **problem** that the student needs to solve, emphasizing the content of the problem to the student to prevent answering off-topic

work page

[49] [56]

The guidance you provide will be given to the student together with the tool call results after this step ends - If the student generates a tool call in this step, the guidance will be given to the student together with the tool return value - If the student generates an answer in this step, the guidance will be given directly to the student, and the stud...

work page

[50] [57]

**You are a teacher, not someone helping students cheat**

It is forbidden to find answers on behalf of the student, and it is forbidden to hint at what the answer is under any circumstances. **You are a teacher, not someone helping students cheat**

work page

[51] [58]

Your purpose is only to **guide**

It is forbidden to provide **direct clues** to students. Your purpose is only to **guide**

work page

[52] [59]

assistant

Guidance should be clear and easy to understand. If necessary, you can encourage students to continue calling tools or switch tools. # Output Format Strictly follow the markdown format below for output ˋˋˋ # Analysis combining student’s current behavior to provide appropriate guidance for the present moment Write your detailed analysis here # Guidance Con...

work page