pith. machine review for the scientific record. sign in

arxiv: 2604.27221 · v1 · submitted 2026-04-29 · 💻 cs.AI

Recognition: unknown

Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:30 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-agent LLMweb information extractionstructured data aggregationagent orchestrationclosed-loop reasoningshared workspace
0
0 comments X

The pith

A bi-level multi-agent system coordinates parallel workers through a shared workspace to extract consistent tables from the web for both broad and deep queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Web2BigTable as a way to handle two kinds of web information tasks that current systems find difficult. An upper orchestrator splits the job into parts for many worker agents to tackle at once. Workers use a shared space to share what they find, cut down on repeated work, and fix disagreements, while a loop of running, checking, and thinking improves the whole process over time using stored memory. This matters because it could let AI reliably turn messy online information into clean, usable tables without missing pieces or contradicting itself across different sources.

Core claim

Web2BigTable claims that its bi-level architecture, with an orchestrator decomposing tasks for parallel worker agents that coordinate via shared workspace and improve through a closed-loop run-verify-reflect process with persistent external memory, enables effective handling of both breadth-oriented structured aggregation and depth-oriented coherent reasoning in web searches.

What carries the argument

The bi-level architecture with upper-level orchestrator for task decomposition and lower-level workers executing in parallel, supported by shared workspace coordination and closed-loop run-verify-reflect with self-evolving single-agent updates.

If this is right

  • It produces schema-aligned outputs with wide coverage and cross-entity consistency on breadth tasks.
  • Workers can follow long, branching search paths coherently on depth tasks.
  • The system reduces redundant exploration and reconciles conflicting evidence through visible partial findings.
  • Performance improves over time via updates to the external memory and agent behaviors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The shared workspace mechanism could be adapted to other collaborative agent scenarios like distributed problem solving in science.
  • Human-readable memory might enable easy auditing or intervention in high-stakes information extraction.
  • If the coordination scales, it could handle searches across thousands of entities without proportional increases in errors.

Load-bearing premise

That the gains in table extraction accuracy arise chiefly from the bi-level setup, shared workspace coordination, and closed-loop reflection rather than from other unmentioned aspects of the implementation.

What would settle it

Running an ablation study that disables the shared workspace or the reflect step and measures the drop in success rate and F1 scores on the WideSearch benchmark.

Figures

Figures reproduced from arXiv: 2604.27221 by Huichi Zhou, Jun Wang, Ka Yiu Lee, Meng Fang, Weilin Luo, Yihang Chen, Yuxiang Chen, Yuxuan Huang, Zhiyuan He.

Figure 1
Figure 1. Figure 1: Web2BigTable interface during query execution. The left panel displays the user query view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of Web2BigTable, a unified training and inference framework. The upper-level orchestrator decomposes each query into subtasks using the Orchestrator Skills So, and lower-level workers execute them in parallel, drawing execution skills from the shared Worker Skills Sw and coordinating asynchronously through the Workboard me, protected by file locks and tag partitioning. Red arrows denote the ad… view at source ↗
Figure 3
Figure 3. Figure 3: Training (self-evolving) flow of Web2BigTable over one episode k. For each training query qk, Stage 1 reads the long-term orchestrator skills So and decomposes qk into subtasks τ . Stage 2 dispatches these subtasks to N parallel workers, which read execution skills from Sw and read/write the short-term workboard me until convergence. Stage 3 verifies the aggregated output Xk against the gold reference, pro… view at source ↗
Figure 4
Figure 4. Figure 4: Inference flow of Web2BigTable on an unseen user query q. Using the trained skill banks S ∗ o and S ∗ w as frozen read-only inputs, Stage 1 decomposes q into subtasks τ . Stage 2 runs N parallel workers that resolve execution skills from S ∗ w and coordinate through the shared workboard me (per-query, short-term); their partial outputs {xi} are aggregated into the structured big table X. No verification, r… view at source ↗
Figure 5
Figure 5. Figure 5: Performance landscape on WideSearch (Avg@4). Scatter points show multi-agent view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy on XBench-DeepSearch. XBench-DeepSearch results view at source ↗
Figure 7
Figure 7. Figure 7: Case study on task ws_en_006 (Taylor Swift concerts, 534 ground-truth rows, 6 tours). Workers collectively retrieve 653 raw rows; after deduplication and aggregation, 556 unique rows are submitted for evaluation. The auto-generated orchestrator skill selects entity-based decomposition with adaptive region splitting, achieving 93.8% Row F1 vs. 12.8%–26.8% for single-agent and skill-less baselines. 4 Related… view at source ↗
Figure 8
Figure 8. Figure 8: Case B: task ws_en_091 (AMD Zen processors, 331 ground-truth rows, 12 columns). Workers collectively retrieve ∼350 raw rows; after deduplication, ∼334 unique rows are submitted. Single agents retrieve fewer than 50 rows with Item F1 below 26%. Web2BigTable applies a learned product-line decomposition with dedicated spec-verification workers, achieving 89% Row F1 and 96% Item F1. Case C. This task requires … view at source ↗
Figure 9
Figure 9. Figure 9: Case C: task ws_zh_069 (LLM papers from ByteDance Seed and DeepSeek, ∼130 ground-truth rows across two asymmetric sources). Single agents retrieve fewer than 30 rows with Item F1 below 40%. Web2BigTable applies a learned source-based decomposition with a dedicated arXiv verification worker, achieving 91% Row F1 and 94% Item F1 view at source ↗
read the original abstract

Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth-oriented tasks demand schema-aligned outputs with wide coverage and cross-entity consistency, while depth-oriented tasks require coherent reasoning over long, branching search trajectories. We introduce \textbf{Web2BigTable}, a multi-agent framework for web-to-table search that supports both regimes. Web2BigTable adopts a bi-level architecture in which an upper-level orchestrator decomposes the task into sub-problems and lower-level worker agents solve them in parallel. Through a closed-loop run--verify--reflect process, the framework jointly improves decomposition and execution over time via persistent, human-readable external memory, with self-evolving updates to each single-agent. During execution, workers coordinate through a shared workspace that makes partial findings visible, allowing them to reduce redundant exploration, reconcile conflicting evidence, and adapt to emerging coverage gaps. Web2BigTable sets a new state of the art on WideSearch, reaching an Avg@4 Success Rate of \textbf{38.50} ($7.5\times$ the second best at 5.10), Row F1 of \textbf{63.53} (+25.03 over the second best), and Item F1 of \textbf{80.12} (+14.42 over the second best). It also generalises to depth-oriented search on XBench-DeepSearch, achieving 73.0 accuracy. Code is available at https://github.com/web2bigtable/web2bigtable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Web2BigTable, a bi-level multi-agent LLM framework for web-to-table search and extraction. An upper-level orchestrator decomposes tasks into sub-problems solved in parallel by lower-level worker agents; coordination occurs via a shared workspace, and a closed-loop run-verify-reflect process with persistent human-readable memory enables iterative improvement. The system is claimed to achieve new state-of-the-art results on the breadth-oriented WideSearch benchmark (Avg@4 Success Rate 38.50, 7.5× the second-best system; Row F1 63.53; Item F1 80.12) and to generalize to depth-oriented search on XBench-DeepSearch (73.0 accuracy). Code is released.

Significance. If the large reported gains prove robust and attributable to the bi-level orchestration, shared workspace, and closed-loop reflection rather than to unablated factors, the work would represent a meaningful advance in scalable agentic systems for structured web information extraction. The public code release is a clear strength that aids reproducibility and follow-up research.

major comments (2)
  1. [§4] §4 (Experimental Evaluation): The manuscript reports large benchmark gains on WideSearch (e.g., Avg@4 Success Rate of 38.50 vs. 5.10) but supplies no descriptions of the baselines, number of runs, error bars, or statistical significance tests. Without these, the reliability and magnitude of the claimed improvements cannot be assessed.
  2. [§4] §4 (Experimental Evaluation): No ablation studies are presented that disable the upper-level orchestrator, the shared workspace visibility, or the reflect step while holding the underlying LLM, base prompts, and retrieval tools fixed. This absence prevents attribution of the 7.5× and F1 gains to the proposed bi-level architecture and coordination mechanisms rather than to other factors such as prompt tuning or benchmark characteristics.
minor comments (1)
  1. [Abstract and §3] The abstract and §3 would benefit from an explicit definition or formula for 'Avg@4 Success Rate' and 'Row F1' to ensure readers can interpret the metrics without ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of the bi-level multi-agent approach. We agree that the experimental evaluation requires additional rigor to support the reported gains and will revise §4 accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Evaluation): The manuscript reports large benchmark gains on WideSearch (e.g., Avg@4 Success Rate of 38.50 vs. 5.10) but supplies no descriptions of the baselines, number of runs, error bars, or statistical significance tests. Without these, the reliability and magnitude of the claimed improvements cannot be assessed.

    Authors: We agree that these details are necessary for proper assessment. In the revised manuscript we will expand §4 with complete descriptions of every baseline, including their architectures, prompting, and tool configurations as implemented in the released code. We will also report the number of independent runs performed for each system, include error bars (standard deviation across runs), and add statistical significance tests (e.g., paired t-tests) comparing Web2BigTable against the strongest baselines. These additions will be directly supported by the public repository. revision: yes

  2. Referee: [§4] §4 (Experimental Evaluation): No ablation studies are presented that disable the upper-level orchestrator, the shared workspace visibility, or the reflect step while holding the underlying LLM, base prompts, and retrieval tools fixed. This absence prevents attribution of the 7.5× and F1 gains to the proposed bi-level architecture and coordination mechanisms rather than to other factors such as prompt tuning or benchmark characteristics.

    Authors: We concur that ablations are required to isolate the contribution of each proposed component. In the revised version we will add a new subsection presenting three controlled ablations performed with identical LLM, base prompts, and retrieval tools: (1) removal of the upper-level orchestrator (single-agent baseline), (2) disabling shared workspace visibility (independent workers), and (3) removal of the reflect step (single-pass run-verify only). Results will quantify the performance drop attributable to each mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description with benchmark results, no derivation chain or self-referential reductions

full rationale

The paper introduces Web2BigTable as a bi-level multi-agent framework and reports direct empirical results on WideSearch and XBench-DeepSearch benchmarks. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims consist of measured success rates, F1 scores, and accuracy figures obtained from running the system; these are not constructed by re-expressing the architecture definition or prior self-citations as outputs. Absence of ablations is a separate evidence-strength issue, not a circularity reduction. The derivation chain is therefore self-contained as an engineering proposal plus experimental evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger reflects assumptions implied by the high-level description rather than explicit statements in the full text.

free parameters (1)
  • LLM prompt templates and hyperparameters
    Agent systems of this type typically require tuned prompts and sampling parameters whose values are not reported in the abstract.
axioms (1)
  • domain assumption Lower-level worker agents can reliably solve decomposed sub-tasks when given access to web search and a shared workspace
    The bi-level design rests on this capability of current LLMs.

pith-pipeline@v0.9.0 · 5622 in / 1310 out tokens · 78287 ms · 2026-05-07T09:30:57.905092+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 27 canonical work pages · 12 internal anchors

  1. [1]

    arXiv preprint arXiv:2506.04651 , year=

    Nikolas Belle, Dakota Barnes, Alfonso Amayuelas, Ivan Bercovich, Xin Eric Wang, and William Wang. Agents of change: Self-evolving llm agents for strategic planning.arXiv preprint arXiv:2506.04651, 2025

  2. [2]

    ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation

    William Brach, Francesco Zuppichini, Marco Vinciguerra, and Lorenzo Padoan. Scrapegraphai-100k: A large-scale dataset for llm-based web information extraction.arXiv preprint arXiv:2602.15189, 2026

  3. [3]

    xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025

    Kaiyuan Chen, Yixin Ren, Yang Liu, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025. Web2BigTable 20

  4. [4]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:64335–64366, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:64335–64366, 2023

  5. [5]

    Learning hierarchical procedural memory for LLM agents through Bayesian selection and contrastive refinement.arXiv preprint arXiv:2512.18950, 2025

    Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, and Mahdi Jalili. Learning hierarchical procedural memory for LLM agents through Bayesian selection and contrastive refinement.arXiv preprint arXiv:2512.18950, 2025. URLhttps://arxiv.org/abs/2512. 18950

  6. [6]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

  7. [7]

    SAMULE: Self- learning agents enhanced by multi-level reflection

    Yubin Ge, Salvatore Romeo, Jason Cai, Monica Sunkara, and Yi Zhang. SAMULE: Self- learning agents enhanced by multi-level reflection. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

  8. [8]

    Automated Design of Agentic Systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2408.08435

  9. [9]

    Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

    Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

  10. [10]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  11. [11]

    InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking

    Ka Yiu Lee, Yuxuan Huang, Zhiyuan He, Huichi Zhou, Weilin Luo, Kun Shao, Meng Fang, and Jun Wang. Infoseeker: A scalable hierarchical parallel agent framework for web information seeking.arXiv preprint arXiv:2604.02971, 2026

  12. [12]

    Webthinker: Empowering large reasoning models with deep research capability,

    Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. WebThinker: Empowering large reasoning models with deep research capability. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2504.21776

  13. [13]

    Select-then-decompose: Adaptive selection strategy for task decomposition in large language models

    Shuodi Liu, Yingzhuo Liu, Zi Wang, Yusheng Wang, Huijia Wu, Liuyu Xiang, and Zhaofeng He. Select-then-decompose: Adaptive selection strategy for task decomposition in large language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025. arXiv:2510.17922

  14. [14]

    Webglm: Towards an efficient web-enhanced question answering system with human preferences

    Xiao Liu, Hanyu Lai, Hao Yu, Yifan Xu, Aohan Zuo, Yuxiao Dong, and Jie Tang. Webglm: Towards an efficient web-enhanced question answering system with human preferences. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4549–4560, 2023

  15. [15]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021. Web2BigTable 21

  16. [16]

    Scaling large-language-model-based multi-agent collaboration

    Chen Qian, Zihao Xie, YiFei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Scaling large lan- guage model-based multi-agent collaboration. InProceedings of the International Conference on Learning Representations (ICLR), 2025. arXiv:2406.07155

  17. [17]

    Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025

    Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self- improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025. URLhttps: //arxiv.org/abs/2512.17102

  18. [18]

    Memento-ii: Learning by stateful reflective memory.arXiv preprint arXiv:2512.22716, 2025

    Jun Wang. Memento-ii: Learning by stateful reflective memory.arXiv preprint arXiv:2512.22716, 2025

  19. [19]

    Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

    Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, and Ke Wang. WideSearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

  20. [20]

    EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

    Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, and Botian Shi. EvolveR: Self-evolving LLM agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

  21. [21]

    Memory in LLM-based multi-agent systems: Mechanisms, challenges, and collective intelligence.TechRxiv preprint, 2025

    Shanglin Wu et al. Memory in LLM-based multi-agent systems: Mechanisms, challenges, and collective intelligence.TechRxiv preprint, 2025. doi: 10.36227/techrxiv.176539617.79044553. URL https://www.techrxiv.org/doi/full/10.36227/techrxiv.176539617.79044553/ v1

  22. [22]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026. URL https://arxiv.org/abs/2602.08234

  23. [23]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: Agentic memory for LLM agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://arxiv.org/abs/2502.12110

  24. [24]

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

    Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, Volker Tresp, and Yunpu Ma. Memory-R1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025. URLhttps://arxiv.org/ abs/2508.19828

  25. [25]

    Learning on the job: An experience-driven self-evolving agent for long-horizon tasks.arXiv preprint arXiv:2510.08002, 2025

    Cheng Yang, Xuemeng Yang, Licheng Wen, Daocheng Fu, Jianbiao Mei, Rong Wu, Pinlong Cai, Yufan Shen, Nianchen Deng, Botian Shi, et al. Learning on the job: An experience-driven self-evolving agent for long-horizon tasks.arXiv preprint arXiv:2510.08002, 2025

  26. [26]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https://arxiv.org/abs/2210.03629

  27. [27]

    arXiv preprint arXiv:2602.16873 , year =

    Geunbin Yu. Adaptorch: Task-adaptive multi-agent orchestration in the era of llm perfor- mance convergence.arXiv preprint arXiv:2602.16873, 2026

  28. [28]

    G- memory: Tracing hierarchical memory for multi-agent systems.arXiv, 2025

    Guibin Zhang et al. G-Memory: Tracing hierarchical memory for multi-agent systems.arXiv preprint arXiv:2506.07398, 2025. URLhttps://arxiv.org/abs/2506.07398. Web2BigTable 22

  29. [29]

    AFlow: Automating Agentic Workflow Generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. AFlow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URLhttps://arxiv. org/abs/2410.10762

  30. [30]

    2025 , issue_date =

    Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents.ACM Transactions on Information Systems, 43(6):155, 2025. doi: 10.1145/3748302. URLhttps://dl.acm.org/doi/10.1145/3748302

  31. [31]

    Memento: Fine-tuning LLM agents without fine-tuning LLMs.Preprint, 2025

    Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. Memento: Fine-tuning LLM agents without fine-tuning LLMs.Preprint, 2025

  32. [32]

    Memento-skills: Let agents design agents

    Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents. arXiv preprint arXiv:2603.18743, 2026

  33. [33]

    Webarena: A realistic web environment for building autonomous agents

    Shuyan Zhou, Frank F Hou, Xingyao Cheng, Hongyuan Mao, Bowen Peng, Shengjia Zhong, Yihao Tai, Edward Corcodel, Rui Zhang, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2023. A Theoretical Extension: Memento-Team The bilevel game formulation present...

  34. [34]

    User Query (ws_en_091) List all AMD processors with Zen architecture released from Lisa Su becoming CEO (2014) to 2024 inclusive. Columns: Time, Product Series, Processor Model, Core Architecture, Manufacturing Process (nm), Cores, Threads, Core Frequency (GHz), L2 Cache (MB), L3 Cache (MB), Graphics Model, Number of Graphics Cores. Output “NA” if informa...

  35. [35]

    AMD Zen processors 2019–2021

    Baseline Decomposition (without learned orchestrator skills) Strategy: split-by-time-period(default LLM reasoning) The orchestrator divided the 10-year span into coarse time windows, mixing product lines (Ryzen Desktop, EPYC Server, Threadripper, Mobile, Embedded, PRO) within each window. Workers queried broad terms such as “AMD Zen processors 2019–2021”,...

  36. [36]

    Learned Orchestrator Skill (auto-generated in Phase 1) The task-router classified this query assplit-by-category. The decompose skill partitions by product line, not by time: # Split-by-category decomposition When the query covers MULTIPLE PRODUCT LINES across a long time span, split by product category –- NOT by year. Each worker specialises in one produ...

  37. [37]

    Row F1 =89%, Item F1 =96%

    Web2BigTable Decomposition (applying learned skills) Strategy: split-by-category (product line) + split-by-generation for large lines Worker 0: Ryzen Desktop (all generations) Worker 1: EPYC Server (Zen 1–2) Worker 2: EPYC Server (Zen 3–5) Worker 3: Ryzen Mobile (U/H/HS/HX series) Worker 4: Threadripper + Threadripper PRO Worker 5: Ryzen PRO (Desktop + Mo...

  38. [38]

    Key insight:With 12 columns per row, single agents retrieve fewer than 50 rows and fill barely 20% of cells correctly

    Performance Comparison Row F1 Item F1 Rows GPT-5 mini (single agent) 10.5% 25.3% 48 Gemini 3 Flash (single agent) 8.2% 21.7% 35 Web2BigTable (w/o skills) 18% 32% 137 Web2BigTable (full) 89% 96% 334 ∆(full vs best single agent) +78.5 +70.7 +286 ∆(full vs w/o skills) +71.0 +64.0 +197 Note:Single-agent scores are for this specific task only, not the benchmar...

  39. [39]

    Search the official websites of both organisations (any paper with Seed-team participation counts)

    User Query (ws_zh_069) Compile all large-model-related papers published by the ByteDance Seed team and DeepSeek between 1 January 2023 and 30 June 2025. Search the official websites of both organisations (any paper with Seed-team participation counts). For each paper, include the publication date (yyyy-mm-dd), title, and primary authors. If two records re...

  40. [40]

    ByteDance Seed papers 2024 H1

    Baseline Decomposition (without learned orchestrator skills) Strategy: split-by-time-period(default LLM reasoning) The orchestrator divided the 30-month window into five 6-month chunks and dispatched workers without distinguishing source organisations. Workers issued generic queries such as “ByteDance Seed papers 2024 H1” and “DeepSeek papers 2024 H2”, re...

  41. [41]

    arXiv date takes precedence

    Learned Orchestrator Skill (auto-generated in Phase 1) The task-router classified this query assplit-by-source. The decompose skill partitions bysource organisation, with a dedicated worker for cross-source temporal verification: # Split-by-source decomposition When the query enumerates records from MULTIPLE NAMED SOURCES (e.g., specific organisations, re...

  42. [42]

    Result: 134 rows retrieved.Row F1 =91%, Item F1 =94%

    Web2BigTable Decomposition (applying learned skills) Strategy: split-by-source + split-by-year for the larger source + dedicated verification worker Worker 0: Seed team: 2023 (full year) Worker 1: Seed team: 2024 (Jan–Jun) Worker 2: Seed team: 2024 (Jul–Dec) Worker 3: Seed team: 2025 (Jan–Jun) Worker 4: DeepSeek: entire 30-month window Worker 5: arXiv cro...

  43. [43]

    Single agents retrieve fewer than 30 rows, missing the majority of Seed papers and most DeepSeek entries entirely

    Performance Comparison Row F1 Item F1 Rows GPT-5 mini (single agent) 22.4% 39.6% 29 Gemini 3 Flash (single agent) 18.7% 34.2% 23 Web2BigTable (w/o skills) 41% 58% 67 Web2BigTable (full) 91% 94% 134 ∆(full vs best single agent) +68.6 +54.4 +105 ∆(full vs w/o skills) +50.0 +36.0 +67 Key insight:The two sources have asymmetric catalogue sizes (∼120 vs∼10). S...