arxiv: 2604.27221 · v1 · submitted 2026-04-29 · 💻 cs.AI

Recognition: unknown

Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction

Yuxuan Huang , Yihang Chen , Zhiyuan He , Yuxiang Chen , Ka Yiu Lee , Huichi Zhou , Weilin Luo , Meng Fang

show 1 more author

Jun Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:30 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent LLMweb information extractionstructured data aggregationagent orchestrationclosed-loop reasoningshared workspace

0 comments

The pith

A bi-level multi-agent system coordinates parallel workers through a shared workspace to extract consistent tables from the web for both broad and deep queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Web2BigTable as a way to handle two kinds of web information tasks that current systems find difficult. An upper orchestrator splits the job into parts for many worker agents to tackle at once. Workers use a shared space to share what they find, cut down on repeated work, and fix disagreements, while a loop of running, checking, and thinking improves the whole process over time using stored memory. This matters because it could let AI reliably turn messy online information into clean, usable tables without missing pieces or contradicting itself across different sources.

Core claim

Web2BigTable claims that its bi-level architecture, with an orchestrator decomposing tasks for parallel worker agents that coordinate via shared workspace and improve through a closed-loop run-verify-reflect process with persistent external memory, enables effective handling of both breadth-oriented structured aggregation and depth-oriented coherent reasoning in web searches.

What carries the argument

The bi-level architecture with upper-level orchestrator for task decomposition and lower-level workers executing in parallel, supported by shared workspace coordination and closed-loop run-verify-reflect with self-evolving single-agent updates.

If this is right

It produces schema-aligned outputs with wide coverage and cross-entity consistency on breadth tasks.
Workers can follow long, branching search paths coherently on depth tasks.
The system reduces redundant exploration and reconciles conflicting evidence through visible partial findings.
Performance improves over time via updates to the external memory and agent behaviors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared workspace mechanism could be adapted to other collaborative agent scenarios like distributed problem solving in science.
Human-readable memory might enable easy auditing or intervention in high-stakes information extraction.
If the coordination scales, it could handle searches across thousands of entities without proportional increases in errors.

Load-bearing premise

That the gains in table extraction accuracy arise chiefly from the bi-level setup, shared workspace coordination, and closed-loop reflection rather than from other unmentioned aspects of the implementation.

What would settle it

Running an ablation study that disables the shared workspace or the reflect step and measures the drop in success rate and F1 scores on the WideSearch benchmark.

Figures

Figures reproduced from arXiv: 2604.27221 by Huichi Zhou, Jun Wang, Ka Yiu Lee, Meng Fang, Weilin Luo, Yihang Chen, Yuxiang Chen, Yuxuan Huang, Zhiyuan He.

**Figure 1.** Figure 1: Web2BigTable interface during query execution. The left panel displays the user query view at source ↗

**Figure 2.** Figure 2: Architecture of Web2BigTable, a unified training and inference framework. The upper-level orchestrator decomposes each query into subtasks using the Orchestrator Skills So, and lower-level workers execute them in parallel, drawing execution skills from the shared Worker Skills Sw and coordinating asynchronously through the Workboard me, protected by file locks and tag partitioning. Red arrows denote the ad… view at source ↗

**Figure 3.** Figure 3: Training (self-evolving) flow of Web2BigTable over one episode k. For each training query qk, Stage 1 reads the long-term orchestrator skills So and decomposes qk into subtasks τ . Stage 2 dispatches these subtasks to N parallel workers, which read execution skills from Sw and read/write the short-term workboard me until convergence. Stage 3 verifies the aggregated output Xk against the gold reference, pro… view at source ↗

**Figure 4.** Figure 4: Inference flow of Web2BigTable on an unseen user query q. Using the trained skill banks S ∗ o and S ∗ w as frozen read-only inputs, Stage 1 decomposes q into subtasks τ . Stage 2 runs N parallel workers that resolve execution skills from S ∗ w and coordinate through the shared workboard me (per-query, short-term); their partial outputs {xi} are aggregated into the structured big table X. No verification, r… view at source ↗

**Figure 5.** Figure 5: Performance landscape on WideSearch (Avg@4). Scatter points show multi-agent view at source ↗

**Figure 6.** Figure 6: Accuracy on XBench-DeepSearch. XBench-DeepSearch results view at source ↗

**Figure 7.** Figure 7: Case study on task ws_en_006 (Taylor Swift concerts, 534 ground-truth rows, 6 tours). Workers collectively retrieve 653 raw rows; after deduplication and aggregation, 556 unique rows are submitted for evaluation. The auto-generated orchestrator skill selects entity-based decomposition with adaptive region splitting, achieving 93.8% Row F1 vs. 12.8%–26.8% for single-agent and skill-less baselines. 4 Related… view at source ↗

**Figure 8.** Figure 8: Case B: task ws_en_091 (AMD Zen processors, 331 ground-truth rows, 12 columns). Workers collectively retrieve ∼350 raw rows; after deduplication, ∼334 unique rows are submitted. Single agents retrieve fewer than 50 rows with Item F1 below 26%. Web2BigTable applies a learned product-line decomposition with dedicated spec-verification workers, achieving 89% Row F1 and 96% Item F1. Case C. This task requires … view at source ↗

**Figure 9.** Figure 9: Case C: task ws_zh_069 (LLM papers from ByteDance Seed and DeepSeek, ∼130 ground-truth rows across two asymmetric sources). Single agents retrieve fewer than 30 rows with Item F1 below 40%. Web2BigTable applies a learned source-based decomposition with a dedicated arXiv verification worker, achieving 91% Row F1 and 94% Item F1 view at source ↗

read the original abstract

Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth-oriented tasks demand schema-aligned outputs with wide coverage and cross-entity consistency, while depth-oriented tasks require coherent reasoning over long, branching search trajectories. We introduce \textbf{Web2BigTable}, a multi-agent framework for web-to-table search that supports both regimes. Web2BigTable adopts a bi-level architecture in which an upper-level orchestrator decomposes the task into sub-problems and lower-level worker agents solve them in parallel. Through a closed-loop run--verify--reflect process, the framework jointly improves decomposition and execution over time via persistent, human-readable external memory, with self-evolving updates to each single-agent. During execution, workers coordinate through a shared workspace that makes partial findings visible, allowing them to reduce redundant exploration, reconcile conflicting evidence, and adapt to emerging coverage gaps. Web2BigTable sets a new state of the art on WideSearch, reaching an Avg@4 Success Rate of \textbf{38.50} ($7.5\times$ the second best at 5.10), Row F1 of \textbf{63.53} (+25.03 over the second best), and Item F1 of \textbf{80.12} (+14.42 over the second best). It also generalises to depth-oriented search on XBench-DeepSearch, achieving 73.0 accuracy. Code is available at https://github.com/web2bigtable/web2bigtable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Web2BigTable offers a concrete bi-level multi-agent setup with shared workspace and persistent memory for web-to-table tasks and reports large benchmark gains, but lacks ablations needed to tie those gains to the architecture rather than prompting or model choices.

read the letter

The main point is a bi-level multi-agent framework that splits web-to-table extraction into an upper orchestrator for decomposition and lower parallel workers, with coordination via a shared workspace and updates to external memory through a run-verify-reflect loop. It targets both broad schema-aligned tables across many entities and deeper branching searches, and it releases code on GitHub. That combination of orchestration, visibility of partial results, and self-evolving memory is the concrete new piece, and the paper lays out how it reduces redundancy and handles conflicting evidence reasonably clearly. The reported numbers on WideSearch (38.50 Avg@4, big F1 lifts) and 73.0 accuracy on XBench-DeepSearch look strong at first glance and show some generalization across task types. The soft spot is the missing controls. No ablation tables appear that disable the upper-level decomposition or the shared workspace while holding the LLM, base prompts, and tools fixed, so the 7.5x jump cannot be confidently assigned to the bi-level design instead of prompt patterns or benchmark specifics. The experimental claims therefore rest on weaker footing than the architecture description. This is aimed at researchers working on agentic systems for structured web extraction. Someone building similar pipelines could extract usable ideas on coordination and memory, though they would want to replicate the results themselves. It has enough of a working framework and public code to deserve serious referee time, even with the experimental gaps.

Referee Report

2 major / 1 minor

Summary. The paper introduces Web2BigTable, a bi-level multi-agent LLM framework for web-to-table search and extraction. An upper-level orchestrator decomposes tasks into sub-problems solved in parallel by lower-level worker agents; coordination occurs via a shared workspace, and a closed-loop run-verify-reflect process with persistent human-readable memory enables iterative improvement. The system is claimed to achieve new state-of-the-art results on the breadth-oriented WideSearch benchmark (Avg@4 Success Rate 38.50, 7.5× the second-best system; Row F1 63.53; Item F1 80.12) and to generalize to depth-oriented search on XBench-DeepSearch (73.0 accuracy). Code is released.

Significance. If the large reported gains prove robust and attributable to the bi-level orchestration, shared workspace, and closed-loop reflection rather than to unablated factors, the work would represent a meaningful advance in scalable agentic systems for structured web information extraction. The public code release is a clear strength that aids reproducibility and follow-up research.

major comments (2)

[§4] §4 (Experimental Evaluation): The manuscript reports large benchmark gains on WideSearch (e.g., Avg@4 Success Rate of 38.50 vs. 5.10) but supplies no descriptions of the baselines, number of runs, error bars, or statistical significance tests. Without these, the reliability and magnitude of the claimed improvements cannot be assessed.
[§4] §4 (Experimental Evaluation): No ablation studies are presented that disable the upper-level orchestrator, the shared workspace visibility, or the reflect step while holding the underlying LLM, base prompts, and retrieval tools fixed. This absence prevents attribution of the 7.5× and F1 gains to the proposed bi-level architecture and coordination mechanisms rather than to other factors such as prompt tuning or benchmark characteristics.

minor comments (1)

[Abstract and §3] The abstract and §3 would benefit from an explicit definition or formula for 'Avg@4 Success Rate' and 'Row F1' to ensure readers can interpret the metrics without ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of the bi-level multi-agent approach. We agree that the experimental evaluation requires additional rigor to support the reported gains and will revise §4 accordingly.

read point-by-point responses

Referee: [§4] §4 (Experimental Evaluation): The manuscript reports large benchmark gains on WideSearch (e.g., Avg@4 Success Rate of 38.50 vs. 5.10) but supplies no descriptions of the baselines, number of runs, error bars, or statistical significance tests. Without these, the reliability and magnitude of the claimed improvements cannot be assessed.

Authors: We agree that these details are necessary for proper assessment. In the revised manuscript we will expand §4 with complete descriptions of every baseline, including their architectures, prompting, and tool configurations as implemented in the released code. We will also report the number of independent runs performed for each system, include error bars (standard deviation across runs), and add statistical significance tests (e.g., paired t-tests) comparing Web2BigTable against the strongest baselines. These additions will be directly supported by the public repository. revision: yes
Referee: [§4] §4 (Experimental Evaluation): No ablation studies are presented that disable the upper-level orchestrator, the shared workspace visibility, or the reflect step while holding the underlying LLM, base prompts, and retrieval tools fixed. This absence prevents attribution of the 7.5× and F1 gains to the proposed bi-level architecture and coordination mechanisms rather than to other factors such as prompt tuning or benchmark characteristics.

Authors: We concur that ablations are required to isolate the contribution of each proposed component. In the revised version we will add a new subsection presenting three controlled ablations performed with identical LLM, base prompts, and retrieval tools: (1) removal of the upper-level orchestrator (single-agent baseline), (2) disabling shared workspace visibility (independent workers), and (3) removal of the reflect step (single-pass run-verify only). Results will quantify the performance drop attributable to each mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description with benchmark results, no derivation chain or self-referential reductions

full rationale

The paper introduces Web2BigTable as a bi-level multi-agent framework and reports direct empirical results on WideSearch and XBench-DeepSearch benchmarks. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims consist of measured success rates, F1 scores, and accuracy figures obtained from running the system; these are not constructed by re-expressing the architecture definition or prior self-citations as outputs. Absence of ablations is a separate evidence-strength issue, not a circularity reduction. The derivation chain is therefore self-contained as an engineering proposal plus experimental evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger reflects assumptions implied by the high-level description rather than explicit statements in the full text.

free parameters (1)

LLM prompt templates and hyperparameters
Agent systems of this type typically require tuned prompts and sampling parameters whose values are not reported in the abstract.

axioms (1)

domain assumption Lower-level worker agents can reliably solve decomposed sub-tasks when given access to web search and a shared workspace
The bi-level design rests on this capability of current LLMs.

pith-pipeline@v0.9.0 · 5622 in / 1310 out tokens · 78287 ms · 2026-05-07T09:30:57.905092+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 27 canonical work pages · 12 internal anchors

[1]

arXiv preprint arXiv:2506.04651 , year=

Nikolas Belle, Dakota Barnes, Alfonso Amayuelas, Ivan Bercovich, Xin Eric Wang, and William Wang. Agents of change: Self-evolving llm agents for strategic planning.arXiv preprint arXiv:2506.04651, 2025

work page arXiv 2025
[2]

ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation

William Brach, Francesco Zuppichini, Marco Vinciguerra, and Lorenzo Padoan. Scrapegraphai-100k: A large-scale dataset for llm-based web information extraction.arXiv preprint arXiv:2602.15189, 2026

work page internal anchor Pith review arXiv 2026
[3]

xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025

Kaiyuan Chen, Yixin Ren, Yang Liu, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025. Web2BigTable 20

2025
[4]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:64335–64366, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:64335–64366, 2023

2023
[5]

Learning hierarchical procedural memory for LLM agents through Bayesian selection and contrastive refinement.arXiv preprint arXiv:2512.18950, 2025

Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, and Mahdi Jalili. Learning hierarchical procedural memory for LLM agents through Bayesian selection and contrastive refinement.arXiv preprint arXiv:2512.18950, 2025. URLhttps://arxiv.org/abs/2512. 18950

work page arXiv 2025
[6]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

work page internal anchor Pith review arXiv 2025
[7]

SAMULE: Self- learning agents enhanced by multi-level reflection

Yubin Ge, Salvatore Romeo, Jason Cai, Monica Sunkara, and Yi Zhang. SAMULE: Self- learning agents enhanced by multi-level reflection. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

2025
[8]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2408.08435

work page internal anchor Pith review arXiv 2025
[9]

Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025

work page arXiv 2025
[10]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review arXiv 2025
[11]

InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking

Ka Yiu Lee, Yuxuan Huang, Zhiyuan He, Huichi Zhou, Weilin Luo, Kun Shao, Meng Fang, and Jun Wang. Infoseeker: A scalable hierarchical parallel agent framework for web information seeking.arXiv preprint arXiv:2604.02971, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Webthinker: Empowering large reasoning models with deep research capability,

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. WebThinker: Empowering large reasoning models with deep research capability. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2504.21776

work page arXiv 2025
[13]

Select-then-decompose: Adaptive selection strategy for task decomposition in large language models

Shuodi Liu, Yingzhuo Liu, Zi Wang, Yusheng Wang, Huijia Wu, Liuyu Xiang, and Zhaofeng He. Select-then-decompose: Adaptive selection strategy for task decomposition in large language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025. arXiv:2510.17922

work page arXiv 2025
[14]

Webglm: Towards an efficient web-enhanced question answering system with human preferences

Xiao Liu, Hanyu Lai, Hao Yu, Yifan Xu, Aohan Zuo, Yuxiao Dong, and Jie Tang. Webglm: Towards an efficient web-enhanced question answering system with human preferences. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4549–4560, 2023

2023
[15]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021. Web2BigTable 21

work page internal anchor Pith review arXiv 2021
[16]

Scaling large-language-model-based multi-agent collaboration

Chen Qian, Zihao Xie, YiFei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Scaling large lan- guage model-based multi-agent collaboration. InProceedings of the International Conference on Learning Representations (ICLR), 2025. arXiv:2406.07155

work page arXiv 2025
[17]

Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self- improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025. URLhttps: //arxiv.org/abs/2512.17102

work page arXiv 2025
[18]

Memento-ii: Learning by stateful reflective memory.arXiv preprint arXiv:2512.22716, 2025

Jun Wang. Memento-ii: Learning by stateful reflective memory.arXiv preprint arXiv:2512.22716, 2025

work page arXiv 2025
[19]

Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, and Ke Wang. WideSearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

work page arXiv 2025
[20]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, and Botian Shi. EvolveR: Self-evolving LLM agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

work page internal anchor Pith review arXiv 2025
[21]

Memory in LLM-based multi-agent systems: Mechanisms, challenges, and collective intelligence.TechRxiv preprint, 2025

Shanglin Wu et al. Memory in LLM-based multi-agent systems: Mechanisms, challenges, and collective intelligence.TechRxiv preprint, 2025. doi: 10.36227/techrxiv.176539617.79044553. URL https://www.techrxiv.org/doi/full/10.36227/techrxiv.176539617.79044553/ v1

work page doi:10.36227/techrxiv.176539617.79044553 2025
[22]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026. URL https://arxiv.org/abs/2602.08234

work page internal anchor Pith review arXiv 2026
[23]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: Agentic memory for LLM agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://arxiv.org/abs/2502.12110

work page internal anchor Pith review arXiv 2025
[24]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, Volker Tresp, and Yunpu Ma. Memory-R1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025. URLhttps://arxiv.org/ abs/2508.19828

work page internal anchor Pith review arXiv 2025
[25]

Learning on the job: An experience-driven self-evolving agent for long-horizon tasks.arXiv preprint arXiv:2510.08002, 2025

Cheng Yang, Xuemeng Yang, Licheng Wen, Daocheng Fu, Jianbiao Mei, Rong Wu, Pinlong Cai, Yufan Shen, Nianchen Deng, Botian Shi, et al. Learning on the job: An experience-driven self-evolving agent for long-horizon tasks.arXiv preprint arXiv:2510.08002, 2025

work page arXiv 2025
[26]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https://arxiv.org/abs/2210.03629

work page internal anchor Pith review arXiv 2023
[27]

arXiv preprint arXiv:2602.16873 , year =

Geunbin Yu. Adaptorch: Task-adaptive multi-agent orchestration in the era of llm perfor- mance convergence.arXiv preprint arXiv:2602.16873, 2026

work page arXiv 2026
[28]

G- memory: Tracing hierarchical memory for multi-agent systems.arXiv, 2025

Guibin Zhang et al. G-Memory: Tracing hierarchical memory for multi-agent systems.arXiv preprint arXiv:2506.07398, 2025. URLhttps://arxiv.org/abs/2506.07398. Web2BigTable 22

work page arXiv 2025
[29]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. AFlow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URLhttps://arxiv. org/abs/2410.10762

work page internal anchor Pith review arXiv 2025
[30]

2025 , issue_date =

Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents.ACM Transactions on Information Systems, 43(6):155, 2025. doi: 10.1145/3748302. URLhttps://dl.acm.org/doi/10.1145/3748302

work page doi:10.1145/3748302 2025
[31]

Memento: Fine-tuning LLM agents without fine-tuning LLMs.Preprint, 2025

Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. Memento: Fine-tuning LLM agents without fine-tuning LLMs.Preprint, 2025

2025
[32]

Memento-skills: Let agents design agents

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents. arXiv preprint arXiv:2603.18743, 2026

work page arXiv 2026
[33]

Webarena: A realistic web environment for building autonomous agents

Shuyan Zhou, Frank F Hou, Xingyao Cheng, Hongyuan Mao, Bowen Peng, Shengjia Zhong, Yihao Tai, Edward Corcodel, Rui Zhang, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2023. A Theoretical Extension: Memento-Team The bilevel game formulation present...

2023
[34]

User Query (ws_en_091) List all AMD processors with Zen architecture released from Lisa Su becoming CEO (2014) to 2024 inclusive. Columns: Time, Product Series, Processor Model, Core Architecture, Manufacturing Process (nm), Cores, Threads, Core Frequency (GHz), L2 Cache (MB), L3 Cache (MB), Graphics Model, Number of Graphics Cores. Output “NA” if informa...

2014
[35]

AMD Zen processors 2019–2021

Baseline Decomposition (without learned orchestrator skills) Strategy: split-by-time-period(default LLM reasoning) The orchestrator divided the 10-year span into coarse time windows, mixing product lines (Ryzen Desktop, EPYC Server, Threadripper, Mobile, Embedded, PRO) within each window. Workers queried broad terms such as “AMD Zen processors 2019–2021”,...

2019
[36]

Learned Orchestrator Skill (auto-generated in Phase 1) The task-router classified this query assplit-by-category. The decompose skill partitions by product line, not by time: # Split-by-category decomposition When the query covers MULTIPLE PRODUCT LINES across a long time span, split by product category –- NOT by year. Each worker specialises in one produ...
[37]

Row F1 =89%, Item F1 =96%

Web2BigTable Decomposition (applying learned skills) Strategy: split-by-category (product line) + split-by-generation for large lines Worker 0: Ryzen Desktop (all generations) Worker 1: EPYC Server (Zen 1–2) Worker 2: EPYC Server (Zen 3–5) Worker 3: Ryzen Mobile (U/H/HS/HX series) Worker 4: Threadripper + Threadripper PRO Worker 5: Ryzen PRO (Desktop + Mo...
[38]

Key insight:With 12 columns per row, single agents retrieve fewer than 50 rows and fill barely 20% of cells correctly

Performance Comparison Row F1 Item F1 Rows GPT-5 mini (single agent) 10.5% 25.3% 48 Gemini 3 Flash (single agent) 8.2% 21.7% 35 Web2BigTable (w/o skills) 18% 32% 137 Web2BigTable (full) 89% 96% 334 ∆(full vs best single agent) +78.5 +70.7 +286 ∆(full vs w/o skills) +71.0 +64.0 +197 Note:Single-agent scores are for this specific task only, not the benchmar...
[39]

Search the official websites of both organisations (any paper with Seed-team participation counts)

User Query (ws_zh_069) Compile all large-model-related papers published by the ByteDance Seed team and DeepSeek between 1 January 2023 and 30 June 2025. Search the official websites of both organisations (any paper with Seed-team participation counts). For each paper, include the publication date (yyyy-mm-dd), title, and primary authors. If two records re...

2023
[40]

ByteDance Seed papers 2024 H1

Baseline Decomposition (without learned orchestrator skills) Strategy: split-by-time-period(default LLM reasoning) The orchestrator divided the 30-month window into five 6-month chunks and dispatched workers without distinguishing source organisations. Workers issued generic queries such as “ByteDance Seed papers 2024 H1” and “DeepSeek papers 2024 H2”, re...

2024
[41]

arXiv date takes precedence

Learned Orchestrator Skill (auto-generated in Phase 1) The task-router classified this query assplit-by-source. The decompose skill partitions bysource organisation, with a dedicated worker for cross-source temporal verification: # Split-by-source decomposition When the query enumerates records from MULTIPLE NAMED SOURCES (e.g., specific organisations, re...
[42]

Result: 134 rows retrieved.Row F1 =91%, Item F1 =94%

Web2BigTable Decomposition (applying learned skills) Strategy: split-by-source + split-by-year for the larger source + dedicated verification worker Worker 0: Seed team: 2023 (full year) Worker 1: Seed team: 2024 (Jan–Jun) Worker 2: Seed team: 2024 (Jul–Dec) Worker 3: Seed team: 2025 (Jan–Jun) Worker 4: DeepSeek: entire 30-month window Worker 5: arXiv cro...

2023
[43]

Single agents retrieve fewer than 30 rows, missing the majority of Seed papers and most DeepSeek entries entirely

Performance Comparison Row F1 Item F1 Rows GPT-5 mini (single agent) 22.4% 39.6% 29 Gemini 3 Flash (single agent) 18.7% 34.2% 23 Web2BigTable (w/o skills) 41% 58% 67 Web2BigTable (full) 91% 94% 134 ∆(full vs best single agent) +68.6 +54.4 +105 ∆(full vs w/o skills) +50.0 +36.0 +67 Key insight:The two sources have asymmetric catalogue sizes (∼120 vs∼10). S...