Recognition: unknown
Web2BigTable: A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction
Pith reviewed 2026-05-07 09:30 UTC · model grok-4.3
The pith
A bi-level multi-agent system coordinates parallel workers through a shared workspace to extract consistent tables from the web for both broad and deep queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Web2BigTable claims that its bi-level architecture, with an orchestrator decomposing tasks for parallel worker agents that coordinate via shared workspace and improve through a closed-loop run-verify-reflect process with persistent external memory, enables effective handling of both breadth-oriented structured aggregation and depth-oriented coherent reasoning in web searches.
What carries the argument
The bi-level architecture with upper-level orchestrator for task decomposition and lower-level workers executing in parallel, supported by shared workspace coordination and closed-loop run-verify-reflect with self-evolving single-agent updates.
If this is right
- It produces schema-aligned outputs with wide coverage and cross-entity consistency on breadth tasks.
- Workers can follow long, branching search paths coherently on depth tasks.
- The system reduces redundant exploration and reconciles conflicting evidence through visible partial findings.
- Performance improves over time via updates to the external memory and agent behaviors.
Where Pith is reading between the lines
- The shared workspace mechanism could be adapted to other collaborative agent scenarios like distributed problem solving in science.
- Human-readable memory might enable easy auditing or intervention in high-stakes information extraction.
- If the coordination scales, it could handle searches across thousands of entities without proportional increases in errors.
Load-bearing premise
That the gains in table extraction accuracy arise chiefly from the bi-level setup, shared workspace coordination, and closed-loop reflection rather than from other unmentioned aspects of the implementation.
What would settle it
Running an ablation study that disables the shared workspace or the reflect step and measures the drop in success rate and F1 scores on the WideSearch benchmark.
Figures
read the original abstract
Agentic web search increasingly faces two distinct demands: deep reasoning over a single target, and structured aggregation across many entities and heterogeneous sources. Current systems struggle on both fronts. Breadth-oriented tasks demand schema-aligned outputs with wide coverage and cross-entity consistency, while depth-oriented tasks require coherent reasoning over long, branching search trajectories. We introduce \textbf{Web2BigTable}, a multi-agent framework for web-to-table search that supports both regimes. Web2BigTable adopts a bi-level architecture in which an upper-level orchestrator decomposes the task into sub-problems and lower-level worker agents solve them in parallel. Through a closed-loop run--verify--reflect process, the framework jointly improves decomposition and execution over time via persistent, human-readable external memory, with self-evolving updates to each single-agent. During execution, workers coordinate through a shared workspace that makes partial findings visible, allowing them to reduce redundant exploration, reconcile conflicting evidence, and adapt to emerging coverage gaps. Web2BigTable sets a new state of the art on WideSearch, reaching an Avg@4 Success Rate of \textbf{38.50} ($7.5\times$ the second best at 5.10), Row F1 of \textbf{63.53} (+25.03 over the second best), and Item F1 of \textbf{80.12} (+14.42 over the second best). It also generalises to depth-oriented search on XBench-DeepSearch, achieving 73.0 accuracy. Code is available at https://github.com/web2bigtable/web2bigtable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Web2BigTable, a bi-level multi-agent LLM framework for web-to-table search and extraction. An upper-level orchestrator decomposes tasks into sub-problems solved in parallel by lower-level worker agents; coordination occurs via a shared workspace, and a closed-loop run-verify-reflect process with persistent human-readable memory enables iterative improvement. The system is claimed to achieve new state-of-the-art results on the breadth-oriented WideSearch benchmark (Avg@4 Success Rate 38.50, 7.5× the second-best system; Row F1 63.53; Item F1 80.12) and to generalize to depth-oriented search on XBench-DeepSearch (73.0 accuracy). Code is released.
Significance. If the large reported gains prove robust and attributable to the bi-level orchestration, shared workspace, and closed-loop reflection rather than to unablated factors, the work would represent a meaningful advance in scalable agentic systems for structured web information extraction. The public code release is a clear strength that aids reproducibility and follow-up research.
major comments (2)
- [§4] §4 (Experimental Evaluation): The manuscript reports large benchmark gains on WideSearch (e.g., Avg@4 Success Rate of 38.50 vs. 5.10) but supplies no descriptions of the baselines, number of runs, error bars, or statistical significance tests. Without these, the reliability and magnitude of the claimed improvements cannot be assessed.
- [§4] §4 (Experimental Evaluation): No ablation studies are presented that disable the upper-level orchestrator, the shared workspace visibility, or the reflect step while holding the underlying LLM, base prompts, and retrieval tools fixed. This absence prevents attribution of the 7.5× and F1 gains to the proposed bi-level architecture and coordination mechanisms rather than to other factors such as prompt tuning or benchmark characteristics.
minor comments (1)
- [Abstract and §3] The abstract and §3 would benefit from an explicit definition or formula for 'Avg@4 Success Rate' and 'Row F1' to ensure readers can interpret the metrics without ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential significance of the bi-level multi-agent approach. We agree that the experimental evaluation requires additional rigor to support the reported gains and will revise §4 accordingly.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Evaluation): The manuscript reports large benchmark gains on WideSearch (e.g., Avg@4 Success Rate of 38.50 vs. 5.10) but supplies no descriptions of the baselines, number of runs, error bars, or statistical significance tests. Without these, the reliability and magnitude of the claimed improvements cannot be assessed.
Authors: We agree that these details are necessary for proper assessment. In the revised manuscript we will expand §4 with complete descriptions of every baseline, including their architectures, prompting, and tool configurations as implemented in the released code. We will also report the number of independent runs performed for each system, include error bars (standard deviation across runs), and add statistical significance tests (e.g., paired t-tests) comparing Web2BigTable against the strongest baselines. These additions will be directly supported by the public repository. revision: yes
-
Referee: [§4] §4 (Experimental Evaluation): No ablation studies are presented that disable the upper-level orchestrator, the shared workspace visibility, or the reflect step while holding the underlying LLM, base prompts, and retrieval tools fixed. This absence prevents attribution of the 7.5× and F1 gains to the proposed bi-level architecture and coordination mechanisms rather than to other factors such as prompt tuning or benchmark characteristics.
Authors: We concur that ablations are required to isolate the contribution of each proposed component. In the revised version we will add a new subsection presenting three controlled ablations performed with identical LLM, base prompts, and retrieval tools: (1) removal of the upper-level orchestrator (single-agent baseline), (2) disabling shared workspace visibility (independent workers), and (3) removal of the reflect step (single-pass run-verify only). Results will quantify the performance drop attributable to each mechanism. revision: yes
Circularity Check
No circularity: empirical system description with benchmark results, no derivation chain or self-referential reductions
full rationale
The paper introduces Web2BigTable as a bi-level multi-agent framework and reports direct empirical results on WideSearch and XBench-DeepSearch benchmarks. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims consist of measured success rates, F1 scores, and accuracy figures obtained from running the system; these are not constructed by re-expressing the architecture definition or prior self-citations as outputs. Absence of ablations is a separate evidence-strength issue, not a circularity reduction. The derivation chain is therefore self-contained as an engineering proposal plus experimental evaluation.
Axiom & Free-Parameter Ledger
free parameters (1)
- LLM prompt templates and hyperparameters
axioms (1)
- domain assumption Lower-level worker agents can reliably solve decomposed sub-tasks when given access to web search and a shared workspace
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2506.04651 , year=
Nikolas Belle, Dakota Barnes, Alfonso Amayuelas, Ivan Bercovich, Xin Eric Wang, and William Wang. Agents of change: Self-evolving llm agents for strategic planning.arXiv preprint arXiv:2506.04651, 2025
-
[2]
ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation
William Brach, Francesco Zuppichini, Marco Vinciguerra, and Lorenzo Padoan. Scrapegraphai-100k: A large-scale dataset for llm-based web information extraction.arXiv preprint arXiv:2602.15189, 2026
work page internal anchor Pith review arXiv 2026
-
[3]
xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025
Kaiyuan Chen, Yixin Ren, Yang Liu, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025. Web2BigTable 20
2025
-
[4]
Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:64335–64366, 2023
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:64335–64366, 2023
2023
-
[5]
Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, and Mahdi Jalili. Learning hierarchical procedural memory for LLM agents through Bayesian selection and contrastive refinement.arXiv preprint arXiv:2512.18950, 2025. URLhttps://arxiv.org/abs/2512. 18950
-
[6]
Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025
work page internal anchor Pith review arXiv 2025
-
[7]
SAMULE: Self- learning agents enhanced by multi-level reflection
Yubin Ge, Salvatore Romeo, Jason Cai, Monica Sunkara, and Yi Zhang. SAMULE: Self- learning agents enhanced by multi-level reflection. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025
2025
-
[8]
Automated Design of Agentic Systems
Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2408.08435
work page internal anchor Pith review arXiv 2025
-
[9]
Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025
Yuxuan Huang, Yihang Chen, Haozheng Zhang, Kang Li, Huichi Zhou, Meng Fang, Linyi Yang, Xiaoguang Li, Lifeng Shang, Songcen Xu, et al. Deep research agents: A systematic examination and roadmap.arXiv preprint arXiv:2506.18096, 2025
-
[10]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025
work page internal anchor Pith review arXiv 2025
-
[11]
InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking
Ka Yiu Lee, Yuxuan Huang, Zhiyuan He, Huichi Zhou, Weilin Luo, Kun Shao, Meng Fang, and Jun Wang. Infoseeker: A scalable hierarchical parallel agent framework for web information seeking.arXiv preprint arXiv:2604.02971, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Webthinker: Empowering large reasoning models with deep research capability,
Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, and Zhicheng Dou. WebThinker: Empowering large reasoning models with deep research capability. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2504.21776
-
[13]
Select-then-decompose: Adaptive selection strategy for task decomposition in large language models
Shuodi Liu, Yingzhuo Liu, Zi Wang, Yusheng Wang, Huijia Wu, Liuyu Xiang, and Zhaofeng He. Select-then-decompose: Adaptive selection strategy for task decomposition in large language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025. arXiv:2510.17922
-
[14]
Webglm: Towards an efficient web-enhanced question answering system with human preferences
Xiao Liu, Hanyu Lai, Hao Yu, Yifan Xu, Aohan Zuo, Yuxiao Dong, and Jie Tang. Webglm: Towards an efficient web-enhanced question answering system with human preferences. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4549–4560, 2023
2023
-
[15]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021. Web2BigTable 21
work page internal anchor Pith review arXiv 2021
-
[16]
Scaling large-language-model-based multi-agent collaboration
Chen Qian, Zihao Xie, YiFei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Scaling large lan- guage model-based multi-agent collaboration. InProceedings of the International Conference on Learning Representations (ICLR), 2025. arXiv:2406.07155
-
[17]
Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self- improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025. URLhttps: //arxiv.org/abs/2512.17102
-
[18]
Memento-ii: Learning by stateful reflective memory.arXiv preprint arXiv:2512.22716, 2025
Jun Wang. Memento-ii: Learning by stateful reflective memory.arXiv preprint arXiv:2512.22716, 2025
-
[19]
Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025
Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, and Ke Wang. WideSearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025
-
[20]
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle
Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, and Botian Shi. EvolveR: Self-evolving LLM agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025
work page internal anchor Pith review arXiv 2025
-
[21]
Shanglin Wu et al. Memory in LLM-based multi-agent systems: Mechanisms, challenges, and collective intelligence.TechRxiv preprint, 2025. doi: 10.36227/techrxiv.176539617.79044553. URL https://www.techrxiv.org/doi/full/10.36227/techrxiv.176539617.79044553/ v1
-
[22]
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning
Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026. URL https://arxiv.org/abs/2602.08234
work page internal anchor Pith review arXiv 2026
-
[23]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-MEM: Agentic memory for LLM agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. URLhttps://arxiv.org/abs/2502.12110
work page internal anchor Pith review arXiv 2025
-
[24]
Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, Volker Tresp, and Yunpu Ma. Memory-R1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025. URLhttps://arxiv.org/ abs/2508.19828
work page internal anchor Pith review arXiv 2025
-
[25]
Cheng Yang, Xuemeng Yang, Licheng Wen, Daocheng Fu, Jianbiao Mei, Rong Wu, Pinlong Cai, Yufan Shen, Nianchen Deng, Botian Shi, et al. Learning on the job: An experience-driven self-evolving agent for long-horizon tasks.arXiv preprint arXiv:2510.08002, 2025
-
[26]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https://arxiv.org/abs/2210.03629
work page internal anchor Pith review arXiv 2023
-
[27]
arXiv preprint arXiv:2602.16873 , year =
Geunbin Yu. Adaptorch: Task-adaptive multi-agent orchestration in the era of llm perfor- mance convergence.arXiv preprint arXiv:2602.16873, 2026
-
[28]
G- memory: Tracing hierarchical memory for multi-agent systems.arXiv, 2025
Guibin Zhang et al. G-Memory: Tracing hierarchical memory for multi-agent systems.arXiv preprint arXiv:2506.07398, 2025. URLhttps://arxiv.org/abs/2506.07398. Web2BigTable 22
-
[29]
AFlow: Automating Agentic Workflow Generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. AFlow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URLhttps://arxiv. org/abs/2410.10762
work page internal anchor Pith review arXiv 2025
-
[30]
Zeyu Zhang, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Quanyu Dai, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents.ACM Transactions on Information Systems, 43(6):155, 2025. doi: 10.1145/3748302. URLhttps://dl.acm.org/doi/10.1145/3748302
-
[31]
Memento: Fine-tuning LLM agents without fine-tuning LLMs.Preprint, 2025
Huichi Zhou, Yihang Chen, Siyuan Guo, Xue Yan, Kin Hei Lee, Zihan Wang, Ka Yiu Lee, Guchun Zhang, Kun Shao, Linyi Yang, and Jun Wang. Memento: Fine-tuning LLM agents without fine-tuning LLMs.Preprint, 2025
2025
-
[32]
Memento-skills: Let agents design agents
Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, et al. Memento-skills: Let agents design agents. arXiv preprint arXiv:2603.18743, 2026
-
[33]
Webarena: A realistic web environment for building autonomous agents
Shuyan Zhou, Frank F Hou, Xingyao Cheng, Hongyuan Mao, Bowen Peng, Shengjia Zhong, Yihao Tai, Edward Corcodel, Rui Zhang, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2023. A Theoretical Extension: Memento-Team The bilevel game formulation present...
2023
-
[34]
User Query (ws_en_091) List all AMD processors with Zen architecture released from Lisa Su becoming CEO (2014) to 2024 inclusive. Columns: Time, Product Series, Processor Model, Core Architecture, Manufacturing Process (nm), Cores, Threads, Core Frequency (GHz), L2 Cache (MB), L3 Cache (MB), Graphics Model, Number of Graphics Cores. Output “NA” if informa...
2014
-
[35]
AMD Zen processors 2019–2021
Baseline Decomposition (without learned orchestrator skills) Strategy: split-by-time-period(default LLM reasoning) The orchestrator divided the 10-year span into coarse time windows, mixing product lines (Ryzen Desktop, EPYC Server, Threadripper, Mobile, Embedded, PRO) within each window. Workers queried broad terms such as “AMD Zen processors 2019–2021”,...
2019
-
[36]
Learned Orchestrator Skill (auto-generated in Phase 1) The task-router classified this query assplit-by-category. The decompose skill partitions by product line, not by time: # Split-by-category decomposition When the query covers MULTIPLE PRODUCT LINES across a long time span, split by product category –- NOT by year. Each worker specialises in one produ...
-
[37]
Row F1 =89%, Item F1 =96%
Web2BigTable Decomposition (applying learned skills) Strategy: split-by-category (product line) + split-by-generation for large lines Worker 0: Ryzen Desktop (all generations) Worker 1: EPYC Server (Zen 1–2) Worker 2: EPYC Server (Zen 3–5) Worker 3: Ryzen Mobile (U/H/HS/HX series) Worker 4: Threadripper + Threadripper PRO Worker 5: Ryzen PRO (Desktop + Mo...
-
[38]
Key insight:With 12 columns per row, single agents retrieve fewer than 50 rows and fill barely 20% of cells correctly
Performance Comparison Row F1 Item F1 Rows GPT-5 mini (single agent) 10.5% 25.3% 48 Gemini 3 Flash (single agent) 8.2% 21.7% 35 Web2BigTable (w/o skills) 18% 32% 137 Web2BigTable (full) 89% 96% 334 ∆(full vs best single agent) +78.5 +70.7 +286 ∆(full vs w/o skills) +71.0 +64.0 +197 Note:Single-agent scores are for this specific task only, not the benchmar...
-
[39]
Search the official websites of both organisations (any paper with Seed-team participation counts)
User Query (ws_zh_069) Compile all large-model-related papers published by the ByteDance Seed team and DeepSeek between 1 January 2023 and 30 June 2025. Search the official websites of both organisations (any paper with Seed-team participation counts). For each paper, include the publication date (yyyy-mm-dd), title, and primary authors. If two records re...
2023
-
[40]
ByteDance Seed papers 2024 H1
Baseline Decomposition (without learned orchestrator skills) Strategy: split-by-time-period(default LLM reasoning) The orchestrator divided the 30-month window into five 6-month chunks and dispatched workers without distinguishing source organisations. Workers issued generic queries such as “ByteDance Seed papers 2024 H1” and “DeepSeek papers 2024 H2”, re...
2024
-
[41]
arXiv date takes precedence
Learned Orchestrator Skill (auto-generated in Phase 1) The task-router classified this query assplit-by-source. The decompose skill partitions bysource organisation, with a dedicated worker for cross-source temporal verification: # Split-by-source decomposition When the query enumerates records from MULTIPLE NAMED SOURCES (e.g., specific organisations, re...
-
[42]
Result: 134 rows retrieved.Row F1 =91%, Item F1 =94%
Web2BigTable Decomposition (applying learned skills) Strategy: split-by-source + split-by-year for the larger source + dedicated verification worker Worker 0: Seed team: 2023 (full year) Worker 1: Seed team: 2024 (Jan–Jun) Worker 2: Seed team: 2024 (Jul–Dec) Worker 3: Seed team: 2025 (Jan–Jun) Worker 4: DeepSeek: entire 30-month window Worker 5: arXiv cro...
2023
-
[43]
Single agents retrieve fewer than 30 rows, missing the majority of Seed papers and most DeepSeek entries entirely
Performance Comparison Row F1 Item F1 Rows GPT-5 mini (single agent) 22.4% 39.6% 29 Gemini 3 Flash (single agent) 18.7% 34.2% 23 Web2BigTable (w/o skills) 41% 58% 67 Web2BigTable (full) 91% 94% 134 ∆(full vs best single agent) +68.6 +54.4 +105 ∆(full vs w/o skills) +50.0 +36.0 +67 Key insight:The two sources have asymmetric catalogue sizes (∼120 vs∼10). S...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.