pith. sign in

arxiv: 2605.20425 · v1 · pith:2CFVY7URnew · submitted 2026-05-19 · 💻 cs.AI

AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

Pith reviewed 2026-05-21 07:01 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-agent workflowsretrieval-based synthesisagent interoperabilityworkflow compositiongenomicslocal repairscientific agentstyped handoffs
0
0 comments X

The pith

AgentCo-op assembles independent agents and tools into genomics workflows through typed handoffs and local repair.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that retrieval can synthesize executable multi-agent workflows from existing scientific agents and tool repositories in open domains that lack standard interfaces or evaluation metrics. It connects components via typed artifact handoffs so that data flows correctly between independently built pieces, then applies bounded local repair only to the parts that fail during a run. A reader would care because this sidesteps the usual need to redesign agents or optimize an entire graph topology for each new task. The genomics demonstrations show the method coordinating agents on spatial transcriptomics and single-cell multiome analysis while keeping the resulting workflows auditable. The same framework also improves benchmark performance and lowers per-task cost relative to other multi-agent setups.

Core claim

AgentCo-op is a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs and then applies bounded self-guided local repair to implicated components when execution evidence indicates failure. In two open-world genomics case studies it assembles independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. One workflow coordinates agents for spatial transcriptomics and gene-set interpretation; the other builds a parallel workflow for cross-modality marker analysis on single-cell multiome data. The method也可

What carries the argument

Typed artifact handoffs, which specify the data types exchanged between agents to guarantee interoperability, together with bounded self-guided local repair that targets only failing components instead of rewriting the whole workflow.

If this is right

  • Independently developed scientific agents can be reused across tasks without redesign or interface changes.
  • Workflows remain auditable because each step uses known, retrieved components rather than opaque generated code.
  • A previously searched workflow can serve as a structural prior that retrieval then grounds with concrete components and repairs.
  • Per-task cost drops compared with standard multi-agent baselines while matching or exceeding accuracy on coding, math, and QA benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval-plus-local-repair pattern could apply to other heterogeneous scientific fields that already have scattered agent repositories.
  • Keeping workflows built from explicit, typed components may make AI-assisted scientific pipelines easier to reproduce and audit by human experts.
  • Hybrid systems that alternate between global search and retrieval-based grounding might handle even larger or more open-ended discovery tasks.
  • If local repair proves insufficient in new domains, the method would need extensions such as automatic interface synthesis to stay fully automatic.

Load-bearing premise

Existing agents and tools can be made to work together once typed artifact handoffs are defined, and that fixing problems locally will usually be enough to produce a working workflow in open scientific settings.

What would settle it

A new genomics task in which repeated local repairs on retrieved components still leave persistent handoff or execution failures that require manual redesign of agent interfaces or the overall topology.

Figures

Figures reproduced from arXiv: 2605.20425 by Jian Ma, Mingqian Ma, Shike Wang, Shuaike Shen, Wenduo Cheng.

Figure 1
Figure 1. Figure 1: Overview of AGENTCO-OP. AGENTCO-OP synthesizes multi-agent workflows through five main stages: Planning, Retrieval, Synthesis, Execution, and Review. Given a typed task specification x = (g, c, r, Ω), the system retrieves relevant knowledge, skills, tools, repositories, and datasets, then synthesizes an executable workflow graph G = (V, E). The synthesis stage includes initial graph construction, Dockerfil… view at source ↗
Figure 3
Figure 3. Figure 3: Sche builds isolated Docker containers, registers each container [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: AGENTCO-OP coordinates external tools for cross-modal marker discovery. AGENTCO-OP registers external Seurat and Signac tool nodes, runs parallel RNA and ATAC marker-discovery branches, validates typed artifacts, evaluates marker support against CellMarker 2.0 and PanglaoDB, and integrates the evidence into a final report. node-associated fibroblast program. Finally, the Integrator combines the differentia… view at source ↗
read the original abstract

Designing multi-agent workflows is especially difficult in open-ended scientific settings where tasks lack curated training sets, reliable scalar evaluation metrics, and standardized interfaces between existing tools and agents. We propose AgentCo-op, a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs, then applies bounded self-guided local repair to implicated components when execution evidence indicates failure. In two open-world genomics case studies, AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesigning them or running global topology search. It coordinates specialized agents for spatial transcriptomics and gene-set interpretation to enable collaborative discovery from spatial transcriptomics data, and builds a parallel workflow for cross-modality marker analysis on single-cell multiome data. AgentCo-op can also import a searched workflow as a structural prior and improve it by grounding nodes with retrieved components and applying local repair, showing that synthesis and search are complementary. On six coding, math, and question-answering benchmarks, AgentCo-op achieves the best result on four benchmarks and the best average score under a unified backbone setting, while consistently reducing per-task cost relative to multi-agent baselines. Together, these results suggest that retrieval-based synthesis can extend automated agentic workflow design beyond benchmark-optimized agent graphs to open-world workflows built from existing agents, tools, and typed artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes AgentCo-op, a retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable multi-agent workflows via typed artifact handoffs, followed by bounded self-guided local repair when execution evidence indicates failure. It reports results from two open-world genomics case studies (spatial transcriptomics collaboration and cross-modality marker analysis on single-cell multiome data) in which independently developed agents and external tool repositories are assembled into auditable workflows without redesign or global topology search. The paper also shows that the method can improve a searched workflow by grounding nodes with retrieved components plus local repair, and reports top performance on four of six coding/math/QA benchmarks with reduced per-task cost under a unified backbone.

Significance. If the central claims are supported by the missing implementation details and validation, the work would usefully extend automated agentic workflow design from benchmark-optimized graphs to open scientific settings that lack curated training data, scalar metrics, and standardized interfaces. The explicit demonstration that retrieval-based synthesis and search-based methods are complementary, together with the reported cost reductions, are concrete strengths that could be leveraged by follow-on research.

major comments (2)
  1. [Genomics case studies] Genomics case studies section: the central claim that AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesign or global topology search depends on the sufficiency of typed artifact handoffs plus bounded local self-guided repair. The manuscript supplies no concrete description of the artifact types employed, the discovery or matching procedure to existing agent interfaces, the repair triggers and bounds, whether repair ever escalates to node replacement, or any failure-resolution traces from the two case studies. Without these, the claim that the method works in open scientific settings reduces to an untested assumption about interface compatibility.
  2. [Experimental evaluation] Benchmark results: the abstract states that AgentCo-op achieves the best result on four benchmarks and the best average score, yet the reported evaluation lacks statistical validation, error analysis, or per-run variance, which weakens the support for the claim of consistent superiority and cost reduction relative to multi-agent baselines.
minor comments (2)
  1. [Abstract] The abstract refers to 'six coding, math, and question-answering benchmarks' without naming them; listing the specific benchmarks (e.g., HumanEval, GSM8K, etc.) would improve clarity and reproducibility.
  2. Ensure that any workflow diagrams or tables in the case-study sections are accompanied by explicit legends that define the typed artifacts and repair actions shown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional clarity and validation would strengthen the manuscript. We address each major comment below and will revise the manuscript to incorporate the requested details and improvements.

read point-by-point responses
  1. Referee: [Genomics case studies] Genomics case studies section: the central claim that AgentCo-op composes independently developed scientific agents and external tool repositories into auditable workflows without redesign or global topology search depends on the sufficiency of typed artifact handoffs plus bounded local self-guided repair. The manuscript supplies no concrete description of the artifact types employed, the discovery or matching procedure to existing agent interfaces, the repair triggers and bounds, whether repair ever escalates to node replacement, or any failure-resolution traces from the two case studies. Without these, the claim that the method works in open scientific settings reduces to an untested assumption about interface compatibility.

    Authors: We agree that the manuscript currently lacks sufficient concrete implementation details on these elements, which are necessary to fully substantiate the claims regarding interoperability in open scientific settings. In the revised manuscript, we will add a dedicated subsection (and supporting appendix material) that explicitly describes the artifact types employed in the genomics workflows, the discovery and matching procedure for agent interfaces, the repair triggers and iteration bounds, confirmation that repairs remain strictly local and do not escalate to node replacement, and selected failure-resolution traces from the two case studies. These additions will provide the missing evidence that typed handoffs combined with bounded local repair enable the reported workflows without redesign or global search. revision: yes

  2. Referee: [Experimental evaluation] Benchmark results: the abstract states that AgentCo-op achieves the best result on four benchmarks and the best average score, yet the reported evaluation lacks statistical validation, error analysis, or per-run variance, which weakens the support for the claim of consistent superiority and cost reduction relative to multi-agent baselines.

    Authors: We acknowledge that the current benchmark evaluation reports point estimates from single runs without variance measures or statistical tests, which limits the robustness of the superiority and cost-reduction claims. In the revised manuscript, we will augment the experimental evaluation section with results aggregated over multiple independent runs, including mean scores, standard deviations, and appropriate statistical significance tests (e.g., paired comparisons against baselines) to better support the reported performance advantages and cost savings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from case studies and benchmarks

full rationale

The paper describes a retrieval-based synthesis method using typed artifact handoffs and bounded local repair, then reports outcomes from two genomics case studies and six benchmarks as experimental findings. No equations, fitted parameters, or derivations are presented that reduce by construction to the inputs. Claims about interoperability and repair are validated externally via application to open-world tasks rather than self-definition or self-citation chains. The derivation chain remains self-contained through method description plus independent evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on domain assumptions about agent interoperability and repair sufficiency, with no explicit free parameters or invented entities detailed in the abstract; the framework itself is the primary contribution.

axioms (1)
  • domain assumption Existing scientific agents and tools possess compatible typed interfaces that enable artifact handoffs without redesign.
    Invoked to support composition of independently developed components in the genomics case studies.

pith-pipeline@v0.9.0 · 5785 in / 1347 out tokens · 57911 ms · 2026-05-21T07:01:45.063896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 15 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    Anthropic. Agent skills. https://docs.anthropic.com/en/docs/claude-code/skills, 2025a. Anthropic. Create custom subagents. https://code.claude.com/docs/en/sub-agents, 2025b. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with ...

  2. [2]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    Varun Pratap Bhardwaj. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430,

  3. [3]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  5. [5]

    Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gard- ner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics,

  6. [6]

    Oscar Franzén, Li-Ming Gan, and Johan L. M. Björkegren. PanglaoDB: a web server for exploration of mouse and human single-cell rna sequencing data.Database, 2019:baz046,

  7. [7]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt

    doi: 10.1016/j.cell.2021.04.048. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. InAdvances in Neural Information Processing Systems,

  8. [8]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352,

  9. [9]

    Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao

    doi: 10.1093/nar/gkac947. Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao. Repo2run: Automated building executable environment for code repository at scale.arXiv preprint arXiv:2502.13681,

  10. [10]

    Automated Design of Agentic Systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435,

  11. [11]

    Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong

    Kaixuan Huang, Yuanhao Qu, Henry Cousins, William A. Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong. CRISPR-GPT: An LLM agent for automated design of gene-editing experiments.bioRxiv 2024.04.25.591003,

  12. [12]

    Carter, Xin Zhou, Matthew Wheeler, Jonathan A

    Kexin Huang, Serena Zhang, Hanchen Wang, Yuanhao Qu, Yingzhou Lu, Yusuf Roohani, Ryan Li, Lin Qiu, Gavin Li, Junze Zhang, Di Yin, Shruti Marwaha, Jennefer N. Carter, Xin Zhou, Matthew Wheeler, Jonathan A. Bernstein, Mengdi Wang, Peng He, Jingtian Zhou, Michael Snyder, Le Cong, Aviv Regev, and Jure Leskovec. Biomni: A general-purpose biomedical AI agent. b...

  13. [13]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  14. [14]

    arXiv preprint arXiv:2507.02004 , year=

    Ruofan Jin, Zaixi Zhang, Mengdi Tang, Le Cong Wang, and Mengdi Wang. STELLA: Self-evolving LLM agent for biomedical research.arXiv preprint arXiv:2507.02004,

  15. [15]

    Autoflow: Automated workflow generation for large language model agents

    Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, and Yongfeng Zhang. AutoFlow: Automated workflow generation for large language model agents. arXiv preprint arXiv:2407.12821,

  16. [16]

    A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration

    Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A dynamic llm-powered agent network for task-oriented agent collaboration.arXiv preprint arXiv:2310.02170,

  17. [17]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

    Accessed: 2026-05-06. Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594,

  18. [18]

    Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452,

    Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452,

  19. [19]

    arXiv preprint arXiv:2410.06153 , year=

    Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space.arXiv preprint arXiv:2410.06153,

  20. [20]

    SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources

    Shuaike Shen, Wenduo Cheng, Mingqian Ma, Alistair Turcan, Martin Jinye Zhang, and Jian Ma. Skillfoundry: Building self-evolving agent skill libraries from heterogeneous scientific resources. arXiv preprint arXiv:2604.03964,

  21. [21]

    Multi-Agent Collaboration Mechanisms: A Survey of LLMs

    doi: 10.1038/ s41586-025-09442-9. Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322,

  22. [22]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    H. Wang et al. Spatialagent: An autonomous ai agent for spatial biology.bioRxiv, 2025a. doi: 10.1101/2025.04.03.646459. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

  23. [23]

    EvoAgentX: An automated framework for evolving agentic workflows.arXiv preprint arXiv:2507.03616, 2025b

    Yingxu Wang, Yuxuan Liu, Lu Tian, Wenkang Shen, Zixuan Tang, Tianqi Wang, Wenhao Wu, Wenjun Liu, and Quanjia Yu. EvoAgentX: An automated framework for evolving agentic workflows.arXiv preprint arXiv:2507.03616, 2025b. 13 Z. Wang, Q. Jin, C.-H. Wei, et al. Geneagent: Self-verification language agent for gene-set analysis using domain databases.Nature Metho...

  24. [24]

    From ai for science to agentic science: A survey on autonomous scientific discovery.arXiv preprint arXiv:2508.14111,

    Jiaqi Wei, Yuejin Yang, Xiang Zhang, Yuhan Chen, Xiang Zhuang, Zhangyang Gao, Dongzhan Zhou, Guangshuai Wang, Zhiqiang Gao, Juntai Cao, et al. From ai for science to agentic science: A survey on autonomous scientific discovery.arXiv preprint arXiv:2508.14111,

  25. [25]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation.arXiv preprint arXiv:2308.08155,

  26. [26]

    Cohen, Ruslan Salakhutdinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380,

  27. [27]

    Zhang, L

    Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, and Xiang Wang. Multi-agent architecture search via agentic supernet.arXiv preprint arXiv:2502.04180,

  28. [28]

    CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

    Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, et al. Evoskills: Self-evolving agent skills via co-evolutionary verification.arXiv preprint arXiv:2604.01687,

  29. [29]

    AFlow: Automating Agentic Workflow Generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation.arXiv preprint arXiv:2410.10762,

  30. [30]

    SEW: Self-Evolving Agentic Workflows for Automated Code Generation

    Siwei Zhao, Xinyu Liu, Yifei Zhao, Tianyu Yang, Jiaqi Wang, Yong Bai, and Yang Liu. SEW: Self-evolving agentic workflows for automated code generation.arXiv preprint arXiv:2505.18646,

  31. [31]

    Language agents as optimizable graphs

    Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Language agents as optimizable graphs.arXiv preprint arXiv:2402.16823,

  32. [32]

    Matching is performed by scoring candidate skills and tools against the role description and the upstream and downstream artifact types of the node, and selecting the top-ranked entries. As a result, every node carries not only an instruction but also the procedural knowledge and callable operations needed to execute it, which both reduces prompt-engineer...

  33. [33]

    The remaining two benchmarks fluctuate within a small margin, which suggests that local repair contributes most when tasks involve longer reasoning chains or precise generation

    After removing the runtime local repair, most benchmarks show a drop in accuracy. The remaining two benchmarks fluctuate within a small margin, which suggests that local repair contributes most when tasks involve longer reasoning chains or precise generation. When we further remove agent skills and tools, the performance on most benchmarks remains close t...