AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

Nicole Koenigstein

arxiv: 2605.27466 · v1 · pith:N6OJDSS5new · submitted 2026-05-26 · 💻 cs.MA · cs.AI· cs.LG· stat.ML

AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

Nicole Koenigstein This is my paper

Pith reviewed 2026-07-01 16:05 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.LGstat.ML

keywords multi-agent systemsLLM coordinationpolicy learningpartial observabilityrouting policiesagent workflowsonline learningtopology compression

0 comments

The pith

AgensFlow treats multi-agent coordination choices as an online policy-learning problem under partial observability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-agent LLM systems require frequent coordination decisions on roles, models, topologies, and step inclusion that resist fixed a priori design. AgensFlow reframes these decisions as observable, learnable policies updated from repeated trajectories rather than static wiring. The framework is tested on distributed-systems incident tasks and security-advisory tasks, where learned routing reaches higher-quality points than fixed baselines. Additional results isolate topology compression and show that warm-started policies cut exploration cost while holding plateau quality. A reader would care because the shift from brittle pipelines to adaptive routing addresses a core scalability barrier in deployed agent collectives.

Core claim

AgensFlow is an open-source framework that models multi-agent coordination as an online policy-learning problem under partial observability, rendering skill protocols, role assignments, model bindings, interaction topologies, and evaluation choices observable and improvable across trajectories instead of fixing them as pipeline constants. On distributed-systems incident and security-advisory corpora, learned routing attains higher-quality operating points than fixed baselines on coordination-heavy classes; skip mechanisms isolate topology compression as a distinct substrate benefit; and warm-started policy graphs lower exploration cost while preserving final quality.

What carries the argument

The coordination-policy substrate that renders coordination decisions observable and subject to online learning under partial observability.

If this is right

Learned routing reaches higher-quality operating points than fixed pipeline baselines on coordination-heavy task classes.
Topology compression via skip mechanisms forms a meaningful, isolable component of the substrate.
Warm-started policy graphs reduce exploration cost while preserving plateau quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The substrate could support continual adaptation when task regimes or operational constraints shift after deployment.
Auditable policy graphs may ease debugging and compliance review compared with opaque static pipelines.
Integration with existing agent orchestration layers could lower the manual tuning burden for new application domains.

Load-bearing premise

Coordination decisions remain sufficiently observable and repeatable across trajectories to support effective online policy learning under partial observability.

What would settle it

A head-to-head run on the same two task corpora in which learned policies achieve no higher quality than the fixed-pipeline baseline or fail to converge because decision outcomes prove non-repeatable.

Figures

Figures reproduced from arXiv: 2605.27466 by Nicole Koenigstein.

**Figure 1.** Figure 1: Per-class quality lift under 3-judge audit. Learned routing improves most strongly on coordination-heavy classes, especially C3 cross-document multi-vendor reasoning, C7 mitigation correctness, and C8 cross-vendor pair tasks. Procedural classes are ties or narrow trades. Taken together, these developments shift the central technical bottleneck from isolated agent capabilities to dynamic coordination. A rob… view at source ↗

**Figure 2.** Figure 2: summarizes the runtime lifecycle and the persistent substrate components before the individual design principles are unpacked below [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: shows the cold-start learning trajectory for the main run before the aggregate results are reported [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Warm-start transfer. Warm-starting from the distributed-systems policy graph reduces early exploration cost on the synthetic security-advisory corpus while preserving plateau quality under cross-family audit. The figure also shows why the single-judge result requires audit: quality differences are modest under 3-judge scoring, while token compression is judgeindependent. Warm-starting from the distributed… view at source ↗

read the original abstract

Multi-agent systems built on large language models (LLMs) require many coordination choices that are difficult to fix a priori: which skill protocol to invoke, which agent role should perform a subtask, which model to bind to each role, how roles should interact, when to use retrieval or verification, and when to omit a step entirely. These choices interact with task regime and operational constraints, so static pipelines and one-off model comparisons provide only a limited view of the design space. This paper introduces AgensFlow, an open-source framework that treats multi-agent coordination as an online policy-learning problem under partial observability. The framework makes coordination decisions observable and learnable from repeated trajectories, rather than treating skill, role, model, topology, and evaluation choices as fixed pipeline design. AgensFlow is evaluated on two corpora: distributed-systems incident tasks and security-advisory tasks. The evaluation shows three main results: learned routing reaches a higher-quality operating point than a fixed pipeline baseline on coordination-heavy classes; skip:X isolates topology compression as a meaningful part of the substrate; and warm-started policy graphs can reduce exploration cost while preserving plateau quality. Overall, the results support that learned, auditable routing can improve coordination-heavy multi-agent workflows over static wiring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgensFlow turns multi-agent coordination into an online policy-learning problem, which is a clean reframing, but the reported results lack the methods detail needed to judge their strength.

read the letter

AgensFlow treats coordination choices in LLM multi-agent systems as an online policy learning task under partial observability. The framework makes decisions about roles, models, topologies, and steps observable from trajectories so they can be learned rather than fixed at design time.

The paper does a solid job explaining why static pipelines are limited when tasks and constraints vary. It ships an open-source substrate and evaluates on two corpora, reporting three outcomes: learned routing reaches better operating points than fixed baselines on coordination-heavy tasks, skip:X isolates the contribution of topology compression, and warm-started policies reduce exploration cost while keeping plateau quality. These directions make sense for the problem.

The soft spot is the evaluation presentation. The abstract states the three results but supplies no methods, data description, baseline construction, or variance measures. That leaves the central claim hard to assess from the given text. The assumption that coordination decisions are observable and repeatable enough for policy learning is exactly what the substrate targets, so it is not an unstated flaw, but the results still need the full experimental record to carry weight.

This work is aimed at researchers and engineers building multi-agent LLM systems who want to experiment with adaptive, auditable coordination instead of hand-tuned pipelines. A reader looking for a concrete starting point on learnable routing would find the framework description useful.

It deserves peer review. The idea is coherent, the problem is real, and the claims are testable. A referee can check the implementation and data details and ask for tighter evidence where needed.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces AgensFlow, an open-source framework that treats coordination choices in LLM-based multi-agent systems (skill protocols, roles, model bindings, topologies, retrieval/verification steps) as an online policy-learning problem under partial observability. Decisions are made observable and learnable from repeated trajectories rather than fixed a priori. Evaluation on distributed-systems incident tasks and security-advisory tasks reports three results: learned routing reaches higher-quality operating points than fixed-pipeline baselines on coordination-heavy classes; skip:X isolates topology compression; and warm-started policy graphs reduce exploration cost while preserving plateau quality. The central claim is that learned, auditable routing improves coordination-heavy multi-agent workflows over static wiring.

Significance. If the empirical results hold with adequate experimental detail, the work supplies a concrete substrate for adaptive coordination in LLM multi-agent systems, shifting emphasis from static design to trajectory-based policy learning with explicit observability. The open-source release and dual-corpus evaluation are constructive contributions to the multi-agent systems literature.

major comments (2)

Abstract: the three reported evaluation results (learned routing improvement, skip:X isolation, warm-start benefits) are presented without any description of methods, data construction, baseline definitions, metrics, or statistical reporting, so the central empirical claim cannot be assessed from the supplied text.
Evaluation (implied by abstract results): no information is given on how partial observability is handled during policy learning, what reward or quality signals are used, how trajectories are collected and replayed, or how the fixed-pipeline baseline is constructed, rendering the reported operating-point improvements unverifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The comments highlight the need for greater explicitness in describing the evaluation methodology. We address each point below and will revise the manuscript to improve verifiability while preserving the core contributions.

read point-by-point responses

Referee: Abstract: the three reported evaluation results (learned routing improvement, skip:X isolation, warm-start benefits) are presented without any description of methods, data construction, baseline definitions, metrics, or statistical reporting, so the central empirical claim cannot be assessed from the supplied text.

Authors: We agree that the abstract, due to its length constraints, presents results at a high level without methodological specifics. The full manuscript contains dedicated Evaluation and Methods sections that define the corpora, metrics (task quality scores), baselines (static pipelines with fixed skill/role/model/topology choices), and statistical reporting (means and variances over repeated runs). To directly address the concern, we will revise by expanding the abstract with one additional sentence summarizing the evaluation setup and by adding a short 'Evaluation Overview' subsection early in the paper that lists data construction, metrics, and baseline definitions. revision: yes
Referee: Evaluation (implied by abstract results): no information is given on how partial observability is handled during policy learning, what reward or quality signals are used, how trajectories are collected and replayed, or how the fixed-pipeline baseline is constructed, rendering the reported operating-point improvements unverifiable.

Authors: The manuscript describes partial observability as arising from incomplete trajectory state (only observable coordination decisions and final task outcomes), with policy learning performed via online updates on repeated task executions. Reward signals are derived from task-specific quality metrics (e.g., incident resolution accuracy and advisory completeness scores). Trajectories are collected by running the system on the two corpora and replayed for policy gradient-style updates; the fixed-pipeline baseline is constructed by freezing all coordination choices to their most common static configuration observed in initial runs. We acknowledge that these elements could be stated more explicitly and will revise the Evaluation section to include a dedicated paragraph on observability handling, reward formulation, trajectory collection/replay procedure, and baseline construction, along with any additional pseudocode or parameter tables needed for full reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a framework for treating coordination as an online policy-learning problem and supports its claims via empirical evaluation on two task corpora. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the provided text. The central claim rests on reported evaluation outcomes (learned routing improvement, topology isolation, warm-start benefits) rather than any derivation that reduces to its own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; insufficient information to populate ledger entries.

pith-pipeline@v0.9.1-grok · 5753 in / 909 out tokens · 32789 ms · 2026-07-01T16:05:49.015040+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 17 canonical work pages · 10 internal anchors

[1]

Finite-time Analysis of the Multi- armed Bandit Problem

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. “Finite-time Analysis of the Multi- armed Bandit Problem”. In:Machine Learning47.2–3 (2002), pp. 235–256.doi:10.1023/ A:1013689704352

2002
[2]

Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu.Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks. 2025. arXiv: 2508.00828 [cs.CE].url:https://arxiv.org/abs/2508.00828

work page arXiv 2025
[3]

Accessed: 2026-03-22

Kyle Brown and OpenPipe Contributors.RULER: Relative-Universal LLM-Elicited Re- wards.https://github.com/OpenPipe/ART. Accessed: 2026-03-22. 2025

2026
[4]

Recurrent Independent Mechanisms

Anirudh Goyal, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, and Bernhard Schölkopf. “Recurrent Independent Mechanisms”. In:International Conference on Learning Representations (ICLR)(2021)

2021
[5]

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Yun Liu, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel Mc- Duff, and Xin Liu.Towards a Science of Scaling Agent Systems. 2026. arXiv:2512.08296 [cs.AI].url:https://arxiv.o...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Yubin Kim, Xuhai Xu, Daniel McDuff, Cynthia Breazeal, and Hae Won Park.Health- LLM: Large Language Models for Health Prediction via Wearable Sensor Data. 2024. arXiv: 2401.06866 [cs.CL].url:https://arxiv.org/abs/2401.06866

work page arXiv 2024
[7]

Dynamic and Context-Dependent Stock Price Prediction Using At- tention Modules and News Sentiment

Nicole Koenigstein. “Dynamic and Context-Dependent Stock Price Prediction Using At- tention Modules and News Sentiment”. In:Digital Finance5.3 (Dec. 2023), pp. 449–481. doi:10.1007/s42521- 023- 00089- 7.url:https://doi.org/10.1007/s42521- 023- 00089-7

work page doi:10.1007/s42521- 2023
[8]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. “CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society”. In:arXiv preprint arXiv:2303.17760(2023). 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Towards a Unified Theory of State Abstraction for MDPs

Lihong Li, Thomas J. Walsh, and Michael L. Littman. “Towards a Unified Theory of State Abstraction for MDPs”. In:ISAIM(2006)

2006
[10]

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

George Ling, Shanshan Zhong, and Richard Huang.Agent Skills: A Data-Driven Analysis of Claude Skills for Extending Large Language Model Functionality. 2026. arXiv:2602. 08004 [cs.SE].url:https://arxiv.org/abs/2602.08004

work page arXiv 2026
[12]

Landsness, Daniel L

Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas, Albert Bou, Kaleigh F. Roberts, Sladjana Zagora...
[13]

arXiv:2511.02824 [cs.AI].url:https://arxiv.org/abs/2511.02824

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Izunna Okpala, Ashkan Golgoon, and Arjun Ravi Kannan.Agentic AI Systems Applied to tasks in Financial Services: Modeling and model risk management crews. 2025. arXiv: 2502.05439 [cs.AI].url:https://arxiv.org/abs/2502.05439

work page arXiv 2025
[15]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom.Toolformer: Language Models Can Teach Themselves to Use Tools. 2023. arXiv:2302.04761 [cs.CL].url:https://arxiv. org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Bissyande.CodeAgent: Autonomous Communicative Agents for Code Review

Xunzhu Tang, Kisub Kim, Yewei Song, Cedric Lothritz, Bei Li, Saad Ezzini, Haoye Tian, Jacques Klein, and Tegawende F. Bissyande.CodeAgent: Autonomous Communicative Agents for Code Review. 2024. arXiv:2402.02172 [cs.SE].url:https://arxiv.org/ abs/2402.02172

work page arXiv 2024
[17]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. “A survey on large language model based autonomous agents”. In:Frontiers of Computer Science18.6 (Mar. 2024).issn: 2095-2236.doi:10.1007/s11704- 024- 40231- 1.url: http://dx.doi.org/10.1007/s1170...

work page doi:10.1007/s11704- 2024
[18]

Yihao Wang, Haoran Xu, Renjie Gu, Yixuan Ye, Xinyi Chen, Xinyu Mu, Yuan Gao, Chunxiao Guo, Peng Wei, Jinjie Gu, Huan Li, Ke Chen, and Lidan Shou.MedMemory- Bench: Benchmarking Agent Memory in Personalized Healthcare. 2026. arXiv:2605.11814 [cs.AI].url:https://arxiv.org/abs/2605.11814

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

JasonWei,XuezhiWang,DaleSchuurmans,MaartenBosma,BrianIchter,FeiXia,EdChi, Quoc V. Le, and Denny Zhou. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. In:Advances in Neural Information Processing Systems (NeurIPS). 2022. 16

2022
[20]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed H. Awadallah, Ryen W. White, Doug Burger, and Chi Wang. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation”. In:arXiv preprint arXiv:2308.08155(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig.TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks. 2025. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foer- ster, Jeff Clune, and David Ha.The AI Scientist-v2: Workshop-Level Automated Scien- tific Discovery via Agentic Tree Search. 2025. arXiv:2504.08066 [cs.AI].url:https: //arxiv.org/abs/2504.08066

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press.SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. 2024. arXiv:2405.15793 [cs.SE].url:https://arxiv.org/abs/ 2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. “ReAct: Synergizing Reasoning and Acting in Language Models”. In:Interna- tional Conference on Learning Representations (ICLR). 2023. 17

2023

[1] [1]

Finite-time Analysis of the Multi- armed Bandit Problem

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. “Finite-time Analysis of the Multi- armed Bandit Problem”. In:Machine Learning47.2–3 (2002), pp. 235–256.doi:10.1023/ A:1013689704352

2002

[2] [2]

Antoine Bigeard, Langston Nashold, Rayan Krishnan, and Shirley Wu.Finance Agent Benchmark: Benchmarking LLMs on Real-world Financial Research Tasks. 2025. arXiv: 2508.00828 [cs.CE].url:https://arxiv.org/abs/2508.00828

work page arXiv 2025

[3] [3]

Accessed: 2026-03-22

Kyle Brown and OpenPipe Contributors.RULER: Relative-Universal LLM-Elicited Re- wards.https://github.com/OpenPipe/ART. Accessed: 2026-03-22. 2025

2026

[4] [4]

Recurrent Independent Mechanisms

Anirudh Goyal, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, and Bernhard Schölkopf. “Recurrent Independent Mechanisms”. In:International Conference on Learning Representations (ICLR)(2021)

2021

[5] [5]

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Yun Liu, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel Mc- Duff, and Xin Liu.Towards a Science of Scaling Agent Systems. 2026. arXiv:2512.08296 [cs.AI].url:https://arxiv.o...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Yubin Kim, Xuhai Xu, Daniel McDuff, Cynthia Breazeal, and Hae Won Park.Health- LLM: Large Language Models for Health Prediction via Wearable Sensor Data. 2024. arXiv: 2401.06866 [cs.CL].url:https://arxiv.org/abs/2401.06866

work page arXiv 2024

[7] [7]

Dynamic and Context-Dependent Stock Price Prediction Using At- tention Modules and News Sentiment

Nicole Koenigstein. “Dynamic and Context-Dependent Stock Price Prediction Using At- tention Modules and News Sentiment”. In:Digital Finance5.3 (Dec. 2023), pp. 449–481. doi:10.1007/s42521- 023- 00089- 7.url:https://doi.org/10.1007/s42521- 023- 00089-7

work page doi:10.1007/s42521- 2023

[8] [8]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. “CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society”. In:arXiv preprint arXiv:2303.17760(2023). 15

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Towards a Unified Theory of State Abstraction for MDPs

Lihong Li, Thomas J. Walsh, and Michael L. Littman. “Towards a Unified Theory of State Abstraction for MDPs”. In:ISAIM(2006)

2006

[10] [10]

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

George Ling, Shanshan Zhong, and Richard Huang.Agent Skills: A Data-Driven Analysis of Claude Skills for Extending Large Language Model Functionality. 2026. arXiv:2602. 08004 [cs.SE].url:https://arxiv.org/abs/2602.08004

work page arXiv 2026

[12] [12]

Landsness, Daniel L

Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C. Landsness, Daniel L. Barabasi, Siddharth Narayanan, Nicky Evans, Shriya Reddy, Martha Foiani, Aizad Kamal, Leah P. Shriver, Fang Cao, Asmamaw T. Wassie, Jon M. Laurent, Edwin Melville-Green, Mayk Caldas, Albert Bou, Kaleigh F. Roberts, Sladjana Zagora...

[13] [13]

arXiv:2511.02824 [cs.AI].url:https://arxiv.org/abs/2511.02824

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Izunna Okpala, Ashkan Golgoon, and Arjun Ravi Kannan.Agentic AI Systems Applied to tasks in Financial Services: Modeling and model risk management crews. 2025. arXiv: 2502.05439 [cs.AI].url:https://arxiv.org/abs/2502.05439

work page arXiv 2025

[15] [15]

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom.Toolformer: Language Models Can Teach Themselves to Use Tools. 2023. arXiv:2302.04761 [cs.CL].url:https://arxiv. org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Bissyande.CodeAgent: Autonomous Communicative Agents for Code Review

Xunzhu Tang, Kisub Kim, Yewei Song, Cedric Lothritz, Bei Li, Saad Ezzini, Haoye Tian, Jacques Klein, and Tegawende F. Bissyande.CodeAgent: Autonomous Communicative Agents for Code Review. 2024. arXiv:2402.02172 [cs.SE].url:https://arxiv.org/ abs/2402.02172

work page arXiv 2024

[17] [17]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. “A survey on large language model based autonomous agents”. In:Frontiers of Computer Science18.6 (Mar. 2024).issn: 2095-2236.doi:10.1007/s11704- 024- 40231- 1.url: http://dx.doi.org/10.1007/s1170...

work page doi:10.1007/s11704- 2024

[18] [18]

Yihao Wang, Haoran Xu, Renjie Gu, Yixuan Ye, Xinyi Chen, Xinyu Mu, Yuan Gao, Chunxiao Guo, Peng Wei, Jinjie Gu, Huan Li, Ke Chen, and Lidan Shou.MedMemory- Bench: Benchmarking Agent Memory in Personalized Healthcare. 2026. arXiv:2605.11814 [cs.AI].url:https://arxiv.org/abs/2605.11814

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

JasonWei,XuezhiWang,DaleSchuurmans,MaartenBosma,BrianIchter,FeiXia,EdChi, Quoc V. Le, and Denny Zhou. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. In:Advances in Neural Information Processing Systems (NeurIPS). 2022. 16

2022

[20] [20]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed H. Awadallah, Ryen W. White, Doug Burger, and Chi Wang. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation”. In:arXiv preprint arXiv:2308.08155(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig.TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks. 2025. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foer- ster, Jeff Clune, and David Ha.The AI Scientist-v2: Workshop-Level Automated Scien- tific Discovery via Agentic Tree Search. 2025. arXiv:2504.08066 [cs.AI].url:https: //arxiv.org/abs/2504.08066

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press.SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. 2024. arXiv:2405.15793 [cs.SE].url:https://arxiv.org/abs/ 2405.15793

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. “ReAct: Synergizing Reasoning and Acting in Language Models”. In:Interna- tional Conference on Learning Representations (ICLR). 2023. 17

2023