Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

Dawei Zhou; Elynn Chen; Enpei Zhang; Jiayu Li; Yujun Yan

arxiv: 2605.19140 · v1 · pith:EH5OV6S7new · submitted 2026-05-18 · 💻 cs.AI

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

Jiayu Li , Enpei Zhang , Dawei Zhou , Elynn Chen , Yujun Yan This is my paper

Pith reviewed 2026-05-20 10:02 UTC · model grok-4.3

classification 💻 cs.AI

keywords decentralized Q-learninginterface-constrained SMDPfinite-sample boundsmulti-agent workflowsneural function approximationhandoff mechanismsapproximate information states

0 comments

The pith

IC-Q delivers a finite-sample error bound for neural Q-learning in decentralized multi-agent handoff workflows using only local observations and single-scalar coordination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies workflow learning among specialized agents that hand off control through a shared artifact, where each agent observes only a local function of the artifact plus its private state and no learner sees joint trajectories. It formalizes the setting as an interface-constrained semi-Markov decision process and introduces IC-Q, an asynchronous decentralized Q-learning algorithm that exchanges exactly one scalar at each handoff. The central result is a finite-sample bound for neural IC-Q that separates into three independently controllable terms: neural function-approximation error, interface representation gap, and mixing-time residual under random option-duration discount. This bound is obtained by extending the approximate information state framework from single-agent MDPs to multi-agent SMDPs while handling random-duration Markovian noise. A reader would care because the result supplies the first such guarantee for neural Q-learning under decentralized partial observability and applies directly to multi-LLM pipelines and similar trust-boundary systems.

Core claim

The authors establish a finite-sample bound for neural IC-Q in IC-SMDPs that decomposes into neural function-approximation error, interface representation gap, and mixing-time residual under the random option-duration discount. Establishing the bound requires lifting the approximate information state framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random durations, steps not previously achieved. The result yields the first finite-sample guarantee for neural Q-learning under decentralized partial observability, with experiments confirming that IC-Q matches a centralized oracle without any agent observing joint trajectories.

What carries the argument

IC-Q, the asynchronous decentralized Q-learning algorithm that coordinates agents at handoffs using exactly one scalar in an interface-constrained semi-Markov decision process, with error controlled via the lifted approximate information state framework.

If this is right

Each of the three error sources can be reduced separately by enlarging the neural network, improving the interface representation, or adjusting the random-duration discount.
IC-Q reaches centralized-oracle performance on multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming tasks.
Learning remains convergent even though no agent ever observes joint trajectories or receives centralized data.
The bound holds for variable handoff times because the random option-duration discount accounts for the mixing-time residual.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same error decomposition could guide interface design in other decentralized systems such as distributed robotics or sensor networks that share only partial observations.
Organizations could adopt single-scalar handoff protocols to achieve convergent workflow learning while keeping data sharing minimal.
Testing the bound at larger scale in production LLM agent pipelines would directly check whether the three terms remain independently controllable.

Load-bearing premise

The approximate information state framework can be lifted from single-agent MDPs to multi-agent SMDPs while controlling Markovian noise under random option durations.

What would settle it

A controlled synthetic IC-SMDP experiment in which the three error terms fail to scale independently when neural network capacity, interface dimension, and discount factor are varied one at a time.

read the original abstract

We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-$Q$, an asynchronous decentralized $Q$-learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC-$Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$-learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC-$Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a finite-sample bound for neural Q-learning in decentralized handoff workflows via a new IC-SMDP model and AIS lifting, with experiments that check the terms, but the extension for random durations may not fully control the claimed independent errors.

read the letter

This paper's main takeaway is a finite-sample bound for neural Q-learning in decentralized multi-agent workflows where agents hand off via a shared artifact without seeing full joint states. They formalize the setting as an interface-constrained semi-Markov decision process, or IC-SMDP, and propose IC-Q, an asynchronous decentralized algorithm that coordinates with a single scalar at each handoff. The bound splits into neural function-approximation error, interface representation gap, and mixing-time residual under random option-duration discount. This requires lifting the approximate information state framework to multi-agent SMDPs while handling Markovian noise from random durations, which they say is new. The paper does well on the experimental side. The synthetic IC-SMDP experiment validates each error term separately as predicted. On multi-LLM mathematical reasoning, agent routing, and CPU programming tasks, IC-Q matches a centralized oracle even though no agent observes joint trajectories, and the errors scale as expected with the three axes. The soft spot is the lifting itself. The stress-test note raises a fair point: random durations could create extra dependence between the interface gap and the mixing residual that the three-term decomposition doesn't absorb. Without seeing the full derivation steps and precise assumptions, it's difficult to confirm that the approximate information state remains sufficient under local observations and stochastic handoff times. This makes the soundness harder to assess from the abstract alone. The work targets researchers in decentralized RL and multi-agent LLM systems, especially those concerned with privacy-preserving or cross-organizational pipelines. A reader focused on finite-sample analysis in partial observability settings would get concrete value from the model and bound structure. I would send this to peer review. The novelty in the formalization and the specific claim justify referee time to check the proofs and the extension details.

Referee Report

2 major / 2 minor

Summary. The manuscript formalizes multi-agent handoff workflows through a shared artifact (with each agent observing only a local function of the artifact and its private state, and no centralized access to joint trajectories) as an interface-constrained semi-Markov decision process (IC-SMDP). It introduces the asynchronous decentralized IC-Q algorithm, which uses exactly one scalar for cross-agent coordination at each handoff. The central claim is a finite-sample bound for neural IC-Q that decomposes into three independently controllable error sources—neural function-approximation error, interface representation gap, and mixing-time residual—under the random option-duration discount. This bound is obtained by lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs while controlling Markovian noise under random durations. Experiments include term-by-term validation on a synthetic IC-SMDP plus three applied tasks (multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming) showing IC-Q matches a centralized oracle.

Significance. If the bound and its decomposition hold, the result would be the first finite-sample guarantee for neural Q-learning under decentralized partial observability in an SMDP setting, directly relevant to multi-LLM pipelines spanning organizational boundaries. The explicit three-term decomposition, the term-by-term synthetic validation, and the reproducible experimental design that isolates each error source along its predicted axis are notable strengths. The work also supplies the first extension of the AIS framework to this multi-agent, random-duration regime.

major comments (2)

[§4 (main finite-sample bound, Theorem 1)] §4 (main finite-sample bound, Theorem 1): The claimed decomposition into three independently controllable terms requires that the lifted AIS framework fully absorbs Markovian noise induced by random option durations so that no residual correlation appears between the interface representation gap and the mixing-time residual. The provided derivation steps do not explicitly verify this independence under stochastic handoff times; if such dependence remains, the three-term separation fails.
[§3.2 (AIS lifting)] §3.2 (AIS lifting): The extension of the approximate information state from single-agent MDPs to multi-agent IC-SMDPs must ensure the local observation of the shared artifact remains a sufficient statistic when handoffs occur at random durations. The manuscript states this lifting has not been done previously, yet the precise assumptions on the duration distribution and the resulting Markovian noise control are not shown in sufficient detail to confirm sufficiency.

minor comments (2)

[Abstract] Abstract: Expand 'IC-SMDP' and 'IC-Q' on first use and briefly define the random option-duration discount for readers unfamiliar with the SMDP literature.
[Figure 2 (synthetic validation)] Figure 2 (synthetic validation): The caption should explicitly state which experimental axis corresponds to each of the three error terms in the bound to make the term-by-term match immediately visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, the positive assessment of the significance, and the constructive comments on the proof details. We address each major comment below and will revise the manuscript to incorporate additional explicit verification and assumptions as requested.

read point-by-point responses

Referee: [§4 (main finite-sample bound, Theorem 1)] §4 (main finite-sample bound, Theorem 1): The claimed decomposition into three independently controllable terms requires that the lifted AIS framework fully absorbs Markovian noise induced by random option durations so that no residual correlation appears between the interface representation gap and the mixing-time residual. The provided derivation steps do not explicitly verify this independence under stochastic handoff times; if such dependence remains, the three-term separation fails.

Authors: We thank the referee for highlighting this aspect of the decomposition. The random option-duration discount is introduced precisely to ensure that stochastic handoff times induce Markovian noise that is absorbed entirely into the mixing-time residual term once the AIS is lifted; under this discount the interface state remains a sufficient statistic, so that cross terms between the interface representation gap and the mixing-time residual are zero. While the current derivation relies on this property, we agree that an explicit verification step would strengthen the presentation. In the revision we will insert a supporting lemma immediately preceding Theorem 1 that bounds any potential cross-covariance to zero under the random-duration assumption. revision: yes
Referee: [§3.2 (AIS lifting)] §3.2 (AIS lifting): The extension of the approximate information state from single-agent MDPs to multi-agent IC-SMDPs must ensure the local observation of the shared artifact remains a sufficient statistic when handoffs occur at random durations. The manuscript states this lifting has not been done previously, yet the precise assumptions on the duration distribution and the resulting Markovian noise control are not shown in sufficient detail to confirm sufficiency.

Authors: We agree that the assumptions on the duration distribution and the resulting noise control deserve a more explicit treatment to confirm sufficiency of the local observation. The lifting proceeds by showing that, under the random option-duration discount and standard moment conditions on the duration random variable (finite mean and independence from the artifact process), the local function of the shared artifact together with the private state forms an approximate information state at each decision epoch. In the revision we will expand §3.2 with a dedicated paragraph stating these assumptions and a short derivation verifying that the local observation remains sufficient when handoffs occur at random times. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; no reduction to inputs by construction

full rationale

The paper derives a finite-sample bound for neural IC-Q by lifting the AIS framework to multi-agent IC-SMDPs and controlling Markovian noise under random option durations, then decomposing the bound into three separately controllable terms (neural approximation error, interface gap, mixing residual). The abstract and description explicitly present this lifting as novel and previously undone, with no indication that any term is defined via a fitted parameter from the target bound, a self-citation chain, or an ansatz smuggled from prior author work. Experiments are described as validating the bound term-by-term on a synthetic IC-SMDP, confirming the derivation stands on its stated assumptions rather than reducing to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the new IC-SMDP formalization and the validity of lifting the single-agent AIS framework to multi-agent SMDPs with random durations; no explicit free parameters are named, and the invented modeling constructs are the primary additions beyond standard MDP theory.

axioms (1)

domain assumption Standard technical conditions for finite-sample Q-learning bounds hold after lifting to the IC-SMDP setting
The bound derivation requires these conditions to carry over from single-agent to the new multi-agent interface-constrained regime.

invented entities (2)

IC-SMDP no independent evidence
purpose: Formal model capturing handoff epochs, local observations, and interface constraints in multi-agent workflows
New modeling construct introduced to represent the operating regime of decentralized LLM pipelines.
IC-Q no independent evidence
purpose: Asynchronous decentralized Q-learning algorithm using single-scalar coordination at handoffs
New algorithm designed specifically for the IC-SMDP setting.

pith-pipeline@v0.9.0 · 5813 in / 1546 out tokens · 43010 ms · 2026-05-20T10:02:48.399421+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

finite-sample bound ... decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat ≃ Nat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 6 internal anchors

[1]

Introducing the Model Context Protocol

Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/ model-context-protocol, 2024

work page 2024
[2]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the AAAI Conference on Artificial Intelligence, 2017

work page 2017
[3]

Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein

Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of Markov decision processes.Mathematics of Operations Research, 27(4):819–840, 2002

work page 2002
[4]

A finite time analysis of temporal difference learning with linear function approximation

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InProceedings of the 31st Conference on Learning Theory, volume 75 ofProceedings of Machine Learning Research, pages 1691–1692. PMLR, 2018

work page 2018
[5]

Bradtke and Michael O

Steven J. Bradtke and Michael O. Duff. Reinforcement learning methods for continuous-time Markov decision problems. InAdvances in Neural Information Processing Systems (NeurIPS), 1994

work page 1994
[6]

Lee, and Zhaoran Wang

Qi Cai, Zhuoran Yang, Jason D. Lee, and Zhaoran Wang. Neural temporal difference and Q learning provably converge to global optima.Mathematics of Operations Research, 49(1):619– 651, 2023

work page 2023
[7]

Deep transferq-learning for offline non-stationary reinforcement learning.arXiv preprint arXiv:2501.04870,

Jinhang Chai, Elynn Chen, and Jianqing Fan. Deep transfer q-learning for offline non-stationary reinforcement learning.arXiv preprint arXiv:2501.04870, 2025

work page arXiv 2025
[8]

Transfer Q-learning with composite MDP structures

Jinhang Chai, Elynn Chen, and Lin Yang. Transfer Q-learning with composite MDP structures. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings of Machine Learning Research, pages 7089–7106. PMLR, 2025

work page 2025
[9]

Optimistic transfer under task shift via Bellman alignment.arXiv preprint arXiv:2601.21924, 2026

Jinhang Chai, Enpei Zhang, Elynn Chen, and Yujun Yan. Optimistic transfer under task shift via Bellman alignment.arXiv preprint arXiv:2601.21924, 2026

work page arXiv 2026
[10]

Data-driven knowledge transfer in batch q˚ learning

Elynn Chen, Xi Chen, and Wenbo Jing. Data-driven knowledge transfer in batch q˚ learning. Journal of the American Statistical Association, 2026. Accepted; published online 05 Jan 2026

work page 2026
[11]

High-dimensional linear bandits under stochastic latent heterogeneity.arXiv preprint arXiv:2502.00423, 2025

Elynn Chen, Xi Chen, Wenbo Jing, and Xiao Liu. High-dimensional linear bandits under stochastic latent heterogeneity.arXiv preprint arXiv:2502.00423, 2025

work page arXiv 2025
[12]

Transfer learning for contextual joint assortment-pricing under cross-market heterogeneity.arXiv preprint arXiv:2603.18114, 2026

Elynn Chen, Xi Chen, and Yi Zhang. Transfer learning for contextual joint assortment-pricing under cross-market heterogeneity.arXiv preprint arXiv:2603.18114, 2026

work page arXiv 2026
[13]

Elynn Chen, Sai Li, and Michael I. Jordan. Transfer Q-learning for finite-horizon Markov decision processes.Electronic Journal of Statistics, 19(2):5289–5312, 2025

work page 2025
[14]

Elynn Chen, Rui Song, and Michael I. Jordan. Reinforcement learning in latent heterogeneous environments.Journal of the American Statistical Association, 119(548):3113–3126, 2024

work page 2024
[15]

arXiv preprint arXiv:2411.05451 , year=

Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. WorkflowLLM: Enhancing workflow orchestration capability of large language models.arXiv preprint arXiv:2411.05451, 2024

work page arXiv 2024
[16]

Counterfactual multi-agent policy gradients

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

work page 2018
[17]

Agent2Agent (A2A) protocol specification, 2025

Google. Agent2Agent (A2A) protocol specification, 2025

work page 2025
[18]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an AI co-scientist.arXiv preprint arXiv:2502.18864, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. MetaGPT: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023. Cited as Hong et al., 2024 per ICLR publication

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, and Heming Cui. AgentCoder: Multi-agent-based code generation with iterative testing and optimisation.arXiv preprint arXiv:2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Actor-attention-critic for multi-agent reinforcement learning

Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 2961–2970. PMLR, 2019

work page 2019
[23]

Common information based approximate state representations in multi-agent reinforcement learning

Hsu Kao and Vijay Subramanian. Common information based approximate state representations in multi-agent reinforcement learning. InArtificial Intelligence and Statistics (AISTATS), 2022

work page 2022
[24]

Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability.Mathematics of Operations Research, 48(4):2066–2093, 2022

Ali Devran Kara and Serdar Yüksel. Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability.Mathematics of Operations Research, 48(4):2066–2093, 2022

work page 2066
[25]

Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch

Ryan Lowe, Yi I. Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[26]

Networked distributed POMDPs: A synthesis of distributed constraint optimization and POMDPs

Ranjit Nair, Pradeep Varakantham, Milind Tambe, and Makoto Yokoo. Networked distributed POMDPs: A synthesis of distributed constraint optimization and POMDPs. InProceedings of the 20th National Conference on Artificial Intelligence (AAAI), pages 133–139, 2005

work page 2005
[27]

Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs

Frans A. Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs. Springer, 2016

work page 2016
[28]

Oliehoek, Matthijs T

Frans A. Oliehoek, Matthijs T. J. Spaan, Shimon Whiteson, and Nikos Vlassis. Exploiting locality of interaction in factored Dec-POMDPs. InProceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 517–524, 2008

work page 2008
[29]

PhD thesis, University of Massachusetts Amherst, 2000

Doina Precup.Temporal Abstraction in Reinforcement Learning. PhD thesis, University of Massachusetts Amherst, 2000

work page 2000
[30]

Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

work page 2020
[31]

Periodic agent-state based Q-learning for POMDPs

Amit Sinha, Matthieu Geist, and Aditya Mahajan. Periodic agent-state based Q-learning for POMDPs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[32]

Agent-state based policies in POMDPs: Beyond belief-state MDPs

Amit Sinha and Aditya Mahajan. Agent-state based policies in POMDPs: Beyond belief-state MDPs. InIEEE Conference on Decision and Control (CDC), 2024

work page 2024
[33]

Difficulty-aware agentic orchestration for query-specific multi-agent workflows

Jinwei Su, Qizhen Lan, Yinghui Xia, Lifan Sun, Weiyou Tian, Tianyu Shi, and Lewei He. Difficulty-aware agentic orchestration for query-specific multi-agent workflows. pages 2060– 2070, 2026

work page 2060
[34]

Approximate information state for approximate planning and reinforcement learning in partially observed systems.Journal of Machine Learning Research, 23(12):1–83, 2022

Jayakumar Subramanian, Amit Sinha, Raihan Seraj, and Aditya Mahajan. Approximate information state for approximate planning and reinforcement learning in partially observed systems.Journal of Machine Learning Research, 23(12):1–83, 2022. arXiv:2010.08843

work page arXiv 2022
[35]

Leibo, Karl Tuyls, et al

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, et al. Value- decomposition networks for cooperative multi-agent learning based on team reward. InAu- tonomous Agents and Multi-Agent Systems (AAMAS), 2018. 12

work page 2018
[36]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1– 2):181–211, 1999

work page 1999
[37]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

A finite-time analysis of q-learning with neural network function approximation

Pan Xu and Quanquan Gu. A finite-time analysis of q-learning with neural network function approximation. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 10555–10565. PMLR, 2020

work page 2020
[39]

Agentnet: Decentralized evolutionary coordination for llm-based multi-agent sys- tems.ArXiv, abs/2504.00587, 2025

Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. AgentNet: Decentralized evolutionary coordination for LLM-based multi-agent systems. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2504.00587

work page arXiv 2025
[40]

ReAct: Synergizing reasoning and acting in language models.International Conference on Learning Representations (ICLR), 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.International Conference on Learning Representations (ICLR), 2023

work page 2023
[41]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiongwei Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. AFlow: Automating agentic workflow genera- tion. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.10762

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Transfer faster, price smarter: Minimax dynamic pricing under cross-market preference shift

Yi Zhang, Elynn Chen, and Yujun Yan. Transfer faster, price smarter: Minimax dynamic pricing under cross-market preference shift. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. Spotlight; arXiv:2505.17203

work page arXiv 2025
[43]

Prior-aligned meta-RL: Thompson sampling with learned priors and guarantees in finite-horizon MDPs.arXiv preprint arXiv:2510.05446, 2025

Runlin Zhou, Chixiang Chen, and Elynn Chen. Prior-aligned meta-RL: Thompson sampling with learned priors and guarantees in finite-horizon MDPs.arXiv preprint arXiv:2510.05446, 2025

work page arXiv 2025
[44]

GPTSwarm: Language agents as optimizable graphs

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. GPTSwarm: Language agents as optimizable graphs. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. 13

work page 2024

[1] [1]

Introducing the Model Context Protocol

Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/ model-context-protocol, 2024

work page 2024

[2] [2]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the AAAI Conference on Artificial Intelligence, 2017

work page 2017

[3] [3]

Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein

Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of Markov decision processes.Mathematics of Operations Research, 27(4):819–840, 2002

work page 2002

[4] [4]

A finite time analysis of temporal difference learning with linear function approximation

Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InProceedings of the 31st Conference on Learning Theory, volume 75 ofProceedings of Machine Learning Research, pages 1691–1692. PMLR, 2018

work page 2018

[5] [5]

Bradtke and Michael O

Steven J. Bradtke and Michael O. Duff. Reinforcement learning methods for continuous-time Markov decision problems. InAdvances in Neural Information Processing Systems (NeurIPS), 1994

work page 1994

[6] [6]

Lee, and Zhaoran Wang

Qi Cai, Zhuoran Yang, Jason D. Lee, and Zhaoran Wang. Neural temporal difference and Q learning provably converge to global optima.Mathematics of Operations Research, 49(1):619– 651, 2023

work page 2023

[7] [7]

Deep transferq-learning for offline non-stationary reinforcement learning.arXiv preprint arXiv:2501.04870,

Jinhang Chai, Elynn Chen, and Jianqing Fan. Deep transfer q-learning for offline non-stationary reinforcement learning.arXiv preprint arXiv:2501.04870, 2025

work page arXiv 2025

[8] [8]

Transfer Q-learning with composite MDP structures

Jinhang Chai, Elynn Chen, and Lin Yang. Transfer Q-learning with composite MDP structures. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings of Machine Learning Research, pages 7089–7106. PMLR, 2025

work page 2025

[9] [9]

Optimistic transfer under task shift via Bellman alignment.arXiv preprint arXiv:2601.21924, 2026

Jinhang Chai, Enpei Zhang, Elynn Chen, and Yujun Yan. Optimistic transfer under task shift via Bellman alignment.arXiv preprint arXiv:2601.21924, 2026

work page arXiv 2026

[10] [10]

Data-driven knowledge transfer in batch q˚ learning

Elynn Chen, Xi Chen, and Wenbo Jing. Data-driven knowledge transfer in batch q˚ learning. Journal of the American Statistical Association, 2026. Accepted; published online 05 Jan 2026

work page 2026

[11] [11]

High-dimensional linear bandits under stochastic latent heterogeneity.arXiv preprint arXiv:2502.00423, 2025

Elynn Chen, Xi Chen, Wenbo Jing, and Xiao Liu. High-dimensional linear bandits under stochastic latent heterogeneity.arXiv preprint arXiv:2502.00423, 2025

work page arXiv 2025

[12] [12]

Transfer learning for contextual joint assortment-pricing under cross-market heterogeneity.arXiv preprint arXiv:2603.18114, 2026

Elynn Chen, Xi Chen, and Yi Zhang. Transfer learning for contextual joint assortment-pricing under cross-market heterogeneity.arXiv preprint arXiv:2603.18114, 2026

work page arXiv 2026

[13] [13]

Elynn Chen, Sai Li, and Michael I. Jordan. Transfer Q-learning for finite-horizon Markov decision processes.Electronic Journal of Statistics, 19(2):5289–5312, 2025

work page 2025

[14] [14]

Elynn Chen, Rui Song, and Michael I. Jordan. Reinforcement learning in latent heterogeneous environments.Journal of the American Statistical Association, 119(548):3113–3126, 2024

work page 2024

[15] [15]

arXiv preprint arXiv:2411.05451 , year=

Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. WorkflowLLM: Enhancing workflow orchestration capability of large language models.arXiv preprint arXiv:2411.05451, 2024

work page arXiv 2024

[16] [16]

Counterfactual multi-agent policy gradients

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

work page 2018

[17] [17]

Agent2Agent (A2A) protocol specification, 2025

Google. Agent2Agent (A2A) protocol specification, 2025

work page 2025

[18] [18]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an AI co-scientist.arXiv preprint arXiv:2502.18864, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. MetaGPT: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023. Cited as Hong et al., 2024 per ICLR publication

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, and Heming Cui. AgentCoder: Multi-agent-based code generation with iterative testing and optimisation.arXiv preprint arXiv:2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Actor-attention-critic for multi-agent reinforcement learning

Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 2961–2970. PMLR, 2019

work page 2019

[23] [23]

Common information based approximate state representations in multi-agent reinforcement learning

Hsu Kao and Vijay Subramanian. Common information based approximate state representations in multi-agent reinforcement learning. InArtificial Intelligence and Statistics (AISTATS), 2022

work page 2022

[24] [24]

Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability.Mathematics of Operations Research, 48(4):2066–2093, 2022

Ali Devran Kara and Serdar Yüksel. Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability.Mathematics of Operations Research, 48(4):2066–2093, 2022

work page 2066

[25] [25]

Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch

Ryan Lowe, Yi I. Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[26] [26]

Networked distributed POMDPs: A synthesis of distributed constraint optimization and POMDPs

Ranjit Nair, Pradeep Varakantham, Milind Tambe, and Makoto Yokoo. Networked distributed POMDPs: A synthesis of distributed constraint optimization and POMDPs. InProceedings of the 20th National Conference on Artificial Intelligence (AAAI), pages 133–139, 2005

work page 2005

[27] [27]

Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs

Frans A. Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs. Springer, 2016

work page 2016

[28] [28]

Oliehoek, Matthijs T

Frans A. Oliehoek, Matthijs T. J. Spaan, Shimon Whiteson, and Nikos Vlassis. Exploiting locality of interaction in factored Dec-POMDPs. InProceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 517–524, 2008

work page 2008

[29] [29]

PhD thesis, University of Massachusetts Amherst, 2000

Doina Precup.Temporal Abstraction in Reinforcement Learning. PhD thesis, University of Massachusetts Amherst, 2000

work page 2000

[30] [30]

Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

work page 2020

[31] [31]

Periodic agent-state based Q-learning for POMDPs

Amit Sinha, Matthieu Geist, and Aditya Mahajan. Periodic agent-state based Q-learning for POMDPs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[32] [32]

Agent-state based policies in POMDPs: Beyond belief-state MDPs

Amit Sinha and Aditya Mahajan. Agent-state based policies in POMDPs: Beyond belief-state MDPs. InIEEE Conference on Decision and Control (CDC), 2024

work page 2024

[33] [33]

Difficulty-aware agentic orchestration for query-specific multi-agent workflows

Jinwei Su, Qizhen Lan, Yinghui Xia, Lifan Sun, Weiyou Tian, Tianyu Shi, and Lewei He. Difficulty-aware agentic orchestration for query-specific multi-agent workflows. pages 2060– 2070, 2026

work page 2060

[34] [34]

Approximate information state for approximate planning and reinforcement learning in partially observed systems.Journal of Machine Learning Research, 23(12):1–83, 2022

Jayakumar Subramanian, Amit Sinha, Raihan Seraj, and Aditya Mahajan. Approximate information state for approximate planning and reinforcement learning in partially observed systems.Journal of Machine Learning Research, 23(12):1–83, 2022. arXiv:2010.08843

work page arXiv 2022

[35] [35]

Leibo, Karl Tuyls, et al

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, et al. Value- decomposition networks for cooperative multi-agent learning based on team reward. InAu- tonomous Agents and Multi-Agent Systems (AAMAS), 2018. 12

work page 2018

[36] [36]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1– 2):181–211, 1999

work page 1999

[37] [37]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

A finite-time analysis of q-learning with neural network function approximation

Pan Xu and Quanquan Gu. A finite-time analysis of q-learning with neural network function approximation. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 10555–10565. PMLR, 2020

work page 2020

[39] [39]

Agentnet: Decentralized evolutionary coordination for llm-based multi-agent sys- tems.ArXiv, abs/2504.00587, 2025

Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. AgentNet: Decentralized evolutionary coordination for LLM-based multi-agent systems. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2504.00587

work page arXiv 2025

[40] [40]

ReAct: Synergizing reasoning and acting in language models.International Conference on Learning Representations (ICLR), 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.International Conference on Learning Representations (ICLR), 2023

work page 2023

[41] [41]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiongwei Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. AFlow: Automating agentic workflow genera- tion. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.10762

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Transfer faster, price smarter: Minimax dynamic pricing under cross-market preference shift

Yi Zhang, Elynn Chen, and Yujun Yan. Transfer faster, price smarter: Minimax dynamic pricing under cross-market preference shift. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. Spotlight; arXiv:2505.17203

work page arXiv 2025

[43] [43]

Prior-aligned meta-RL: Thompson sampling with learned priors and guarantees in finite-horizon MDPs.arXiv preprint arXiv:2510.05446, 2025

Runlin Zhou, Chixiang Chen, and Elynn Chen. Prior-aligned meta-RL: Thompson sampling with learned priors and guarantees in finite-horizon MDPs.arXiv preprint arXiv:2510.05446, 2025

work page arXiv 2025

[44] [44]

GPTSwarm: Language agents as optimizable graphs

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. GPTSwarm: Language agents as optimizable graphs. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. 13

work page 2024