pith. sign in

arxiv: 2605.19140 · v1 · pith:EH5OV6S7new · submitted 2026-05-18 · 💻 cs.AI

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

Pith reviewed 2026-05-20 10:02 UTC · model grok-4.3

classification 💻 cs.AI
keywords decentralized Q-learninginterface-constrained SMDPfinite-sample boundsmulti-agent workflowsneural function approximationhandoff mechanismsapproximate information states
4
0 comments X

The pith

IC-Q delivers a finite-sample error bound for neural Q-learning in decentralized multi-agent handoff workflows using only local observations and single-scalar coordination.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies workflow learning among specialized agents that hand off control through a shared artifact, where each agent observes only a local function of the artifact plus its private state and no learner sees joint trajectories. It formalizes the setting as an interface-constrained semi-Markov decision process and introduces IC-Q, an asynchronous decentralized Q-learning algorithm that exchanges exactly one scalar at each handoff. The central result is a finite-sample bound for neural IC-Q that separates into three independently controllable terms: neural function-approximation error, interface representation gap, and mixing-time residual under random option-duration discount. This bound is obtained by extending the approximate information state framework from single-agent MDPs to multi-agent SMDPs while handling random-duration Markovian noise. A reader would care because the result supplies the first such guarantee for neural Q-learning under decentralized partial observability and applies directly to multi-LLM pipelines and similar trust-boundary systems.

Core claim

The authors establish a finite-sample bound for neural IC-Q in IC-SMDPs that decomposes into neural function-approximation error, interface representation gap, and mixing-time residual under the random option-duration discount. Establishing the bound requires lifting the approximate information state framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random durations, steps not previously achieved. The result yields the first finite-sample guarantee for neural Q-learning under decentralized partial observability, with experiments confirming that IC-Q matches a centralized oracle without any agent observing joint trajectories.

What carries the argument

IC-Q, the asynchronous decentralized Q-learning algorithm that coordinates agents at handoffs using exactly one scalar in an interface-constrained semi-Markov decision process, with error controlled via the lifted approximate information state framework.

If this is right

  • Each of the three error sources can be reduced separately by enlarging the neural network, improving the interface representation, or adjusting the random-duration discount.
  • IC-Q reaches centralized-oracle performance on multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming tasks.
  • Learning remains convergent even though no agent ever observes joint trajectories or receives centralized data.
  • The bound holds for variable handoff times because the random option-duration discount accounts for the mixing-time residual.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same error decomposition could guide interface design in other decentralized systems such as distributed robotics or sensor networks that share only partial observations.
  • Organizations could adopt single-scalar handoff protocols to achieve convergent workflow learning while keeping data sharing minimal.
  • Testing the bound at larger scale in production LLM agent pipelines would directly check whether the three terms remain independently controllable.

Load-bearing premise

The approximate information state framework can be lifted from single-agent MDPs to multi-agent SMDPs while controlling Markovian noise under random option durations.

What would settle it

A controlled synthetic IC-SMDP experiment in which the three error terms fail to scale independently when neural network capacity, interface dimension, and discount factor are varied one at a time.

read the original abstract

We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-$Q$, an asynchronous decentralized $Q$-learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC-$Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$-learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC-$Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript formalizes multi-agent handoff workflows through a shared artifact (with each agent observing only a local function of the artifact and its private state, and no centralized access to joint trajectories) as an interface-constrained semi-Markov decision process (IC-SMDP). It introduces the asynchronous decentralized IC-Q algorithm, which uses exactly one scalar for cross-agent coordination at each handoff. The central claim is a finite-sample bound for neural IC-Q that decomposes into three independently controllable error sources—neural function-approximation error, interface representation gap, and mixing-time residual—under the random option-duration discount. This bound is obtained by lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs while controlling Markovian noise under random durations. Experiments include term-by-term validation on a synthetic IC-SMDP plus three applied tasks (multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming) showing IC-Q matches a centralized oracle.

Significance. If the bound and its decomposition hold, the result would be the first finite-sample guarantee for neural Q-learning under decentralized partial observability in an SMDP setting, directly relevant to multi-LLM pipelines spanning organizational boundaries. The explicit three-term decomposition, the term-by-term synthetic validation, and the reproducible experimental design that isolates each error source along its predicted axis are notable strengths. The work also supplies the first extension of the AIS framework to this multi-agent, random-duration regime.

major comments (2)
  1. [§4 (main finite-sample bound, Theorem 1)] §4 (main finite-sample bound, Theorem 1): The claimed decomposition into three independently controllable terms requires that the lifted AIS framework fully absorbs Markovian noise induced by random option durations so that no residual correlation appears between the interface representation gap and the mixing-time residual. The provided derivation steps do not explicitly verify this independence under stochastic handoff times; if such dependence remains, the three-term separation fails.
  2. [§3.2 (AIS lifting)] §3.2 (AIS lifting): The extension of the approximate information state from single-agent MDPs to multi-agent IC-SMDPs must ensure the local observation of the shared artifact remains a sufficient statistic when handoffs occur at random durations. The manuscript states this lifting has not been done previously, yet the precise assumptions on the duration distribution and the resulting Markovian noise control are not shown in sufficient detail to confirm sufficiency.
minor comments (2)
  1. [Abstract] Abstract: Expand 'IC-SMDP' and 'IC-Q' on first use and briefly define the random option-duration discount for readers unfamiliar with the SMDP literature.
  2. [Figure 2 (synthetic validation)] Figure 2 (synthetic validation): The caption should explicitly state which experimental axis corresponds to each of the three error terms in the bound to make the term-by-term match immediately visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, the positive assessment of the significance, and the constructive comments on the proof details. We address each major comment below and will revise the manuscript to incorporate additional explicit verification and assumptions as requested.

read point-by-point responses
  1. Referee: [§4 (main finite-sample bound, Theorem 1)] §4 (main finite-sample bound, Theorem 1): The claimed decomposition into three independently controllable terms requires that the lifted AIS framework fully absorbs Markovian noise induced by random option durations so that no residual correlation appears between the interface representation gap and the mixing-time residual. The provided derivation steps do not explicitly verify this independence under stochastic handoff times; if such dependence remains, the three-term separation fails.

    Authors: We thank the referee for highlighting this aspect of the decomposition. The random option-duration discount is introduced precisely to ensure that stochastic handoff times induce Markovian noise that is absorbed entirely into the mixing-time residual term once the AIS is lifted; under this discount the interface state remains a sufficient statistic, so that cross terms between the interface representation gap and the mixing-time residual are zero. While the current derivation relies on this property, we agree that an explicit verification step would strengthen the presentation. In the revision we will insert a supporting lemma immediately preceding Theorem 1 that bounds any potential cross-covariance to zero under the random-duration assumption. revision: yes

  2. Referee: [§3.2 (AIS lifting)] §3.2 (AIS lifting): The extension of the approximate information state from single-agent MDPs to multi-agent IC-SMDPs must ensure the local observation of the shared artifact remains a sufficient statistic when handoffs occur at random durations. The manuscript states this lifting has not been done previously, yet the precise assumptions on the duration distribution and the resulting Markovian noise control are not shown in sufficient detail to confirm sufficiency.

    Authors: We agree that the assumptions on the duration distribution and the resulting noise control deserve a more explicit treatment to confirm sufficiency of the local observation. The lifting proceeds by showing that, under the random option-duration discount and standard moment conditions on the duration random variable (finite mean and independence from the artifact process), the local function of the shared artifact together with the private state forms an approximate information state at each decision epoch. In the revision we will expand §3.2 with a dedicated paragraph stating these assumptions and a short derivation verifying that the local observation remains sufficient when handoffs occur at random times. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; no reduction to inputs by construction

full rationale

The paper derives a finite-sample bound for neural IC-Q by lifting the AIS framework to multi-agent IC-SMDPs and controlling Markovian noise under random option durations, then decomposing the bound into three separately controllable terms (neural approximation error, interface gap, mixing residual). The abstract and description explicitly present this lifting as novel and previously undone, with no indication that any term is defined via a fitted parameter from the target bound, a self-citation chain, or an ansatz smuggled from prior author work. Experiments are described as validating the bound term-by-term on a synthetic IC-SMDP, confirming the derivation stands on its stated assumptions rather than reducing to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the new IC-SMDP formalization and the validity of lifting the single-agent AIS framework to multi-agent SMDPs with random durations; no explicit free parameters are named, and the invented modeling constructs are the primary additions beyond standard MDP theory.

axioms (1)
  • domain assumption Standard technical conditions for finite-sample Q-learning bounds hold after lifting to the IC-SMDP setting
    The bound derivation requires these conditions to carry over from single-agent to the new multi-agent interface-constrained regime.
invented entities (2)
  • IC-SMDP no independent evidence
    purpose: Formal model capturing handoff epochs, local observations, and interface constraints in multi-agent workflows
    New modeling construct introduced to represent the operating regime of decentralized LLM pipelines.
  • IC-Q no independent evidence
    purpose: Asynchronous decentralized Q-learning algorithm using single-scalar coordination at handoffs
    New algorithm designed specifically for the IC-SMDP setting.

pith-pipeline@v0.9.0 · 5813 in / 1546 out tokens · 43010 ms · 2026-05-20T10:02:48.399421+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 6 internal anchors

  1. [1]

    Introducing the Model Context Protocol

    Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/ model-context-protocol, 2024

  2. [2]

    The option-critic architecture

    Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the AAAI Conference on Artificial Intelligence, 2017

  3. [3]

    Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein

    Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of Markov decision processes.Mathematics of Operations Research, 27(4):819–840, 2002

  4. [4]

    A finite time analysis of temporal difference learning with linear function approximation

    Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. InProceedings of the 31st Conference on Learning Theory, volume 75 ofProceedings of Machine Learning Research, pages 1691–1692. PMLR, 2018

  5. [5]

    Bradtke and Michael O

    Steven J. Bradtke and Michael O. Duff. Reinforcement learning methods for continuous-time Markov decision problems. InAdvances in Neural Information Processing Systems (NeurIPS), 1994

  6. [6]

    Lee, and Zhaoran Wang

    Qi Cai, Zhuoran Yang, Jason D. Lee, and Zhaoran Wang. Neural temporal difference and Q learning provably converge to global optima.Mathematics of Operations Research, 49(1):619– 651, 2023

  7. [7]

    Deep transferq-learning for offline non-stationary reinforcement learning.arXiv preprint arXiv:2501.04870,

    Jinhang Chai, Elynn Chen, and Jianqing Fan. Deep transfer q-learning for offline non-stationary reinforcement learning.arXiv preprint arXiv:2501.04870, 2025

  8. [8]

    Transfer Q-learning with composite MDP structures

    Jinhang Chai, Elynn Chen, and Lin Yang. Transfer Q-learning with composite MDP structures. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings of Machine Learning Research, pages 7089–7106. PMLR, 2025

  9. [9]

    Optimistic transfer under task shift via Bellman alignment.arXiv preprint arXiv:2601.21924, 2026

    Jinhang Chai, Enpei Zhang, Elynn Chen, and Yujun Yan. Optimistic transfer under task shift via Bellman alignment.arXiv preprint arXiv:2601.21924, 2026

  10. [10]

    Data-driven knowledge transfer in batch q˚ learning

    Elynn Chen, Xi Chen, and Wenbo Jing. Data-driven knowledge transfer in batch q˚ learning. Journal of the American Statistical Association, 2026. Accepted; published online 05 Jan 2026

  11. [11]

    High-dimensional linear bandits under stochastic latent heterogeneity.arXiv preprint arXiv:2502.00423, 2025

    Elynn Chen, Xi Chen, Wenbo Jing, and Xiao Liu. High-dimensional linear bandits under stochastic latent heterogeneity.arXiv preprint arXiv:2502.00423, 2025

  12. [12]

    Transfer learning for contextual joint assortment-pricing under cross-market heterogeneity.arXiv preprint arXiv:2603.18114, 2026

    Elynn Chen, Xi Chen, and Yi Zhang. Transfer learning for contextual joint assortment-pricing under cross-market heterogeneity.arXiv preprint arXiv:2603.18114, 2026

  13. [13]

    Elynn Chen, Sai Li, and Michael I. Jordan. Transfer Q-learning for finite-horizon Markov decision processes.Electronic Journal of Statistics, 19(2):5289–5312, 2025

  14. [14]

    Elynn Chen, Rui Song, and Michael I. Jordan. Reinforcement learning in latent heterogeneous environments.Journal of the American Statistical Association, 119(548):3113–3126, 2024

  15. [15]

    arXiv preprint arXiv:2411.05451 , year=

    Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. WorkflowLLM: Enhancing workflow orchestration capability of large language models.arXiv preprint arXiv:2411.05451, 2024

  16. [16]

    Counterfactual multi-agent policy gradients

    Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the AAAI Conference on Artificial Intelligence, 2018

  17. [17]

    Agent2Agent (A2A) protocol specification, 2025

    Google. Agent2Agent (A2A) protocol specification, 2025

  18. [18]

    Towards an AI co-scientist

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an AI co-scientist.arXiv preprint arXiv:2502.18864, 2025. 11

  19. [19]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. MetaGPT: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023. Cited as Hong et al., 2024 per ICLR publication

  20. [20]

    Automated Design of Agentic Systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024

  21. [21]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    Dong Huang, Qingwen Bu, Jie M. Zhang, Michael Luck, and Heming Cui. AgentCoder: Multi-agent-based code generation with iterative testing and optimisation.arXiv preprint arXiv:2312.13010, 2023

  22. [22]

    Actor-attention-critic for multi-agent reinforcement learning

    Shariq Iqbal and Fei Sha. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), volume 97 of Proceedings of Machine Learning Research, pages 2961–2970. PMLR, 2019

  23. [23]

    Common information based approximate state representations in multi-agent reinforcement learning

    Hsu Kao and Vijay Subramanian. Common information based approximate state representations in multi-agent reinforcement learning. InArtificial Intelligence and Statistics (AISTATS), 2022

  24. [24]

    Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability.Mathematics of Operations Research, 48(4):2066–2093, 2022

    Ali Devran Kara and Serdar Yüksel. Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability.Mathematics of Operations Research, 48(4):2066–2093, 2022

  25. [25]

    Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch

    Ryan Lowe, Yi I. Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

  26. [26]

    Networked distributed POMDPs: A synthesis of distributed constraint optimization and POMDPs

    Ranjit Nair, Pradeep Varakantham, Milind Tambe, and Makoto Yokoo. Networked distributed POMDPs: A synthesis of distributed constraint optimization and POMDPs. InProceedings of the 20th National Conference on Artificial Intelligence (AAAI), pages 133–139, 2005

  27. [27]

    Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs

    Frans A. Oliehoek and Christopher Amato.A Concise Introduction to Decentralized POMDPs. Springer, 2016

  28. [28]

    Oliehoek, Matthijs T

    Frans A. Oliehoek, Matthijs T. J. Spaan, Shimon Whiteson, and Nikos Vlassis. Exploiting locality of interaction in factored Dec-POMDPs. InProceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 517–524, 2008

  29. [29]

    PhD thesis, University of Massachusetts Amherst, 2000

    Doina Precup.Temporal Abstraction in Reinforcement Learning. PhD thesis, University of Massachusetts Amherst, 2000

  30. [30]

    Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

    Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

  31. [31]

    Periodic agent-state based Q-learning for POMDPs

    Amit Sinha, Matthieu Geist, and Aditya Mahajan. Periodic agent-state based Q-learning for POMDPs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  32. [32]

    Agent-state based policies in POMDPs: Beyond belief-state MDPs

    Amit Sinha and Aditya Mahajan. Agent-state based policies in POMDPs: Beyond belief-state MDPs. InIEEE Conference on Decision and Control (CDC), 2024

  33. [33]

    Difficulty-aware agentic orchestration for query-specific multi-agent workflows

    Jinwei Su, Qizhen Lan, Yinghui Xia, Lifan Sun, Weiyou Tian, Tianyu Shi, and Lewei He. Difficulty-aware agentic orchestration for query-specific multi-agent workflows. pages 2060– 2070, 2026

  34. [34]

    Approximate information state for approximate planning and reinforcement learning in partially observed systems.Journal of Machine Learning Research, 23(12):1–83, 2022

    Jayakumar Subramanian, Amit Sinha, Raihan Seraj, and Aditya Mahajan. Approximate information state for approximate planning and reinforcement learning in partially observed systems.Journal of Machine Learning Research, 23(12):1–83, 2022. arXiv:2010.08843

  35. [35]

    Leibo, Karl Tuyls, et al

    Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, et al. Value- decomposition networks for cooperative multi-agent learning based on team reward. InAu- tonomous Agents and Multi-Agent Systems (AAMAS), 2018. 12

  36. [36]

    Sutton, Doina Precup, and Satinder Singh

    Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1– 2):181–211, 1999

  37. [37]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv preprint arXiv:2308.08155, 2023

  38. [38]

    A finite-time analysis of q-learning with neural network function approximation

    Pan Xu and Quanquan Gu. A finite-time analysis of q-learning with neural network function approximation. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 10555–10565. PMLR, 2020

  39. [39]

    Agentnet: Decentralized evolutionary coordination for llm-based multi-agent sys- tems.ArXiv, abs/2504.00587, 2025

    Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. AgentNet: Decentralized evolutionary coordination for LLM-based multi-agent systems. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. arXiv:2504.00587

  40. [40]

    ReAct: Synergizing reasoning and acting in language models.International Conference on Learning Representations (ICLR), 2023

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.International Conference on Learning Representations (ICLR), 2023

  41. [41]

    AFlow: Automating Agentic Workflow Generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiongwei Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. AFlow: Automating agentic workflow genera- tion. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.10762

  42. [42]

    Transfer faster, price smarter: Minimax dynamic pricing under cross-market preference shift

    Yi Zhang, Elynn Chen, and Yujun Yan. Transfer faster, price smarter: Minimax dynamic pricing under cross-market preference shift. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. Spotlight; arXiv:2505.17203

  43. [43]

    Prior-aligned meta-RL: Thompson sampling with learned priors and guarantees in finite-horizon MDPs.arXiv preprint arXiv:2510.05446, 2025

    Runlin Zhou, Chixiang Chen, and Elynn Chen. Prior-aligned meta-RL: Thompson sampling with learned priors and guarantees in finite-horizon MDPs.arXiv preprint arXiv:2510.05446, 2025

  44. [44]

    GPTSwarm: Language agents as optimizable graphs

    Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. GPTSwarm: Language agents as optimizable graphs. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. 13