pith. sign in

arxiv: 2602.11574 · v3 · pith:LDADS4PWnew · submitted 2026-02-12 · 💻 cs.AI

Learning to Configure Agentic AI Systems

Pith reviewed 2026-05-22 11:25 UTC · model grok-4.3

classification 💻 cs.AI
keywords agent configurationhierarchical policysemi-Markov decision processLLM agentstool usereasoning benchmarksadaptive configurationquery-specific selection
0
0 comments X

The pith

Dynamically learning per-query agent configurations improves LLM accuracy on reasoning and tool-use tasks over fixed templates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current LLM-based agent systems rely on fixed templates or hand-tuned heuristics for choosing workflows, tools, budgets, and prompts, which leads to brittle results and wasted compute because the same setup is applied regardless of query difficulty. It reformulates configuration selection as a semi-Markov decision process in which each full configuration acts as a temporally extended option, then introduces a lightweight hierarchical policy that learns to pick the right option for each incoming query. If this holds, agents would match their resource use and strategy to the specific demands of the query instead of over- or under-configuring, producing higher success rates without raising the overall compute budget. A sympathetic reader would care because this turns a manual design problem into a learnable one that could make tool-using and reasoning agents more reliable across varied inputs.

Core claim

ARC formulates agent configuration as a semi-Markov decision process where each possible combination of workflow, tools, token budget, and prompt serves as a temporally extended option, then trains a lightweight hierarchical policy to select the option best suited to the current query. Across reasoning, tool-use, and agentic benchmarks this learned selection raises average reasoning accuracy by 31.3 percent, tool-use accuracy by 13.95 percent, and doubles Pass^1 success on the tau-Bench Airline task from 9.0 percent to 18.0 percent relative to budget-matched tool-augmented baselines. The results establish that replacing one-size-fits-all designs with query-specific learned configurations is,

What carries the argument

ARC, the lightweight hierarchical policy that selects query-specific agent configurations treated as temporally extended options inside a semi-Markov decision process.

If this is right

  • ARC raises average reasoning accuracy by 31.3 percent while staying within the same compute budget as the baselines.
  • Tool-use accuracy increases by 13.95 percent through query-specific choice of workflows and tools.
  • Pass^1 success on the tau-Bench Airline task doubles from 9.0 percent to 18.0 percent.
  • Per-query adaptation replaces hand-tuned heuristics with a policy that matches configuration effort to query difficulty.
  • The approach demonstrates that treating configurations as learnable options in an SMDP can outperform static agent designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the discretization into options remains tractable, the same hierarchical selection idea could be applied to other combinatorial choices such as prompt ensembles or model routing.
  • Deployed systems might automatically lower token budgets on easy queries and raise them only when needed, producing measurable savings at scale.
  • A direct follow-up experiment would test whether a policy trained on one underlying language model transfers to a different model without retraining.

Load-bearing premise

The space of agent configurations can be discretized into a manageable collection of reusable options whose values can be learned by a hierarchical policy without prohibitive sample complexity or overfitting to the training queries.

What would settle it

Training ARC on one collection of queries and then evaluating it on a fresh collection drawn from a noticeably different distribution; if the accuracy and success-rate gains over fixed configurations largely disappear, the claim that the learned policy generalizes usefully would be falsified.

Figures

Figures reproduced from arXiv: 2602.11574 by Aditya Taparia, Ransalu Senanayake, Som Sagar.

Figure 1
Figure 1. Figure 1: (a) Shows how our method learns to configure optimal configuration across thousands of possibilities for the given input. (b) Shows improvement by our method over multiple datasets. (These results are for Qwen 2.5 7B Instruct model.) performance degrades in long contexts due to the “lost-in￾the-middle” phenomenon, where models fail to attend to relevant information (Liu et al., 2024; Hong et al., 2025). Se… view at source ↗
Figure 2
Figure 2. Figure 2: Training pipeline. The structure policy selects workflows, tools, and budgets while the prompt policy composes instructions. During RL training, episodes are stored in a memory buffer. After RL converges, high-reward episodes are filtered and used for supervised fine-tuning (SFT), which consolidates successful strategies and improves consistency. optimization tractable, we decompose π into two levels in a … view at source ↗
Figure 3
Figure 3. Figure 3: Action masking reduces the effective action￾sequence within the RL policy. a single reasoning agent, yet Astruct permits alloca￾tion of tools and budgets to an additional agent di￾mension. To avoid such wasteful exploration of such infeasible configu￾rations, we employ ac￾tion masking to prune invalid action combina￾tions, reducing |Astruct| to 41,904 valid configu￾rations (a 32.6% reduction, [PITH_FULL_I… view at source ↗
Figure 5
Figure 5. Figure 5: Scaling trends of model accuracy with capacity. Accu￾racy as a function of model size for the Qwen 2.5 family (7B, 32B, 72B) across four benchmarks. Performance improves consistently with scale, with gains varying by task complexity. ods, achieving high accuracy at lower cost. This indicates that instance-specific adaptation yields more efficient ac￾curacy–cost trade-offs than uniform resource allocation o… view at source ↗
Figure 6
Figure 6. Figure 6: Error distribution across benchmarks. Reasoning tasks exhibit primarily reasoning errors, while tool-use tasks are dominated by knowledge gap errors. Policy configuration errors remain minimal (<10%) across all datasets. for tractably navigating the combinatorial design space. 4.6. Error Analysis We categorize errors across all benchmarks into four types: (1) policy configuration errors, where the learned … view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the nine agentic workflows: Direct (0), Reason+Ans (1), Reason+Verify+Ans (2), Routing (3), Parallel-Sectioning (4), Parallel-Voting (5), Orchestrator-Workers (6), Evaluator-Optimizer (7), and Autonomous-Agent (8). Each workflow defines a distinct pattern of LLM calls and agent interactions. Our framework supports nine agentic workflows, ranging from single-call baselines to multi-agent orchest… view at source ↗
Figure 8
Figure 8. Figure 8: Overall score vs. runtime vs. multimodality with the Pareto frontier under a three-dimensional dominance criterion (runtime ↓, score ↑, multimodality ↑). Circles denote text-only models and triangles denote multimodal models. The selected multimodal model (MetaCLIP-H14) is highlighted. Clustering Quality (ARI). We evaluate how well em￾beddings group semantically similar questions using Ad￾justed Rand Index… view at source ↗
Figure 9
Figure 9. Figure 9: Training dynamics of ARC across datasets. Left: cumulative reward over episodes, showing steady improvement as the policy discovers higher-value configurations on GSM8K, DROP, MedQA, HotpotQA, and GAIA. Middle: rolling mean ± standard deviation of per-episode reward, indicating reduced variance and stabilization over time. Right: running validation accuracy, demonstrating that reward gains translate into i… view at source ↗
Figure 10
Figure 10. Figure 10: Tool usage during training. Running average number of tools used per episode for each dataset. ARC quickly learns sparse tool usage and gradually adjusts invocation patterns, with different steady-state levels reflecting task-specific reliance on tools [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Evolution of workflow selection during training. Stacked area plots show, for GSM8K, HotpotQA, and GAIA, the fraction of episodes assigned to each workflow as training progresses. The structure policy quickly prunes suboptimal patterns and concentrates mass on a small set of task-appropriate workflows (e.g., Evaluator–Optimizer on GSM8K, Orchestrator–Workers on HotpotQA). rather than indiscriminately call… view at source ↗
Figure 12
Figure 12. Figure 12: Accuracy by workflow and dataset. Each bar shows the average accuracy of a fixed workflow on a given benchmark. Performance varies substantially across workflows and tasks no single workflow is uniformly optimal—highlighting the importance of learning query-adaptive configurations rather than relying on a fixed architecture [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Reward distribution by workflow. Violin plots show the distribution of per-episode rewards for each workflow across datasets. Higher-performing workflows exhibit both higher central reward and tighter spread, illustrating that certain structural patterns not only achieve better returns but also yield more stable behavior during training. variance. On GSM8K, GRPO achieved 81.2% accuracy after 2,000 episode… view at source ↗
read the original abstract

Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed templates or hand-tuned heuristics that apply the same configuration regardless of query difficulty, leading to brittle behavior and wasted compute. To address this, we formulate agent configuration as a semi-Markov decision process (SMDP) where each configuration acts as a temporally extended option that determines how an agent system processes a query, and introduce introduce ARC (Agentic Resource & Configuration learner), a lightweight hierarchical policy that dynamically selects query-specific agent configurations. Across reasoning, tool-use, and agentic benchmarks, ARC consistently improves over budget-matched tool-augmented LLMs, increasing average reasoning accuracy by 31.3%, tool-use accuracy by 13.95%, and doubling {\tau}-Bench (Airline) Pass^1 success from 9.0% to 18.0%. These results demonstrate that learning per-query agent configurations is a powerful alternative to "one size fits all" designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript formulates agent configuration as a semi-Markov decision process (SMDP) in which each choice of workflow, tools, token budget, and prompt is a temporally extended option, and introduces ARC, a lightweight hierarchical policy that learns to select query-specific configurations. It reports that ARC improves over budget-matched tool-augmented LLM baselines, raising average reasoning accuracy by 31.3 %, tool-use accuracy by 13.95 %, and doubling τ-Bench (Airline) Pass^1 success from 9.0 % to 18.0 %.

Significance. If the empirical gains are shown to arise from genuine per-query adaptation rather than selection of a strong fixed configuration, the work would demonstrate a practical alternative to static heuristics in combinatorial agent design spaces and could influence how resource allocation is handled in deployed LLM agents.

major comments (2)
  1. [Abstract] Abstract: the reported numeric improvements (31.3 % reasoning, 13.95 % tool-use, doubling Pass^1 from 9.0 % to 18.0 %) are presented without any description of the training procedure, the cardinality of the discretized option set, the number of training queries, the specific RL algorithm for the hierarchical policy, run-to-run variance, or statistical tests; these omissions prevent assessment of whether the gains reflect dynamic adaptation or a strong fixed configuration.
  2. [SMDP formulation and ARC] SMDP formulation and ARC description: the central claim rests on the assumption that the space of agent configurations can be discretized into a manageable set of options whose values are learnable by the hierarchical policy without prohibitive sample complexity or overfitting to the training query distribution; the manuscript provides no explicit size of the option set, number of training examples, or empirical checks for generalization versus overfitting.
minor comments (1)
  1. [Abstract] Abstract contains a duplicated word: 'introduce introduce ARC'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, agreeing that additional details on training and the option space are needed for clarity. Revisions will be made to incorporate these elements without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported numeric improvements (31.3 % reasoning, 13.95 % tool-use, doubling Pass^1 from 9.0 % to 18.0 %) are presented without any description of the training procedure, the cardinality of the discretized option set, the number of training queries, the specific RL algorithm for the hierarchical policy, run-to-run variance, or statistical tests; these omissions prevent assessment of whether the gains reflect dynamic adaptation or a strong fixed configuration.

    Authors: We agree that the abstract lacks sufficient context on the experimental setup. In the revised manuscript we will expand the abstract with a concise description of the training procedure, the cardinality of the discretized option set, the number of training queries, the specific RL algorithm used for the hierarchical policy, run-to-run variance across seeds, and the statistical tests performed. Corresponding details will also be added to the main text and a new appendix to demonstrate that gains arise from per-query adaptation rather than a fixed configuration. revision: yes

  2. Referee: [SMDP formulation and ARC] SMDP formulation and ARC description: the central claim rests on the assumption that the space of agent configurations can be discretized into a manageable set of options whose values are learnable by the hierarchical policy without prohibitive sample complexity or overfitting to the training query distribution; the manuscript provides no explicit size of the option set, number of training examples, or empirical checks for generalization versus overfitting.

    Authors: We acknowledge that the manuscript would benefit from more explicit statements on these points. The revised version will state the size of the option set, the number of training examples, and include empirical checks such as performance on held-out queries to show generalization and address potential overfitting. These additions will strengthen the justification for the SMDP formulation and the learnability of the hierarchical policy. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on held-out benchmarks against external baselines

full rationale

The paper formulates agent configuration as an SMDP and introduces the ARC hierarchical policy as a method to select query-specific configurations. Reported improvements (31.3% reasoning accuracy, 13.95% tool-use accuracy, doubling of τ-Bench success) are obtained by direct comparison to budget-matched tool-augmented LLMs on standard benchmarks. No equations, fitted parameters, or self-citations are shown that reduce these gains to quantities defined or optimized inside the same experiment; the evaluation uses held-out test distributions and external baselines, keeping the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard RL modeling assumptions plus the empirical claim that the learned policy generalizes; no new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • Policy hyperparameters and option discretization granularity
    Training of the hierarchical policy requires choices for learning rate, discount factor, and how the configuration space is turned into discrete options.
axioms (1)
  • domain assumption Agent configuration choices can be represented as temporally extended options in a semi-Markov decision process.
    This modeling step is stated at the start of the method and is required for the hierarchical policy to be applicable.

pith-pipeline@v0.9.0 · 5709 in / 1315 out tokens · 37475 ms · 2026-05-22T11:25:35.192508+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 12 internal anchors

  1. [1]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Agrawal, L. A., Tan, S., Soylu, D., Ziems, N., Khare, R., Opsahl-Ong, K., Singhvi, A., Shandilya, H., Ryan, M. J., Jiang, M., et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457,

  2. [2]

    Chase, H

    Accessed: 2025-01-14. Chase, H. Langchain,

  3. [3]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    URL https:// github.com/langchain-ai/langchain. Ac- cessed: 2025-01-12. Chen, L., Zaharia, M., and Zou, J. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176,

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  5. [5]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  6. [6]

    DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

    Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. Drop: A reading comprehension bench- mark requiring discrete reasoning over paragraphs.arXiv preprint arXiv:1903.00161,

  7. [7]

    Token-budget-aware llm reasoning

    Han, T., Wang, Z., Fang, C., Zhao, S., Ma, S., and Chen, Z. Token-budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 24842–24855,

  8. [8]

    Llmlingua: Compressing prompts for accelerated inference of large language models

    URL https: //research.trychroma.com/context-rot. Jiang, H., Wu, Q., Lin, C.-Y ., Yang, Y ., and Qiu, L. Llmlin- gua: Compressing prompts for accelerated inference of large language models.arXiv preprint arXiv:2310.05736,

  9. [9]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- thanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

  10. [10]

    AgentBench: Evaluating LLMs as Agents

    Liu, X., Yu, H., Zhang, H., Xu, Y ., Lei, X., Lai, H., Gu, Y ., Ding, H., Men, K., Yang, K., et al. Agentbench: Evalu- ating llms as agents.arXiv preprint arXiv:2308.03688,

  11. [11]

    H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y

    Ni, J., Abrego, G. H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y . Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. InFindings of the association for computational linguistics: ACL 2022, pp. 1864–1874,

  12. [12]

    Y ., Yuan, W., Cho, K., He, H., Sukhbaatar, S., and Weston, J

    Pang, R. Y ., Yuan, W., Cho, K., He, H., Sukhbaatar, S., and Weston, J. Iterative reasoning preference optimization. In arXiv preprint arXiv:2404.19733,

  13. [13]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    9 Learning to Configure Agentic AI Systems Qin, Y ., Liang, S., Ye, Y ., Zhu, K., Yan, L., Lu, Y ., Lin, Y ., Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789,

  14. [14]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. In arXiv preprint arXiv:1707.06347,

  15. [15]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533,

  16. [16]

    Qwen2.5 Technical Report

    Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Wan, Y ., Liu, Y ....

  17. [17]

    E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O

    Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. Swe-agent: Agent- computer interfaces enable automated software engineer- ing.Advances in Neural Information Processing Systems, 37:50528–50652, 2024b. Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhut- dinov, R., and Manning, C. D. Hotpotqa: A dataset for...

  18. [18]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

  19. [19]

    inventing

    (21) =τ.(22) 13 Learning to Configure Agentic AI Systems C.3.1. DISCUSSION: WHYTHESEGUARANTEESMATTER Support Restrictionensures that the refined policy only proposes configurations that were successful during training. This prevents the policy from “inventing” novel, untested configurations at deployment time, which could lead to unpredictable behavior. R...

  20. [20]

    Decompose the problem

    Embedder Mode ARI Cls. Acc Complexity Decision Overall Time (s) sentence-t5-base (768D) native 0.5603± 0.1019 0.8733± 0.0048 0.9261± 0.0099 0.7221± 0.0070 0.7704± 0.0305 38.39± 1.24 sentence-t5-base (768D) projected 0.4867± 0.1017 0.8635± 0.0084 0.9189± 0.0109 0.7138± 0.0018 0.7457± 0.0229 37.56± 0.89 MetaCLIP-H14 (1024D) native 0.4074± 0.0525 0.8514± 0.0...