Learning to Configure Agentic AI Systems

Aditya Taparia; Ransalu Senanayake; Som Sagar

arxiv: 2602.11574 · v3 · pith:LDADS4PWnew · submitted 2026-02-12 · 💻 cs.AI

Learning to Configure Agentic AI Systems

Aditya Taparia , Som Sagar , Ransalu Senanayake This is my paper

Pith reviewed 2026-05-22 11:25 UTC · model grok-4.3

classification 💻 cs.AI

keywords agent configurationhierarchical policysemi-Markov decision processLLM agentstool usereasoning benchmarksadaptive configurationquery-specific selection

0 comments

The pith

Dynamically learning per-query agent configurations improves LLM accuracy on reasoning and tool-use tasks over fixed templates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that current LLM-based agent systems rely on fixed templates or hand-tuned heuristics for choosing workflows, tools, budgets, and prompts, which leads to brittle results and wasted compute because the same setup is applied regardless of query difficulty. It reformulates configuration selection as a semi-Markov decision process in which each full configuration acts as a temporally extended option, then introduces a lightweight hierarchical policy that learns to pick the right option for each incoming query. If this holds, agents would match their resource use and strategy to the specific demands of the query instead of over- or under-configuring, producing higher success rates without raising the overall compute budget. A sympathetic reader would care because this turns a manual design problem into a learnable one that could make tool-using and reasoning agents more reliable across varied inputs.

Core claim

ARC formulates agent configuration as a semi-Markov decision process where each possible combination of workflow, tools, token budget, and prompt serves as a temporally extended option, then trains a lightweight hierarchical policy to select the option best suited to the current query. Across reasoning, tool-use, and agentic benchmarks this learned selection raises average reasoning accuracy by 31.3 percent, tool-use accuracy by 13.95 percent, and doubles Pass^1 success on the tau-Bench Airline task from 9.0 percent to 18.0 percent relative to budget-matched tool-augmented baselines. The results establish that replacing one-size-fits-all designs with query-specific learned configurations is,

What carries the argument

ARC, the lightweight hierarchical policy that selects query-specific agent configurations treated as temporally extended options inside a semi-Markov decision process.

If this is right

ARC raises average reasoning accuracy by 31.3 percent while staying within the same compute budget as the baselines.
Tool-use accuracy increases by 13.95 percent through query-specific choice of workflows and tools.
Pass^1 success on the tau-Bench Airline task doubles from 9.0 percent to 18.0 percent.
Per-query adaptation replaces hand-tuned heuristics with a policy that matches configuration effort to query difficulty.
The approach demonstrates that treating configurations as learnable options in an SMDP can outperform static agent designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the discretization into options remains tractable, the same hierarchical selection idea could be applied to other combinatorial choices such as prompt ensembles or model routing.
Deployed systems might automatically lower token budgets on easy queries and raise them only when needed, producing measurable savings at scale.
A direct follow-up experiment would test whether a policy trained on one underlying language model transfers to a different model without retraining.

Load-bearing premise

The space of agent configurations can be discretized into a manageable collection of reusable options whose values can be learned by a hierarchical policy without prohibitive sample complexity or overfitting to the training queries.

What would settle it

Training ARC on one collection of queries and then evaluating it on a fresh collection drawn from a noticeably different distribution; if the accuracy and success-rate gains over fixed configurations largely disappear, the claim that the learned policy generalizes usefully would be falsified.

Figures

Figures reproduced from arXiv: 2602.11574 by Aditya Taparia, Ransalu Senanayake, Som Sagar.

**Figure 1.** Figure 1: (a) Shows how our method learns to configure optimal configuration across thousands of possibilities for the given input. (b) Shows improvement by our method over multiple datasets. (These results are for Qwen 2.5 7B Instruct model.) performance degrades in long contexts due to the “lost-inthe-middle” phenomenon, where models fail to attend to relevant information (Liu et al., 2024; Hong et al., 2025). Se… view at source ↗

**Figure 2.** Figure 2: Training pipeline. The structure policy selects workflows, tools, and budgets while the prompt policy composes instructions. During RL training, episodes are stored in a memory buffer. After RL converges, high-reward episodes are filtered and used for supervised fine-tuning (SFT), which consolidates successful strategies and improves consistency. optimization tractable, we decompose π into two levels in a … view at source ↗

**Figure 3.** Figure 3: Action masking reduces the effective actionsequence within the RL policy. a single reasoning agent, yet Astruct permits allocation of tools and budgets to an additional agent dimension. To avoid such wasteful exploration of such infeasible configurations, we employ action masking to prune invalid action combinations, reducing |Astruct| to 41,904 valid configurations (a 32.6% reduction, [PITH_FULL_I… view at source ↗

**Figure 5.** Figure 5: Scaling trends of model accuracy with capacity. Accuracy as a function of model size for the Qwen 2.5 family (7B, 32B, 72B) across four benchmarks. Performance improves consistently with scale, with gains varying by task complexity. ods, achieving high accuracy at lower cost. This indicates that instance-specific adaptation yields more efficient accuracy–cost trade-offs than uniform resource allocation o… view at source ↗

**Figure 6.** Figure 6: Error distribution across benchmarks. Reasoning tasks exhibit primarily reasoning errors, while tool-use tasks are dominated by knowledge gap errors. Policy configuration errors remain minimal (<10%) across all datasets. for tractably navigating the combinatorial design space. 4.6. Error Analysis We categorize errors across all benchmarks into four types: (1) policy configuration errors, where the learned … view at source ↗

**Figure 7.** Figure 7: Overview of the nine agentic workflows: Direct (0), Reason+Ans (1), Reason+Verify+Ans (2), Routing (3), Parallel-Sectioning (4), Parallel-Voting (5), Orchestrator-Workers (6), Evaluator-Optimizer (7), and Autonomous-Agent (8). Each workflow defines a distinct pattern of LLM calls and agent interactions. Our framework supports nine agentic workflows, ranging from single-call baselines to multi-agent orchest… view at source ↗

**Figure 8.** Figure 8: Overall score vs. runtime vs. multimodality with the Pareto frontier under a three-dimensional dominance criterion (runtime ↓, score ↑, multimodality ↑). Circles denote text-only models and triangles denote multimodal models. The selected multimodal model (MetaCLIP-H14) is highlighted. Clustering Quality (ARI). We evaluate how well embeddings group semantically similar questions using Adjusted Rand Index… view at source ↗

**Figure 9.** Figure 9: Training dynamics of ARC across datasets. Left: cumulative reward over episodes, showing steady improvement as the policy discovers higher-value configurations on GSM8K, DROP, MedQA, HotpotQA, and GAIA. Middle: rolling mean ± standard deviation of per-episode reward, indicating reduced variance and stabilization over time. Right: running validation accuracy, demonstrating that reward gains translate into i… view at source ↗

**Figure 10.** Figure 10: Tool usage during training. Running average number of tools used per episode for each dataset. ARC quickly learns sparse tool usage and gradually adjusts invocation patterns, with different steady-state levels reflecting task-specific reliance on tools [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Evolution of workflow selection during training. Stacked area plots show, for GSM8K, HotpotQA, and GAIA, the fraction of episodes assigned to each workflow as training progresses. The structure policy quickly prunes suboptimal patterns and concentrates mass on a small set of task-appropriate workflows (e.g., Evaluator–Optimizer on GSM8K, Orchestrator–Workers on HotpotQA). rather than indiscriminately call… view at source ↗

**Figure 12.** Figure 12: Accuracy by workflow and dataset. Each bar shows the average accuracy of a fixed workflow on a given benchmark. Performance varies substantially across workflows and tasks no single workflow is uniformly optimal—highlighting the importance of learning query-adaptive configurations rather than relying on a fixed architecture [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Reward distribution by workflow. Violin plots show the distribution of per-episode rewards for each workflow across datasets. Higher-performing workflows exhibit both higher central reward and tighter spread, illustrating that certain structural patterns not only achieve better returns but also yield more stable behavior during training. variance. On GSM8K, GRPO achieved 81.2% accuracy after 2,000 episode… view at source ↗

read the original abstract

Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed templates or hand-tuned heuristics that apply the same configuration regardless of query difficulty, leading to brittle behavior and wasted compute. To address this, we formulate agent configuration as a semi-Markov decision process (SMDP) where each configuration acts as a temporally extended option that determines how an agent system processes a query, and introduce introduce ARC (Agentic Resource & Configuration learner), a lightweight hierarchical policy that dynamically selects query-specific agent configurations. Across reasoning, tool-use, and agentic benchmarks, ARC consistently improves over budget-matched tool-augmented LLMs, increasing average reasoning accuracy by 31.3%, tool-use accuracy by 13.95%, and doubling {\tau}-Bench (Airline) Pass^1 success from 9.0% to 18.0%. These results demonstrate that learning per-query agent configurations is a powerful alternative to "one size fits all" designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARC gets solid benchmark lifts by treating agent config as an SMDP and learning a hierarchical policy to pick per-query setups, but the writeup leaves the option space, training details, and generalization checks too thin to tell real adaptation from a strong fixed choice.

read the letter

The core move is to model choosing workflows, tools, token budgets, and prompts as temporally extended options in a semi-Markov decision process, then train a lightweight hierarchical policy that picks the right one for each incoming query. That framing is new in this exact combination, and the reported results show consistent gains over budget-matched baselines: 31.3% higher average reasoning accuracy, 13.95% better tool-use accuracy, and a doubling of Pass^1 success on tau-Bench Airline from 9% to 18%. Those deltas are large enough to notice in mixed workloads where some queries are easy and others are not.

Referee Report

2 major / 1 minor

Summary. The manuscript formulates agent configuration as a semi-Markov decision process (SMDP) in which each choice of workflow, tools, token budget, and prompt is a temporally extended option, and introduces ARC, a lightweight hierarchical policy that learns to select query-specific configurations. It reports that ARC improves over budget-matched tool-augmented LLM baselines, raising average reasoning accuracy by 31.3 %, tool-use accuracy by 13.95 %, and doubling τ-Bench (Airline) Pass^1 success from 9.0 % to 18.0 %.

Significance. If the empirical gains are shown to arise from genuine per-query adaptation rather than selection of a strong fixed configuration, the work would demonstrate a practical alternative to static heuristics in combinatorial agent design spaces and could influence how resource allocation is handled in deployed LLM agents.

major comments (2)

[Abstract] Abstract: the reported numeric improvements (31.3 % reasoning, 13.95 % tool-use, doubling Pass^1 from 9.0 % to 18.0 %) are presented without any description of the training procedure, the cardinality of the discretized option set, the number of training queries, the specific RL algorithm for the hierarchical policy, run-to-run variance, or statistical tests; these omissions prevent assessment of whether the gains reflect dynamic adaptation or a strong fixed configuration.
[SMDP formulation and ARC] SMDP formulation and ARC description: the central claim rests on the assumption that the space of agent configurations can be discretized into a manageable set of options whose values are learnable by the hierarchical policy without prohibitive sample complexity or overfitting to the training query distribution; the manuscript provides no explicit size of the option set, number of training examples, or empirical checks for generalization versus overfitting.

minor comments (1)

[Abstract] Abstract contains a duplicated word: 'introduce introduce ARC'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, agreeing that additional details on training and the option space are needed for clarity. Revisions will be made to incorporate these elements without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the reported numeric improvements (31.3 % reasoning, 13.95 % tool-use, doubling Pass^1 from 9.0 % to 18.0 %) are presented without any description of the training procedure, the cardinality of the discretized option set, the number of training queries, the specific RL algorithm for the hierarchical policy, run-to-run variance, or statistical tests; these omissions prevent assessment of whether the gains reflect dynamic adaptation or a strong fixed configuration.

Authors: We agree that the abstract lacks sufficient context on the experimental setup. In the revised manuscript we will expand the abstract with a concise description of the training procedure, the cardinality of the discretized option set, the number of training queries, the specific RL algorithm used for the hierarchical policy, run-to-run variance across seeds, and the statistical tests performed. Corresponding details will also be added to the main text and a new appendix to demonstrate that gains arise from per-query adaptation rather than a fixed configuration. revision: yes
Referee: [SMDP formulation and ARC] SMDP formulation and ARC description: the central claim rests on the assumption that the space of agent configurations can be discretized into a manageable set of options whose values are learnable by the hierarchical policy without prohibitive sample complexity or overfitting to the training query distribution; the manuscript provides no explicit size of the option set, number of training examples, or empirical checks for generalization versus overfitting.

Authors: We acknowledge that the manuscript would benefit from more explicit statements on these points. The revised version will state the size of the option set, the number of training examples, and include empirical checks such as performance on held-out queries to show generalization and address potential overfitting. These additions will strengthen the justification for the SMDP formulation and the learnability of the hierarchical policy. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on held-out benchmarks against external baselines

full rationale

The paper formulates agent configuration as an SMDP and introduces the ARC hierarchical policy as a method to select query-specific configurations. Reported improvements (31.3% reasoning accuracy, 13.95% tool-use accuracy, doubling of τ-Bench success) are obtained by direct comparison to budget-matched tool-augmented LLMs on standard benchmarks. No equations, fitted parameters, or self-citations are shown that reduce these gains to quantities defined or optimized inside the same experiment; the evaluation uses held-out test distributions and external baselines, keeping the derivation chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard RL modeling assumptions plus the empirical claim that the learned policy generalizes; no new physical entities or ad-hoc constants are introduced.

free parameters (1)

Policy hyperparameters and option discretization granularity
Training of the hierarchical policy requires choices for learning rate, discount factor, and how the configuration space is turned into discrete options.

axioms (1)

domain assumption Agent configuration choices can be represented as temporally extended options in a semi-Markov decision process.
This modeling step is stated at the start of the method and is required for the hierarchical policy to be applicable.

pith-pipeline@v0.9.0 · 5709 in / 1315 out tokens · 37475 ms · 2026-05-22T11:25:35.192508+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate agent configuration as a semi-Markov decision process (SMDP) where each configuration acts as a temporally extended option... ARC (Agentic Resource & Configuration learner), a lightweight hierarchical policy that dynamically selects query-specific agent configurations.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Reward R = α·I[correct] − βs·nsteps − βt·ntokens/Tmax + η·Rtool

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 12 internal anchors

[1]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Agrawal, L. A., Tan, S., Soylu, D., Ziems, N., Khare, R., Opsahl-Ong, K., Singhvi, A., Shandilya, H., Ryan, M. J., Jiang, M., et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Chase, H

Accessed: 2025-01-14. Chase, H. Langchain,

work page 2025
[3]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

URL https:// github.com/langchain-ai/langchain. Ac- cessed: 2025-01-12. Chen, L., Zaharia, M., and Zou, J. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. Drop: A reading comprehension bench- mark requiring discrete reasoning over paragraphs.arXiv preprint arXiv:1903.00161,

work page internal anchor Pith review Pith/arXiv arXiv 1903
[7]

Token-budget-aware llm reasoning

Han, T., Wang, Z., Fang, C., Zhao, S., Ma, S., and Chen, Z. Token-budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 24842–24855,

work page 2025
[8]

Llmlingua: Compressing prompts for accelerated inference of large language models

URL https: //research.trychroma.com/context-rot. Jiang, H., Wu, Q., Lin, C.-Y ., Yang, Y ., and Qiu, L. Llmlin- gua: Compressing prompts for accelerated inference of large language models.arXiv preprint arXiv:2310.05736,

work page arXiv
[9]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- thanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

AgentBench: Evaluating LLMs as Agents

Liu, X., Yu, H., Zhang, H., Xu, Y ., Lei, X., Lai, H., Gu, Y ., Ding, H., Men, K., Yang, K., et al. Agentbench: Evalu- ating llms as agents.arXiv preprint arXiv:2308.03688,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y

Ni, J., Abrego, G. H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y . Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. InFindings of the association for computational linguistics: ACL 2022, pp. 1864–1874,

work page 2022
[12]

Y ., Yuan, W., Cho, K., He, H., Sukhbaatar, S., and Weston, J

Pang, R. Y ., Yuan, W., Cho, K., He, H., Sukhbaatar, S., and Weston, J. Iterative reasoning preference optimization. In arXiv preprint arXiv:2404.19733,

work page arXiv
[13]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

9 Learning to Configure Agentic AI Systems Qin, Y ., Liang, S., Ye, Y ., Zhu, K., Yan, L., Lu, Y ., Lin, Y ., Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. In arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Qwen2.5 Technical Report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Wan, Y ., Liu, Y ....

work page internal anchor Pith review Pith/arXiv arXiv
[17]

E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O

Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. Swe-agent: Agent- computer interfaces enable automated software engineer- ing.Advances in Neural Information Processing Systems, 37:50528–50652, 2024b. Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhut- dinov, R., and Manning, C. D. Hotpotqa: A dataset for...

work page 2018
[18]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

inventing

(21) =τ.(22) 13 Learning to Configure Agentic AI Systems C.3.1. DISCUSSION: WHYTHESEGUARANTEESMATTER Support Restrictionensures that the refined policy only proposes configurations that were successful during training. This prevents the policy from “inventing” novel, untested configurations at deployment time, which could lead to unpredictable behavior. R...

work page 2022
[20]

Decompose the problem

Embedder Mode ARI Cls. Acc Complexity Decision Overall Time (s) sentence-t5-base (768D) native 0.5603± 0.1019 0.8733± 0.0048 0.9261± 0.0099 0.7221± 0.0070 0.7704± 0.0305 38.39± 1.24 sentence-t5-base (768D) projected 0.4867± 0.1017 0.8635± 0.0084 0.9189± 0.0109 0.7138± 0.0018 0.7457± 0.0229 37.56± 0.89 MetaCLIP-H14 (1024D) native 0.4074± 0.0525 0.8514± 0.0...

work page 2000

[1] [1]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Agrawal, L. A., Tan, S., Soylu, D., Ziems, N., Khare, R., Opsahl-Ong, K., Singhvi, A., Shandilya, H., Ryan, M. J., Jiang, M., et al. Gepa: Reflective prompt evolution can outperform reinforcement learning.arXiv preprint arXiv:2507.19457,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Chase, H

Accessed: 2025-01-14. Chase, H. Langchain,

work page 2025

[3] [3]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

URL https:// github.com/langchain-ai/langchain. Ac- cessed: 2025-01-12. Chen, L., Zaharia, M., and Zou, J. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. Drop: A reading comprehension bench- mark requiring discrete reasoning over paragraphs.arXiv preprint arXiv:1903.00161,

work page internal anchor Pith review Pith/arXiv arXiv 1903

[7] [7]

Token-budget-aware llm reasoning

Han, T., Wang, Z., Fang, C., Zhao, S., Ma, S., and Chen, Z. Token-budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 24842–24855,

work page 2025

[8] [8]

Llmlingua: Compressing prompts for accelerated inference of large language models

URL https: //research.trychroma.com/context-rot. Jiang, H., Wu, Q., Lin, C.-Y ., Yang, Y ., and Qiu, L. Llmlin- gua: Compressing prompts for accelerated inference of large language models.arXiv preprint arXiv:2310.05736,

work page arXiv

[9] [9]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- thanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., et al. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

AgentBench: Evaluating LLMs as Agents

Liu, X., Yu, H., Zhang, H., Xu, Y ., Lei, X., Lai, H., Gu, Y ., Ding, H., Men, K., Yang, K., et al. Agentbench: Evalu- ating llms as agents.arXiv preprint arXiv:2308.03688,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y

Ni, J., Abrego, G. H., Constant, N., Ma, J., Hall, K., Cer, D., and Yang, Y . Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. InFindings of the association for computational linguistics: ACL 2022, pp. 1864–1874,

work page 2022

[12] [12]

Y ., Yuan, W., Cho, K., He, H., Sukhbaatar, S., and Weston, J

Pang, R. Y ., Yuan, W., Cho, K., He, H., Sukhbaatar, S., and Weston, J. Iterative reasoning preference optimization. In arXiv preprint arXiv:2404.19733,

work page arXiv

[13] [13]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

9 Learning to Configure Agentic AI Systems Qin, Y ., Liang, S., Ye, Y ., Zhu, K., Yan, L., Lu, Y ., Lin, Y ., Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. In arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Qwen2.5 Technical Report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y ., Su, Y ., Zhang, Y ., Wan, Y ., Liu, Y ....

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O

Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., and Press, O. Swe-agent: Agent- computer interfaces enable automated software engineer- ing.Advances in Neural Information Processing Systems, 37:50528–50652, 2024b. Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhut- dinov, R., and Manning, C. D. Hotpotqa: A dataset for...

work page 2018

[18] [18]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y ., Fried, D., et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

inventing

(21) =τ.(22) 13 Learning to Configure Agentic AI Systems C.3.1. DISCUSSION: WHYTHESEGUARANTEESMATTER Support Restrictionensures that the refined policy only proposes configurations that were successful during training. This prevents the policy from “inventing” novel, untested configurations at deployment time, which could lead to unpredictable behavior. R...

work page 2022

[20] [20]

Decompose the problem

Embedder Mode ARI Cls. Acc Complexity Decision Overall Time (s) sentence-t5-base (768D) native 0.5603± 0.1019 0.8733± 0.0048 0.9261± 0.0099 0.7221± 0.0070 0.7704± 0.0305 38.39± 1.24 sentence-t5-base (768D) projected 0.4867± 0.1017 0.8635± 0.0084 0.9189± 0.0109 0.7138± 0.0018 0.7457± 0.0229 37.56± 0.89 MetaCLIP-H14 (1024D) native 0.4074± 0.0525 0.8514± 0.0...

work page 2000