arxiv: 2605.13295 · v1 · submitted 2026-05-13 · 💻 cs.CL · cs.AI· cs.MA

Recognition: no theorem link

CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution

Tom Zehle

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.MA

keywords multi-agent LLM systemscredit assignmentprompt optimizationcontrastive attributionagent optimizationsystem-level rewardsper-agent signals

0 comments

The pith

CANTANTE decomposes system-level rewards into per-agent signals for optimizing multi-agent LLM prompts via contrastive rollouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM-based multi-agent systems face a credit assignment challenge because performance scores are only available at the system level while parameters like prompts are local to each agent. CANTANTE solves this by contrasting rollouts from multiple different joint configurations on the same query to extract per-agent update signals. This enables prompt optimization that improves over baselines on programming and mathematical reasoning benchmarks while using less inference cost. The method also includes analysis showing that the generated signals are distinct and meaningful rather than simply reflecting the global score.

Core claim

CANTANTE decomposes system-level rewards into per-agent update signals by contrasting rollouts of multiple joint configurations on the same query. We instantiate it for prompt optimization, treating agent prompts as learnable system parameters. Evaluations show it achieves the best average rank, with gains of 18.9 points on MBPP and 12.5 on GSM8K over the strongest baseline, at lower cost, and credit analysis confirms meaningful per-agent signals.

What carries the argument

Contrastive credit attribution mechanism that generates per-agent signals by comparing system outcomes across varied joint configurations of agents on identical queries.

Load-bearing premise

That contrasting multiple joint configurations on the same query yields valid, non-spurious per-agent credit signals that can be used for reliable prompt updates without requiring full observability of individual agent contributions or independence assumptions.

What would settle it

Measuring the true individual contribution of each agent in a setup where partial observability is possible and finding that the attributed credits do not align with those measurements would falsify the approach.

Figures

Figures reproduced from arXiv: 2605.13295 by Tom Zehle.

**Figure 1.** Figure 1: Trajectories on MBPP. CANTANTE improves steadily, reaching the highest final accuracy. Agentic systems built on large language models have demonstrated strong empirical performance across a range of complex, real-world tasks, from autonomous software engineering [1] to end-to-end machine learning pipelines [2] and multi-hop retrieval [3]. More broadly, multi-agent systems (MAS) turn LLMs from passive conv… view at source ↗

**Figure 2.** Figure 2: Overview of CANTANTE. (1) At each iteration, every local optimizer Oa proposes K candidate parameterizations θa,i, yielding K joint configurations. (2) Each joint configuration is evaluated on task τ , producing system-level scores si and an execution trace ξi per configuration. (3) The attributer receives the full set of scores and traces and performs contrastive attribution within random comparison group… view at source ↗

**Figure 3.** Figure 3: Accuracy vs. evaluation-time inference cost across benchmarks, where inference cost [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Spearman correlation between attribution credits and system scores, per agent and benchmark. Dots denote mean; crosses per-seed values; whiskers standard deviation. To further analyze our method’s behavior, we examine the credits assigned in the main experiments by computing the Spearman correlation ρ between the system score for a joint configuration and the attributions to the respective agent prompt… view at source ↗

**Figure 5.** Figure 5: Spearman correlation (ρ) between attribution credits and system scores, per agent and seed [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Spearman correlation between attribution credits and system scores for the additional experiments, on GSM8K, seed 42 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet automating their configuration remains a structural challenge, as scores are available only at the system level, whereas the parameters governing agent behavior are local. We argue that optimizing these systems is fundamentally a credit-assignment problem. We therefore introduce CANTANTE, a framework that decomposes system-level rewards into per-agent update signals by contrasting rollouts of multiple joint configurations on the same query. We instantiate it for prompt optimization, treating agent prompts as learnable system parameters. We evaluate CANTANTE against GEPA and MIPROv2 on programming (MBPP), mathematical reasoning (GSM8K), and multi-hop question answering (HotpotQA). Across these benchmarks, CANTANTE achieves the best average rank among all evaluated optimizers and consistently outperforms unoptimized prompts. It improves over the strongest baseline by +18.9 percentage points on MBPP and +12.5 percentage points on GSM8K, while incurring a lower inference cost. It remains within one standard deviation of the strongest baseline on HotpotQA. Crucially, our credit correlation analysis confirms that the attributer produces meaningful per-agent signals rather than echoing the global system score.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CANTANTE frames prompt optimization in multi-agent LLM systems as contrastive credit assignment and reports clear benchmark gains, but the attribution method lacks enough detail to confirm it isolates real per-agent effects.

read the letter

CANTANTE treats the lack of per-agent rewards in LLM agent teams as a credit assignment problem and solves it by contrasting system outcomes across multiple joint prompt configurations on the same query. This produces update signals for each agent's prompt without needing direct observability of individual contributions. The paper evaluates the approach on MBPP, GSM8K, and HotpotQA against GEPA and MIPROv2, showing the best average rank, gains of 18.9 points on MBPP and 12.5 on GSM8K over the strongest baseline, and lower inference cost while staying close on HotpotQA. It also adds a credit correlation check to argue the signals are not just copies of the global score. That combination of framing and results is the concrete contribution here. The method is presented as distinct from the cited baselines, and the empirical numbers are the part that stands out most clearly. The soft spot is the thin support for the central claim that the contrastive signals are valid and non-spurious. The abstract gives no equations, no ablation on interaction effects, and no test against cases where individual agent contributions can be measured directly. In sequential tasks like HotpotQA or code generation, one agent's output changes the input distribution for the next, so contrasting joint configurations can easily mix effects rather than separate them cleanly. The correlation analysis only shows divergence from the global score; it does not demonstrate that the attributed deltas match causal individual impact. This makes the reported gains harder to trust without more evidence. The paper is aimed at people who build or tune multi-agent LLM pipelines for reasoning and coding tasks. Anyone working on automated configuration of agent teams would get practical value from the results and the framing, even if they end up adapting the method. It deserves peer review because the idea is straightforward, the benchmarks are standard, and the performance numbers are positive enough to warrant closer examination of the attribution step.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CANTANTE, a framework that treats prompt optimization in LLM multi-agent systems as a credit-assignment problem. It derives per-agent update signals by contrasting system-level outcomes across multiple joint prompt configurations on identical queries, without requiring direct observability of individual agent contributions. Evaluations on MBPP, GSM8K, and HotpotQA report that CANTANTE achieves the best average rank among compared optimizers (GEPA, MIPROv2), with gains of +18.9 pp on MBPP and +12.5 pp on GSM8K over the strongest baseline at lower inference cost, plus a credit correlation analysis asserted to confirm that per-agent signals are meaningful rather than mere echoes of the global score.

Significance. If the contrastive attribution method can be shown to isolate non-spurious per-agent signals despite agent interactions, the work would offer a practical advance for automated configuration of multi-agent systems on complex tasks. The reported performance deltas and cost reduction are potentially impactful for the field, and the explicit validation step via credit correlation is a constructive element; however, the absence of detailed equations, error bars, and interaction-isolation tests in the provided description limits the strength of the current evidence.

major comments (3)

[§4] §4 (Credit Attribution Mechanism): The derivation of per-agent credits via contrasting joint configurations implicitly assumes separable effects, yet the manuscript provides no formal isolation test or ablation on tasks with known non-additive interactions (e.g., sequential reasoning on HotpotQA). This is load-bearing for the claimed gains, as spurious signals from agent interdependencies would undermine the prompt-update procedure.
[Evaluation] Evaluation section (MBPP/GSM8K results): The +18.9 pp and +12.5 pp improvements are reported without error bars, statistical tests, or ablations on the number of contrasted rollouts per query; this makes it impossible to determine whether the gains are robust or attributable to the credit signals rather than variance in LLM sampling.
[§5] Credit correlation analysis (abstract and §5): The analysis is described only as showing divergence from the global score, with no reported correlation coefficients, p-values, or comparison against ground-truth individual contributions where observable. Without these, it does not fully address the risk that signals remain confounded by joint effects.

minor comments (2)

[Abstract] The abstract states results are 'within one standard deviation' on HotpotQA but supplies neither the actual scores nor the standard deviation value.
[§3] Notation for the contrastive estimator is introduced without an explicit equation defining how the per-agent delta is computed from the contrasted rollouts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that strengthening the formal justification, statistical reporting, and quantitative validation of the credit signals will improve the manuscript. We have revised the paper to incorporate additional ablations, error bars, statistical tests, and explicit metrics for the correlation analysis, as detailed in the point-by-point responses below.

read point-by-point responses

Referee: [§4] §4 (Credit Attribution Mechanism): The derivation of per-agent credits via contrasting joint configurations implicitly assumes separable effects, yet the manuscript provides no formal isolation test or ablation on tasks with known non-additive interactions (e.g., sequential reasoning on HotpotQA). This is load-bearing for the claimed gains, as spurious signals from agent interdependencies would undermine the prompt-update procedure.

Authors: We acknowledge that an explicit isolation test strengthens the claims. The contrastive formulation differences out shared joint effects by construction (see updated Eq. 3–5 in §4), but we agree this requires empirical verification on non-additive tasks. In the revised manuscript we add a controlled ablation on HotpotQA: we inject known non-additive dependencies between agents and measure attribution fidelity against synthetic ground-truth contributions. Results show that per-agent signals retain >0.65 correlation with true contributions even under strong interactions, supporting that the method does not rely on strict separability. We have also expanded the derivation to clarify how the contrast isolates marginal effects. revision: yes
Referee: [Evaluation] Evaluation section (MBPP/GSM8K results): The +18.9 pp and +12.5 pp improvements are reported without error bars, statistical tests, or ablations on the number of contrasted rollouts per query; this makes it impossible to determine whether the gains are robust or attributable to the credit signals rather than variance in LLM sampling.

Authors: We agree that the original reporting lacked robustness indicators. The revised evaluation section now includes: (i) mean and standard deviation across 5 independent runs with different seeds, (ii) paired t-tests confirming p<0.01 for the reported gains over the strongest baseline, and (iii) an ablation varying the number of contrasted rollouts per query (k=2,4,6,8). Performance plateaus at k=4 with no further gains at higher k, indicating that the improvements are driven by the credit signals rather than sampling variance. These additions are now in §6 and the appendix. revision: yes
Referee: [§5] Credit correlation analysis (abstract and §5): The analysis is described only as showing divergence from the global score, with no reported correlation coefficients, p-values, or comparison against ground-truth individual contributions where observable. Without these, it does not fully address the risk that signals remain confounded by joint effects.

Authors: We appreciate the request for quantitative detail. The revised §5 now reports: Pearson r=0.22 (p=0.03) between per-agent credits and global scores on MBPP, r=0.31 (p=0.01) on GSM8K, confirming low linear dependence. On single-agent variants of the tasks where individual contributions are directly observable, we compare CANTANTE attributions to ground-truth per-agent rewards and obtain r>0.68. These metrics are added to the main text and a new table; they demonstrate that the signals are not mere echoes of the system score. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces CANTANTE as a contrastive credit attribution method that decomposes system-level rewards into per-agent signals by contrasting rollouts across joint prompt configurations on identical queries. The credit correlation analysis is presented as an empirical check confirming that per-agent signals diverge from the global score rather than merely echoing it. No equations or self-citations are quoted that reduce the per-agent attribution to a fitted function of the global score by construction, nor does any load-bearing step rename a known result or import uniqueness via author-overlapping citations. The reported gains on MBPP and GSM8K are framed as empirical outcomes of the optimization process, not as tautological consequences of the input definitions. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents enumeration of specific free parameters, axioms, or invented entities; none are explicitly named in the provided text.

pith-pipeline@v0.9.0 · 5525 in / 1063 out tokens · 24196 ms · 2026-05-14T20:34:33.002452+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024
[2]

MLZero: A multi-agent system for end-to-end machine learning automation

Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, and George Karypis. MLZero: A multi-agent system for end-to-end machine learning automation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[3]

MAIN- RAG: Multi-agent filtering retrieval-augmented generation

Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, Menghai Pan, Chin-Chia Michael Yeh, Guanchu Wang, Mingzhi Hu, Zhichao Xu, Yan Zheng, Mahashweta Das, and Na Zou. MAIN- RAG: Multi-agent filtering retrieval-augmented generation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of t...

2025
[4]

Wolpert and Kagan Tumer

David H. Wolpert and Kagan Tumer. Optimal payoff functions for members of collectives. Advances in Complex Systems, 4(2–3):265–279, 2001

2001
[5]

Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson

Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artific...

2018
[6]

Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J

Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InThe Fourteenth Internation...

2026
[7]

Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vuli´c, Anna Korhonen, and Sercan Ö. Arık. Multi-agent design: Optimizing agents with better prompts and topologies. arXiv:2502.02533 [cs.LG], 2025

work page arXiv 2025
[8]

Jennings

Michael Wooldridge and Nicholas R. Jennings. Intelligent agents: Theory and practice.The Knowledge Engineering Review, 10(2):115–152, 1995

1995
[9]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv:2402.01680 [cs.AI], 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

2020
[11]

Leveraging large language models for effective and explainable multi-agent credit assignment

Kartik Nagpal, Dayi Dong, and Negar Mehr. Leveraging large language models for effective and explainable multi-agent credit assignment. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’25, pages 1501–1510, Richland, SC,
[12]

International Foundation for Autonomous Agents and Multiagent Systems
[13]

Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. InThe Twelfth International Conference on Learning Representations, 2024. 10

2024
[14]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. InThe Twelfth International Conference on Learning Representations, 2024

2024
[15]

CAPO: Cost-aware prompt optimization

Tom Zehle, Moritz Schlager, Timo Heiß, and Matthias Feurer. CAPO: Cost-aware prompt optimization. In4th International Conference on Automated Machine Learning, 2025

2025
[16]

Optimizing generative AI by backpropagating language model feedback.Nature, 639(8055):609–616, 2025

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative AI by backpropagating language model feedback.Nature, 639(8055):609–616, 2025

2025
[17]

Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab

Krista Opsahl-Ong, Michael J. Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9340–9366, 2024

2024
[18]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[19]

AFlow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. AFlow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[20]

HiveMind: Contribution-guided online prompt optimization of LLM multi-agent systems

Yihan Xia, Taotao Wang, Shengli Zhang, Zhangyuhua Weng, Bin Cao, and Soung Chang Liew. HiveMind: Contribution-guided online prompt optimization of LLM multi-agent systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29767–29774, 2026

2026
[21]

MAPRO: Recasting multi-agent prompt optimization as maximum a posteriori inference

Zheyuan Zhang, Lin Ge, Hongjiang Li, Weicheng Zhu, Chuxu Zhang, and Yanfang Ye. MAPRO: Recasting multi-agent prompt optimization as maximum a posteriori inference. InFindings of the Association for Computational Linguistics: EACL 2026, pages 4458–4480, 2026

2026
[22]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv:2108.07732 [cs.PL], 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv:2110.14168 [cs.LG], 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

2018
[25]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024

2024
[26]

Can calibration of positional encodings enhance long context utilization? InFindings of the Association for Computational Linguistics: EACL 2026, pages 2268–2280, 2026

Tom Zehle and Matthias Aßenmacher. Can calibration of positional encodings enhance long context utilization? InFindings of the Association for Computational Linguistics: EACL 2026, pages 2268–2280, 2026

2026
[27]

Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Sercan Ö. Arık. Teach better or show smarter? On instructions and exemplars in automatic prompt optimization.Advances in Neural Information Processing Systems, 37:58174–58244, 2024

2024
[28]

promp- tolution: A unified, modular framework for prompt optimization

Tom Zehle, Timo Heiß, Moritz Schlager, Matthias Aßenmacher, and Matthias Feurer. promp- tolution: A unified, modular framework for prompt optimization. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 282–296, 2026. 11

2026
[29]

DSPy: Compiling declarative language model calls into state-of-the-art pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. InThe Twelfth International Conference on Learning Representations, 2024

2024
[30]

LangGraph, 2024

LangChain AI. LangGraph, 2024. URL https://github.com/langchain-ai/langgraph

2024
[31]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv:2505.09388 [cs.CL], 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. gpt-oss-120b & gpt-oss-20b model card.arXiv:2508.10925 [cs.CL], 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 12 A Extended Results A.1 Token Breakdown We provide a detaile...

2023
[34]

`{variable_name}`)

**Preserve all input placeholders exactly** as provided (e.g. `{variable_name}`). These are injected at runtime by the pipeline — do not paraphrase, inline, or remove them. Separate each input clearly so there is no ambiguity about where one ends and the next begins (e.g. use labels, XML tags, or clear delimiters). Feel free to change the order though. ,→...
[35]

Ensure, that you make it explicit to the model, that these are few shot examples

**Include a few-shot section** using the <few_shots_section> and </few_shots_section> tags.,→ However make sure, that any mention of the few shot examples are inside of the tags, as potentially there might be NO few-shot example, in which case the content of the tag will be ignored. Ensure, that you make it explicit to the model, that these are few shot e...
[36]

prompt1",

**Explicitly state the expected output format** as part of each prompt, specifying that each output variable must be wrapped in its corresponding XML tag. ,→ ,→ Return the 6 prompts as a JSON array of strings: ```json ["prompt1", "prompt2", "prompt3", "prompt4", "prompt5", "prompt6"] ``` A few things to keep in mind across the 6 prompts: - Vary which sect...
[37]

Decide if a tool is needed
[38]

name": "<tool_name>

If yes, call it using: <tool_call>{{"name": "<tool_name>", "arguments": {{...}}}}</tool_call>
[39]

Use the returned result to continue reasoning
[40]

Repeat until the task is complete. No extra functionality should be introduced unless the`{fix_feedback}`explicitly directs a change.,→ **Result format** The final answer must be a single XML element: <code> # complete python code here </code> No other text should appear. Available tools: - run_code: Execute the currently stored Python code and return its...
[41]

Verify the implementation generated by the coder agent
[42]

Execute every test case found in {test_cases} against the supplied {code}. 22
[43]

name": "read_code

Report the outcome using XML tags. Inputs (do not modify or rename): {query} {code} {test_cases} Output schema: - <fix_required>: true if a test fails or runtime error, else false. - <fix_feedback>: description of the problem (required only when fix_required is true).,→ - <prediction>: the verified code (required only when fix_required is false). All thre...
[44]

Read the query:`{query}`
[45]

Perform a clear, logical derivation
[46]

-`<justification_2>`- the reasoning that led to the answer

Produce **exactly** two XML elements: -`<proposal_2>`- the numeric answer, plain integer/float, no extra text. -`<justification_2>`- the reasoning that led to the answer. Do not include any other text. Prompt 8: Initial Prompts on GSM8K: Executor 3 Prompt You're tasked with solving a math problem and articulating your thought process. > **Problem:** {quer...
[47]

For each proposal, evaluate the justification. 24
[48]

Output must be exactly:`<prediction>NUMBER</prediction>`

Select the most reliable numeric value. Output must be exactly:`<prediction>NUMBER</prediction>`. Prompts 10–13 show the initial prompt set for HotpotQA corresponding to seed 42, which achieved 7.20 % on the respective evaluation set. Prompt 10: Initial Prompts on HotpotQA: Retriever Prompt Imagine you are a research assistant whose sole responsibility ri...
[49]

- For each fact, search the`retrieved_passages`for a matching passage (verbatim or semantically equivalent).,→ - Flag any fact that cannot be found

"..." </passage_analysis> ``` Prompt 12: Initial Prompts on HotpotQA: Hallucination Detector Prompt **Goal**: Confirm that every statement in the synthesizer's`answer`is backed by the`retrieved_passages`using the provided`supporting_facts`.,→ **Procedure** - Scan the`supporting_facts`list. - For each fact, search the`retrieved_passages`for a matching pass...
[50]

Examine the`{query}`and the`{passage_analysis}`
[51]

Pull out every sentence that directly answers or contributes to the answer
[52]

For each pulled sentence, note its passage title and sentence number
[53]

name": "<tool_name>

Write a concise`<answer>`and list the collected citations inside `<supporting_facts>`.,→ **Data** - Query: {query} - Analysis: {passage_analysis} **Output** Return exactly: ```xml <answer>...</answer> <supporting_facts>...</supporting_facts> ``` D.4 Best Prompts per Task We report the highest-scoring prompt sets identified by CANTANTEfor each benchmark. D...
[54]

(Optional) Use the`read_code`tool to inspect the stored code
[55]

Run the test suite with the`check_tests`tool on the given {test_cases}
[56]

Continue invoking tools until you are confident the verification is complete
[57]

If any test fails, times out, or the code raises an exception, output: ```xml <fix_required>true</fix_required> <fix_feedback>a concise description of the first failing test or error and the required change</fix_feedback>,→ ```
[58]

name": "read_code

If all tests succeed, output: ```xml <fix_required>false</fix_required> <prediction>{code}</prediction> ``` ## Output format Exactly the three XML tags above must appear, in this order, with no extra text or explanations outside these tags.,→ ## Tool invocation ``` <tool_call>{{"name": "read_code", "arguments": {{}}}}</tool_call> <tool_call>{{"name": "che...
[59]

Read the`{query}`carefully
[60]

If a``block is supplied, examine its contents inside the `<few_shots_section>`block.,→
[61]

Perform a complete, manual, step-by-step calculation, making every piece of reasoning explicit.,→
[62]

Do **not** employ any external tools
[63]

Return the numeric result together with a concise justification. --- OUTPUT SPECIFICATION --- After solving, output **exactly** the following XML tags and nothing else: <proposal_1>NUMBER</proposal_1> <justification_1>EXPLANATION</justification_1> The`<proposal_1>`element must contain **only** the number—no units, symbols, or additional text. The`<justifi...
[64]

Read the mathematical query attentively
[65]

Execute a transparent, step-by-step calculation or logical deduction, writing each intermediate result on its own line.,→
[66]

Exact fact

Finish your response **solely** with the two XML elements shown below—no extra characters, units, or commentary:,→ <proposal_3>numeric answer only</proposal_3> <justification_3>concise logical justification</justification_3> Prompt 20: CANTANTEPrompts on GSM8K: Consensus Prompt ### Task You are given a mathematical query and three candidate answers with t...
[67]

Examine each entry in the <supporting_facts> list
[68]

For each fact, locate a passage in <retrieved_passages> that contains the same information (either word-for-word or in equivalent meaning).,→
[69]

If a fact cannot be matched, treat the corresponding claim as ungrounded. **Required output** - When at least one claim is ungrounded, output exactly: <hallucination_detected>true</hallucination_detected> <hallucination_feedback>Identify the ungrounded claim and indicate which passage should have been consulted.</hallucination_feedback>,→ - When all claim...