Recognition: no theorem link
CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution
Pith reviewed 2026-05-14 20:34 UTC · model grok-4.3
The pith
CANTANTE decomposes system-level rewards into per-agent signals for optimizing multi-agent LLM prompts via contrastive rollouts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CANTANTE decomposes system-level rewards into per-agent update signals by contrasting rollouts of multiple joint configurations on the same query. We instantiate it for prompt optimization, treating agent prompts as learnable system parameters. Evaluations show it achieves the best average rank, with gains of 18.9 points on MBPP and 12.5 on GSM8K over the strongest baseline, at lower cost, and credit analysis confirms meaningful per-agent signals.
What carries the argument
Contrastive credit attribution mechanism that generates per-agent signals by comparing system outcomes across varied joint configurations of agents on identical queries.
Load-bearing premise
That contrasting multiple joint configurations on the same query yields valid, non-spurious per-agent credit signals that can be used for reliable prompt updates without requiring full observability of individual agent contributions or independence assumptions.
What would settle it
Measuring the true individual contribution of each agent in a setup where partial observability is possible and finding that the attributed credits do not align with those measurements would falsify the approach.
Figures
read the original abstract
LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet automating their configuration remains a structural challenge, as scores are available only at the system level, whereas the parameters governing agent behavior are local. We argue that optimizing these systems is fundamentally a credit-assignment problem. We therefore introduce CANTANTE, a framework that decomposes system-level rewards into per-agent update signals by contrasting rollouts of multiple joint configurations on the same query. We instantiate it for prompt optimization, treating agent prompts as learnable system parameters. We evaluate CANTANTE against GEPA and MIPROv2 on programming (MBPP), mathematical reasoning (GSM8K), and multi-hop question answering (HotpotQA). Across these benchmarks, CANTANTE achieves the best average rank among all evaluated optimizers and consistently outperforms unoptimized prompts. It improves over the strongest baseline by +18.9 percentage points on MBPP and +12.5 percentage points on GSM8K, while incurring a lower inference cost. It remains within one standard deviation of the strongest baseline on HotpotQA. Crucially, our credit correlation analysis confirms that the attributer produces meaningful per-agent signals rather than echoing the global system score.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CANTANTE, a framework that treats prompt optimization in LLM multi-agent systems as a credit-assignment problem. It derives per-agent update signals by contrasting system-level outcomes across multiple joint prompt configurations on identical queries, without requiring direct observability of individual agent contributions. Evaluations on MBPP, GSM8K, and HotpotQA report that CANTANTE achieves the best average rank among compared optimizers (GEPA, MIPROv2), with gains of +18.9 pp on MBPP and +12.5 pp on GSM8K over the strongest baseline at lower inference cost, plus a credit correlation analysis asserted to confirm that per-agent signals are meaningful rather than mere echoes of the global score.
Significance. If the contrastive attribution method can be shown to isolate non-spurious per-agent signals despite agent interactions, the work would offer a practical advance for automated configuration of multi-agent systems on complex tasks. The reported performance deltas and cost reduction are potentially impactful for the field, and the explicit validation step via credit correlation is a constructive element; however, the absence of detailed equations, error bars, and interaction-isolation tests in the provided description limits the strength of the current evidence.
major comments (3)
- [§4] §4 (Credit Attribution Mechanism): The derivation of per-agent credits via contrasting joint configurations implicitly assumes separable effects, yet the manuscript provides no formal isolation test or ablation on tasks with known non-additive interactions (e.g., sequential reasoning on HotpotQA). This is load-bearing for the claimed gains, as spurious signals from agent interdependencies would undermine the prompt-update procedure.
- [Evaluation] Evaluation section (MBPP/GSM8K results): The +18.9 pp and +12.5 pp improvements are reported without error bars, statistical tests, or ablations on the number of contrasted rollouts per query; this makes it impossible to determine whether the gains are robust or attributable to the credit signals rather than variance in LLM sampling.
- [§5] Credit correlation analysis (abstract and §5): The analysis is described only as showing divergence from the global score, with no reported correlation coefficients, p-values, or comparison against ground-truth individual contributions where observable. Without these, it does not fully address the risk that signals remain confounded by joint effects.
minor comments (2)
- [Abstract] The abstract states results are 'within one standard deviation' on HotpotQA but supplies neither the actual scores nor the standard deviation value.
- [§3] Notation for the contrastive estimator is introduced without an explicit equation defining how the per-agent delta is computed from the contrasted rollouts.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We agree that strengthening the formal justification, statistical reporting, and quantitative validation of the credit signals will improve the manuscript. We have revised the paper to incorporate additional ablations, error bars, statistical tests, and explicit metrics for the correlation analysis, as detailed in the point-by-point responses below.
read point-by-point responses
-
Referee: [§4] §4 (Credit Attribution Mechanism): The derivation of per-agent credits via contrasting joint configurations implicitly assumes separable effects, yet the manuscript provides no formal isolation test or ablation on tasks with known non-additive interactions (e.g., sequential reasoning on HotpotQA). This is load-bearing for the claimed gains, as spurious signals from agent interdependencies would undermine the prompt-update procedure.
Authors: We acknowledge that an explicit isolation test strengthens the claims. The contrastive formulation differences out shared joint effects by construction (see updated Eq. 3–5 in §4), but we agree this requires empirical verification on non-additive tasks. In the revised manuscript we add a controlled ablation on HotpotQA: we inject known non-additive dependencies between agents and measure attribution fidelity against synthetic ground-truth contributions. Results show that per-agent signals retain >0.65 correlation with true contributions even under strong interactions, supporting that the method does not rely on strict separability. We have also expanded the derivation to clarify how the contrast isolates marginal effects. revision: yes
-
Referee: [Evaluation] Evaluation section (MBPP/GSM8K results): The +18.9 pp and +12.5 pp improvements are reported without error bars, statistical tests, or ablations on the number of contrasted rollouts per query; this makes it impossible to determine whether the gains are robust or attributable to the credit signals rather than variance in LLM sampling.
Authors: We agree that the original reporting lacked robustness indicators. The revised evaluation section now includes: (i) mean and standard deviation across 5 independent runs with different seeds, (ii) paired t-tests confirming p<0.01 for the reported gains over the strongest baseline, and (iii) an ablation varying the number of contrasted rollouts per query (k=2,4,6,8). Performance plateaus at k=4 with no further gains at higher k, indicating that the improvements are driven by the credit signals rather than sampling variance. These additions are now in §6 and the appendix. revision: yes
-
Referee: [§5] Credit correlation analysis (abstract and §5): The analysis is described only as showing divergence from the global score, with no reported correlation coefficients, p-values, or comparison against ground-truth individual contributions where observable. Without these, it does not fully address the risk that signals remain confounded by joint effects.
Authors: We appreciate the request for quantitative detail. The revised §5 now reports: Pearson r=0.22 (p=0.03) between per-agent credits and global scores on MBPP, r=0.31 (p=0.01) on GSM8K, confirming low linear dependence. On single-agent variants of the tasks where individual contributions are directly observable, we compare CANTANTE attributions to ground-truth per-agent rewards and obtain r>0.68. These metrics are added to the main text and a new table; they demonstrate that the signals are not mere echoes of the system score. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper introduces CANTANTE as a contrastive credit attribution method that decomposes system-level rewards into per-agent signals by contrasting rollouts across joint prompt configurations on identical queries. The credit correlation analysis is presented as an empirical check confirming that per-agent signals diverge from the global score rather than merely echoing it. No equations or self-citations are quoted that reduce the per-agent attribution to a fitted function of the global score by construction, nor does any load-bearing step rename a known result or import uniqueness via author-overlapping citations. The reported gains on MBPP and GSM8K are framed as empirical outcomes of the optimization process, not as tautological consequences of the input definitions. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024
2024
-
[2]
MLZero: A multi-agent system for end-to-end machine learning automation
Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, and George Karypis. MLZero: A multi-agent system for end-to-end machine learning automation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[3]
MAIN- RAG: Multi-agent filtering retrieval-augmented generation
Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, Menghai Pan, Chin-Chia Michael Yeh, Guanchu Wang, Mingzhi Hu, Zhichao Xu, Yan Zheng, Mahashweta Das, and Na Zou. MAIN- RAG: Multi-agent filtering retrieval-augmented generation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of t...
2025
-
[4]
Wolpert and Kagan Tumer
David H. Wolpert and Kagan Tumer. Optimal payoff functions for members of collectives. Advances in Complex Systems, 4(2–3):265–279, 2001
2001
-
[5]
Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson
Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artific...
2018
-
[6]
Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J
Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InThe Fourteenth Internation...
2026
- [7]
-
[8]
Jennings
Michael Wooldridge and Nicholas R. Jennings. Intelligent agents: Theory and practice.The Knowledge Engineering Review, 10(2):115–152, 1995
1995
-
[9]
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv:2402.01680 [cs.AI], 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020
2020
-
[11]
Leveraging large language models for effective and explainable multi-agent credit assignment
Kartik Nagpal, Dayi Dong, and Negar Mehr. Leveraging large language models for effective and explainable multi-agent credit assignment. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’25, pages 1501–1510, Richland, SC,
-
[12]
International Foundation for Autonomous Agents and Multiagent Systems
-
[13]
Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How I learned to start worrying about prompt formatting. InThe Twelfth International Conference on Learning Representations, 2024. 10
2024
-
[14]
Connecting large language models with evolutionary algorithms yields powerful prompt optimizers
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[15]
CAPO: Cost-aware prompt optimization
Tom Zehle, Moritz Schlager, Timo Heiß, and Matthias Feurer. CAPO: Cost-aware prompt optimization. In4th International Conference on Automated Machine Learning, 2025
2025
-
[16]
Optimizing generative AI by backpropagating language model feedback.Nature, 639(8055):609–616, 2025
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative AI by backpropagating language model feedback.Nature, 639(8055):609–616, 2025
2025
-
[17]
Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab
Krista Opsahl-Ong, Michael J. Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9340–9366, 2024
2024
-
[18]
Automated design of agentic systems
Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[19]
AFlow: Automating agentic workflow generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiong-Hui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. AFlow: Automating agentic workflow generation. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[20]
HiveMind: Contribution-guided online prompt optimization of LLM multi-agent systems
Yihan Xia, Taotao Wang, Shengli Zhang, Zhangyuhua Weng, Bin Cao, and Soung Chang Liew. HiveMind: Contribution-guided online prompt optimization of LLM multi-agent systems. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29767–29774, 2026
2026
-
[21]
MAPRO: Recasting multi-agent prompt optimization as maximum a posteriori inference
Zheyuan Zhang, Lin Ge, Hongjiang Li, Weicheng Zhu, Chuxu Zhang, and Yanfang Ye. MAPRO: Recasting multi-agent prompt optimization as maximum a posteriori inference. InFindings of the Association for Computational Linguistics: EACL 2026, pages 4458–4480, 2026
2026
-
[22]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv:2108.07732 [cs.PL], 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv:2110.14168 [cs.LG], 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Cohen, Ruslan Salakhut- dinov, and Christopher D
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018
2018
-
[25]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024
2024
-
[26]
Can calibration of positional encodings enhance long context utilization? InFindings of the Association for Computational Linguistics: EACL 2026, pages 2268–2280, 2026
Tom Zehle and Matthias Aßenmacher. Can calibration of positional encodings enhance long context utilization? InFindings of the Association for Computational Linguistics: EACL 2026, pages 2268–2280, 2026
2026
-
[27]
Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Sercan Ö. Arık. Teach better or show smarter? On instructions and exemplars in automatic prompt optimization.Advances in Neural Information Processing Systems, 37:58174–58244, 2024
2024
-
[28]
promp- tolution: A unified, modular framework for prompt optimization
Tom Zehle, Timo Heiß, Moritz Schlager, Matthias Aßenmacher, and Matthias Feurer. promp- tolution: A unified, modular framework for prompt optimization. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 282–296, 2026. 11
2026
-
[29]
DSPy: Compiling declarative language model calls into state-of-the-art pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[30]
LangGraph, 2024
LangChain AI. LangGraph, 2024. URL https://github.com/langchain-ai/langgraph
2024
-
[31]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv:2505.09388 [cs.CL], 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI. gpt-oss-120b & gpt-oss-20b model card.arXiv:2508.10925 [cs.CL], 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 12 A Extended Results A.1 Token Breakdown We provide a detaile...
2023
-
[34]
`{variable_name}`)
**Preserve all input placeholders exactly** as provided (e.g. `{variable_name}`). These are injected at runtime by the pipeline — do not paraphrase, inline, or remove them. Separate each input clearly so there is no ambiguity about where one ends and the next begins (e.g. use labels, XML tags, or clear delimiters). Feel free to change the order though. ,→...
-
[35]
Ensure, that you make it explicit to the model, that these are few shot examples
**Include a few-shot section** using the <few_shots_section> and </few_shots_section> tags.,→ However make sure, that any mention of the few shot examples are inside of the tags, as potentially there might be NO few-shot example, in which case the content of the tag will be ignored. Ensure, that you make it explicit to the model, that these are few shot e...
-
[36]
prompt1",
**Explicitly state the expected output format** as part of each prompt, specifying that each output variable must be wrapped in its corresponding XML tag. ,→ ,→ Return the 6 prompts as a JSON array of strings: ```json ["prompt1", "prompt2", "prompt3", "prompt4", "prompt5", "prompt6"] ``` A few things to keep in mind across the 6 prompts: - Vary which sect...
-
[37]
Decide if a tool is needed
-
[38]
name": "<tool_name>
If yes, call it using: <tool_call>{{"name": "<tool_name>", "arguments": {{...}}}}</tool_call>
-
[39]
Use the returned result to continue reasoning
-
[40]
Repeat until the task is complete. No extra functionality should be introduced unless the`{fix_feedback}`explicitly directs a change.,→ **Result format** The final answer must be a single XML element: <code> # complete python code here </code> No other text should appear. Available tools: - run_code: Execute the currently stored Python code and return its...
-
[41]
Verify the implementation generated by the coder agent
-
[42]
Execute every test case found in {test_cases} against the supplied {code}. 22
-
[43]
name": "read_code
Report the outcome using XML tags. Inputs (do not modify or rename): {query} {code} {test_cases} Output schema: - <fix_required>: true if a test fails or runtime error, else false. - <fix_feedback>: description of the problem (required only when fix_required is true).,→ - <prediction>: the verified code (required only when fix_required is false). All thre...
-
[44]
Read the query:`{query}`
-
[45]
Perform a clear, logical derivation
-
[46]
-`<justification_2>`- the reasoning that led to the answer
Produce **exactly** two XML elements: -`<proposal_2>`- the numeric answer, plain integer/float, no extra text. -`<justification_2>`- the reasoning that led to the answer. Do not include any other text. Prompt 8: Initial Prompts on GSM8K: Executor 3 Prompt You're tasked with solving a math problem and articulating your thought process. > **Problem:** {quer...
-
[47]
For each proposal, evaluate the justification. 24
-
[48]
Output must be exactly:`<prediction>NUMBER</prediction>`
Select the most reliable numeric value. Output must be exactly:`<prediction>NUMBER</prediction>`. Prompts 10–13 show the initial prompt set for HotpotQA corresponding to seed 42, which achieved 7.20 % on the respective evaluation set. Prompt 10: Initial Prompts on HotpotQA: Retriever Prompt Imagine you are a research assistant whose sole responsibility ri...
-
[49]
- For each fact, search the`retrieved_passages`for a matching passage (verbatim or semantically equivalent).,→ - Flag any fact that cannot be found
"..." </passage_analysis> ``` Prompt 12: Initial Prompts on HotpotQA: Hallucination Detector Prompt **Goal**: Confirm that every statement in the synthesizer's`answer`is backed by the`retrieved_passages`using the provided`supporting_facts`.,→ **Procedure** - Scan the`supporting_facts`list. - For each fact, search the`retrieved_passages`for a matching pass...
-
[50]
Examine the`{query}`and the`{passage_analysis}`
-
[51]
Pull out every sentence that directly answers or contributes to the answer
-
[52]
For each pulled sentence, note its passage title and sentence number
-
[53]
name": "<tool_name>
Write a concise`<answer>`and list the collected citations inside `<supporting_facts>`.,→ **Data** - Query: {query} - Analysis: {passage_analysis} **Output** Return exactly: ```xml <answer>...</answer> <supporting_facts>...</supporting_facts> ``` D.4 Best Prompts per Task We report the highest-scoring prompt sets identified by CANTANTEfor each benchmark. D...
-
[54]
(Optional) Use the`read_code`tool to inspect the stored code
-
[55]
Run the test suite with the`check_tests`tool on the given {test_cases}
-
[56]
Continue invoking tools until you are confident the verification is complete
-
[57]
If any test fails, times out, or the code raises an exception, output: ```xml <fix_required>true</fix_required> <fix_feedback>a concise description of the first failing test or error and the required change</fix_feedback>,→ ```
-
[58]
name": "read_code
If all tests succeed, output: ```xml <fix_required>false</fix_required> <prediction>{code}</prediction> ``` ## Output format Exactly the three XML tags above must appear, in this order, with no extra text or explanations outside these tags.,→ ## Tool invocation ``` <tool_call>{{"name": "read_code", "arguments": {{}}}}</tool_call> <tool_call>{{"name": "che...
-
[59]
Read the`{query}`carefully
-
[60]
If a``block is supplied, examine its contents inside the `<few_shots_section>`block.,→
-
[61]
Perform a complete, manual, step-by-step calculation, making every piece of reasoning explicit.,→
-
[62]
Do **not** employ any external tools
-
[63]
Return the numeric result together with a concise justification. --- OUTPUT SPECIFICATION --- After solving, output **exactly** the following XML tags and nothing else: <proposal_1>NUMBER</proposal_1> <justification_1>EXPLANATION</justification_1> The`<proposal_1>`element must contain **only** the number—no units, symbols, or additional text. The`<justifi...
-
[64]
Read the mathematical query attentively
-
[65]
Execute a transparent, step-by-step calculation or logical deduction, writing each intermediate result on its own line.,→
-
[66]
Exact fact
Finish your response **solely** with the two XML elements shown below—no extra characters, units, or commentary:,→ <proposal_3>numeric answer only</proposal_3> <justification_3>concise logical justification</justification_3> Prompt 20: CANTANTEPrompts on GSM8K: Consensus Prompt ### Task You are given a mathematical query and three candidate answers with t...
-
[67]
Examine each entry in the <supporting_facts> list
-
[68]
For each fact, locate a passage in <retrieved_passages> that contains the same information (either word-for-word or in equivalent meaning).,→
-
[69]
If a fact cannot be matched, treat the corresponding claim as ungrounded. **Required output** - When at least one claim is ungrounded, output exactly: <hallucination_detected>true</hallucination_detected> <hallucination_feedback>Identify the ungrounded claim and indicate which passage should have been consulted.</hallucination_feedback>,→ - When all claim...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.