FAPO: Fully Automated Prompt Optimization of Multi-Step LLM Pipelines
Pith reviewed 2026-06-26 19:48 UTC · model grok-4.3
The pith
FAPO automates inspection and scoped edits of multi-step LLM pipelines to beat prompt-only baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FAPO lets Claude Code repeatedly evaluate a pipeline, inspect its intermediate results, diagnose whether a failure is prompt-level or structural, propose a scoped edit, and validate the variant against the score function; it prefers prompt edits and only escalates to structural changes when attribution shows a structural bottleneck. Across six benchmarks and three task models this procedure beats the GEPA baseline in 15 of 18 comparisons, with non-overlapping mean-plus-std-dev intervals in 11 cases and an average gain of 14.1 percentage points; on the six HoVer and IFBench runs that required structural edits the mean gain rises to 33.8 points. The same procedure also raises accuracy on the s
What carries the argument
An iterative diagnosis-and-edit loop that first attempts prompt-level changes and escalates to permitted structural changes only after attributing failure to a chain-level bottleneck.
If this is right
- On tasks whose bottlenecks are prompt-level, FAPO still improves accuracy but by smaller margins than on tasks that need structural repair.
- Security-oriented pipelines such as CVE-to-CWE mapping become more accurate without manual prompt engineering.
- The method is model-agnostic in the sense that the same optimization loop works for three different task models.
- When structural changes are triggered, the performance delta over prompt-only search more than doubles.
Where Pith is reading between the lines
- If the diagnosis step is reliable, the same loop could be pointed at any codebase that exposes intermediate outputs and a scalar score, not just the six benchmarks tested.
- The escalation rule implies that purely prompt-based optimizers will systematically under-perform on pipelines whose errors are architectural rather than linguistic.
- A cheaper or open-source substitute for the diagnosis LLM would let the same workflow run at lower cost on the same benchmarks.
Load-bearing premise
The LLM performing the diagnosis can correctly identify whether a failure is caused by a prompt or by pipeline structure and can propose edits that raise the target score without creating new undetected errors.
What would settle it
A controlled run on one of the six benchmarks in which every edit proposed by the optimizer either leaves the score unchanged or lowers it while the reported accuracy still rises.
read the original abstract
Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present Fully Automated Prompt Optimization (FAPO), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean $\pm$ trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FAPO, a framework that uses Claude Code to automatically optimize multi-step LLM pipelines. It evaluates pipelines, inspects intermediate outputs, attributes failures to prompt or structural causes, proposes scoped edits (starting with prompts and escalating to structure only when needed), and iterates against a score function. Empirical claims state that across six benchmarks and three task models, FAPO outperforms the GEPA baseline in 15 of 18 comparisons, with mean gains of +14.1 pp overall and +33.8 pp in the six cases involving structural changes; additional gains are reported on security tasks such as CTIBench-RCM.
Significance. If the empirical results and the reliability of the automated diagnosis/editing loop hold after proper controls and verification, FAPO would represent a meaningful advance in automated optimization of multi-step LLM systems by addressing interactions across retrieval, reasoning, and formatting steps that prompt-only methods miss. The integration of failure attribution with scoped structural edits within a standardized codebase is a distinctive contribution relative to prior prompt optimization work.
major comments (2)
- [Abstract] Abstract: The headline numerical claims (15/18 wins, mean +14.1 pp gain, non-overlapping mean ± trial-std ranges in 11 cases, +33.8 pp on structural-escalation cases) are presented without any reported trial counts per comparison, definition of the score function, data splits, controls for optimizer LLM stochasticity, or statistical tests. These omissions make it impossible to evaluate whether the reported superiority is robust or could be driven by variability in the Claude Code optimizer itself.
- [Abstract] Abstract (FAPO framework description): The six structural-change wins on HoVer and IFBench, which drive the largest reported gains, rest on the unverified assumption that Claude Code can reliably inspect intermediate outputs, correctly attribute failures to prompt versus structural causes, and generate scoped edits that improve the target score without introducing new undetected errors. No ablation study, inter-rater agreement metric, or independent audit of the attribution accuracy is supplied, rendering these results vulnerable to the possibility that gains arise from stochastic optimizer behavior rather than the FAPO loop.
minor comments (1)
- The manuscript should clarify whether the standardized codebase and any evaluation harness are released, as this directly affects reproducibility of the reported pipeline optimizations.
Simulated Author's Rebuttal
We appreciate the referee's careful reading and constructive suggestions for improving the clarity and verifiability of our results. We address the major comments point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline numerical claims (15/18 wins, mean +14.1 pp gain, non-overlapping mean ± trial-std ranges in 11 cases, +33.8 pp on structural-escalation cases) are presented without any reported trial counts per comparison, definition of the score function, data splits, controls for optimizer LLM stochasticity, or statistical tests. These omissions make it impossible to evaluate whether the reported superiority is robust or could be driven by variability in the Claude Code optimizer itself.
Authors: We agree that the abstract would benefit from additional methodological details to allow readers to better assess the robustness of the claims. We will revise the abstract to include the number of trials per comparison, a brief definition of the score function, information on data splits, controls for the optimizer LLM's stochasticity, and any statistical tests. The full details are elaborated in the experimental sections of the manuscript. revision: yes
-
Referee: [Abstract] Abstract (FAPO framework description): The six structural-change wins on HoVer and IFBench, which drive the largest reported gains, rest on the unverified assumption that Claude Code can reliably inspect intermediate outputs, correctly attribute failures to prompt versus structural causes, and generate scoped edits that improve the target score without introducing new undetected errors. No ablation study, inter-rater agreement metric, or independent audit of the attribution accuracy is supplied, rendering these results vulnerable to the possibility that gains arise from stochastic optimizer behavior rather than the FAPO loop.
Authors: While the end-to-end performance gains provide indirect support for the effectiveness of the attribution and editing process, we acknowledge that direct verification through ablation or audit would strengthen the claims. We will add an ablation study isolating the impact of structural changes and include examples of the diagnosis and attribution process in the revised manuscript to address concerns about potential stochastic effects. revision: yes
Circularity Check
No circularity: empirical comparisons only
full rationale
The paper describes an empirical optimization framework (FAPO) that runs Claude Code to edit prompts and (when needed) structure, then reports measured accuracy gains versus the external baseline GEPA on six benchmarks. No equations, fitted parameters, first-principles derivations, or self-referential definitions appear in the provided text. Performance numbers are direct experimental outcomes, not quantities defined in terms of themselves or forced by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central claims therefore remain independent of the circularity patterns enumerated in the instructions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An LLM (Claude Code) can inspect intermediate outputs and correctly diagnose whether failures are prompt-level or structural.
invented entities (1)
-
FAPO framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
GEPA: Reflective prompt evolution can outperform reinforcement learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InThe Fourteenth International...
2026
-
[2]
CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence
Md Tanvirul Alam, Dipkamal Bhusal, Le Nguyen, and Nidhi Rastogi. CTIBench: A benchmark for evaluating LLMs in cyber threat intelligence. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 50805–50825. Curran Associates, Inc., 2024. doi: 10.52202/...
-
[3]
Claude Code: An agentic coding tool, 2025.https://docs.anthropic.com/en/docs/claude-code
Anthropic. Claude Code: An agentic coding tool, 2025.https://docs.anthropic.com/en/docs/claude-code
2025
-
[4]
AdaEvolve: Adaptive LLM-driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026
Mert Cemri, Shubham Agrawal, Akshat Gupta, Shu Liu, Audrey Cheng, Qiuyang Mang, Ashwin Naren, Lutfi Eren Erdogan, Koushik Sen, Matei Zaharia, Alex Dimakis, and Ion Stoica. AdaEvolve: Adaptive LLM-driven zeroth-order optimization.arXiv preprint arXiv:2602.20133, 2026
arXiv 2026
-
[5]
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023
Pith/arXiv arXiv 2023
-
[6]
Beyond benchmarks: MathArena as an evaluation platform for mathematics with LLMs, 2026
Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvaldsson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond benchmarks: MathArena as an evaluation platform for mathematics with LLMs, 2026. URLhttps://arxiv.org/abs/2605.00674
Pith/arXiv arXiv 2026
-
[7]
PromptBreeder: self-referential self-improvement via prompt evolution
Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. PromptBreeder: self-referential self-improvement via prompt evolution. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024
2024
-
[8]
Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025
Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. URLhttps://arxiv.org/abs/2503.19786
Pith/arXiv arXiv 2025
-
[9]
Connecting large language models with evolutionary algorithms yields powerful prompt optimizers
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=ZG3RaNIsO8
2024
-
[10]
Best-of-N jailbreaking
John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. Best-of-N jailbreaking. InAdvances in Neural Information Processing Systems, 2025. URL https://proceedings.neurips. cc/paper_files/paper/2025/hash/69f3eb242c7c9df9ea2f2b66ea8b3c0f-Abstract-Conference.html. 10
2025
-
[11]
HoVer: A dataset for many-hop fact extraction and claim verification
Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. HoVer: A dataset for many-hop fact extraction and claim verification. In Trevor Cohn, Yulan He, and Yang Liu, editors,Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3441–3460, Online, November 2020. Association for Computational Linguis...
-
[12]
autoresearch: AI agents running research on single-GPU nanochat training automatically, 2026
Andrej Karpathy. autoresearch: AI agents running research on single-GPU nanochat training automatically, 2026. https://github.c om/karpathy/autoresearch
2026
-
[13]
Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. InThe Twelfth International Conference on Learning Represent...
2024
-
[14]
LangGraph: Building stateful, multi-agent applications with LLMs, 2024
LangChain. LangGraph: Building stateful, multi-agent applications with LLMs, 2024. https://github.com/langchain-ai/langgr aph
2024
-
[15]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda R...
2023
-
[16]
EvoX: Meta-evolution for automated discovery.arXiv preprint arXiv:2602.23413, 2026
Shu Liu, Shubham Agarwal, et al. EvoX: Meta-evolution for automated discovery.arXiv preprint arXiv:2602.23413, 2026
arXiv 2026
-
[17]
AgentBench: Evaluating LLMs as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning R...
2024
-
[18]
AutoDAN: Generating stealthy jailbreak prompts on aligned large language models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=7Jwp w4qKkb
2024
-
[19]
Tree of attacks: Jailbreaking black-box LLMs automatically.Advances in Neural Information Processing Systems, 2024
Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically.Advances in Neural Information Processing Systems, 2024
2024
-
[20]
Introducing GPT-4.1 in the API, 2025.https://openai.com/index/gpt-4-1/
OpenAI. Introducing GPT-4.1 in the API, 2025.https://openai.com/index/gpt-4-1/
2025
-
[21]
Introducing GPT-5 for developers, 2025.https://openai.com/index/introducing-gpt-5-for-developers/
OpenAI. Introducing GPT-5 for developers, 2025.https://openai.com/index/introducing-gpt-5-for-developers/
2025
-
[22]
Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
Krista Opsahl-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 93...
-
[23]
Capability-based scaling trends for LLM-based red-teaming
Alexander Panfilov, Paul Kassianik, Maksym Andriushchenko, and Jonas Geiping. Capability-based scaling trends for LLM-based red-teaming. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?i d=1InFGGz1D5
2026
-
[24]
Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko. Claudini: Autoresearch discovers state-of-the-art adversarial attack algorithms for LLMs.arXiv preprint arXiv:2603.24511, 2026
Pith/arXiv arXiv 2026
-
[25]
Generalizing verifiable instruction following
Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi. Generalizing verifiable instruction following. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URLhttps://openreview.net/forum?id=yfYgwjj5F8
2026
-
[26]
Pappas, Amin Karbasi, and Hamed Hassani
Mahdi Sabbaghi, Paul Kassianik, George J. Pappas, Amin Karbasi, and Hamed Hassani. Adversarial reasoning at jailbreaking time. In Forty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=aWd7mL5U9Q
2025
-
[27]
and Wallace, Eric and Singh, Sameer
Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, Online, ...
-
[28]
PAPILLON: Privacy preservation from Internet-based and local language model ensembles
Li Siyan, Vethavikashini Chithrra Raghuram, Omar Khattab, Julia Hirschberg, and Zhou Yu. PAPILLON: Privacy preservation from Internet-based and local language model ensembles. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human...
-
[29]
Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amand...
2023
-
[30]
Universal Adversarial Triggers for Attacking and Analyzing NLP
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Proce...
-
[31]
Sajana Weerawardhena, Paul Kassianik, Blaine Nelson, Baturay Saglam, Anu Vellore, Aman Priyanshu, Supriti Vijay, Massimo Aufiero, Arthur Goldblatt, Fraser Burch, Ed Li, Jianliang He, Dhruv Kedia, Kojin Oshiba, Zhuoran Yang, Yaron Singer, and Amin Karbasi. Llama-3.1-FoundationAI-SecurityLLM-8B-Instruct technical report.arXiv preprint arXiv:2508.01059, 2025...
-
[32]
LiveBench: A challenging, contamination-limited LLM benchmark
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination-limited LLM benchmark. In...
2025
-
[33]
Large language models as optimizers
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Bb4VGOWELI
2024
-
[34]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processin...
-
[35]
Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B technical report
Zhuoran Yang, Ed Li, Jianliang He, Aman Priyanshu, Baturay Saglam, Paul Kassianik, Sajana Weerawardhena, Anu Vellore, Blaine Nelson, Neusha Javidnia, Arthur Goldblatt, Fraser Burch, Avi Zohary, Assaf Eisenman, Mahdi Sabbaghi, Supriti Vijay, Rahim Dharssi, Dhruv Kedia, Kojin Oshiba, Yaron Singer, and Amin Karbasi. Llama-3.1-FoundationAI-SecurityLLM-Reasoni...
-
[36]
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024
Pith/arXiv arXiv 2024
-
[37]
Large language models are human-level prompt engineers
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview .net/forum?id=92gvk82DE-. 12
2023
-
[38]
standard NVD abstraction level
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 13 A System Implementation Details This appendix gives the technical details that are summarized at a higher level in Section 2. A.1 Runtime and Task Works...
Pith/arXiv arXiv 2023
-
[39]
‘ question ‘ ( str ) :
-
[40]
‘ summary_1 ‘ ( str ) :
-
[41]
‘ summary_2 ‘ ( str ) : Your output fields are :
-
[42]
‘ reasoning ‘ ( str ) :
-
[43]
Optimized (variant-003, 70.3% val EM): System : You answer multi - hop q u e s t i o n s with the SHO RT EST po ss ibl e answer
‘ answer ‘ ( str ) : [...] In adh er ing to this structure , your o b j e c t i v e is : Given the fields ‘ question ‘ , ‘ summary_1 ‘ , ‘ summary_2 ‘ , produce the fields ‘ answer ‘. Optimized (variant-003, 70.3% val EM): System : You answer multi - hop q u e s t i o n s with the SHO RT EST po ss ibl e answer . CR IT ICA L RULES :
-
[44]
unknown
MUST ALWAYS provide an answer . NEVER say " unknown " , " none " , " N / A " , or " not enough i n f o r m a t i o n "
-
[45]
If s u m m a r i e s contain partial info , use what you have to make your best i n f e r e n c e
-
[46]
yes " or
If the que st ion asks for a c o m p a r i s o n and you only have data for one entity , answer with that entity . ANSWER FORMAT RULES ( follow EXACTLY ) : - Output ONLY the entity name , number , date , or yes / no . - NEVER output a full se nte nc e as the answer . - For yes / no q u e s t i o n s : " yes " or " no " ( l o w e r c a s e ) . - For " who ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.