MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

Anlan Zhang; Branislav Kveton; Jayakumar Subramanian; Md Mehrab Tanjim; Somdeb Sarkhel; Subhojyoti Mukherjee; Sunav Choudhury; Sungchul Kim; Xiang Chen

arxiv: 2605.19330 · v1 · pith:QBZP7G3Tnew · submitted 2026-05-19 · 💻 cs.AI · cs.LG· cs.SE

MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

Md Mehrab Tanjim , Jayakumar Subramanian , Xiang Chen , Branislav Kveton , Subhojyoti Mukherjee , Anlan Zhang , Sungchul Kim , Somdeb Sarkhel

show 1 more author

Sunav Choudhury

This is my paper

Pith reviewed 2026-05-20 06:05 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.SE

keywords multi-objective optimizationChebyshev scalarizationLLM agent skillsPareto frontannealingprompt optimizationskill discoveryplatform constraints

0 comments

The pith

MOCHA optimizes LLM agent skills across conflicting platform constraints by using Chebyshev scalarization to cover the full Pareto front plus annealing to shift from exploration to exploitation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that skills for LLM agents are multi-field objects forced into trade-offs by real platform limits such as truncated descriptions, compacted instructions, and shared context windows. Standard prompt optimizers either ignore those limits or fold them into a single weighted score, so they miss good solutions in non-convex regions and often make no progress at all. MOCHA instead scalarizes the objectives with the Chebyshev metric to reach every part of the Pareto surface and applies exponential annealing to move from broad search to precise refinement. On six tasks where every method receives the same mutation operator and per-objective feedback, this approach improves mean correctness on all tasks while surfacing twice as many Pareto-optimal skill variants. The result matters because agent deployments live inside tight resource budgets, and any method that reliably finds better feasible skill sets directly raises performance without extra hardware.

Core claim

MOCHA replaces single-objective selection with Chebyshev scalarization that covers the full Pareto front, including non-convex regions, combined with exponential annealing that transitions from exploration to exploitation. Across six diverse agent skills, all methods share the identical multi-objective mutation operator and baselines receive identical per-objective textual feedback; existing optimizers fail to improve the seed skill on four of the six tasks after 1000 rollouts, while MOCHA improves on every task with a 7.5 percent relative gain in mean correctness and twice as many Pareto-optimal variants.

What carries the argument

Chebyshev scalarization, which minimizes the maximum weighted deviation from ideal per-objective values so that non-convex parts of the Pareto front remain reachable, paired with an exponential annealing schedule that gradually tightens the search from exploration to exploitation.

If this is right

Skill libraries for agents can be maintained as explicit Pareto sets rather than single best prompts, letting deployers pick variants that fit different context budgets.
Multi-objective mutation plus Chebyshev selection can be dropped into existing agent frameworks without changing the mutation code or the feedback format.
Tasks that previously showed zero progress under weighted-sum or single-objective optimizers become solvable once the full non-convex front is searched.
The annealing schedule provides a controllable knob between discovering diverse skill variants and converging on high-correctness ones for a given deployment.
Platform constraints such as description length and instruction compaction become first-class objectives instead of after-the-fact filters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Chebyshev-plus-annealing pattern could be applied to other LLM tuning problems that trade accuracy against latency or cost, even outside agent skill design.
If the annealing temperature is made adaptive to the observed spread of objective values rather than fixed, further reductions in the number of wasted rollouts may be possible.
Extending the method to include dynamic context-window resizing as an additional objective would test whether the Pareto front itself moves during deployment.
Open-sourcing the discovered Pareto skill sets would let downstream researchers measure how much of the reported gain transfers to new model families or new task distributions.

Load-bearing premise

That giving every optimizer the same mutation operator and the same per-objective textual feedback isolates the benefit to the selection mechanism, and that the six chosen tasks represent the hard platform constraints typical in actual LLM deployments.

What would settle it

Re-running the identical experimental protocol but on a new set of tasks whose context-window or truncation limits are twice as severe, then checking whether MOCHA still improves correctness on every task and still returns at least twice the number of Pareto-optimal variants.

Figures

Figures reproduced from arXiv: 2605.19330 by Anlan Zhang, Branislav Kveton, Jayakumar Subramanian, Md Mehrab Tanjim, Somdeb Sarkhel, Subhojyoti Mukherjee, Sunav Choudhury, Sungchul Kim, Xiang Chen.

**Figure 2.** Figure 2: Optimization dynamics across six skills. Correctness vs. iteration (mean ± 1 std, 5 seeds). MOCHA (blue) consistently improves beyond the initial prompt, while baselines plateau early or remain stuck at the seed skill. Dashed grey: seed skill performance. Baselines. As discussed in Section 2, fine-tuning is inapplicable for our scope: our setting operates on the skill definition axis rather than model weig… view at source ↗

**Figure 3.** Figure 3: 2D Pareto front (correctness × body compliance): MOCHA (blue, HV=.563) sits balanced between w/o HVC (exploitation, purple) and w/o Annealing (exploration, green). Baselines cluster at a single operating point. HV values in legend [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: FEVER qualitative comparison. Grey: shared YAML fields. Red : baseline skill (all three baselines returned the seed template unchanged). Green : MOCHA-optimized skill with structured rules and explicit reasoning. Per-task comparisons in Section C.6. 5 Discussion and Conclusion When does MOCHA help? MOCHA’s gains scale with objective conflict. On FEVER (14.9% relative gain) and TheoremQA (10.4%), improving … view at source ↗

**Figure 5.** Figure 5: 2D Pareto fronts (correctness × body compliance) for all six skills. Three baselines (TextGrad, ProTeGi, GEPA) and three MOCHA variants are shown. Shaded regions indicate dominated hypervolume. MOCHA variants consistently explore multiple non-dominated operating points while baselines remain near the initial prompt. C.4 Convergence Curves See [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: 2D Pareto fronts (correctness × description compliance) for all six skills. The same pattern holds: MOCHA discovers diverse non-dominated skill variants spanning the correctness–description compliance frontier, while baselines cluster at a single operating point. error avoidance (stereochemistry traps, redshift calculations, reduction reaction selectivity). Test correctness: MOCHA .636 vs. GEPA .592 (+4.4p… view at source ↗

**Figure 7.** Figure 7: 2D Pareto fronts (correctness × overall compliance, i.e., average of body and description compliance) for all six skills. The pattern is consistent across all three compliance views: MOCHA’s multi-objective selection enables Pareto front exploration that single-objective baselines cannot achieve. qualitative, not just quantitative: MOCHA skills contain domain-specific reasoning protocols, explicit error av… view at source ↗

**Figure 8.** Figure 8: Prompt evolution trees for MOCHA across all six skills ( shown for one seed). Each node is a committed skill variant; node labels show candidate ID and mean test score (%). Blue node = best test correctness; blue edges = path from root. Metric annotations (C/D/B) at root and best node reveal how MOCHA trades compliance for correctness gains. Grey nodes = other committed candidates. C.8 Ablation: Hypervolum… view at source ↗

**Figure 9.** Figure 9: GPQA: Seed skill vs. MOCHA-optimized. The seed skill (top, red) is a single-line template returned unchanged by all three baselines. MOCHA (bottom, green) discovers a 6-step expert verification protocol with adversarial self-checking and domain-specific error patterns for organic chemistry, physics, and genetics. Correctness improves from .59 to .71. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: TheoremQA: Seed skill vs. MOCHA-optimized. Baselines partially optimize but produce verbose, loosely structured output. MOCHA discovers a lean skill with theorem identification, sign/unit tracking, domain-specific templates, and strict formatting rules. Correctness improves from .53 to .82. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: HoVer: Seed skill vs. MOCHA-optimized. The seed skill (top, red) is returned unchanged by all baselines. MOCHA (bottom, green) discovers a 7-step verification procedure with “default toward SUPPORTED” bias and retriever-augmented gap filling. Correctness improves from .62 to .67. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: HotpotQA: Seed skill vs. MOCHA-optimized. Both baselines and MOCHA partially optimize this task. MOCHA discovers a skill emphasizing verbatim extraction (exact name forms, location qualifiers) with explicit good/bad formatting examples. Correctness improves from .34 to .66. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: DebugBench: Seed skill vs. MOCHA-optimized. The seed template (top, red) provides no debugging strategy. MOCHA (bottom, green) develops a category-aware protocol: classify by bug type, apply type-specific heuristics (reference → scope check, logic → boundary check, multiple → count 2–4), and follow a “conservative fixing principle” that prevents over-correction on multi-bug inputs. 24 [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 14.** Figure 14: Ablation heatmap: Correctness ∆ over GEPA for each MOCHA variant across six skills. All MOCHA variants achieve substantial gains on TheoremQA and FEVER. Removing HVC gating shifts toward exploitation (highest per-task correctness); removing annealing shifts toward exploration (highest Pareto diversity). See [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

read the original abstract

LLM agents organize behavior through skills - structured natural-language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi-field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co-resident skills compete for limited context windows. These constraints make skill optimization inherently multi-objective: a skill must simultaneously maximize task performance and satisfy platform limits. Yet existing prompt optimizers either ignore these trade-offs or collapse them into a weighted sum, missing Pareto-optimal variants in non-convex objective regions. We introduce MOCHA (Multi-Objective Chebyshev Annealing), which replaces single-objective selection with Chebyshev scalarization - covering the full Pareto front, including non-convex regions - combined with exponential annealing that transitions from exploration to exploitation. In our experiments across six diverse agent skills - where all methods share the same multi-objective mutation operator and baselines receive identical per-objective textual feedback - existing optimizers fail to improve the seed skill on 4 of 6 tasks: 1000 rollouts yield zero progress. MOCHA breaks through on every task, achieving 7.5% relative improvement in mean correctness over the strongest baseline (up to 14.9% on FEVER and 10.4% on TheoremQA) while discovering twice as many more Pareto-optimal skill variants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MOCHA pairs Chebyshev scalarization with annealing for LLM skill optimization under platform constraints and reports gains over baselines, but the evidence is abstract-level and the baseline fairness needs checking.

read the letter

The main point to know is that this paper introduces MOCHA, which combines Chebyshev scalarization with exponential annealing for optimizing multi-objective skills in LLM agents. These skills have to balance performance with hard constraints like description truncation and limited context windows. The experiments show it improving mean correctness by 7.5% over the best baseline across six tasks, with bigger lifts on FEVER and TheoremQA, and it finds twice as many Pareto-optimal skill variants. The new part is applying this specific pairing to the domain of agent skill optimization, where previous prompt optimizers either ignore the trade-offs or use simple weighted sums that miss non-convex parts of the front. By using Chebyshev, it covers more of the Pareto front. The setup gives all methods the same multi-objective mutation and per-objective feedback, which is a reasonable way to focus on the selection mechanism. It does a good job highlighting a practical issue in deploying agents on platforms with real limits. The fact that baselines fail to improve on most tasks while MOCHA succeeds on all suggests the method has some edge in exploration-exploitation balance via annealing. On the soft side, the abstract gives relative improvements but skips variance, significance, and rollout counts, so the strength of the empirical result is not fully clear yet. The concern about whether the shared mutation operator fairly tests the selection is valid. If the mutation produces proposals that work better under Chebyshev selection, then part of the advantage might not be from the scalarization itself. Without a targeted ablation, it's tough to say the 7.5% is cleanly due to the new selection. This work is for researchers and engineers working on LLM agents that need to operate under platform constraints. Anyone thinking about multi-objective optimization in AI systems could find it relevant. I think it deserves a serious referee. The core idea is grounded in established techniques but applied usefully, and the results point to something worth checking out even if more evidence is needed.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MOCHA, which applies Chebyshev scalarization combined with exponential annealing to optimize LLM agent skills as multi-objective artifacts subject to platform constraints such as truncation and context limits. It claims that, when all methods share the same multi-objective mutation operator and per-objective feedback, MOCHA improves mean correctness by 7.5% relative to the strongest baseline (with peaks of 14.9% on FEVER and 10.4% on TheoremQA), discovers twice as many Pareto-optimal variants, and succeeds on all six tasks while baselines fail on four even after 1000 rollouts.

Significance. If the reported gains prove robust under statistical controls and the experimental isolation of the selection mechanism holds, the work would meaningfully advance multi-objective prompt and skill optimization for constrained LLM agents by addressing non-convex Pareto fronts without weighted-sum collapse. The concrete task-specific numbers and the emphasis on platform constraints provide a practical contribution, though the current empirical presentation limits immediate impact.

major comments (2)

[Abstract] Abstract and experimental results: the reported 7.5% relative improvement in mean correctness (and task-specific gains) is presented without variance estimates, statistical significance tests, exact rollout counts per method, or a precise definition and measurement procedure for Pareto optimality. This omission makes the central empirical claim difficult to evaluate and requires additional tables or reporting to substantiate.
[Experiments] Experimental setup: the design asserts that sharing the identical multi-objective mutation operator and per-objective textual feedback across methods isolates the benefit of Chebyshev scalarization plus annealing. However, without an ablation that swaps only the selection rule while holding mutation fixed, performance differences could arise from asymmetric interactions between mutation proposals and selection dynamics rather than the claimed MOCHA components; this assumption is load-bearing for attributing the 7.5% lift and doubled Pareto count.

minor comments (1)

[Abstract] Abstract: the phrasing 'twice as many more Pareto-optimal skill variants' is imprecise and should be replaced with exact counts and a clear definition of how Pareto optimality is determined in the skill space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and have made revisions to improve the clarity and robustness of our empirical claims.

read point-by-point responses

Referee: [Abstract] Abstract and experimental results: the reported 7.5% relative improvement in mean correctness (and task-specific gains) is presented without variance estimates, statistical significance tests, exact rollout counts per method, or a precise definition and measurement procedure for Pareto optimality. This omission makes the central empirical claim difficult to evaluate and requires additional tables or reporting to substantiate.

Authors: We agree with the referee that the empirical claims would benefit from additional statistical rigor and precise reporting. In the revised manuscript, we have included variance estimates from multiple independent runs, conducted statistical significance tests (such as paired t-tests with p-values reported), specified the exact number of rollouts for each method, and added a clear definition and measurement procedure for identifying Pareto-optimal variants. These details are now presented in a new supplementary table and expanded experimental section. revision: yes
Referee: [Experiments] Experimental setup: the design asserts that sharing the identical multi-objective mutation operator and per-objective textual feedback across methods isolates the benefit of Chebyshev scalarization plus annealing. However, without an ablation that swaps only the selection rule while holding mutation fixed, performance differences could arise from asymmetric interactions between mutation proposals and selection dynamics rather than the claimed MOCHA components; this assumption is load-bearing for attributing the 7.5% lift and doubled Pareto count.

Authors: We thank the referee for highlighting this important point about experimental isolation. Our original design ensured that the multi-objective mutation operator and per-objective feedback are identical across all compared methods, with the only varying component being the selection mechanism. This directly attributes differences to the Chebyshev scalarization and annealing in MOCHA. To further strengthen this isolation, we have added an explicit ablation experiment in the revised manuscript where we hold the mutation operator fixed and vary only the selection rule, demonstrating that the performance improvements stem from MOCHA's selection strategy rather than interactions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation and claims are self-contained

full rationale

The paper presents MOCHA as an algorithmic combination of Chebyshev scalarization for Pareto coverage and exponential annealing for exploration-exploitation transition. The central claims of improved mean correctness and doubled Pareto-optimal variants are supported by empirical results on six tasks under a shared mutation operator. No equation, selection rule, or performance metric reduces by construction to a fitted parameter, self-citation chain, or input definition. The experimental isolation of selection benefit is an assumption about fairness rather than a definitional tautology, and the derivation does not invoke uniqueness theorems or ansatzes from prior self-work that would force the outcome.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard properties of Chebyshev scalarization in multi-objective optimization and the effectiveness of annealing schedules for transitioning from exploration to exploitation; no new entities are postulated.

free parameters (1)

annealing rate and Chebyshev parameter
The exponential annealing schedule and any scalarization weighting parameter are likely tuned or chosen to control the exploration-exploitation transition and Pareto coverage.

axioms (1)

domain assumption Chebyshev scalarization can cover the full Pareto front including non-convex regions
Invoked when claiming the method finds variants missed by weighted-sum approaches.

pith-pipeline@v0.9.0 · 5810 in / 1349 out tokens · 57924 ms · 2026-05-20T06:05:50.914908+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

[1]

Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J

Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Dan Klein, Ion Stoica, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InICLR, 2026

work page 2026
[2]

Extend claude with skills

Anthropic. Extend claude with skills. https://code.claude.com/docs/en/skills. Ac- cessed: 2026-04-25

work page 2026
[3]

Approximation quality of the hypervolume indicator

Karl Bringmann and Tobias Friedrich. Approximation quality of the hypervolume indicator. Artificial Intelligence, 195:265–290, 2013

work page 2013
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, volume 33, pages 1877–1901, 2020

work page 1901
[5]

TheoremQA: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. TheoremQA: A theorem-driven question answering dataset. InEMNLP, pages 7889–7901, 2023

work page 2023
[6]

Trace is the next autodiff: Generative optimization with rich feedback, execution traces, and LLMs.arXiv preprint arXiv:2406.16218, 2024

Ching-An Cheng, Allen Nie, and Adith Swaminathan. Trace is the next autodiff: Generative optimization with rich feedback, execution traces, and LLMs.arXiv preprint arXiv:2406.16218, 2024

work page arXiv 2024
[7]

Differentiable expected hypervolume improvement for parallel multi-objective Bayesian optimization

Samuel Daulton, Maximilian Balandat, and Eytan Bakshy. Differentiable expected hypervolume improvement for parallel multi-objective Bayesian optimization. InNeurIPS, volume 33, pages 9851–9864, 2020

work page 2020
[8]

A fast and elitist multiobjective genetic algorithm: Nsga-ii.IEEE Transactions on Evolutionary Computation, 6 (2):182–197, 2002

Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii.IEEE Transactions on Evolutionary Computation, 6 (2):182–197, 2002

work page 2002
[9]

Xing, and Zhiting Hu

Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric P. Xing, and Zhiting Hu. RLPrompt: Optimizing discrete text prompts with reinforcement learning. InEMNLP, 2022

work page 2022
[10]

Michael T. M. Emmerich and Andr ´e H. Deutz. A tutorial on multiobjective optimization: Fundamentals and evolutionary methods.Natural Computing, 17(3):585–609, 2018

work page 2018
[11]

Guerreiro, Carlos M

Andreia P. Guerreiro, Carlos M. Fonseca, and Lu ´ıs Paquete. The hypervolume indicator: Problems and algorithms.ACM Computing Surveys, 54(6):1–42, 2021

work page 2021
[12]

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

HoVer: A dataset for many-hop fact extraction and claim verification

Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. HoVer: A dataset for many-hop fact extraction and claim verification. InFindings of EMNLP, 2020

work page 2020
[14]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines. InICLR, 2024

work page 2024
[15]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Smooth tchebycheff scalarization for multi-objective optimization

Xi Lin, Xiaoyuan Zhang, Zhiyuan Yang, Fei Liu, Zhenkun Wang, and Qingfu Zhang. Smooth tchebycheff scalarization for multi-objective optimization. InICML, 2024. 10

work page 2024
[17]

Eureka: Human-level reward design via coding large language models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. InICLR, 2024

work page 2024
[18]

Springer, Boston, MA, 1999

Kaisa Miettinen.Nonlinear Multiobjective Optimization. Springer, Boston, MA, 1999

work page 1999
[19]

Multi-objective alignment of large language models through hypervolume maximization

Subhojyoti Mukherjee, Anusha Lalitha, Sailik Sengupta, Aniket Deshmukh, and Branislav Kve- ton. Multi-objective alignment of large language models through hypervolume maximization. arXiv preprint arXiv:2412.05469, 2024

work page arXiv 2024
[20]

Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab

Krista Opsahl-Ong, Michael J. Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. InEMNLP, 2024

work page 2024
[21]

gradient descent

Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search. InEMNLP, 2023

work page 2023
[22]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

FEVER: a large-scale dataset for fact extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. InNAACL-HLT, pages 809–819, 2018

work page 2018
[24]

DebugBench: Evaluating debugging capability of large language models

Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, and Maosong Sun. DebugBench: Evaluating debugging capability of large language models. InFindings of ACL, pages 4173–4198, 2024

work page 2024
[25]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Arithmetic control of LLMs for diverse user preferences: Directional preference alignment with multi-objective rewards.arXiv preprint arXiv:2402.18571, 2024

Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, and Tong Zhang. Arithmetic control of LLMs for diverse user preferences: Directional preference alignment with multi-objective rewards.arXiv preprint arXiv:2402.18571, 2024

work page arXiv 2024
[27]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, volume 35, pages 24824–24837, 2022

work page 2022
[28]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InICLR, 2024

work page 2024
[30]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InEMNLP, pages 2369–2380, 2018

work page 2018
[31]

TextGrad: Automatic "Differentiation" via Text

Mert Y¨uksekg¨on¨ul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InICLR, 2023

work page 2023
[33]

Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization

Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Findings of ACL, 2024. 11

work page 2024
[34]

Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach.IEEE Transactions on Evolutionary Computation, 3(4): 257–271, 1999

Eckart Zitzler and Lothar Thiele. Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach.IEEE Transactions on Evolutionary Computation, 3(4): 257–271, 1999

work page 1999
[35]

Correct! Verdict is{expected}

Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M Fonseca, and Viviane Grunert Da Fonseca. Performance assessment of multiobjective optimizers: An analysis and review. IEEE Transactions on Evolutionary Computation, 7(2):117–132, 2003. 12 A Background: Scalarization and Hypervolume Theory We provide extended background on the theoretical foundations ...

work page 2003
[36]

Identify the domain and relevant theorem(s): State which theorem, formula, or principle applies

work page
[37]

Define all variables and given quantities explicitly: Write out every given value with correct signs and units

work page
[38]

Double-check: •Signs: Pay extreme attention to negative signs

Apply the theorem step by step: Show each algebraic/logical step. Double-check: •Signs: Pay extreme attention to negative signs. Never drop them. •Powers of 10: Verify exponent arithmetic carefully. •Units: Track throughout. Convert as needed but CHECK expected units

work page
[39]

Radial:R= (ρ/2πL) ln(R o/Ri)

Domain-specific rules: •Resistance with geometry: Axial:R=ρL/(π(R 2 o −R 2 i )). Radial:R= (ρ/2πL) ln(R o/Ri). •Stopping times:Tis stopping time iff{T≤t} ∈ F t. Sum of non-negative stopping times IS a stopping time. •Iteration methods: For Aitken’s∆ 2, count iterations of the ACCELERATED method only. CRITICAL Formatting Rules: •If multiple sub-parts, retu...

work page
[40]

Never ‘‘PARTIALLY SUPPORTED’’ or any other value

Binary output only: exactly SUPPORTED or NOT SUPPORTED. Never ‘‘PARTIALLY SUPPORTED’’ or any other value

work page
[41]

Do NOT require every detail to be explicitly stated---implicit support and reasonable inference count

Default toward SUPPORTED when evidence is consistent. Do NOT require every detail to be explicitly stated---implicit support and reasonable inference count

work page
[42]

default toward SUPPORTED

Only NOT SUPPORTED when evidenceactively contradictsthe claim. Reasoning Strategy: Step 1: Decompose claim into atomic sub-claims. Step 2: Map evidence to sub-claims. Note direct vs. inferential support. Step 3: Use retriever tool to fill gaps with targeted queries. Step 4: Chain reasoning across passages. Follow entity links completely. Step 5: Check for...

work page
[43]

Identify what entity/fact each hop requires

Decompose: Break question into sub-questions. Identify what entity/fact each hop requires

work page
[44]

Extract all names (full formal names), dates, nicknames, roles, locations---even from parenthetical remarks

Extract: Read every evidence piece. Extract all names (full formal names), dates, nicknames, roles, locations---even from parenthetical remarks

work page
[45]

Do NOT give up

Retrieve: If evidence is insufficient, call retriever with targeted queries. Do NOT give up

work page
[46]

Entity A in passage 1→Entity B in passage 2

Chain: Connect facts across passages. Entity A in passage 1→Entity B in passage 2

work page
[47]

Critical Rules for Answer Field: •Short exact phrase---name, date, number, place, or brief noun phrase

Synthesize: Determine final answer. Critical Rules for Answer Field: •Short exact phrase---name, date, number, place, or brief noun phrase. •EXACT form from evidence: ‘‘Jerral Wayne Jones Sr.’’ NOT ‘‘Jerry Jones’’. ‘‘Dayton, Ohio’’ NOT ‘‘Dayton’’. •Copy verbatim whenever possible. Preserve location qualifiers. •Use the most complete, formal name version f...

work page 1953
[48]

Fix ONLY the incorrect reference(s)

Understand bug type first---it determines fixing strategy: •reference error: Wrong variable/function/method name. Fix ONLY the incorrect reference(s). •syntax error: Missing colon, semicolon, bracket, wrong operator syntax. Fix ONLY syntax. •logic error: Off-by-one, wrong comparison, wrong return, wrong condition. Fix ONLY logic. •type error: Wrong type u...

work page
[49]

A wrong fix is worse than a missing fix

Conservative fixing principle: When uncertain, do NOT change. A wrong fix is worse than a missing fix

work page
[50]

conservative fixing principle

Reproduce the rest EXACTLY---preserve all indentation, spacing, comments, structure. Reasoning Process: Step 1: Read bug type. Single-category or multiple? Step 2: Understand algorithm PURPOSE before making changes. Step 3: For each bug, state: (a) exact line, (b) what is wrong, (c) fix, (d) why it is definitely a bug. Step 4: For multiple error---count b...

work page

[1] [1]

Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J

Lakshya A. Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J. Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Dan Klein, Ion Stoica, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InICLR, 2026

work page 2026

[2] [2]

Extend claude with skills

Anthropic. Extend claude with skills. https://code.claude.com/docs/en/skills. Ac- cessed: 2026-04-25

work page 2026

[3] [3]

Approximation quality of the hypervolume indicator

Karl Bringmann and Tobias Friedrich. Approximation quality of the hypervolume indicator. Artificial Intelligence, 195:265–290, 2013

work page 2013

[4] [4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, volume 33, pages 1877–1901, 2020

work page 1901

[5] [5]

TheoremQA: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. TheoremQA: A theorem-driven question answering dataset. InEMNLP, pages 7889–7901, 2023

work page 2023

[6] [6]

Trace is the next autodiff: Generative optimization with rich feedback, execution traces, and LLMs.arXiv preprint arXiv:2406.16218, 2024

Ching-An Cheng, Allen Nie, and Adith Swaminathan. Trace is the next autodiff: Generative optimization with rich feedback, execution traces, and LLMs.arXiv preprint arXiv:2406.16218, 2024

work page arXiv 2024

[7] [7]

Differentiable expected hypervolume improvement for parallel multi-objective Bayesian optimization

Samuel Daulton, Maximilian Balandat, and Eytan Bakshy. Differentiable expected hypervolume improvement for parallel multi-objective Bayesian optimization. InNeurIPS, volume 33, pages 9851–9864, 2020

work page 2020

[8] [8]

A fast and elitist multiobjective genetic algorithm: Nsga-ii.IEEE Transactions on Evolutionary Computation, 6 (2):182–197, 2002

Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii.IEEE Transactions on Evolutionary Computation, 6 (2):182–197, 2002

work page 2002

[9] [9]

Xing, and Zhiting Hu

Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric P. Xing, and Zhiting Hu. RLPrompt: Optimizing discrete text prompts with reinforcement learning. InEMNLP, 2022

work page 2022

[10] [10]

Michael T. M. Emmerich and Andr ´e H. Deutz. A tutorial on multiobjective optimization: Fundamentals and evolutionary methods.Natural Computing, 17(3):585–609, 2018

work page 2018

[11] [11]

Guerreiro, Carlos M

Andreia P. Guerreiro, Carlos M. Fonseca, and Lu ´ıs Paquete. The hypervolume indicator: Problems and algorithms.ACM Computing Surveys, 54(6):1–42, 2021

work page 2021

[12] [12]

EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

HoVer: A dataset for many-hop fact extraction and claim verification

Yichen Jiang, Shikha Bordia, Zheng Zhong, Charles Dognin, Maneesh Singh, and Mohit Bansal. HoVer: A dataset for many-hop fact extraction and claim verification. InFindings of EMNLP, 2020

work page 2020

[14] [14]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into self-improving pipelines. InICLR, 2024

work page 2024

[15] [15]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Smooth tchebycheff scalarization for multi-objective optimization

Xi Lin, Xiaoyuan Zhang, Zhiyuan Yang, Fei Liu, Zhenkun Wang, and Qingfu Zhang. Smooth tchebycheff scalarization for multi-objective optimization. InICML, 2024. 10

work page 2024

[17] [17]

Eureka: Human-level reward design via coding large language models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. InICLR, 2024

work page 2024

[18] [18]

Springer, Boston, MA, 1999

Kaisa Miettinen.Nonlinear Multiobjective Optimization. Springer, Boston, MA, 1999

work page 1999

[19] [19]

Multi-objective alignment of large language models through hypervolume maximization

Subhojyoti Mukherjee, Anusha Lalitha, Sailik Sengupta, Aniket Deshmukh, and Branislav Kve- ton. Multi-objective alignment of large language models through hypervolume maximization. arXiv preprint arXiv:2412.05469, 2024

work page arXiv 2024

[20] [20]

Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab

Krista Opsahl-Ong, Michael J. Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab. Optimizing instructions and demonstrations for multi-stage language model programs. InEMNLP, 2024

work page 2024

[21] [21]

gradient descent

Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search. InEMNLP, 2023

work page 2023

[22] [22]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

FEVER: a large-scale dataset for fact extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. InNAACL-HLT, pages 809–819, 2018

work page 2018

[24] [24]

DebugBench: Evaluating debugging capability of large language models

Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, and Maosong Sun. DebugBench: Evaluating debugging capability of large language models. InFindings of ACL, pages 4173–4198, 2024

work page 2024

[25] [25]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Arithmetic control of LLMs for diverse user preferences: Directional preference alignment with multi-objective rewards.arXiv preprint arXiv:2402.18571, 2024

Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, and Tong Zhang. Arithmetic control of LLMs for diverse user preferences: Directional preference alignment with multi-objective rewards.arXiv preprint arXiv:2402.18571, 2024

work page arXiv 2024

[27] [27]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, volume 35, pages 24824–24837, 2022

work page 2022

[28] [28]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Large language models as optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. InICLR, 2024

work page 2024

[30] [30]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InEMNLP, pages 2369–2380, 2018

work page 2018

[31] [31]

TextGrad: Automatic "Differentiation" via Text

Mert Y¨uksekg¨on¨ul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InICLR, 2023

work page 2023

[33] [33]

Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization

Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Findings of ACL, 2024. 11

work page 2024

[34] [34]

Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach.IEEE Transactions on Evolutionary Computation, 3(4): 257–271, 1999

Eckart Zitzler and Lothar Thiele. Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach.IEEE Transactions on Evolutionary Computation, 3(4): 257–271, 1999

work page 1999

[35] [35]

Correct! Verdict is{expected}

Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M Fonseca, and Viviane Grunert Da Fonseca. Performance assessment of multiobjective optimizers: An analysis and review. IEEE Transactions on Evolutionary Computation, 7(2):117–132, 2003. 12 A Background: Scalarization and Hypervolume Theory We provide extended background on the theoretical foundations ...

work page 2003

[36] [36]

Identify the domain and relevant theorem(s): State which theorem, formula, or principle applies

work page

[37] [37]

Define all variables and given quantities explicitly: Write out every given value with correct signs and units

work page

[38] [38]

Double-check: •Signs: Pay extreme attention to negative signs

Apply the theorem step by step: Show each algebraic/logical step. Double-check: •Signs: Pay extreme attention to negative signs. Never drop them. •Powers of 10: Verify exponent arithmetic carefully. •Units: Track throughout. Convert as needed but CHECK expected units

work page

[39] [39]

Radial:R= (ρ/2πL) ln(R o/Ri)

Domain-specific rules: •Resistance with geometry: Axial:R=ρL/(π(R 2 o −R 2 i )). Radial:R= (ρ/2πL) ln(R o/Ri). •Stopping times:Tis stopping time iff{T≤t} ∈ F t. Sum of non-negative stopping times IS a stopping time. •Iteration methods: For Aitken’s∆ 2, count iterations of the ACCELERATED method only. CRITICAL Formatting Rules: •If multiple sub-parts, retu...

work page

[40] [40]

Never ‘‘PARTIALLY SUPPORTED’’ or any other value

Binary output only: exactly SUPPORTED or NOT SUPPORTED. Never ‘‘PARTIALLY SUPPORTED’’ or any other value

work page

[41] [41]

Do NOT require every detail to be explicitly stated---implicit support and reasonable inference count

Default toward SUPPORTED when evidence is consistent. Do NOT require every detail to be explicitly stated---implicit support and reasonable inference count

work page

[42] [42]

default toward SUPPORTED

Only NOT SUPPORTED when evidenceactively contradictsthe claim. Reasoning Strategy: Step 1: Decompose claim into atomic sub-claims. Step 2: Map evidence to sub-claims. Note direct vs. inferential support. Step 3: Use retriever tool to fill gaps with targeted queries. Step 4: Chain reasoning across passages. Follow entity links completely. Step 5: Check for...

work page

[43] [43]

Identify what entity/fact each hop requires

Decompose: Break question into sub-questions. Identify what entity/fact each hop requires

work page

[44] [44]

Extract all names (full formal names), dates, nicknames, roles, locations---even from parenthetical remarks

Extract: Read every evidence piece. Extract all names (full formal names), dates, nicknames, roles, locations---even from parenthetical remarks

work page

[45] [45]

Do NOT give up

Retrieve: If evidence is insufficient, call retriever with targeted queries. Do NOT give up

work page

[46] [46]

Entity A in passage 1→Entity B in passage 2

Chain: Connect facts across passages. Entity A in passage 1→Entity B in passage 2

work page

[47] [47]

Critical Rules for Answer Field: •Short exact phrase---name, date, number, place, or brief noun phrase

Synthesize: Determine final answer. Critical Rules for Answer Field: •Short exact phrase---name, date, number, place, or brief noun phrase. •EXACT form from evidence: ‘‘Jerral Wayne Jones Sr.’’ NOT ‘‘Jerry Jones’’. ‘‘Dayton, Ohio’’ NOT ‘‘Dayton’’. •Copy verbatim whenever possible. Preserve location qualifiers. •Use the most complete, formal name version f...

work page 1953

[48] [48]

Fix ONLY the incorrect reference(s)

Understand bug type first---it determines fixing strategy: •reference error: Wrong variable/function/method name. Fix ONLY the incorrect reference(s). •syntax error: Missing colon, semicolon, bracket, wrong operator syntax. Fix ONLY syntax. •logic error: Off-by-one, wrong comparison, wrong return, wrong condition. Fix ONLY logic. •type error: Wrong type u...

work page

[49] [49]

A wrong fix is worse than a missing fix

Conservative fixing principle: When uncertain, do NOT change. A wrong fix is worse than a missing fix

work page

[50] [50]

conservative fixing principle

Reproduce the rest EXACTLY---preserve all indentation, spacing, comments, structure. Reasoning Process: Step 1: Read bug type. Single-category or multiple? Step 2: Understand algorithm PURPOSE before making changes. Step 3: For each bug, state: (a) exact line, (b) what is wrong, (c) fix, (d) why it is definitely a bug. Step 4: For multiple error---count b...

work page