When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

Abhishek Divekar; Parth Darshan

arxiv: 2605.26046 · v2 · pith:IWHFYRJXnew · submitted 2026-05-25 · 💻 cs.CL · cs.AI· cs.LG· cs.MA· cs.SE

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

Parth Darshan , Abhishek Divekar This is my paper

Pith reviewed 2026-06-29 21:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.MAcs.SE

keywords multi-objective optimizationtextual gradientsLLM judgesprompt optimizationgradient dilutioninstruction interferencefailure modes

0 comments

The pith

Multi-objective prompt optimization for LLM judges fails due to gradient dilution at optimization and instruction interference at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends single-objective textual gradient methods to multiple evaluation criteria for LLM judges. It evaluates four modes of decomposing the optimization by varying information sharing across objectives in the loss, gradient, and optimizer LLMs. When the gradient LLM handles multiple criteria jointly, its task focus drops from 9.0 to 3.7 out of 10, a 59% reduction. Combining prompts optimized separately for each criterion lowers Spearman rho from 0.305 to 0.220. These findings point to two distinct failure modes that limit multi-objective judge optimization using textual feedback.

Core claim

Extending TextGrad to the multi-objective setting shows that gradient task-focus drops substantially when the gradient LLM must address multiple criteria at once, and that merging single-objective optimized instructions into one prompt reduces correlation performance, identifying optimization-time gradient dilution and inference-time instruction interference as separable failure modes.

What carries the argument

Four decomposition modes of textual gradient optimizers that vary cross-objective information sharing among the loss, gradient, and optimizer LLMs.

If this is right

The gradient LLM's ability to focus on individual tasks decreases markedly in joint multi-objective feedback.
Naive combination of single-objective prompts leads to degraded evaluation correlation.
The design space for multi-objective textual gradient optimization is constrained by these two failure modes.
Separable failure modes suggest that optimization and inference stages require distinct handling strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Optimizers could benefit from processing objectives separately during gradient computation to avoid dilution.
Prompt composition techniques might reduce interference when merging optimized instructions.
Similar issues may arise in other multi-criteria LLM optimization tasks beyond judges.

Load-bearing premise

The four decomposition modes provide a valid test of multi-objective textual gradient behavior without introducing uncontrolled biases from the underlying LLMs themselves.

What would settle it

Observing no drop in gradient task-focus when using joint multi-criteria feedback, or no degradation in Spearman rho when combining single-objective prompts, would falsify the identified failure modes.

Figures

Figures reproduced from arXiv: 2605.26046 by Abhishek Divekar, Parth Darshan.

**Figure 2.** Figure 2: Per-task Spearman ρ for each optimization steps on SUMMEVAL with Qwen3. We average over N = 3 runs (shaded bands show min to max). Each column shows one of the five decomposition modes. On the top row we apply validation-MAE to gate prompts at each step, while bottom row has no gating. Gray line indicates task-averaged ρ; stars mark best step. Black diamonds (right axis) denote the hypervolume indicator fo… view at source ↗

**Figure 3.** Figure 3: Gradient specificity (1 to 10 scale, higher is [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Gradient specificity for SSC vs. SCC after [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Per-task Spearman ρ for each optimization steps on SUMMEVAL with DeepSeek v4. Notation same as [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Gradient specificity by decomposition mode [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Customizing an LLM judge to a specific problem or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural-language critiques, not numerical vectors. Thus, the conflict-resolution toolkit of multi-task learning (PCGrad, MGDA) does not apply to this multi-objective textual gradient setting. We extend TextGrad to the multi-objective setting and test four decomposition modes of textual gradient optimizers by varying how much cross-objective information the loss, gradient and optimizer LLMs share. We find the gradient's task-focus drops by 59% (9.0 to 3.7 out of 10) when the gradient LLM must provide feedback on multiple criteria jointly. Separately, we observe that naively combining single-objective optimized instructions into a single prompt degrades Spearman rho from 0.305 to 0.220 (-0.085). These results identify two separable failure modes: optimization-time gradient dilution and inference-time instruction interference, which together constrain the design space for multi-objective judge optimization using textual feedback.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper quantifies two failure modes in multi-objective textual gradient optimization for LLM judges but leaves open whether the drops are method-specific or just LLM limits on joint instructions.

read the letter

The punchline is that extending TextGrad to multiple criteria produces a 59% drop in gradient task-focus and a 0.085 drop in Spearman rho when single-objective prompts are combined, and these appear as two distinct problems.

The work tests four decomposition modes that vary how much the loss, gradient, and optimizer LLMs share cross-objective information. It reports the concrete numbers above and argues the modes separate optimization-time dilution from inference-time interference. That separation and the specific measurements are the main addition relative to the single-objective TextGrad papers it cites.

The numbers are useful for anyone who has tried to optimize an LLM judge prompt and watched performance on one criterion collapse when a second is added. The paper stays empirical and does not claim a general theory, which keeps the claims grounded.

The soft spot is the stress-test concern: every component is an LLM, so the observed drops could simply reflect the base models' difficulty parsing multi-criterion instructions rather than anything particular to textual gradients. The abstract gives no evidence that the results survive model swaps or that the task-focus metric itself is stable. Without dataset details, statistical tests, or ablation on the underlying LLMs, it is hard to know how cleanly the four modes isolate the claimed effects.

This is for groups building or evaluating LLM judges who need practical guardrails on prompt optimization. It is narrow in scope but the quantitative observations are new enough that a serious referee should see the full manuscript and ask for the missing controls.

Referee Report

2 major / 2 minor

Summary. The paper extends TextGrad to multi-objective prompt optimization for LLM judges by testing four decomposition modes that vary the degree of cross-objective information sharing among the loss, gradient, and optimizer LLMs. It reports two main empirical findings: a 59% drop in the gradient LLM's task-focus score (9.0 to 3.7 out of 10) when feedback must address multiple criteria jointly, and a drop in Spearman rho from 0.305 to 0.220 when single-objective optimized instructions are naively combined at inference time. These are interpreted as separable failure modes of optimization-time gradient dilution and inference-time instruction interference that constrain the design space for multi-objective textual-gradient judge optimization.

Significance. If the reported quantitative drops prove robust after controls for base-LLM confounds, the work provides a concrete empirical constraint on multi-objective textual gradient methods, showing that standard multi-task learning conflict-resolution tools cannot be directly ported and that new mechanisms for handling objective interference in natural-language feedback are needed. The identification of two distinct failure modes (one at optimization time, one at inference time) is a useful organizing observation for future prompt-optimization research.

major comments (2)

[Methods / Experimental Design] The central claim that the four decomposition modes isolate multi-objective textual-gradient effects rests on the untested assumption that observed drops (task-focus 9.0→3.7; rho 0.305→0.220) are not primarily caused by the base LLMs' limited ability to parse or generate multi-criterion instructions. No ablation that swaps the underlying models or validates the task-focus metric against human judgments is described, which directly undermines the attribution to gradient mechanics rather than base-model limitations.
[Results] The quantitative results that support the two failure modes are presented without dataset descriptions, number of evaluation instances, number of optimization runs, statistical tests, or variance estimates. Because the abstract itself states these specific numbers (59% drop, -0.085 rho), the absence of these details makes it impossible to determine whether the measured effects are reliable enough to ground the design-space constraint claim.

minor comments (2)

[Abstract / Results] The abstract and results sections should explicitly define how the task-focus score (out of 10) is computed and whether it is itself produced by an LLM judge, as this metric is load-bearing for the gradient-dilution claim.
[Methods] Clarify the exact information-sharing protocol for each of the four decomposition modes (e.g., what text is passed between loss/gradient/optimizer LLMs) so that the modes can be reproduced or extended.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, providing the strongest honest defense of the manuscript while acknowledging where clarification or additions are warranted.

read point-by-point responses

Referee: [Methods / Experimental Design] The central claim that the four decomposition modes isolate multi-objective textual-gradient effects rests on the untested assumption that observed drops (task-focus 9.0→3.7; rho 0.305→0.220) are not primarily caused by the base LLMs' limited ability to parse or generate multi-criterion instructions. No ablation that swaps the underlying models or validates the task-focus metric against human judgments is described, which directly undermines the attribution to gradient mechanics rather than base-model limitations.

Authors: The four decomposition modes hold the base LLMs fixed while varying only the degree of cross-objective information sharing in the loss, gradient, and optimizer stages. Therefore, differences in task-focus (9.0 vs. 3.7) and downstream Spearman rho are attributable to the decomposition strategy rather than base-model limitations. The task-focus score is an internal rating produced by the same gradient LLM under single- versus multi-criterion prompts, providing a controlled comparison. We will revise the Methods section to state explicitly that base models are constant across conditions and to note the lack of external human validation of the task-focus metric as a limitation, with model-swap experiments planned for future work. revision: partial
Referee: [Results] The quantitative results that support the two failure modes are presented without dataset descriptions, number of evaluation instances, number of optimization runs, statistical tests, or variance estimates. Because the abstract itself states these specific numbers (59% drop, -0.085 rho), the absence of these details makes it impossible to determine whether the measured effects are reliable enough to ground the design-space constraint claim.

Authors: We agree that the Results section omitted these details. The revised manuscript will add: dataset descriptions, exact counts of evaluation instances, number of independent optimization runs, statistical tests performed, and variance estimates (means ± standard deviation). These additions will allow readers to evaluate the reliability of the reported 59% task-focus drop and −0.085 rho change. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements of LLM prompt optimization

full rationale

The paper reports direct experimental observations from four decomposition modes of textual gradient optimizers, measuring drops in task-focus (9.0 to 3.7) and Spearman rho (0.305 to 0.220). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description. The central claims are falsifiable empirical outcomes on specific LLM behaviors, not reductions to inputs by construction. This is self-contained empirical work with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5732 in / 998 out tokens · 32580 ms · 2026-06-29T21:11:24.721952+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 5 canonical work pages · 4 internal anchors

[1]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Prediction-powered inference.Science, 382(6671):669–674. Anthropic. 2026. Introducing claude sonnet 4.6. An- thropic Blog. Jill Baumann and Oliver Kramer. 2024. Evolutionary multi-objective optimization of large language model prompts for balancing sentiments. InApplications of Evolutionary Computation (EvoApplications), pages 212–224. Springer. Pierre Bo...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

CorrSynth: A correlated sampling method for diverse dataset generation from LLMs. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Mi- ami, FL, USA, November 12-16, 2024, pages 16076– 16095. Association for Computational Linguistics. Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. 2021....

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

VESTA: Visual Exploration with Statistical Tool Agents

Vesta: Visual exploration with statistical tool agents.Preprint, arXiv:2606.00384. Ozan Sener and Vladlen Koltun. 2018. Multi-task learn- ing as multi-objective optimization. InAdvances in Neural Information Processing Systems, volume 31. Curran Associates, Inc. Prith Sharma and Austin Z. Henley. 2026. Modular prompt optimization: Optimizing structured pr...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Qwen3 Technical Report

J1: Incentivizing thinking in LLM-as-a-judge via reinforcement learning. InThe Fourteenth Inter- national Conference on Learning Representations. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Day- iheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

Large language models as optimizers. InIn- ternational Conference on Learning Representations, volume 2024, pages 12028–12068. Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. 2020. Gradient surgery for multi-task learning. InAd- vances in Neural Information Processing Systems, volume 33, pages 5824–5836. Curran As...

work page arXiv 2024
[6]

Consider every strength and flaw you find when making your evaluation
[7]

fluency": 1|2|3|4|5,

Based on the number and severity of the strengths and flaws, assign a value. Use the Instructions below to perform your evaluation. Output a JSON with the requested scores. Do NOT include reasoning or explanations. ## Output format (follow this EXACTLY): { "fluency": 1|2|3|4|5, "relevance": 1|2|3|4|5, "coherence": 1|2|3|4|5, "consistency": 1|2|3|4|5 } ## ...

[1] [1]

Scaling Synthetic Data Creation with 1,000,000,000 Personas

Prediction-powered inference.Science, 382(6671):669–674. Anthropic. 2026. Introducing claude sonnet 4.6. An- thropic Blog. Jill Baumann and Oliver Kramer. 2024. Evolutionary multi-objective optimization of large language model prompts for balancing sentiments. InApplications of Evolutionary Computation (EvoApplications), pages 212–224. Springer. Pierre Bo...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

CorrSynth: A correlated sampling method for diverse dataset generation from LLMs. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Mi- ami, FL, USA, November 12-16, 2024, pages 16076– 16095. Association for Computational Linguistics. Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. 2021....

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

VESTA: Visual Exploration with Statistical Tool Agents

Vesta: Visual exploration with statistical tool agents.Preprint, arXiv:2606.00384. Ozan Sener and Vladlen Koltun. 2018. Multi-task learn- ing as multi-objective optimization. InAdvances in Neural Information Processing Systems, volume 31. Curran Associates, Inc. Prith Sharma and Austin Z. Henley. 2026. Modular prompt optimization: Optimizing structured pr...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Qwen3 Technical Report

J1: Incentivizing thinking in LLM-as-a-judge via reinforcement learning. InThe Fourteenth Inter- national Conference on Learning Representations. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Day- iheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Ji...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

Large language models as optimizers. InIn- ternational Conference on Learning Representations, volume 2024, pages 12028–12068. Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. 2020. Gradient surgery for multi-task learning. InAd- vances in Neural Information Processing Systems, volume 33, pages 5824–5836. Curran As...

work page arXiv 2024

[6] [6]

Consider every strength and flaw you find when making your evaluation

[7] [7]

fluency": 1|2|3|4|5,

Based on the number and severity of the strengths and flaws, assign a value. Use the Instructions below to perform your evaluation. Output a JSON with the requested scores. Do NOT include reasoning or explanations. ## Output format (follow this EXACTLY): { "fluency": 1|2|3|4|5, "relevance": 1|2|3|4|5, "coherence": 1|2|3|4|5, "consistency": 1|2|3|4|5 } ## ...