arxiv: 2604.08801 · v1 · submitted 2026-04-09 · 💻 cs.LG · cs.CL

Recognition: no theorem link

p1: Better Prompt Optimization with Fewer Prompts

Zhaolin Gao , Yu (Sid) Wang , Bo Liu , Thorsten Joachims , Kiant\'e Brantley , Wen Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:20 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords prompt optimizationvariance decompositionsystem promptsuser promptslanguage modelsreasoning benchmarksdata filteringgeneralization

0 comments

The pith

Prompt optimization improves when a small subset of user prompts is chosen for high variance across candidate system prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why prompt optimization succeeds on some tasks but fails on others by decomposing reward variance into response stochasticity and differences in system prompt quality. It shows that optimization only works reliably when variance among system prompts is large enough to outweigh response noise. Adding more user prompts can actually shrink this useful variance on mixed datasets where individual prompts favor different system prompts. The proposed filtering approach keeps only a few user prompts that display high variance across candidate system prompts, making it easier to identify good system prompts and yielding better optimization that generalizes from very small sets.

Core claim

The reward variance across system prompts decomposes into variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates. Scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Selecting a small subset of user prompts with high variance across candidate system prompts allows one to distinguish a good system prompt from a bad one, so

What carries the argument

Decomposition of reward variance into response variance and system prompt variance, together with the filtering procedure that retains only user prompts showing high variance across a set of candidate system prompts.

Load-bearing premise

That user prompts selected for high variance across a given set of candidate system prompts will reliably produce a subset that supports better optimization without the selection process introducing bias or depending on the specific candidates.

What would settle it

On a new heterogeneous reasoning dataset, the system prompt optimized from the high-variance subset performs no better or worse than one optimized from the full set of user prompts.

Figures

Figures reproduced from arXiv: 2604.08801 by Bo Liu, Kiant\'e Brantley, Thorsten Joachims, Wen Sun, Yu (Sid) Wang, Zhaolin Gao.

**Figure 1.** Figure 1: Comparison of p1 against the base model and baseline methods. For all methods, the system prompt is optimized based on AIME 24 and Qwen3-4B-Instruct-2507, and directly applied to these benchmarks and to Qwen3-30B-A3B-Instruct-2507. The results are averaged over 64 generations per user prompt. ∗Correspondence to zg292@cornell.edu 1 arXiv:2604.08801v1 [cs.LG] 9 Apr 2026 [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗

**Figure 2.** Figure 2: Training reward and evaluation accuracy on IFBench and AIME with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Variances across responses and system prompts for IFBench training set and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Variance among system prompts vs. training reward improvement when training on one AIME prompt. Prompt learnability correlates with variance among system prompts. To investigate the effect of variance in prompt optimization, we perform prompt optimization on a single AIME 24 prompt at a time, across 10 different prompts. We follow the same setup as in Sec. 3.1, except that we set M = 32 since training is… view at source ↗

**Figure 5.** Figure 5: Variance among responses and among system prompts for AIME and IFBench. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of learned system prompts from p1 and GEPA on AIME 24 with Qwen3- 4B-Instruct-2507. p1 produces a more general reasoning-oriented prompt, while GEPA produces a more task-specific prompt that appears to memorize training-set patterns. level, which encourages general mathematical reasoning behaviors such as structured problem solving, without encoding substantial content tied to particular training … view at source ↗

read the original abstract

Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates the variance of the system prompts. Surprisingly, we further show that scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Motivated by this insight, we propose $p1$, a simple user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This subset of user prompts allows one to distinguish a good system prompt from a bad one, making system optimization easier. Experiments on reasoning benchmarks show that $p1$ substantially improves prompt optimization over training on the full dataset and outperforms strong baselines such as GEPA. Notably, training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The variance split between response noise and system-prompt quality is a useful way to see why more data can hurt prompt optimization on mixed tasks, and p1's filtering gives clear empirical gains, but the candidate-dependent selection step needs robustness checks.

read the letter

The main point is that prompt optimization works better when you filter to user prompts that show high variance across a set of candidate system prompts. Their decomposition of total reward variance into response stochasticity versus differences in system prompt quality explains why scaling to more user prompts can reduce effective signal on heterogeneous data, where different prompts favor different system prompts. This is a straightforward observation that prior prompt optimization work had not framed this way. p1 then uses that to pick a small subset, and the experiments show it beats training on the full set and outperforms GEPA on reasoning benchmarks. Notably, two prompts from AIME-24 produce a system prompt that generalizes to other tests. That data-efficiency result is the practical win here. The soft spot is exactly the one the stress-test flags: the filtering step depends on whatever initial candidate system prompts you use to measure the variances. If those candidates are narrow or not representative of what the optimizer will actually explore, the selected subset may not stay informative. The abstract does not spell out how the candidates are generated or whether they ran sensitivity tests on different candidate pools, so that part of the method could be fragile in practice. This is for people building or tuning prompt optimizers for reasoning tasks. It is a simple enough idea that practitioners could try it quickly, and the reported improvements are large enough to warrant referee time even if the candidate robustness needs more work.

Referee Report

3 major / 2 minor

Summary. The manuscript analyzes why prompt optimization succeeds or fails for LLMs by decomposing reward variance across system prompts into response stochasticity versus differences in system-prompt quality. It shows that optimization works when system-prompt variance dominates but can be harmed by scaling to more user prompts on heterogeneous data (where different prompts favor different system prompts). The authors introduce p1, a simple filter that selects a small subset of user prompts with high variance across a candidate set of system prompts, and report that this yields better optimization than the full dataset or baselines like GEPA, with strong cross-benchmark generalization from as few as two AIME-24 prompts.

Significance. If the central claims hold, the variance decomposition supplies a useful diagnostic for when prompt optimization is likely to succeed, and p1 offers a practical, low-cost way to improve both efficiency and generalization by training on fewer but more informative user prompts. The empirical demonstration on reasoning benchmarks adds immediate practical value for prompt engineering workflows.

major comments (3)

[§3] §3 (Variance Decomposition): the central claim that prompt optimization succeeds precisely when system-prompt variance exceeds response variance rests on empirical variance estimates, yet the manuscript provides no details on sample sizes per prompt, number of Monte Carlo rollouts, or whether the reported dominance is statistically significant; without these, it is impossible to assess whether the decomposition reliably predicts optimization success.
[§4.3] §4.3 (p1 selection procedure): p1 computes user-prompt variances with respect to a fixed or heuristically generated candidate pool of system prompts. Given the paper’s own observation that heterogeneous datasets contain user prompts favoring different system prompts, the selected subset may be an artifact of the particular candidate pool rather than a general consequence of the variance principle; no ablation on alternative candidate pools or on whether the subset remains informative once the optimizer leaves the initial pool is reported.
[§5] §5 (Experiments): the reported gains (including generalization from two AIME-24 prompts) are presented without run-to-run variance, confidence intervals, or explicit statistical tests against baselines; likewise, data splits, prompt sampling procedures, and controls for prompt length or format are not fully specified, making it difficult to judge whether the improvements are robust or reproducible.

minor comments (2)

[Notation] The exact mathematical definition of the two variance components (response vs. system-prompt) should be stated as an equation in the main text rather than only in prose or appendix.
[Figures] Figures showing variance ratios or selected-prompt performance should include error bars or multiple runs to convey variability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas for improving reproducibility and robustness. We have revised the manuscript accordingly and provide point-by-point responses below.

read point-by-point responses

Referee: [§3] §3 (Variance Decomposition): the central claim that prompt optimization succeeds precisely when system-prompt variance exceeds response variance rests on empirical variance estimates, yet the manuscript provides no details on sample sizes per prompt, number of Monte Carlo rollouts, or whether the reported dominance is statistically significant; without these, it is impossible to assess whether the decomposition reliably predicts optimization success.

Authors: We agree that these details are necessary to evaluate the reliability of the variance decomposition and its predictive power for optimization success. In the revised manuscript, Section 3 has been expanded to specify the estimation procedure: 20 responses were sampled per user-prompt/system-prompt pair at temperature 0.7, variances were computed via 100 Monte Carlo rollouts, and bootstrap resampling (1000 iterations) was used to obtain 95% confidence intervals. These intervals confirm that system-prompt variance significantly exceeds response variance (p < 0.05) precisely on the datasets and regimes where prompt optimization succeeds, while the reverse holds where it fails. This strengthens the central claim without altering the original analysis. revision: yes
Referee: [§4.3] §4.3 (p1 selection procedure): p1 computes user-prompt variances with respect to a fixed or heuristically generated candidate pool of system prompts. Given the paper’s own observation that heterogeneous datasets contain user prompts favoring different system prompts, the selected subset may be an artifact of the particular candidate pool rather than a general consequence of the variance principle; no ablation on alternative candidate pools or on whether the subset remains informative once the optimizer leaves the initial pool is reported.

Authors: This concern about potential pool-specific artifacts is well-taken and directly engages the paper’s own discussion of heterogeneity. While the primary candidate pool was constructed for diversity (mix of hand-crafted, prior-work, and perturbed prompts), we have added an ablation in the revised Section 4.3 and Appendix A. It compares p1 subsets derived from three alternative pools: purely random system prompts, GEPA-generated prompts, and an expanded pool of 20 candidates. Across all pools the high-variance user-prompt subset improves optimization over the full dataset, and the selected prompts remain discriminative even after the optimizer moves outside the initial pool, as shown by sustained gains in the cross-benchmark generalization experiments. revision: yes
Referee: [§5] §5 (Experiments): the reported gains (including generalization from two AIME-24 prompts) are presented without run-to-run variance, confidence intervals, or explicit statistical tests against baselines; likewise, data splits, prompt sampling procedures, and controls for prompt length or format are not fully specified, making it difficult to judge whether the improvements are robust or reproducible.

Authors: We acknowledge that the original experimental reporting lacked sufficient statistical detail and procedural transparency. The revised Section 5 and new Appendix C now report all main results as means over five independent optimization runs, accompanied by standard deviations and 95% confidence intervals. Paired t-tests against each baseline (including GEPA) are provided with p-values. Data splits are explicitly stated (official AIME-24/MATH splits; 70/30 random split for other benchmarks), user-prompt sampling is uniform, and system-prompt length is controlled by truncation to ≤300 tokens with fixed formatting. These additions allow direct assessment of robustness and reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in variance decomposition or prompt filtering heuristic

full rationale

The paper's derivation begins with an empirical decomposition of observed reward variance into response-level stochasticity and system-prompt quality differences, which follows directly from partitioning total variance over sampled (user prompt, system prompt, response) triples rather than from any optimization objective or fitted parameter. The subsequent claim that additional user prompts can dilute cross-prompt variance on heterogeneous data is an observed statistical consequence, not a definitional tautology. The p1 filtering step selects user prompts by high empirical variance across a fixed candidate pool of system prompts; the resulting performance gains on reasoning benchmarks are demonstrated through direct experimentation rather than being forced by the selection criterion itself. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the central claims, and the method remains falsifiable against external benchmarks without reducing to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical validity of the variance decomposition and the assumption that high-variance user prompts on candidate system prompts will generalize to better optimization.

axioms (1)

domain assumption Total reward variance decomposes additively into variance among responses and variance among system prompts.
This decomposition is invoked to determine when prompt optimization succeeds or fails.

pith-pipeline@v0.9.0 · 5536 in / 1273 out tokens · 90696 ms · 2026-05-10T17:20:19.158779+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

Hybridflow: A flexible and efficient RLHF framework

URLhttps://arxiv.org/abs/2309.16797. Zhaolin Gao, Joongwon Kim, Wen Sun, Thorsten Joachims, Sid Wang, Richard Yuanzhe Pang, and Liang Tan. Prompt curriculum learning for efficient llm post-training, 2025. URLhttps://arxiv.org/abs/2510.01135. Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Evopromp...

work page doi:10.1145/3689031.3696075 2025
[2]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

URLhttps://arxiv.org/abs/2503.14476. 11 Preprint. Under review. Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic ”differentiation” via text, 2024. URL https://arxiv. org/abs/2406.07496. Tom Zehle, Moritz Schlager, Timo Heiß, and Matthias Feurer. Capo: Cost-aware prompt optimization,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

No assumptions without justification

**Precision & Accuracy**: Every step must be mathematically correct. No assumptions without justification. All claims must be logically derived from axioms, theorems, or known identities

work page
[4]

Use concise language, avoid unnecessary fluff, and label each step (e.g., ”Observation”, ”Key Insight”, ”Application of Theorem”, ”Final Computation”)

**Clarity & Structure**: Present solutions in a step-by-step format with clear reasoning. Use concise language, avoid unnecessary fluff, and label each step (e.g., ”Observation”, ”Key Insight”, ”Application of Theorem”, ”Final Computation”)

work page
[5]

Prioritize elegant, non-brute-force solutions

**Creative Insight**: Look for patterns, symmetries, substitutions, invariants, or clever transformations that reduce complexity. Prioritize elegant, non-brute-force solutions

work page
[6]

Recognize when to use known theorems (e.g., Cauchy-Schwarz, Fermat’s Little Theorem, Pigeonhole, Vieta’s, symmetry arguments)

**Problem-Specific Depth**: Tailor your approach to the nature of the problem (algebra, combinatorics, number theory, geometry, probability, or functional equations). Recognize when to use known theorems (e.g., Cauchy-Schwarz, Fermat’s Little Theorem, Pigeonhole, Vieta’s, symmetry arguments)

work page
[7]

Solutions should be concise yet complete — no redundant steps or excessive explanation

**Efficiency & Brevity**: In competition settings, time is critical. Solutions should be concise yet complete — no redundant steps or excessive explanation. However, do not sacrifice rigor for brevity

work page
[8]

If applicable, confirm that the answer matches expected bounds or known results

**Verification**: Always verify your solution by checking edge cases, units, or consistency. If applicable, confirm that the answer matches expected bounds or known results

work page
[9]

Do not rely on external data, approximations, or real-world analogies

**No External Knowledge**: Solve problems using only mathematical knowledge and standard problem-solving techniques. Do not rely on external data, approximations, or real-world analogies

work page
[10]

For proofs or detailed derivations, present the full logical flow with the conclusion clearly stated

**Answer Format**: When a numerical answer is required (e.g., AIME problems), provide only the final boxed number (e.g., ‘123 ‘). For proofs or detailed derivations, present the full logical flow with the conclusion clearly stated. **Example of Expected Output:** **Problem**: Find the number of positive integersn≤ 1000 such thatn is divisible by 3 or 5 bu...

work page
[11]

By symmetry, the expression is minimized when x = y = z,

**Neutrality and Independence**: Do not rely on external knowledge beyond mathematical reasoning. All solutions must be self-contained and derived from first principles. When Responding: - Analyze the problem thoroughly before solving. - Identify the type (algebra, combinatorics, number theory, geometry, probability, discrete math). - If multiple approach...

work page 2024
[12]

If necessary, rephrase or reframe the problem in a more accessible or insightful form

**Understand and Analyze**: Begin by carefully interpreting the problem, identifying key constraints, patterns, or symmetries. If necessary, rephrase or reframe the problem in a more accessible or insightful form

work page
[13]

**Apply Advanced Techniques**: Use sophisticated mathematical tools appropriate for the difficulty level—such as inequalities (e.g., AM-GM, Cauchy-Schwarz), combinatorics (e.g., inclusion-exclusion, generating functions), number theory (e.g., modular arithmetic, Diophantine equations), geometry (e.g., projective or synthetic methods), functional equations...

work page
[14]

If a known lemma, theorem, or result is used, cite it appropriately or provide a brief justification

**Maintain Rigor and Clarity**: Every step must be logically justified, with no assumptions left unverified. If a known lemma, theorem, or result is used, cite it appropriately or provide a brief justification. If the solution requires a clever observation or transformation, clearly explain its significance

work page
[15]

When multiple approaches are viable, select the most efficient and insightful one

**Solve Strategically**: Prioritize elegance and insight over computational complexity. When multiple approaches are viable, select the most efficient and insightful one. If the problem is unsolved or appears to be open, respond honestly and constructively—acknowledge the difficulty, suggest potential directions, or propose a conjecture with supporting evidence

work page
[16]

**Final Answer**: Conclude with a definitive, boxed solution in the format: **Final Answer** answer where the answer is expressed in exact form (e.g., a number, expression, or simplified function), not an approximation or decimal

work page
[17]

However, I can offer observations, potential strategies, or relevant mathematical concepts that may guide further exploration.” Do not guess or fabricate a solution

**When Uncertain or Stuck**: If you are unable to solve the problem within a reasonable timeframe or lack sufficient information, respond with: ”I am unable to determine a solution at this time. However, I can offer observations, potential strategies, or relevant mathematical concepts that may guide further exploration.” Do not guess or fabricate a solution

work page
[18]

Avoid over- statement or speculative claims

**Tone and Style**: Speak with confidence, precision, and intellectual humility. Avoid over- statement or speculative claims. Use formal mathematical language, and when introducing new ideas, ensure they are well-motivated and properly defined. Remember: Competition-level math demands deep insight, creativity, and mastery of ad- vanced techniques. Your re...

work page 2023
[19]

is an early example of this formulation. Subsequent work has explored more stable or structured RL-based prompt tuning procedures, including StablePrompt (Kwon et al., 2024), PReWrite (Kong et al., 2024), PromptMII (Xiao et al., 2025), PRL (Batorski et al., 2025), and Prompt-R1 (Liu et al., 2026b). These methods differ in the prompt parameterization, rewa...

work page 2024