Recognition: no theorem link
p1: Better Prompt Optimization with Fewer Prompts
Pith reviewed 2026-05-10 17:20 UTC · model grok-4.3
The pith
Prompt optimization improves when a small subset of user prompts is chosen for high variance across candidate system prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The reward variance across system prompts decomposes into variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates. Scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Selecting a small subset of user prompts with high variance across candidate system prompts allows one to distinguish a good system prompt from a bad one, so
What carries the argument
Decomposition of reward variance into response variance and system prompt variance, together with the filtering procedure that retains only user prompts showing high variance across a set of candidate system prompts.
Load-bearing premise
That user prompts selected for high variance across a given set of candidate system prompts will reliably produce a subset that supports better optimization without the selection process introducing bias or depending on the specific candidates.
What would settle it
On a new heterogeneous reasoning dataset, the system prompt optimized from the high-variance subset performs no better or worse than one optimized from the full set of user prompts.
Figures
read the original abstract
Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates the variance of the system prompts. Surprisingly, we further show that scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Motivated by this insight, we propose $p1$, a simple user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This subset of user prompts allows one to distinguish a good system prompt from a bad one, making system optimization easier. Experiments on reasoning benchmarks show that $p1$ substantially improves prompt optimization over training on the full dataset and outperforms strong baselines such as GEPA. Notably, training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes why prompt optimization succeeds or fails for LLMs by decomposing reward variance across system prompts into response stochasticity versus differences in system-prompt quality. It shows that optimization works when system-prompt variance dominates but can be harmed by scaling to more user prompts on heterogeneous data (where different prompts favor different system prompts). The authors introduce p1, a simple filter that selects a small subset of user prompts with high variance across a candidate set of system prompts, and report that this yields better optimization than the full dataset or baselines like GEPA, with strong cross-benchmark generalization from as few as two AIME-24 prompts.
Significance. If the central claims hold, the variance decomposition supplies a useful diagnostic for when prompt optimization is likely to succeed, and p1 offers a practical, low-cost way to improve both efficiency and generalization by training on fewer but more informative user prompts. The empirical demonstration on reasoning benchmarks adds immediate practical value for prompt engineering workflows.
major comments (3)
- [§3] §3 (Variance Decomposition): the central claim that prompt optimization succeeds precisely when system-prompt variance exceeds response variance rests on empirical variance estimates, yet the manuscript provides no details on sample sizes per prompt, number of Monte Carlo rollouts, or whether the reported dominance is statistically significant; without these, it is impossible to assess whether the decomposition reliably predicts optimization success.
- [§4.3] §4.3 (p1 selection procedure): p1 computes user-prompt variances with respect to a fixed or heuristically generated candidate pool of system prompts. Given the paper’s own observation that heterogeneous datasets contain user prompts favoring different system prompts, the selected subset may be an artifact of the particular candidate pool rather than a general consequence of the variance principle; no ablation on alternative candidate pools or on whether the subset remains informative once the optimizer leaves the initial pool is reported.
- [§5] §5 (Experiments): the reported gains (including generalization from two AIME-24 prompts) are presented without run-to-run variance, confidence intervals, or explicit statistical tests against baselines; likewise, data splits, prompt sampling procedures, and controls for prompt length or format are not fully specified, making it difficult to judge whether the improvements are robust or reproducible.
minor comments (2)
- [Notation] The exact mathematical definition of the two variance components (response vs. system-prompt) should be stated as an equation in the main text rather than only in prose or appendix.
- [Figures] Figures showing variance ratios or selected-prompt performance should include error bars or multiple runs to convey variability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important areas for improving reproducibility and robustness. We have revised the manuscript accordingly and provide point-by-point responses below.
read point-by-point responses
-
Referee: [§3] §3 (Variance Decomposition): the central claim that prompt optimization succeeds precisely when system-prompt variance exceeds response variance rests on empirical variance estimates, yet the manuscript provides no details on sample sizes per prompt, number of Monte Carlo rollouts, or whether the reported dominance is statistically significant; without these, it is impossible to assess whether the decomposition reliably predicts optimization success.
Authors: We agree that these details are necessary to evaluate the reliability of the variance decomposition and its predictive power for optimization success. In the revised manuscript, Section 3 has been expanded to specify the estimation procedure: 20 responses were sampled per user-prompt/system-prompt pair at temperature 0.7, variances were computed via 100 Monte Carlo rollouts, and bootstrap resampling (1000 iterations) was used to obtain 95% confidence intervals. These intervals confirm that system-prompt variance significantly exceeds response variance (p < 0.05) precisely on the datasets and regimes where prompt optimization succeeds, while the reverse holds where it fails. This strengthens the central claim without altering the original analysis. revision: yes
-
Referee: [§4.3] §4.3 (p1 selection procedure): p1 computes user-prompt variances with respect to a fixed or heuristically generated candidate pool of system prompts. Given the paper’s own observation that heterogeneous datasets contain user prompts favoring different system prompts, the selected subset may be an artifact of the particular candidate pool rather than a general consequence of the variance principle; no ablation on alternative candidate pools or on whether the subset remains informative once the optimizer leaves the initial pool is reported.
Authors: This concern about potential pool-specific artifacts is well-taken and directly engages the paper’s own discussion of heterogeneity. While the primary candidate pool was constructed for diversity (mix of hand-crafted, prior-work, and perturbed prompts), we have added an ablation in the revised Section 4.3 and Appendix A. It compares p1 subsets derived from three alternative pools: purely random system prompts, GEPA-generated prompts, and an expanded pool of 20 candidates. Across all pools the high-variance user-prompt subset improves optimization over the full dataset, and the selected prompts remain discriminative even after the optimizer moves outside the initial pool, as shown by sustained gains in the cross-benchmark generalization experiments. revision: yes
-
Referee: [§5] §5 (Experiments): the reported gains (including generalization from two AIME-24 prompts) are presented without run-to-run variance, confidence intervals, or explicit statistical tests against baselines; likewise, data splits, prompt sampling procedures, and controls for prompt length or format are not fully specified, making it difficult to judge whether the improvements are robust or reproducible.
Authors: We acknowledge that the original experimental reporting lacked sufficient statistical detail and procedural transparency. The revised Section 5 and new Appendix C now report all main results as means over five independent optimization runs, accompanied by standard deviations and 95% confidence intervals. Paired t-tests against each baseline (including GEPA) are provided with p-values. Data splits are explicitly stated (official AIME-24/MATH splits; 70/30 random split for other benchmarks), user-prompt sampling is uniform, and system-prompt length is controlled by truncation to ≤300 tokens with fixed formatting. These additions allow direct assessment of robustness and reproducibility. revision: yes
Circularity Check
No significant circularity in variance decomposition or prompt filtering heuristic
full rationale
The paper's derivation begins with an empirical decomposition of observed reward variance into response-level stochasticity and system-prompt quality differences, which follows directly from partitioning total variance over sampled (user prompt, system prompt, response) triples rather than from any optimization objective or fitted parameter. The subsequent claim that additional user prompts can dilute cross-prompt variance on heterogeneous data is an observed statistical consequence, not a definitional tautology. The p1 filtering step selects user prompts by high empirical variance across a fixed candidate pool of system prompts; the resulting performance gains on reasoning benchmarks are demonstrated through direct experimentation rather than being forced by the selection criterion itself. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the central claims, and the method remains falsifiable against external benchmarks without reducing to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Total reward variance decomposes additively into variance among responses and variance among system prompts.
Reference graph
Works this paper leans on
-
[1]
Hybridflow: A flexible and efficient RLHF framework
URLhttps://arxiv.org/abs/2309.16797. Zhaolin Gao, Joongwon Kim, Wen Sun, Thorsten Joachims, Sid Wang, Richard Yuanzhe Pang, and Liang Tan. Prompt curriculum learning for efficient llm post-training, 2025. URLhttps://arxiv.org/abs/2510.01135. Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Evopromp...
-
[2]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
URLhttps://arxiv.org/abs/2503.14476. 11 Preprint. Under review. Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad: Automatic ”differentiation” via text, 2024. URL https://arxiv. org/abs/2406.07496. Tom Zehle, Moritz Schlager, Timo Heiß, and Matthias Feurer. Capo: Cost-aware prompt optimization,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
No assumptions without justification
**Precision & Accuracy**: Every step must be mathematically correct. No assumptions without justification. All claims must be logically derived from axioms, theorems, or known identities
-
[4]
**Clarity & Structure**: Present solutions in a step-by-step format with clear reasoning. Use concise language, avoid unnecessary fluff, and label each step (e.g., ”Observation”, ”Key Insight”, ”Application of Theorem”, ”Final Computation”)
-
[5]
Prioritize elegant, non-brute-force solutions
**Creative Insight**: Look for patterns, symmetries, substitutions, invariants, or clever transformations that reduce complexity. Prioritize elegant, non-brute-force solutions
-
[6]
**Problem-Specific Depth**: Tailor your approach to the nature of the problem (algebra, combinatorics, number theory, geometry, probability, or functional equations). Recognize when to use known theorems (e.g., Cauchy-Schwarz, Fermat’s Little Theorem, Pigeonhole, Vieta’s, symmetry arguments)
-
[7]
Solutions should be concise yet complete — no redundant steps or excessive explanation
**Efficiency & Brevity**: In competition settings, time is critical. Solutions should be concise yet complete — no redundant steps or excessive explanation. However, do not sacrifice rigor for brevity
-
[8]
If applicable, confirm that the answer matches expected bounds or known results
**Verification**: Always verify your solution by checking edge cases, units, or consistency. If applicable, confirm that the answer matches expected bounds or known results
-
[9]
Do not rely on external data, approximations, or real-world analogies
**No External Knowledge**: Solve problems using only mathematical knowledge and standard problem-solving techniques. Do not rely on external data, approximations, or real-world analogies
-
[10]
For proofs or detailed derivations, present the full logical flow with the conclusion clearly stated
**Answer Format**: When a numerical answer is required (e.g., AIME problems), provide only the final boxed number (e.g., ‘123 ‘). For proofs or detailed derivations, present the full logical flow with the conclusion clearly stated. **Example of Expected Output:** **Problem**: Find the number of positive integersn≤ 1000 such thatn is divisible by 3 or 5 bu...
-
[11]
By symmetry, the expression is minimized when x = y = z,
**Neutrality and Independence**: Do not rely on external knowledge beyond mathematical reasoning. All solutions must be self-contained and derived from first principles. When Responding: - Analyze the problem thoroughly before solving. - Identify the type (algebra, combinatorics, number theory, geometry, probability, discrete math). - If multiple approach...
work page 2024
-
[12]
If necessary, rephrase or reframe the problem in a more accessible or insightful form
**Understand and Analyze**: Begin by carefully interpreting the problem, identifying key constraints, patterns, or symmetries. If necessary, rephrase or reframe the problem in a more accessible or insightful form
-
[13]
**Apply Advanced Techniques**: Use sophisticated mathematical tools appropriate for the difficulty level—such as inequalities (e.g., AM-GM, Cauchy-Schwarz), combinatorics (e.g., inclusion-exclusion, generating functions), number theory (e.g., modular arithmetic, Diophantine equations), geometry (e.g., projective or synthetic methods), functional equations...
-
[14]
If a known lemma, theorem, or result is used, cite it appropriately or provide a brief justification
**Maintain Rigor and Clarity**: Every step must be logically justified, with no assumptions left unverified. If a known lemma, theorem, or result is used, cite it appropriately or provide a brief justification. If the solution requires a clever observation or transformation, clearly explain its significance
-
[15]
When multiple approaches are viable, select the most efficient and insightful one
**Solve Strategically**: Prioritize elegance and insight over computational complexity. When multiple approaches are viable, select the most efficient and insightful one. If the problem is unsolved or appears to be open, respond honestly and constructively—acknowledge the difficulty, suggest potential directions, or propose a conjecture with supporting evidence
-
[16]
**Final Answer**: Conclude with a definitive, boxed solution in the format: **Final Answer** answer where the answer is expressed in exact form (e.g., a number, expression, or simplified function), not an approximation or decimal
-
[17]
**When Uncertain or Stuck**: If you are unable to solve the problem within a reasonable timeframe or lack sufficient information, respond with: ”I am unable to determine a solution at this time. However, I can offer observations, potential strategies, or relevant mathematical concepts that may guide further exploration.” Do not guess or fabricate a solution
-
[18]
Avoid over- statement or speculative claims
**Tone and Style**: Speak with confidence, precision, and intellectual humility. Avoid over- statement or speculative claims. Use formal mathematical language, and when introducing new ideas, ensure they are well-motivated and properly defined. Remember: Competition-level math demands deep insight, creativity, and mastery of ad- vanced techniques. Your re...
work page 2023
-
[19]
is an early example of this formulation. Subsequent work has explored more stable or structured RL-based prompt tuning procedures, including StablePrompt (Kwon et al., 2024), PReWrite (Kong et al., 2024), PromptMII (Xiao et al., 2025), PRL (Batorski et al., 2025), and Prompt-R1 (Liu et al., 2026b). These methods differ in the prompt parameterization, rewa...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.