SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

Jinghong Chen; Luka Smyth; Saki Shinoda; Ziyi Zhu

arxiv: 2606.18902 · v1 · pith:BDGZMYK6new · submitted 2026-06-17 · 💻 cs.CL

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

Ziyi Zhu , Luka Smyth , Saki Shinoda , Jinghong Chen This is my paper

Pith reviewed 2026-06-26 20:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords prompt optimizationstochastic searchagentic systemsdialogue systemsblack-box optimizationA/B testingmental health chatbotcontext engineering

0 comments

The pith

Coupling qualitative diagnosis with quantitative validation via multi-agent stochastic search makes prompt optimization effective for open-ended dialogue.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats automatic prompt optimization as black-box search because textual gradients do not behave like real gradients. It introduces the SPO framework and tests three strategies of increasing complexity, culminating in SAGE, which uses multiple agents plus code execution to diagnose errors while searching prompt space. Benchmarks show that no strategy wins outright; results hinge on how the task landscape interacts with the kinds of errors present. In a live mental-health chatbot, SAGE runs repeated A/B tests and compounds their noisy signals into a clear lift in next-day retention. The core argument is that pairing diagnostic insight with measured outcomes is what allows agentic search to succeed on open-ended, task-oriented dialogue.

Core claim

SAGE performs stochastic prompt optimization through a multi-agent pipeline that executes diagnostic code; this couples error diagnosis with quantitative A/B validation, enabling the system to compound individually noisy test cycles into statistically robust retention gains when applied continuously to an open-ended mental-health dialogue task.

What carries the argument

SAGE multi-agent pipeline with diagnostic code execution, which conducts agent-guided exploration over prompt space while generating and testing qualitative diagnoses.

If this is right

Effectiveness of any prompt-search strategy depends on the interaction between the task landscape structure and the dominant error types.
Running optimization as a continuous sequence of A/B tests allows noisy individual results to compound into reliable performance lifts.
Agentic methods that generate and act on qualitative diagnoses outperform purely quantitative search on open-ended dialogue tasks.
Black-box stochastic search can improve deployed dialogue systems without any parameter updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same diagnosis-plus-validation loop could be applied to prompt optimization in other open-ended domains such as customer support or tutoring.
Production systems might adopt automated cycles of prompt search and live A/B testing to reduce reliance on manual engineering.
Future deployments could test whether the approach remains effective when the underlying model or user population changes.
The method points toward hybrid human-AI loops where agents surface interpretable reasons for prompt changes that humans can review.

Load-bearing premise

The A/B-test retention gains observed in the mental-health chatbot deployment are produced by the prompt changes rather than by external time trends or selective choice of which cycles to report.

What would settle it

A controlled replication of the deployment in which next-day retention shows no net improvement once external calendar effects are removed and all cycles are included without post-hoc selection.

Figures

Figures reproduced from arXiv: 2606.18902 by Jinghong Chen, Luka Smyth, Saki Shinoda, Ziyi Zhu.

**Figure 2.** Figure 2: Best training accuracy over K=10 iterations (mean with shaded IQR across runs). Iteration 1 shows the initial prompt before optimization. 0 0.1 0.2 0.3 0.4 0.5 0 1 2 3 4 5 6 7 Cosine Distance γ(d) × 10 3 AppWorld Formula FiNER [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Continuous optimization of D1 retention on [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Validation accuracy over K=10 iterations (mean with shaded IQR). On AppWorld (c), SPO-GA’s validation accuracy declines sharply in later iterations even as training accuracy continues to rise, yet its bestvalidation prompt achieves the highest test accuracy, suggesting that early stopping matters more than final convergence on this task. 1 2 3 4 5 6 7 8 9 10 0 1K 2K 3K 4K Iteration Prompt Tokens (a) Formu… view at source ↗

**Figure 6.** Figure 6: Best prompt token count over iterations (mean with shaded IQR). [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Per-cycle A/B delta for the promoted arm with [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as black-box search. We introduce SPO (Stochastic Prompt Optimization), a framework for stochastic search over prompt space, and compare three strategies of increasing sophistication: error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (SPO via Agent-Guided Exploration), a multi-agent pipeline with diagnostic code execution. Across three benchmarks, no single strategy dominates; effectiveness depends on the interaction of landscape structure with error type. We further deploy SAGE on a mental-health chatbot under a continuous optimization paradigm, where it compounds eight cycles of individually-noisy A/B tests into a statistically robust gain in next-day retention. We argue that coupling qualitative diagnosis with quantitative validation is what makes agentic optimization effective for open-ended task-oriented dialogue.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAGE layers a multi-agent diagnostic step onto stochastic prompt search and shows retention gains in one chatbot deployment, but the results do not separate the agent contribution from repeated testing.

read the letter

The paper's main concrete output is a deployed system that runs eight cycles of A/B tests on a mental-health chatbot and reports a next-day retention improvement. It also compares three black-box search strategies on three benchmarks and finds that none dominates across the board.

The comparison is useful because it shows effectiveness depends on the prompt landscape and error type. Treating prompt optimization as search rather than gradient descent is already in the literature, but the specific multi-agent pipeline with code-execution diagnostics is a new combination they actually ran.

The deployment section is the weakest part. The claim that qualitative diagnosis plus quantitative validation is what makes the method work rests on those eight noisy cycles, yet the abstract supplies no sample sizes, test statistics, pre-specification details, or checks for external retention drivers. The gains could come from simply running more tests or from post-hoc selection. The stress-test note is right on this point.

This is for teams already doing prompt work on production dialogue systems who want an off-the-shelf search recipe. It is not reshaping the field. A serious referee could ask for the missing statistical controls and an ablation that turns the diagnostic agents on and off; without those the central argument stays under-supported. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Stochastic Prompt Optimization (SPO) as a black-box search framework for automatic prompt optimization (APO), motivated by limitations of textual gradients. It compares three strategies of increasing complexity—error-informed random search, a genetic algorithm with evolutionary operators, and SAGE (a multi-agent pipeline incorporating diagnostic code execution)—across three benchmarks. No single strategy dominates; performance depends on interactions between landscape structure and error type. The paper further deploys SAGE in a continuous-optimization setting on a mental-health chatbot, where eight cycles of noisy A/B tests compound into a reported statistically robust gain in next-day retention, arguing that coupling qualitative diagnosis with quantitative validation is key to effective agentic optimization for open-ended task-oriented dialogue.

Significance. If the benchmark interactions and deployment results hold under rigorous statistical controls, the work would usefully demonstrate that agent-guided exploration can compound noisy signals in open-ended dialogue settings where simpler search methods fall short. The explicit finding that effectiveness is landscape- and error-dependent is a strength, as is the shift from single-shot optimization to continuous deployment.

major comments (2)

[Deployment section] Deployment section: the claim that eight A/B cycles compound into a 'statistically robust' next-day retention gain is load-bearing for the central argument that qualitative diagnosis (rather than repeated A/B testing alone) drives the result. The manuscript provides no per-cycle sample sizes, exact hypothesis test, multiple-comparison correction, pre-specification of cycles, or controls for external retention drivers; without these, the observed gain cannot be isolated from post-hoc selection or unmodeled confounders.
[Benchmark experiments] Benchmark results (across the three tasks): the statement that 'no single strategy dominates' and that effectiveness depends on landscape/error-type interactions is presented as a key empirical takeaway, yet the manuscript does not report quantitative measures of landscape structure (e.g., modality, noise level, or basin geometry) that would allow readers to reproduce or generalize the interaction claim.

minor comments (2)

[Method] Notation for the three SPO variants is introduced in the abstract but not consistently carried through the method section; a single table mapping names to algorithmic components would improve clarity.
[Introduction] The abstract states that 'textual gradients do not function as real gradients' but cites no specific prior result or section where this is demonstrated or referenced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Deployment section] Deployment section: the claim that eight A/B cycles compound into a 'statistically robust' next-day retention gain is load-bearing for the central argument that qualitative diagnosis (rather than repeated A/B testing alone) drives the result. The manuscript provides no per-cycle sample sizes, exact hypothesis test, multiple-comparison correction, pre-specification of cycles, or controls for external retention drivers; without these, the observed gain cannot be isolated from post-hoc selection or unmodeled confounders.

Authors: We agree that the deployment section requires additional statistical detail to support the claim. In the revised manuscript we will report per-cycle sample sizes, the exact hypothesis test and any multiple-comparison corrections applied, confirmation that the eight cycles were pre-specified, and a discussion of potential external retention drivers. These additions will clarify how the compounding effect is isolated from the agent-guided process. revision: yes
Referee: [Benchmark experiments] Benchmark results (across the three tasks): the statement that 'no single strategy dominates' and that effectiveness depends on landscape/error-type interactions is presented as a key empirical takeaway, yet the manuscript does not report quantitative measures of landscape structure (e.g., modality, noise level, or basin geometry) that would allow readers to reproduce or generalize the interaction claim.

Authors: The claim that no single strategy dominates is grounded in the empirical performance patterns observed across the three benchmarks, where relative effectiveness varied systematically with task and error characteristics. We maintain that these results suffice to demonstrate the interaction without explicit quantitative landscape descriptors; adding such metrics would require new experiments outside the current scope. We therefore do not intend to revise this section. revision: no

Circularity Check

0 steps flagged

No derivations or fitted quantities; empirical results are independent of method definition

full rationale

The paper introduces SPO/SAGE as black-box search strategies and reports benchmark comparisons plus an A/B deployment outcome. No equations, parameters, or predictions appear that reduce to inputs by construction. The central argument rests on observed retention gains from external A/B tests rather than any self-referential fit or self-citation chain. The work is therefore self-contained against its stated empirical benchmarks with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that the three benchmarks and the chatbot A/B tests constitute valid evaluations of prompt-optimization effectiveness.

pith-pipeline@v0.9.1-grok · 5687 in / 1055 out tokens · 23069 ms · 2026-06-26T20:50:59.736923+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 3 linked inside Pith

[1]

InAdvances in Neural Informa- tion Processing Systems (NeurIPS)

Practical Bayesian optimization of machine learning algorithms. InAdvances in Neural Informa- tion Processing Systems (NeurIPS). James C. Spall. 2003.Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Con- trol. Wiley. Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. 2026. Dynamic cheat- sheet: Te...

Pith/arXiv arXiv 2003
[2]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers)

AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Ser- can O. Arik. 2024. Teach better or show smarter? on instructions and exemplars in automatic promp...

Pith/arXiv arXiv 2024
[3]

differentiation

Large language models as optimizers. InIn- ternational Conference on Learning Representations (ICLR). 8 Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR). Haoran Ye, Xuning He, Vincent Ar...

arXiv 2023
[4]

ICLR 2026

Agentic context engineering: Evolving con- texts for self-improving language models.Preprint, arXiv:2510.04618. ICLR 2026. Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023. Large language models are human-level prompt engineers. InInternational Conference on Learning Representations (ICLR). Ziyi ...

Pith/arXiv arXiv 2026
[5]

Automated overview: Run analysis scripts with --prompt all for structured error breakdowns across all prompts
[6]

Which prompts have this error ? What instruction in the better prompt prevented it?

Cross-prompt comparison: For each error category, check how each prompt performs. Which prompts have this error ? What instruction in the better prompt prevented it?
[7]

Count exact occurrences

Deep-dive: Examine 5-10 specific error cases, focusing on the model's reasoning trace. Count exact occurrences
[8]

Hypothesis formation: Each hypothesis must be distinct and include: observation (with sample indices and counts), cross-prompt comparison, mechanism, predicted fix, suggested parent prompt, and estimated impact. ## Output A numbered list of exactly { num_max_hypotheses} hypotheses ordered by estimated impact, each with: observation, cross-prompt note, hyp...
[9]

Quantify precisely: Count exactly how many error cases match this pattern using scripts or targeted one- liners
[10]

Quote the exact reasoning showing where the model went wrong

Examine reasoning traces: Read 5-10 specific error cases. Quote the exact reasoning showing where the model went wrong
[11]

Check correct cases: Do correctly- answered samples of the same type avoid this pattern?
[12]

If so, what instruction helped?

Compare across prompts: Check if other prompts handle this error better. If so, what instruction helped?
[13]

if X then Y

Assess the fix: Would the proposed change actually help? Are there edge cases where it might hurt? ## Output Verdict (Supported/Partially supported/ Not supported), evidence with concrete examples, cross-prompt comparison, exact prevalence count, confidence level, implementation-ready recommended fix, and suggested parent prompt. 11 Generator.Produces a c...
[14]

Deploy the current best prompt to real users and collect production conversations
[15]

Run SAGE on the collected conversations to identify failure patterns and generate Q im- proved prompt candidates
[16]

A/B test all candidates alongside the current best prompt simultaneously, running for ∼48 hours to capture full D1 retention results
[17]

why do I keep doing this?

Promote the best-performing prompt as the new main; repeat. A/B test analysis.Within a cycle, every arm a is assigned a disjoint, randomly-sampled slice of incoming users, and we record its mean D1 reten- tion µa (a rate in [0,1] ) together with the standard error σa of that mean over its na enrolled users. We report each test arm relative to the incumben...

1990

[1] [1]

InAdvances in Neural Informa- tion Processing Systems (NeurIPS)

Practical Bayesian optimization of machine learning algorithms. InAdvances in Neural Informa- tion Processing Systems (NeurIPS). James C. Spall. 2003.Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Con- trol. Wiley. Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, and James Zou. 2026. Dynamic cheat- sheet: Te...

Pith/arXiv arXiv 2003

[2] [2]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers)

AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Ser- can O. Arik. 2024. Teach better or show smarter? on instructions and exemplars in automatic promp...

Pith/arXiv arXiv 2024

[3] [3]

differentiation

Large language models as optimizers. InIn- ternational Conference on Learning Representations (ICLR). 8 Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR). Haoran Ye, Xuning He, Vincent Ar...

arXiv 2023

[4] [4]

ICLR 2026

Agentic context engineering: Evolving con- texts for self-improving language models.Preprint, arXiv:2510.04618. ICLR 2026. Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023. Large language models are human-level prompt engineers. InInternational Conference on Learning Representations (ICLR). Ziyi ...

Pith/arXiv arXiv 2026

[5] [5]

Automated overview: Run analysis scripts with --prompt all for structured error breakdowns across all prompts

[6] [6]

Which prompts have this error ? What instruction in the better prompt prevented it?

Cross-prompt comparison: For each error category, check how each prompt performs. Which prompts have this error ? What instruction in the better prompt prevented it?

[7] [7]

Count exact occurrences

Deep-dive: Examine 5-10 specific error cases, focusing on the model's reasoning trace. Count exact occurrences

[8] [8]

Hypothesis formation: Each hypothesis must be distinct and include: observation (with sample indices and counts), cross-prompt comparison, mechanism, predicted fix, suggested parent prompt, and estimated impact. ## Output A numbered list of exactly { num_max_hypotheses} hypotheses ordered by estimated impact, each with: observation, cross-prompt note, hyp...

[9] [9]

Quantify precisely: Count exactly how many error cases match this pattern using scripts or targeted one- liners

[10] [10]

Quote the exact reasoning showing where the model went wrong

Examine reasoning traces: Read 5-10 specific error cases. Quote the exact reasoning showing where the model went wrong

[11] [11]

Check correct cases: Do correctly- answered samples of the same type avoid this pattern?

[12] [12]

If so, what instruction helped?

Compare across prompts: Check if other prompts handle this error better. If so, what instruction helped?

[13] [13]

if X then Y

Assess the fix: Would the proposed change actually help? Are there edge cases where it might hurt? ## Output Verdict (Supported/Partially supported/ Not supported), evidence with concrete examples, cross-prompt comparison, exact prevalence count, confidence level, implementation-ready recommended fix, and suggested parent prompt. 11 Generator.Produces a c...

[14] [14]

Deploy the current best prompt to real users and collect production conversations

[15] [15]

Run SAGE on the collected conversations to identify failure patterns and generate Q im- proved prompt candidates

[16] [16]

A/B test all candidates alongside the current best prompt simultaneously, running for ∼48 hours to capture full D1 retention results

[17] [17]

why do I keep doing this?

Promote the best-performing prompt as the new main; repeat. A/B test analysis.Within a cycle, every arm a is assigned a disjoint, randomly-sampled slice of incoming users, and we record its mean D1 reten- tion µa (a rate in [0,1] ) together with the standard error σa of that mean over its na enrolled users. We report each test arm relative to the incumben...

1990