arxiv: 2604.06699 · v1 · submitted 2026-04-08 · 💻 cs.CL · cs.LG

Recognition: no theorem link

Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs

Haoyue Liu , Zhichao Wang , Yongxin Guo , Haoran Shou , Xiaoying Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:53 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords prompt optimizationsemantic factorizationinterventional updatescompositional promptsLLM reasoningadaptive optimizationfactor discovery

0 comments

The pith

By factoring prompts into semantic components and updating them individually through interventions, aPSF achieves better accuracy with substantially lower optimization costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Adaptive Prompt Structure Factorization to make automated prompt optimization more efficient for large language models. It replaces editing of whole prompts with an Architect model that identifies distinct semantic factors within the prompt. These factors receive targeted updates based on how each one alone changes validation performance and on the main current error type. The result is higher accuracy on reasoning tasks together with large savings in the number of tokens used for optimization.

Core claim

aPSF employs an Architect model to discover task-specific prompt structures as semantic factors. It then applies interventional single-factor updates, where each factor's marginal contribution is estimated by changes in validation performance, and error-guided selection directs updates to the dominant failure source. This compositional approach outperforms monolithic prompt optimizers on reasoning benchmarks.

What carries the argument

The Architect model for self-discovering semantic factors together with interventional factor-level scoring that isolates each factor's contribution.

Load-bearing premise

The factors discovered by the Architect model represent truly independent components of the prompt, allowing their individual effects on performance to be measured and optimized in isolation.

What would settle it

A controlled test in which changing one discovered factor produces performance shifts that cannot be isolated to that factor alone, or in which aPSF shows no reliable accuracy gain over strong baselines across repeated trials.

Figures

Figures reproduced from arXiv: 2604.06699 by Haoran Shou, Haoyue Liu, Xiaoying Tang, Yongxin Guo, Zhichao Wang.

**Figure 1.** Figure 1: Monolithic prompt optimization vs. aPSF. Top: monolithic API-only prompt optimizers iteratively edit a single prompt. Bottom: aPSF decomposes the prompt into semantic factors and updates selected factors while freezing the rest. Guo et al., 2023; Deng et al., 2022; Tang et al., 2025). Early approaches such as APE (Zhou et al., 2022) and OPRO (Yang et al., 2023) treat the prompt as a holistic, indivisible … view at source ↗

**Figure 2.** Figure 2: Overview of aPSF. Given a dataset of sampled examples and an optional initial prompt, aPSF proceeds in two phases. (1) Structure Discovery: the Architect LLM analyzes the task and decomposes the prompt into K semantic factors {Fk} K k=1 (task-specific factorization, rather than a fixed scaffold). (2) Iterative Factor Optimization: the Worker LLM executes the composed prompt; an evaluator returns scores an… view at source ↗

**Figure 3.** Figure 3: Per-task accuracy on BBH benchmarks. aPSF (red) achieves dominant performance on arithmetic and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Factor selection patterns and performance. (A) Selection rate heatmap grouped by semantic category (original names in Appendix K). (B) Test accuracy and dominant factors per task (color indicates domain). Source Target Dataset GSM8K AQUA MultiArith GSM-Hard MATH MultiArith 89.76 81.50 – 52.98 52.80 GSM8K – 82.00 99.30 54.08 51.60 GSM-Hard 89.27 80.50 99.30 – 51.00 AQUA 90.31 – 99.30 53.78 52.20 Same-task 9… view at source ↗

**Figure 5.** Figure 5: Optimization efficiency on MultiArith. (a) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Factor-level ablation on GSM-Hard. Bars show performance drop when each factor is removed. Architect generation (structure discovery and candidate proposals) and Worker evaluation (validation inference) across all steps. As shown in Figure 5, we observe that: 1. Token efficiency: aPSF reaches peak validation accuracy with 206K tokens (45–87% fewer than baselines). 2. Convergence speed: aPSF reaches pea… view at source ↗

**Figure 7.** Figure 7: Meta-prompt for From-Scratch Structure Discovery. It enforces a strict four-section output format to ensure reliable parsing of the factor structure. BBH Sub-task aPSF ProTeGi OPRO GrIPS APE Zero-shot CoT boolean_expressions 97.0 95.5 93.5 94.5 94.5 93.0 multistep_arithmetic_two 96.0 86.0 87.0 86.0 84.0 81.0 reasoning_about_colored_objects 92.0 76.0 77.0 71.0 66.0 72.0 web_of_lies 90.5 85.0 84.0 86.0 85.5 … view at source ↗

**Figure 8.** Figure 8: Meta-prompt for Initial-Prompt Analysis. It guides the Architect to preserve the core reasoning style of the user’s input while expanding it into a full factorized program. • Comparable performance: The two modes achieve nearly identical results across all three benchmarks, with differences within 0.5pp. This confirms that aPSF’s structure discovery is robust regardless of initialization. • From-scratch c… view at source ↗

**Figure 9.** Figure 9: Meta-prompt for Factor-Wise Editing. It provides error analysis as reference while explicitly preventing overfitting through general-purpose constraints. Semantic grouping for cross-task analysis. For the heatmap visualization in [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Meta-prompt for Step 1: Open-Ended Error Diagnosis. It enables unbiased identification of root causes and failure mechanisms prior to targeted prompt optimization. cedural guidance, enabling subsequent factor-wise optimization. L.0.2 BBH–Logical Deduction: A Complete Optimization Trace We present a complete example on BBH–Logical Deduction (Five Objects), a task requiring constraint parsing, order establ… view at source ↗

**Figure 11.** Figure 11: Meta-prompt for Step 2: Factor Selection. It maps diagnosed errors to specific factors, with historyaware recommendations to encourage balanced exploration when a single factor is repeatedly selected [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Performance profile on BBH sub-tasks. aPSF (red) exhibits the steepest curve and highest plateau, indicating it most frequently achieves nearoptimal performance across tasks. At τ=0.1 (within 10% of best), aPSF succeeds on 76% of tasks vs. 65% for OPRO. Empty-CoT denotes the initial prompt baseline without prompt optimization. L.0.3 Discovered Factor Structures by Task Type [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 14.** Figure 14: Step-wise optimization trajectory on BBH [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Real-world execution trace of the Error-Guided Factor Selection mechanism. The system identifies that [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative comparison of reasoning traces. [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

read the original abstract

Automated prompt optimization is crucial for eliciting reliable reasoning from large language models (LLMs), yet most API-only prompt optimizers iteratively edit monolithic prompts, coupling components and obscuring credit assignment, limiting controllability, and wasting tokens. We propose Adaptive Prompt Structure Factorization (aPSF), an API-only framework (prompt-in/text-out; no access to model internals) that uses an Architect model to discover task-specific prompt structures as semantic factors. aPSF then performs interventional, single-factor updates: interventional factor-level scoring estimates each factor's marginal contribution via validation-performance changes, and error-guided factor selection routes updates to the current dominant failure source for more sample-efficient optimization. Across multiple advanced reasoning benchmarks, aPSF outperforms strong baselines including principle-aware optimizers, improving accuracy by up to +2.16 percentage points on average, and reduces optimization cost by 45--87% tokens on MultiArith while reaching peak validation in 1 step.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

aPSF factors prompts into semantic pieces and scores them one at a time for updates, which looks efficient on paper but rests on an untested independence assumption.

read the letter

The main thing to know is that this paper describes Adaptive Prompt Structure Factorization: an Architect model splits a prompt into task-specific semantic factors, then the system runs single-factor interventional updates that measure each factor's marginal effect on validation performance and routes changes based on the dominant error type. This replaces the usual loop of editing the whole prompt at once. The reported outcomes include beating principle-aware baselines by up to 2.16 points on average across reasoning benchmarks and cutting token use by 45-87% on MultiArith while hitting peak validation in one step. That efficiency angle is the clearest practical takeaway if the numbers hold.

Referee Report

1 major / 2 minor

Summary. The paper proposes Adaptive Prompt Structure Factorization (aPSF), an API-only framework for prompt optimization in LLMs. An Architect model discovers task-specific semantic factors from prompts; these are then optimized via interventional single-factor updates that estimate marginal contributions from validation-performance deltas and route updates to dominant failure sources. The central claims are that aPSF outperforms strong baselines (including principle-aware optimizers) by up to +2.16 percentage points on average across reasoning benchmarks while reducing optimization cost by 45-87% tokens on MultiArith and reaching peak validation performance in a single step.

Significance. If the empirical results and underlying isolation assumption hold, aPSF would represent a meaningful advance in controllable, sample-efficient prompt optimization under API-only constraints. The approach directly addresses credit-assignment opacity in monolithic prompt editing and could reduce token waste in iterative optimization loops. The reported gains and speed-ups, if reproducible, would be of practical interest to the prompt-engineering community.

major comments (1)

The interventional single-factor scoring procedure (described in the abstract and methodology) rests on the untested assumption that the Architect-discovered semantic factors are sufficiently independent for marginal credit assignment. No factor-correlation statistics, joint-update ablations, or sensitivity analysis on factor granularity are reported; if factors interact (e.g., a reasoning-template factor modulating example-selection usage), the single-factor delta will misattribute gains, directly undermining both the +2.16 pp accuracy claim and the 45-87% token-reduction claim.

minor comments (2)

Abstract: performance numbers are stated without accompanying details on experimental setup, number of runs, statistical tests, or variance, making it impossible to assess whether the reported gains are robust.
The manuscript should include explicit definitions or pseudocode for the Architect factorization step and the error-guided factor-selection rule to allow replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The concern about the independence assumption in our interventional scoring is well-taken, and we address it directly below while committing to strengthen the manuscript with additional analyses.

read point-by-point responses

Referee: The interventional single-factor scoring procedure (described in the abstract and methodology) rests on the untested assumption that the Architect-discovered semantic factors are sufficiently independent for marginal credit assignment. No factor-correlation statistics, joint-update ablations, or sensitivity analysis on factor granularity are reported; if factors interact (e.g., a reasoning-template factor modulating example-selection usage), the single-factor delta will misattribute gains, directly undermining both the +2.16 pp accuracy claim and the 45-87% token-reduction claim.

Authors: We agree that the single-factor interventional updates rely on the assumption that Architect-discovered semantic factors exhibit limited interactions, enabling reliable marginal credit assignment through validation-performance deltas. The current manuscript does not report explicit pairwise factor-correlation statistics, joint-update ablations, or granularity sensitivity analyses. However, the consistent accuracy gains (up to +2.16 pp) and token reductions (45-87% on MultiArith) across multiple reasoning benchmarks provide indirect empirical support that the discovered factors permit effective, sample-efficient optimization in practice. To directly address the concern, we will revise the manuscript to include: (1) quantitative statistics on factor correlations derived from co-occurrence patterns in the optimized prompts, (2) an ablation study comparing single-factor updates against joint multi-factor updates, and (3) sensitivity analysis varying the Architect's factor-granularity prompt. These additions will test the independence assumption and clarify the conditions under which marginal scoring remains valid, thereby reinforcing the reported accuracy and efficiency claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with independent validation

full rationale

The paper presents aPSF as an empirical API-only method: an Architect model discovers semantic factors, followed by interventional single-factor scoring on validation performance deltas and error-guided updates. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims (accuracy gains, token savings) rest on reported benchmark comparisons to external baselines rather than reducing to definitions or prior self-work by construction. The independence assumption for factors is an untested modeling choice (correctness risk) but does not create definitional or fitted-input circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities beyond the high-level framework description.

pith-pipeline@v0.9.0 · 5478 in / 1049 out tokens · 54071 ms · 2026-05-10T18:53:20.198857+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

arXiv preprint arXiv:2402.08702 , year=

Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901...

work page arXiv 2020
[2]

In The Twelfth International Conference on Learning Rep- resentations

Large language models as optimizers. In The Twelfth International Conference on Learning Rep- resentations. Shunyu Y ao, Dian Y u, Jeffrey Zhao, Izhak Shafran, Tom Grifﬁths, Y uan Cao, and Karthik Narasimhan

work page
[3]

Advances in neural in- formation processing systems, 36:11809–11822

Tree of thoughts: Deliberate problem solving with large language models. Advances in neural in- formation processing systems, 36:11809–11822. Seungyoun Yi, Minsoo Khang, and Sungrae Park

work page
[4]

[verbatim substring from template]

Zera: Zero-init instruction evolving re- ﬁnement agent–from zero instructions to structured prompts via principle-based optimization. In Pro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing , pages 23334– 23348. Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improv- ing fe...

work page 2025
[5]

Output ONLY the new text segment

work page
[6]

The new segment must be grammatically compatible with the surrounding text

work page
[7]

PRESERVE what makes the current prompt work for correct samples

work page
[8]

Keep improvements CONCISE and GENERAL-PURPOSE for this dataset type

work page
[9]

Do NOT overfit to the specific error examples - improve the general approach

work page
[10]

Consider the nature of {dataset_name} tasks when making improvements

work page
[11]

description 1

Do NOT include markdown blocks, just raw JSON. Output format: A valid JSON array of strings, e.g., [“description 1”, “description 2”]. Figure 9: Meta-prompt for Factor-Wise Editing. It provides error analysis as reference while explicitly preventing overﬁtting through general-purpose constraints. Semantic grouping for cross-task analysis. For the heatmap ...

work page
[12]

Error Essence: Identify the fundamental root cause of the error

work page
[13]

Error Type: Assign a concise and descriptive label to characterize the error

work page
[14]

Error Mechanism: Explain how and why the error occurred

work page
[15]

Error Impact: Assess how this error affects the overall reasoning or outcome

work page
[16]

Let’s think step by step

Improvement Direction: Propose specific and actionable prompt-level improvements. Response Format Error Essence: [...] Error Type: [...] Error Mechanism: [...] Error Impact: [...] Improvement Suggestion: [...] Figure 10: Meta-prompt for Step 1: Open-Ended Error Diagnosis . It enables unbiased identiﬁcation of root causes and failure mechanisms prior to ta...

work page
[17]

The factor whose improvement would most directly resolve the root cause

work page
[18]

The factor whose scope best aligns with the observed failure patterns

work page
[19]

Let’s think step by step

The factor with the highest potential to prevent similar future errors. Factor Selection History Frequently selected: {overexplored_factors} Less explored: {underexplored_factors} Recommendation: If the less explored factors are also relevant to solving the current errors, consider giving them opportunities to ensure balanced exploration and avoid over-fo...

work page 2025
[20]

ComponentAnalysis: Identify key quantities

work page
[21]

CalculationExecution: Perform arithmetic operations

work page
[22]

ParsingStatements: Extract logical premises

ResultAggregation: Format ﬁnal output 1 → 2 → 3 → 4 Logic (BBH) 1. ParsingStatements: Extract logical premises

work page
[23]

EstablishingOrder: Deduce chronological/logical sequence

work page
[24]

EvaluatingOptions: V erify consistency against premises

work page
[25]

ProblemAnalysis: Identify question type

OptionV eriﬁcation: Double-check constraints 1 → 2 → 3 → 4 QA (AQUA) 1. ProblemAnalysis: Identify question type

work page
[26]

CalculationExecution: Perform systematic calculations

work page
[27]

V eriﬁcation: Self-validate plausibility

work page
[28]

\n\n", "—

AnswerSelection: Select correct option 1 → 2 → 3 → 4 Table 14: Factor structures discovered by aPSF. The factor names (e.g., ComponentAnalysis, CalculationExecu- tion) correspond to the domain-speciﬁc terminology cataloged in Appendix K. Parameter Value Temperature 0.0 (deterministic) Top-p (nucleus sampling) 1.0 (disabled) Top-k Not applied Max output to...

work page 2021