pith. sign in

arxiv: 2603.19732 · v2 · pith:GRYQWRJ2new · submitted 2026-03-20 · 💻 cs.MA

Helix: A Dual-Helix Co-Evolutionary Multi-Agent System for Prompt Optimization and Question Reformulation

Pith reviewed 2026-05-19 18:23 UTC · model grok-4.3

classification 💻 cs.MA
keywords prompt optimizationquestion reformulationmulti-agent systemco-evolutionary frameworklarge language modelsautomated prompt optimizationLLM task performance
0
0 comments X

The pith

Helix jointly optimizes question reformulation and prompt instructions via a three-stage co-evolutionary multi-agent framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that prompt optimization has been held back by treating user questions as fixed inputs while only refining the surrounding instructions. Question phrasing and prompt design actually influence each other, so optimizing one without the other leaves performance on the table. Helix addresses this by breaking the joint goal into coupled objectives, then running two specialized agent tracks that iteratively refine and critique each other's outputs, and finally generating strong question versions for final use. Experiments across twelve benchmarks show this dual approach yields gains up to 3.95 percent over prior single-sided methods while keeping the search efficient. A reader would care because most current ways of adapting large language models still require separate, manual tuning of questions and prompts.

Core claim

We propose a unified multi-agent system (Helix) that jointly optimizes question reformulation and prompt instructions through a structured three-stage co-evolutionary framework. Helix integrates planner-guided decomposition that breaks optimization into coupled question-prompt objectives, dual-track co-evolution where specialized agents iteratively refine and critique each other to produce complementary improvements, and strategy-driven question generation that instantiates high-quality reformulations for robust inference. Extensive experiments on 12 benchmarks against 6 strong baselines demonstrate the effectiveness of Helix, achieving up to 3.95% performance improvements across tasks with

What carries the argument

The dual-track co-evolution inside the three-stage framework, where one agent track refines question reformulations while the other refines prompt instructions and each critiques the other to generate mutual gains.

If this is right

  • Coupled optimization of questions and prompts yields complementary gains unavailable when either is held fixed.
  • The method delivers measurable accuracy lifts of up to 3.95 percent on twelve standard benchmarks relative to six prior approaches.
  • Optimization stays efficient while expanding the search space through mutual critique between the two tracks.
  • The framework makes the overall process more adaptable because better prompts surface clearer ways to restate queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar dual-track co-evolution could be applied to other paired tasks such as data curation paired with model fine-tuning.
  • Users might experience lower sensitivity to exact phrasing if the system automatically explores reformulations during inference.
  • The planner-guided decomposition step could be reused as a modular component in other multi-objective LLM pipelines.

Load-bearing premise

Question formulation and prompt design are interdependent enough that letting specialized agents co-evolve both sides produces improvements single-sided optimization cannot reach.

What would settle it

If single-sided prompt-only or question-only optimization matches or exceeds Helix performance on the same twelve benchmarks, the claim that dual co-evolution supplies unavailable complementary gains would not hold.

Figures

Figures reproduced from arXiv: 2603.19732 by Kewen Zhu, Liping Yi, Qinghua Hu, Xiang Li, Zhiming Zhao.

Figure 1
Figure 1. Figure 1: Comparison of three strategies for pronoun disambigua￾tion. Left: original question without prompt instructions yields an incorrect answer. Middle: adding a CoT prompt to the original question still fails. Right: Helix jointly optimizes question formu￾lation and prompt instructions, producing the correct prediction. 1. Introduction Large language models (LLMs) such as GPT-4 (Achiam et al., 2023) and Llama … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Helix framework including 6 LLM-based agents for joint optimization of question reformulation and prompt instructions. The ⃝1 Planner decomposes the task into a sequence of helix objectives, dual-helix co-evolution alternates between ⃝2 Prompt-Architect and ⃝3 Question-Architect with ⃝4 Mediator validation, and the ⃝5 Question-Generator together with the ⃝6 Question-Judge produces validated… view at source ↗
Figure 4
Figure 4. Figure 4: Prompt efficiency (PE) comparison on four represen￾tative BBH tasks, where Helix consistently achieves the highest performance per optimization cost. Dual-role architects reduce agent redundancy, mediator val￾idation prunes incompatible updates early, and joint op￾timization across question and prompt spaces accelerates convergence. In practice, most helix objectives converge within a single co-evolution r… view at source ↗
Figure 6
Figure 6. Figure 6: Inference-stage strategy-driven question generation on the Formal Fallacies task. The Question-Generator reformu￾lates the original query following the learned strategy, while the -JudgeQuestion-Judge validates semantic preservation and struc￾tural quality, producing an optimized question for final model inference. importance of structured task decomposition for effective dual optimization. Removing either… view at source ↗
Figure 7
Figure 7. Figure 7: Training process of Helix (Part 1/2): Planner-guided decomposition and dual-helix co-evolution Round 1. (Continued in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training process of Helix (Part 2/2): Dual-helix co-evolution Round 2. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Inference process of Helix (Part 1/2): Strategy-driven question generation Round 1. (Continued in [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Inference process of Helix (Part 2/2): Strategy-driven question generation Round 2. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Automated prompt optimization (APO) aims to improve large language model performance by refining prompt instructions. However, existing methods are largely constrained by fixed prompt templates, limited search spaces, or single-sided optimization that treats user questions as immutable inputs. In practice, question formulation and prompt design are inherently interdependent: clearer question structures facilitate focused reasoning and task understanding, while effective prompts reveal better ways to organize and restate queries. Ignoring this coupling fundamentally limits the effectiveness and adaptability of current APO approaches. We propose a unified multi-agent system (Helix) that jointly optimizes question reformulation and prompt instructions through a structured three-stage co-evolutionary framework. Helix integrates (1) planner-guided decomposition that breaks optimization into coupled question-prompt objectives, (2) dual-track co-evolution where specialized agents iteratively refine and critique each other to produce complementary improvements, and (3) strategy-driven question generation that instantiates high-quality reformulations for robust inference. Extensive experiments on 12 benchmarks against 6 strong baselines demonstrate the effectiveness of Helix, achieving up to 3.95% performance improvements across tasks with favorable optimization efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Helix, a unified multi-agent system for automated prompt optimization that jointly optimizes prompt instructions and question reformulation via a three-stage co-evolutionary framework: (1) planner-guided decomposition into coupled objectives, (2) dual-track co-evolution with specialized agents that iteratively refine and critique each other, and (3) strategy-driven generation of reformulations. It reports up to 3.95% performance gains over 6 baselines across 12 benchmarks along with favorable optimization efficiency.

Significance. If the dual-track co-evolution demonstrably yields complementary gains unavailable from independent or single-sided optimization, the work would meaningfully extend multi-agent approaches to prompt engineering by explicitly modeling the interdependence of questions and instructions. The scale of evaluation across 12 benchmarks provides a reasonable empirical foundation for assessing practical utility, though the absence of isolating controls limits the strength of the mechanistic claims.

major comments (2)
  1. [§4 (Experiments)] The central claim attributes observed gains (up to 3.95%) to the co-evolutionary coupling between question and prompt tracks, yet the experimental evaluation provides no ablation that disables the iterative critique loop while preserving total LLM calls, search breadth, and joint optimization of both variables. This leaves open whether any non-interactive joint optimization would suffice.
  2. [§4 (Experiments)] No statistical testing, variance reporting, or hyperparameter sensitivity analysis is described for the performance deltas across the 12 benchmarks, making it difficult to assess whether the reported improvements are robust or attributable to the proposed mechanism rather than baseline variability or tuning differences.
minor comments (2)
  1. [Abstract] The abstract states 'favorable optimization efficiency' without accompanying metrics (e.g., LLM call counts, wall-clock time, or convergence curves) or direct comparisons to baselines on this dimension.
  2. [§3 (Method)] Notation for the three-stage framework (planner decomposition, dual-track refinement, strategy-driven generation) could be clarified with a diagram or pseudocode to make the agent interaction flow more precise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's focus on strengthening the empirical support for the co-evolutionary mechanism in Helix. We address each major comment below and outline the revisions we will make to the experimental section.

read point-by-point responses
  1. Referee: [§4 (Experiments)] The central claim attributes observed gains (up to 3.95%) to the co-evolutionary coupling between question and prompt tracks, yet the experimental evaluation provides no ablation that disables the iterative critique loop while preserving total LLM calls, search breadth, and joint optimization of both variables. This leaves open whether any non-interactive joint optimization would suffice.

    Authors: We agree that an ablation isolating the iterative critique loop is necessary to more rigorously attribute gains to the co-evolutionary coupling rather than joint optimization alone. In the revised manuscript, we will introduce a new baseline that performs joint optimization of question reformulation and prompt instructions without the dual-track iterative critique and refinement process. This baseline will be configured to use an equivalent total number of LLM calls and comparable search breadth (e.g., by matching the number of candidate generations and evaluations). We will report the performance of this non-interactive joint optimizer alongside the full Helix results to demonstrate the incremental benefit of the co-evolution. revision: yes

  2. Referee: [§4 (Experiments)] No statistical testing, variance reporting, or hyperparameter sensitivity analysis is described for the performance deltas across the 12 benchmarks, making it difficult to assess whether the reported improvements are robust or attributable to the proposed mechanism rather than baseline variability or tuning differences.

    Authors: We acknowledge this limitation in the current presentation of results. In the revision, we will add standard deviation values computed over at least three independent runs for all reported metrics. We will also include statistical significance tests (paired t-tests where assumptions hold, or Wilcoxon signed-rank tests otherwise) comparing Helix against each baseline on a per-benchmark basis. Additionally, we will provide a sensitivity analysis for key hyperparameters, including the number of co-evolution iterations and the temperature settings for the specialized agents, to show that performance gains remain stable across reasonable ranges. revision: yes

Circularity Check

0 steps flagged

Empirical multi-agent system proposal with no derivation chain or fitted predictions

full rationale

The manuscript describes Helix as an empirical multi-agent framework for joint prompt optimization and question reformulation, evaluated via experiments on 12 benchmarks against 6 baselines. No equations, self-referential predictions, fitted parameters renamed as outputs, or uniqueness theorems appear in the provided text. The three-stage co-evolutionary process is presented as a design choice justified by performance gains rather than a closed mathematical reduction. External comparisons and ablation-style claims rest on observed results, not on inputs that are redefined as outputs. This is a standard empirical system paper with independent validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, mathematical axioms, or invented entities; the contribution is an architectural description rather than a formal model with fitted constants or new postulates.

pith-pipeline@v0.9.0 · 5736 in / 1201 out tokens · 66923 ms · 2026-05-19T18:23:40.956352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 1 Pith paper

  1. [1]

    Primary rule:Enclose key pronouns (e.g., [their]) in brackets within the provided sentence to ensure they are visually distinct and immediately identifiable

  2. [2]

    Secondary rule:Apply the brackets uniformly to all instances of the pronoun in the sentence and related options to maintain consistency

  3. [3]

    Table 11.Original question from the Disambiguation QA task before optimization

    Preservation rule:Avoid altering the original sentence structure, phrasing, or answer options to preserve the natural flow of the question. Table 11.Original question from the Disambiguation QA task before optimization. In the following sentences, explain the antecedent of the pronoun (which thing the pronoun refers to), or state that it is ambiguous. Sen...

  4. [4]

    Primary rule:Include a concise clarification in the question explicitly stating whether the path is open, closed, or subject to specific rules about near-collinear points, overlapping paths, or self-intersections

  5. [5]

    Secondary rule:Add this clarification as a note in parentheses or as a short sentence at the end of the question, ensuring it integrates naturally without disrupting readability or overloading the question with excessive detail

  6. [6]

    M 25.00,38.00 L 89.00,58.00

    Preservation rule:Keep the original structure, phrasing, and answer options intact to maintain the natural flow and intent of the question. Table 15.Original question from the Geometric Shapes task before optimization. This SVG path element<path d="M 25.00,38.00 L 89.00,58.00"/>draws a shape. Options: (A) circle (B) heptagon (C) hexagon (D) kite (E) line ...

  7. [7]

    Here comes a perfectly valid argument

    Primary rule:Remove all introductory preambles (e.g., “Here comes a perfectly valid argument”, “It is not always easy to...”) and restructure the argument with clear labels

  8. [8]

    Premises:

    Secondary rule:Use “Premises:” and “Conclusion:” labels to explicitly separate logical components, with each premise on a new line

  9. [9]

    fall out bo[t]

    Preservation rule:Keep all original wording of premises and conclusion exactly as stated, only adding structural labels and line breaks. Table 19.Original question from the Formal Fallacies task before optimization. Here comes a perfectly valid argument: Sophie is an infrequent user of Nioxin shampoo. Every owner of a Nexxus shampoo and every infrequent u...

  10. [10]

    (new word)

    Secondary rule:After each option, add a brief hint about the result: “(new word)” for meaningful edits or “(random)” for meaningless ones. 3.Preservation rule:Keep all original content unchanged. Table 23.Original question from the Ruin Names task before optimization. Which of the following is a humorous edit of this artist or movie name: ’fall out boy’? ...

  11. [11]

    Table 27.Original question from the Sports Understanding task before optimization

    Preservation rule:Keep the original structure, phrasing, and event context intact to ensure the question reads naturally and aligns with real-world plausibility. Table 27.Original question from the Sports Understanding task before optimization. Is the following sentence plausible? ”Neymar did a maradona on the defender in the Champions League Semifinal.” ...

  12. [12]

    Fixed Assignments:

    Primary rule:Group constraints by type using category headers. Format as “Fixed Assignments:” followed by constraints, then “Prohibitions:” followed by constraints, then “Conditional Rules:” followed by constraints

  13. [13]

    Tuesday is the only day George can report

    Secondary rule:Fixed Assignments specify definite requirements (e.g., “Tuesday is the only day George can report”). Prohibitions forbid arrangements (e.g., “Neither Olivia nor Robert can give afternoon reports”). Conditional Rules have if-then structure (e.g., “If Nina gives a report, then Helen and Irving must both give reports the next day”)

  14. [14]

    Domain: [Field Name]

    Preservation rule:Keep all constraint text verbatim. Do not paraphrase or modify context or options. Only add category headers. Table 31.Original question from the LSAT-AR task before optimization. Of the eight students—George, Helen, Irving, Kyle, Lenore, Nina, Olivia, and Robert—in a seminar, exactly six will give individual oral reports during three co...

  15. [15]

    Input voltage: 110 V

    Extract Information from Structure:Given parameters are listed with labels and units (e.g., “Input voltage: 110 V”). Question is stated separately after parameters. Options are grouped clearly. For knowledge questions without parameters, structure remains simple

  16. [16]

    Identify Problem Type:Knowledge-based (Definitions, properties, standard practices) or Calculation-based (Circuit analysis, formula application)

  17. [17]

    Recall fundamental EE concepts

    For Knowledge Questions:Focus on technical terms (often highlighted). Recall fundamental EE concepts. Use elimination for uncertain options

  18. [18]

    Match to relevant formula

    For Calculation Questions:Identify circuit type from given parameters. Match to relevant formula. Substitute values (already organized with units). Convert if needed: mA→A, kΩ→Ω,µF→F

  19. [19]

    Now solve step by step and give the correct answer

    Key Formulas:Ohm’s Law (V = IR), Power (P = VI = I²R = V²/R), Duty cycle step-up (D = 1 - V in/V out), Reactance (X C = 1/(2πfC), X L = 2πfL), Time constant (τ RC = RC,τ RL = L/R). Now solve step by step and give the correct answer. Table 38.Optimized question strategy for the Electrical Engineering task generated by Helix through dual-helix co-evolution....

  20. [20]

    Primary:Separate the question into distinct information blocks - given parameters in one section, the actual question in another section, and options clearly grouped

  21. [21]

    110 V” not “110

    Secondary:Highlight numerical values with their units (keep them together as “110 V” not “110” and “V” separately) and emphasize technical terms that are central to the question

  22. [22]

    (focuses on stages)

    Preservation:Maintain all original numbers, units, technical terms, and option content exactly as given without any modification. Table 39.Original question from the Electrical Engineering task before optimization. SCR gate cathode characteristic is a straight line of 130. Triggered source volume is 15 V . Allowable gate power dissipation is 0.5 W. Comput...

  23. [23]

    This model is called: (Rogers, 1962)

    Primary rule:For questions testing specific marketing theories or frameworks, add the theorist name and year in parentheses after the question stem (e.g., “This model is called: (Rogers, 1962)”)

  24. [24]

    (focuses on stages)

    Secondary rule:When options contain similar-sounding terms or concepts, add a brief distinguishing note in parentheses after each option to highlight the key difference (e.g., “(focuses on stages)” vs “(focuses on hierarchy)”)

  25. [25]

    Only add the supplementary notes

    Preservation rule:Keep all original question text, numbers, and core option content unchanged. Only add the supplementary notes. Table 43.Original question from the Marketing task before optimization. This is a hierarchy of effects or sequential model used to explain how advertising works: Options: (A) ADD (B) AIDA (C) PESTLE (D) SWOT Table 44.Optimized q...

  26. [26]

    Primary:Extract and present document metadata (type, author, date, context) at the beginning, followed by key quoted evidence from the source, then the question with explicit type labeling

  27. [27]

    Document Overview

    Secondary:Organize long source documents into “Document Overview” (metadata) and “Key Evidence” (2-3 critical quotes) sections to reduce information overload

  28. [28]

    The Inca leadership sometimes used warfare as an instrument to identify and promote capable commanders, strengthening internal administration and military organization

    Preservation:Maintain all original source text, questions, and options exactly as given; only reorganize structure without changing any wording. Table 47.Original question from the History task before optimization. While the Inca are often noted for their territorial expansion, the causes of their many wars included more than conquest and resource acquisi...

  29. [29]

    P1 → P2 → C

    Primary rule:For arguments containing logical reasoning, format multi-step reasoning as “P1 → P2 → C” or display conditional structures as “IF P1 AND P2 THEN C”

  30. [30]

    Conditional: IF X THEN Y

    Secondary rule:Extract conditional relationships and display as: “Conditional: IF X THEN Y” for implications, “Necessary: X requires Y” for necessary conditions, “Sufficient: X guarantees Y” for sufficient conditions, “IF AND ONLY IF” for biconditionals

  31. [31]

    Do not add solving hints or alter the question’s meaning

    Preservation rule:Keep all original philosophical terminology, thinker names, symbolic notation, and option content exactly as given. Do not add solving hints or alter the question’s meaning. Table 51.Original question from the Philosophy task before optimization. Statement: All birds live in some nest. Predicate notation: Bx = x is a bird; Ny = y is a ne...

  32. [32]

    Let: v = 60 km/hr, t = 9 sec, n = 5 items

    Primary rule:Extract all numerical values from the question and assign them meaningful variable names at the start. Format: “Let: v = 60 km/hr, t = 9 sec, n = 5 items”. Use intuitive variable names (v for velocity, t for time, r for rate, p for price, n for count, etc.). Place this “Let:” line before the question text

  33. [33]

    Find:” line to explicitly state what needs to be calculated. Format: “Find: L (length of train)

    Secondary rule:Add a “Find:” line to explicitly state what needs to be calculated. Format: “Find: L (length of train)” or “Find: T (total time)”. Use a descriptive variable name with explanation in parentheses

  34. [34]

    Do not modify any numbers, wording, or option values in the original content

    Preservation rule:Keep the original question text completely intact after the variable extraction. Do not modify any numbers, wording, or option values in the original content. Table 55.Original question from the AQuA-RAT task before optimization. A car is being driven, in a straight line and at a uniform speed, towards the base of a vertical tower. The t...