arxiv: 2604.17937 · v1 · submitted 2026-04-20 · 💻 cs.AI

Recognition: unknown

ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis

Rishav Rishav , Pushpak Pujari , Pushpendre Rastogi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:13 UTC · model grok-4.3

classification 💻 cs.AI

keywords prompt optimizationcontrastive analysischain of thoughtreasoning tracesdyadic comparisondecision treebenchmark evaluation

0 comments

The pith

ContraPrompt extracts prompt-improvement rules by contrasting full reasoning traces from a model's failure and its later success on the exact same input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional prompt optimization either examines individual failures alone or compares different prompt versions across separate examples. ContraPrompt instead uses an agentic retry loop to collect pairs of chain-of-thought traces for the identical question—one that initially fails and one that succeeds after receiving feedback. The differences between these two complete traces highlight the specific reasoning adjustments and error corrections that turn failure into success. These differences are converted into rules and arranged in a decision tree that selects instructions based on observable input traits. This dyadic approach delivers measurable gains over previous methods on four different benchmarks.

Core claim

The paper establishes that the contrast between two reasoning traces generated by the same model on the same input—one failing and one succeeding after feedback—supplies an optimization signal distinct from those available in single-trace or cross-example analyses. By instrumenting a multi-attempt solving phase to produce these dyadic pairs automatically, the method extracts rules that are structured into an input-aware decision tree. This leads to improved performance on reasoning and compliance tasks without requiring human-annotated contrast data.

What carries the argument

Dyadic reasoning trace analysis, which isolates the reasoning strategy differences and appended feedback effects between paired failure and success traces that share model, input, and base prompt.

Load-bearing premise

The differences in the two reasoning traces on one input actually capture useful optimization information that is not already present in single traces or comparisons across examples.

What would settle it

Running the optimization without using the paired traces for contrast—such as by analyzing only successes or random pairs—and observing whether the benchmark improvements vanish.

Figures

Figures reproduced from arXiv: 2604.17937 by Pushpak Pujari, Pushpendre Rastogi, Rishav Rishav.

**Figure 1.** Figure 1: Monadic vs. dyadic trace analysis. Left: Prior methods consume one failed trace and produce a generic diagnosis with no positive target. Right: ContraPrompt consumes a trace pair (τ −, τ +) produced by the same model on the same input, identifying the inserted reasoning steps that carry the success (highlighted, steps 1’–3’). The extracted rule targets that specific step, not generic thoroughness. monadica… view at source ↗

**Figure 2.** Figure 2: System Overview. Training (top): the instrumented retry loop produces contrastive trace pairs (τ −, τ +) and an all-fail bucket; rules extracted dyadically feed the input-aware tree, which is injected and checkpointed each outer iteration. Inference (bottom): features φ(x) route the input through the tree, so the augmented prompt P ⊕ R(x)⊕x carries only the rule subset relevant to that input class. did not… view at source ↗

**Figure 3.** Figure 3: Average relative performance drop when each component is removed. Contrastive mining [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: A representative contrastive pair on HotPotQA. Both traces share model, input, and base [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Prompt optimization methods either analyze individual failures in isolation or compare prompt variants across examples, operating on single execution traces with no access to the reasoning process distinguishing success from failure on the same input. We introduce ContraPrompt, built on the observation that when a model fails but succeeds on a retry with feedback, the difference between its two chain-of-thought traces constitutes an optimization signal not captured by prior methods. Unlike prior contrastive methods, we compare complete intermediate reasoning processes: the two traces share model, input, and base prompt, so remaining differences reflect reasoning strategy and appended error feedback -- we call this dyadic reasoning trace analysis. The multi-attempt solving phase is an instrumented agentic retry loop that generates contrastive data automatically without human annotation. Extracted rules are organized into an input-aware decision tree routing instructions by observable input characteristics. On four reasoning and compliance benchmarks, ContraPrompt outperforms GEPA (Agrawal et al., 2026) on all four, with absolute gains of +8.29 pp on HotPotQA (+20.8% rel.), +2.21 pp on GDPR-Bench (+18.2% rel.), +7.14 pp on GPQA Diamond (+10.6% rel.), and +0.74 pp on BBH (+0.85% rel.). Ablations confirm dyadic trace contrastivity is the critical component, with a -16% relative average drop upon its removal. On 53 EvalSet black-box optimization problems, ContraPrompt beats GEPA on 11, ties on 41, and loses on 1 at equal budget. On FiNER-139 financial named entity recognition (Loukas et al., 2022), ContraPrompt achieves +7.77 pp over the unoptimized baseline (+11.6% rel.) and +1.94 pp over GEPA (+2.66% rel.), with branch conditions aligning with standard US GAAP financial-instrument categories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ContraPrompt turns a model's retry failures and successes on the same input into contrastive rules that beat GEPA on several benchmarks, though extraction details stay thin.

read the letter

ContraPrompt's main contribution is the dyadic reasoning trace analysis: it grabs the full chain-of-thought from a failed attempt and the successful retry with feedback on the exact same question, then treats the differences as a signal for prompt rules. That setup is new compared to single-trace or cross-example contrastive methods, and they automate it with an instrumented retry loop that needs no human labels. The rules then feed an input-aware decision tree for routing instructions by observable input traits. On the reported benchmarks it shows clear gains over GEPA—roughly 8 points absolute on HotPotQA, smaller lifts on GPQA and GDPR-Bench, and a tiny edge on BBH—plus mostly ties or wins on the 53 black-box problems at equal budget. The ablation that removes the dyadic contrast and sees a 16% relative drop is useful evidence that the core mechanism is doing work. The decision-tree organization is a practical touch that could make the output more usable in deployment. The empirical controls look reasonable given the equal-budget framing and the held-out benchmarks. The soft spot is the rule extraction step itself. The abstract and stress-test note the contrast but give little on how differences between traces are turned into concrete rules or how the tree branches are decided. If the full paper supplies pseudocode, examples, or a reproducible procedure for that part, the claim holds up better; otherwise it risks looking like a black box that works on these datasets. Compute accounting for the extra retry attempts is addressed at a high level but would benefit from more explicit verification. This is for people building automated prompt optimizers for reasoning and compliance tasks who already use agentic loops. It is worth a serious referee because the idea is testable, the ablation isolates the claimed novelty, and the results are concrete enough to debate in review.

Referee Report

2 major / 1 minor

Summary. The paper introduces ContraPrompt, a prompt optimization technique that generates contrastive data via an instrumented agentic retry loop. When a model fails on an input but succeeds after feedback, the differences between the two chain-of-thought traces on the identical input are analyzed to extract optimization rules; these rules are then organized into an input-aware decision tree that routes instructions by observable input features. The method is evaluated on HotPotQA, GDPR-Bench, GPQA Diamond, and BBH, where it reports consistent outperformance over GEPA at equal budget, an ablation isolating the dyadic contrast component, results on 53 black-box optimization problems, and gains on FiNER-139.

Significance. If the reported gains and ablation hold after the methodological details are clarified, the work would offer a practically useful advance in automated prompt engineering by supplying an annotation-free optimization signal derived from paired reasoning traces rather than single traces or cross-example comparisons. The -16% relative drop when the contrastive component is removed and the 11-win/41-tie/1-loss record on the 53-problem suite provide concrete empirical support for the central assumption.

major comments (2)

The multi-attempt solving phase and retry loop are described at a high level, but no information is given on how the additional inference budget incurred by the loop is matched in the GEPA baseline runs. Without an explicit compute-matched comparison (e.g., equal total tokens or equal number of model calls), the absolute gains of +8.29 pp on HotPotQA and +7.14 pp on GPQA Diamond cannot be unambiguously attributed to dyadic trace analysis rather than extra compute.
Rule extraction from the dyadic traces and the subsequent construction of the input-aware decision tree are not specified algorithmically. The manuscript does not state the criteria used to identify differing reasoning strategies, how rules are distilled from trace pairs, or the procedure for learning branch conditions from observable input characteristics. These steps are load-bearing for the claim that the method supplies a signal unavailable to prior single-trace or cross-example approaches.

minor comments (1)

The ablation table would be clearer if it reported absolute as well as relative drops and included the per-benchmark breakdown rather than only the average -16% figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen methodological transparency and ensure fair comparisons.

read point-by-point responses

Referee: The multi-attempt solving phase and retry loop are described at a high level, but no information is given on how the additional inference budget incurred by the loop is matched in the GEPA baseline runs. Without an explicit compute-matched comparison (e.g., equal total tokens or equal number of model calls), the absolute gains of +8.29 pp on HotPotQA and +7.14 pp on GPQA Diamond cannot be unambiguously attributed to dyadic trace analysis rather than extra compute.

Authors: We acknowledge that the manuscript does not provide explicit budget-matching details for the four main benchmarks, although equal-budget results are reported for the 53-problem suite. The retry loop generates contrastive pairs only on failure cases and is integral to the method, but this does incur extra calls relative to a single-pass GEPA run. In the revision we will add a dedicated paragraph with measured totals (model calls and approximate tokens) for both ContraPrompt and GEPA on each benchmark. We will also run and report new GEPA variants given an equivalent extra budget (multiple independent prompt evaluations per input) so that the reported gains can be attributed to dyadic analysis rather than compute. revision: yes
Referee: Rule extraction from the dyadic traces and the subsequent construction of the input-aware decision tree are not specified algorithmically. The manuscript does not state the criteria used to identify differing reasoning strategies, how rules are distilled from trace pairs, or the procedure for learning branch conditions from observable input characteristics. These steps are load-bearing for the claim that the method supplies a signal unavailable to prior single-trace or cross-example approaches.

Authors: We agree that the algorithmic details of rule extraction and decision-tree construction were insufficiently specified. In the revised manuscript we will expand the method section with pseudocode and concrete criteria: differing strategies are identified by step-wise alignment of the failed and successful traces, flagging segments present only in the successful trace that directly address the appended feedback; rules are distilled by prompting the model to output concise if-then statements summarizing the corrective reasoning difference; branch conditions are learned by extracting observable input features (entity types, question length, presence of numerical or legal terms) and fitting a decision-tree classifier whose leaves correspond to applicable rules. Illustrative trace pairs and resulting rules from each benchmark will be included. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript describes an empirical prompt optimization technique relying on automated generation of dyadic (failure vs. success-after-feedback) reasoning traces on identical inputs, followed by rule extraction into an input-aware decision tree. Performance is evaluated via direct comparisons against the external baseline GEPA on four held-out benchmarks plus a 53-problem suite and one additional NER task, with an ablation isolating the dyadic contrast component. No equations, first-principles derivations, or parameter-fitting steps appear that would reduce the reported gains to quantities defined by or fitted on the same evaluation data. The central claim is therefore supported by external empirical measurements rather than by construction or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that differences between successful and failed reasoning traces on identical inputs supply an optimization signal unavailable to single-trace or cross-example methods. No free parameters or invented entities are identifiable from the abstract alone.

axioms (1)

domain assumption Differences between two chain-of-thought traces on the same input (failure vs. success after feedback) constitute a usable optimization signal not captured by prior methods.
Explicitly stated as the observation on which ContraPrompt is built.

pith-pipeline@v0.9.0 · 5664 in / 1378 out tokens · 52479 ms · 2026-05-10T04:13:12.027982+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 8 canonical work pages · 6 internal anchors

[1]

Learning from contrastive prompts: Automated optimization and adaptation.arXiv preprint arXiv:2409.15199,

Mingqi Li, Karan Aggarwal, Yong Xie, Aitzaz Ahmad, and Stephen Lau. Learning from contrastive prompts: Automated optimization and adaptation.arXiv preprint arXiv:2409.15199,

work page arXiv
[2]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review arXiv
[3]

gradient descent

Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with “gradient descent” and beam search. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7957–7968,

2023
[4]

GDPR-Bench-Android: A benchmark for evaluating automated GDPR compliance detection in Android.arXiv preprint arXiv:2511.00619,

Huaijin Ran, Haoyi Zhang, and Xunzhu Tang. GDPR-Bench-Android: A benchmark for evaluating automated GDPR compliance detection in Android.arXiv preprint arXiv:2511.00619,

work page arXiv
[5]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level Google-proof Q&A benchmark.arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review arXiv
[6]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Sch¨arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, and Jason Wei. Challenging BIG- Bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,

work page internal anchor Pith review arXiv
[7]

Large Language Models as Optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers.arXiv preprint arXiv:2309.03409,

work page internal anchor Pith review arXiv
[8]

HotpotQA: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380,

2018
[9]

TextGrad: Automatic "Differentiation" via Text

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. TextGrad: Automatic “differentiation” via text.arXiv preprint arXiv:2406.07496,

work page internal anchor Pith review arXiv
[10]

Agentic context engineering: Evolving contexts for self-improving language models,

11 Published at LLM Reasoning Workshop ICLR 2026 Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kama- nuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models,

2026
[11]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

URLhttps://arxiv.org/abs/2510.04618. Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InInternational Confer- ence on Learning Representations,

work page internal anchor Pith review arXiv
[12]

Average relative drop

A FULLALGORITHM Algorithm 1ContraPrompt: Contrastive Prompt Optimization Require:Training setD train, base moduleM, max iterationsT, max attemptsA, patienceP Ensure:Optimized moduleM ∗ 1:P ←EXTRACTINSTRUCTIONS(M);R all ← ∅;s ∗ ← −1; wait←0 2:fort= 1toTdo 3:{Phase 1: Instrumented agentic retry loop} 4:A ← ∅ {attempt records across all training examples} 5:...

2026
[13]

Ulysses S. Grant, 1877

4 Combine cited facts: successor-of-Grant = Hayes. 5 Strip prefix; return only the answer span. output a+ “Ulysses S. Grant, 1877” EXTRACTED RULE Δ(τ−, τ+) “When the question requires combining facts from multiple paragraphs, state which paragraph provides each individual fact before combining them, rather than combining inferences directly.” Figure 4: A ...

2026