Overconfident and Blind to Details: Fixing Prompt Insensitivity with Abductive Preference Learning

Peng Qi; Simon Yu; Yijin Ni

arxiv: 2510.09887 · v2 · submitted 2025-10-10 · 💻 cs.CL

Overconfident and Blind to Details: Fixing Prompt Insensitivity with Abductive Preference Learning

Yijin Ni , Simon Yu , Peng Qi This is my paper

Pith reviewed 2026-05-18 07:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords abductive preference learningprompt insensitivityvision-language modelspreference optimizationDPOcounterfactual sensitivityVLMBiaspolicy amplification

0 comments

The pith

Optimizing the reverse policy from responses to prompts fixes vision-language models' blindness to critical input changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that models ignore key prompt edits because standard tuning of the forward policy struggles with rare inputs. It introduces abductive preference learning to optimize the reverse policy instead, proving this amplifies gains by the ratio of response to prompt probabilities and delivers the biggest lift on the least common cases. For common methods like DPO, the reverse policy requires only swapping which prompt is compared against a fixed response. This matters because it turns low-density problems into high-leverage opportunities without new data collection or model redesign, as shown by large jumps on sensitivity benchmarks.

Core claim

Abductive preference learning optimizes the abductive policy π(x | y) rather than the usual forward policy π(y | x). The approach proves that improvements are amplified by a factor of q(y)/p(x), producing the largest effects on rarest prompts. For translation-invariant pairwise preference methods such as DPO, estimating the reverse policy reduces exactly to a structural data swap that compares prompts while holding the response fixed, with no architectural modifications required. On VLMBias this raises accuracy from 3% to 44%, and on Inverse-IFEval Multi-DPOP reaches 65–84% at the 9B scale while avoiding the degradation seen in standard DPO.

What carries the argument

The abductive policy π(x | y), realized through a structural data swap that compares prompts for a fixed response in translation-invariant pairwise preference learners.

If this is right

Accuracy on VLMBias rises from 3% to 44% and exceeds GPT-5.2 plus most closed-source vision-language models.
Multi-DPOP reaches 65–84% on Inverse-IFEval at the 9B scale while outperforming GPT-5 and preserving IFBench scores.
Standard DPO lowers IFBench performance by 8–12%, whereas the abductive variants avoid this drop.
Gains are largest on the rarest prompts because the amplification scales with q(y)/p(x).

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same swap technique could be tested on other preference-optimization algorithms to check whether the amplification holds beyond DPO.
Models improved this way may require fewer hand-crafted prompts in deployment because they become more sensitive to input variations by construction.
Extending the reverse-policy idea to text-only or multimodal chain-of-thought tasks could address similar long-tail failure modes without extra data.

Load-bearing premise

That performing the data swap to estimate the reverse policy for methods like DPO produces the desired amplification without side effects or the need for any model changes.

What would settle it

A controlled experiment showing that models trained with the abductive method achieve no larger gains on rare-prompt subsets than standard DPO, or fail to reach the reported accuracy increase from 3% to 44% on VLMBias.

read the original abstract

Vision and language models frequently ignore semantically critical input edits, defaulting to pretraining priors. For example, models will confidently assert a five-legged dog has four legs; consequently, on the VLMBias benchmark, GPT 5.2 and Claude Sonnet 4.6 achieve only $4.6\%$ and $0\%$ accuracy, respectively. Existing methods address this problem through building up datasets that covers the underrepresented inputs to tune the policy function $\pi(y \mid x)$, where $x$ and $y$ refer to input prompts and responses, respectively. However, prompting baselines yield gains of under $3\%$ on VLMBias due to the low probability density of rare prompts. To bypass this bottleneck, we propose \emph{abductive preference learning} to optimize the abductive policy $\pi(x \mid y)$. We prove this amplifies forward policy improvements by a factor of $q(y)/p(x)$, where $p(\cdot)$ and $q(\cdot)$ denote the marginal probabilities of the prompt and response, yielding the largest gains on the rarest prompts. Furthermore, we demonstrate that for translation invariant pairwise preference learning methods, such as DPO, estimating $\pi(x \mid y)$ reduces to a structural data swap that compares prompts for a fixed response, requiring no architectural changes. Empirically, abductive preference learning delivers large gains on counterfactual sensitivity: on VLMBias, A-DPO raises accuracy from $3\%$ to $44\%$ ($14\times$), outperforming GPT-5.2 ($4.6\%$) and all closed-source VLMs except Gemini~3~Flash; on Inverse-IFEval, Multi-DPOP reaches $65$--$84\%$, surpassing GPT-5 ($73.7\%$) at the 9B scale while preserving IFBench, unlike DPO which degrades it by $8$--$12\%$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's key move is to optimize the reverse policy π(x|y) via a data swap in DPO-style training, claiming this amplifies gains on rare prompts by q(y)/p(x) and fixes VLM insensitivity to input edits.

read the letter

The paper introduces abductive preference learning as a way to sidestep the low density of rare prompts when tuning VLMs. Instead of pushing the forward policy π(y|x), they optimize the abductive one π(x|y) and show that for pairwise methods like DPO this reduces to swapping the roles of prompt and response in the preference pairs. They report that this delivers a 14× lift on VLMBias, taking accuracy from 3% to 44% and beating most closed models on that counterfactual sensitivity task. On Inverse-IFEval they also claim strong results at 9B scale without hurting the standard IFBench metric the way plain DPO does. That data-swap trick is the practical contribution worth noting if it works without side effects. It directly targets a visible failure mode where models confidently ignore edits like leg count changes. The amplification claim via marginal probabilities is presented as a derivation rather than a fit, which is cleaner than many post-hoc fixes. The empirical scale of the reported gains is large enough to be interesting on its own. The main soft spot is the claimed exact reduction for DPO. The reference policy in DPO is trained on the original (x,y) distribution, so after the swap the KL regularizer is no longer symmetric and may inject an extra forward-direction penalty that depends on p(x) rather than q(y). If that holds, the amplification factor does not apply cleanly to the objective actually being optimized. The abstract does not address this asymmetry, and without seeing the full derivation or ablations on the reference term it is hard to know how much the reported numbers rely on the framing versus the data construction itself. The benchmarks are narrow, and error bars or more controls would help. This work is aimed at people doing preference optimization and robustness in multimodal models. Readers who care about making models respond to small input changes will find the reverse-policy angle and the simple swap implementation worth trying. It is coherent enough and different enough from standard forward tuning to deserve a serious referee, even if the DPO equivalence needs verification. I would send it to review but ask the authors to clarify the reference-model behavior under the swap.

Referee Report

2 major / 2 minor

Summary. The paper claims that VLMs are insensitive to semantically critical prompt edits (e.g., asserting four legs for a five-legged dog), with existing forward-policy tuning limited by low density of rare prompts. It proposes abductive preference learning to optimize the reverse policy π(x|y), proves that this amplifies forward improvements by the factor q(y)/p(x) (largest on rarest prompts), and shows that for translation-invariant pairwise methods such as DPO the reverse policy is obtained exactly by a structural data swap (fixed y, vary x) with no architectural changes. Empirically, A-DPO raises VLMBias accuracy from 3% to 44% (14×), outperforming GPT-5.2 and most closed-source VLMs; Multi-DPOP reaches 65–84% on Inverse-IFEval while preserving IFBench.

Significance. If the amplification derivation and the exact DPO reduction hold, the work offers a principled route to large gains on low-probability counterfactuals without new architectures or massive data collection. The empirical scale of the VLMBias lift and the preservation of forward capabilities are notable strengths that would make the method immediately usable for existing preference pipelines.

major comments (2)

[Abstract] Abstract (paragraph on DPO reduction): the assertion that the reduction to a structural data swap is exact for 'translation invariant' pairwise methods is load-bearing for the central claim, yet the skeptic correctly notes that DPO's explicit KL term is taken against a fixed reference policy trained on the original (x,y) distribution; after the swap this term is no longer symmetric and injects an unintended forward-direction penalty dependent on p(x) rather than q(y). A concrete derivation or ablation showing that the effective objective remains exactly the desired abductive update is required.
[Abstract] Abstract (amplification claim): the factor q(y)/p(x) is presented as following directly from marginal probabilities, but without the full derivation or explicit statement of the assumptions under which the amplification is parameter-free, it is impossible to verify whether the claimed largest gains on rarest prompts survive the reference-model asymmetry identified above.

minor comments (2)

The abstract reports point estimates (3% to 44%, 14×) without error bars, number of runs, or ablation on the data-swap construction; adding these would strengthen the empirical section.
Notation for p(·) and q(·) as marginals of prompt and response is introduced without an explicit definition of the joint or the support; a short clarifying sentence would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The observations on the DPO reduction and the amplification factor are substantive and help clarify the theoretical claims. We address each point below and will revise the manuscript to incorporate explicit derivations, assumptions, and supporting ablations.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on DPO reduction): the assertion that the reduction to a structural data swap is exact for 'translation invariant' pairwise methods is load-bearing for the central claim, yet the skeptic correctly notes that DPO's explicit KL term is taken against a fixed reference policy trained on the original (x,y) distribution; after the swap this term is no longer symmetric and injects an unintended forward-direction penalty dependent on p(x) rather than q(y). A concrete derivation or ablation showing that the effective objective remains exactly the desired abductive update is required.

Authors: We agree that the interaction between the fixed reference policy and the structural data swap requires explicit treatment. In the revision we will add a dedicated appendix deriving the effective DPO objective after the swap. The derivation shows that the KL term, while no longer symmetric, contributes an additive regularization whose gradient still aligns with the q(y)/p(x) weighting on rare prompts; the pairwise preference component remains translation-invariant and directly implements the abductive update. We will also report an ablation that compares standard A-DPO against a variant in which the reference is retrained on the swapped data, quantifying any deviation from the ideal abductive objective. These additions will be reflected in both the abstract and the main theoretical section. revision: yes
Referee: [Abstract] Abstract (amplification claim): the factor q(y)/p(x) is presented as following directly from marginal probabilities, but without the full derivation or explicit statement of the assumptions under which the amplification is parameter-free, it is impossible to verify whether the claimed largest gains on rarest prompts survive the reference-model asymmetry identified above.

Authors: We will expand the theoretical development to present the complete derivation of the amplification factor, beginning from the general preference optimization gradient and arriving at the q(y)/p(x) multiplier under the abductive policy. The derivation will list the assumptions explicitly (sufficiently expressive policy class, pairwise loss without additional forward-direction constraints beyond the reference, and marginal probabilities estimated from the training distribution). We will then analyze the reference asymmetry and show analytically that it does not cancel the dominant amplification on low-p(x) prompts; this is consistent with the empirical observation that the largest accuracy lifts occur precisely on the rarest counterfactuals. The revised abstract and main text will state these assumptions and the resulting conditions under which the gains remain largest for the rarest prompts. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on independent marginal-probability proof and structural reduction

full rationale

The paper states it proves the q(y)/p(x) amplification factor from marginal probabilities and demonstrates an exact structural data swap for translation-invariant methods like DPO. No quoted equations reduce the claimed result to a fitted parameter, self-definition, or author-overlapping citation chain. The central claims are presented as first-principles derivations rather than tautologies, and the abstract provides no load-bearing self-citation. This matches the default expectation of a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the central claim rests on standard RLHF probability assumptions plus the unelaborated proof of the amplification factor.

axioms (1)

domain assumption The amplification of forward policy improvements by the factor q(y)/p(x) holds for the marginal distributions of prompts and responses.
Invoked directly in the abstract's proof statement for why abductive optimization yields largest gains on rarest prompts.

pith-pipeline@v0.9.0 · 5888 in / 1567 out tokens · 45922 ms · 2026-05-18T07:14:28.287883+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

estimating π(x|y) reduces to a structural data swap comparing prompts for a fixed response without introducing side effects... LA−DPO(πθ;πref)=−E[logσ(β(ψ(xw,y)−ψ(xl,y)))]
IndisputableMonolith/Foundation/LogicAsFunctionalEquation.lean SatisfiesLawsOfLogic unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

abductive policy eπ(x|y) = π(y|x)p(x)/q(y)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.