Efficient Preference Poisoning Attack on Offline RLHF

Chenye Yang; Lifeng Lai; Weiyu Xu

arxiv: 2605.02495 · v2 · pith:TA5KJQNTnew · submitted 2026-05-04 · 💻 cs.LG · cs.AI· stat.ML

Efficient Preference Poisoning Attack on Offline RLHF

Chenye Yang , Weiyu Xu , Lifeng Lai This is my paper

Pith reviewed 2026-05-08 19:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords preference poisoningoffline RLHFDPOlabel flip attacksparse approximationlattice reductionbinary matching pursuitgradient shift

0 comments

The pith

Flipping one preference label in log-linear DPO creates a parameter-independent gradient shift that turns targeted poisoning into a binary sparse approximation problem.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in offline RLHF using log-linear Direct Preference Optimization, flipping a single preference label produces a fixed shift in the gradient vector that does not depend on the current model parameters. This independence lets the authors reformulate the problem of selecting which labels to flip as a structured binary sparse approximation task over a dictionary of gradient directions. They introduce two solution methods: BAL-A, which embeds the selection into a binary-aware lattice and applies lattice reduction plus nearest-plane search, and BMP-A, which adapts binary matching pursuit to deliver coherence-based recovery guarantees and impossibility certificates for a given flip budget. A reader would care because the result demonstrates how minimal changes to a pre-collected preference dataset can steer the trained policy with theoretical efficiency and success conditions.

Core claim

In log-linear DPO, flipping one preference label induces a parameter-independent shift in the DPO gradient. This converts the targeted poisoning problem into a structured binary sparse approximation problem, which BAL-A and BMP-A solve using lattice reduction and binary matching pursuit with sufficient conditions, coherence-based guarantees, and robustness certificates.

What carries the argument

The parameter-independent gradient shift induced by a single preference label flip, which reduces the poisoning attack to binary sparse approximation over a non-normalized gradient dictionary.

If this is right

BAL-A recovers the minimum number of flips when the lattice reduction and nearest-plane steps satisfy the stated sufficient conditions for binary coefficients.
BMP-A provides coherence-based recovery guarantees and impossibility certificates that bound attack success for any K-flip budget.
Attack effectiveness is governed by the geometry of the gradient dictionary constructed from the preference data.
The same reduction applies to any log-linear preference optimization objective that admits an additive gradient contribution per sample.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If preference datasets are collected from public or crowdsourced sources, an adversary could pre-compute the dictionary once and reuse the lattice or pursuit solver for multiple target policies.
Defenses that add small non-linear regularizers or switch to non-log-linear objectives would invalidate the parameter-independence step and thereby block this family of attacks.
The coherence measure that controls BMP-A recovery could be used as a dataset-quality metric to identify preference collections that are naturally harder to poison.
Extending the lattice construction to include higher-order interactions among flips might yield tighter bounds when multiple labels affect overlapping gradient directions.

Load-bearing premise

The DPO objective must be strictly log-linear in the parameters so the gradient shift from any single label flip remains independent of the current parameter vector.

What would settle it

Observe that the gradient shift after one label flip changes with the current parameter values when the model is trained with a non-linear preference objective or with regularization terms that break log-linearity.

Figures

Figures reproduced from arXiv: 2605.02495 by Chenye Yang, Lifeng Lai, Weiyu Xu.

**Figure 2.** Figure 2: TPR of BMP-A on synthetic V as a function of K⋆ view at source ↗

**Figure 1.** Figure 1: TPR of BAL-A on synthetic V as a function of M view at source ↗

**Figure 3.** Figure 3: True positive rate of BAL-A on V from SHP as a function of M view at source ↗

**Figure 4.** Figure 4: ℓ2 distance between learned parameters and ℓ1 distance between learned policies, comparing training on the BAL-A attacked subset D˜(ˆx) versus training on the ground-truth attacked subset D˜(x ⋆ ), as a function of M view at source ↗

**Figure 5.** Figure 5: Histogram of pairwise normalized correlations for two subsets of SHP: a random subset and a low-coherence subset. y = V x⋆ , then run BMP-A with tolerance ε = 10−3 up to budget tmax = 15 with 200 trials. The low-coherence subset yields consistently higher TPR as the budget increases and drives the residual down faster, often reaching a near-zero residual around K⋆ . In contrast, on the random subset BMP-A … view at source ↗

**Figure 6.** Figure 6: True positive rate and residual of BMP-A on V from different subset of SHP as a function of budget K. and similarly for µ = πθµ . Therefore, log πθ(ai | si) µ(ai | si) − log πθ(a ′ i | si) µ(a ′ i | si) = ψ(si , ai) ⊤θ − ψ(si , ai) ⊤θµ − ψ(si , a′ i ) ⊤θ + ψ(si , a′ i ) ⊤θµ = ∆ψ ⊤ i (θ − θµ), where ∆ψi := ψ(si , ai) − ψ(si , a′ i ) ∈ R d . Plugging into (1) gives per-sample loss ℓi(θ) = − log σ view at source ↗

**Figure 7.** Figure 7: ℓ2 distance between learned parameters and ℓ1 distance between learned policies, comparing training on the BMP-A attacked D˜(ˆx) versus training on the ground-truth attacked D˜(x ⋆ ), as a function of budget K. Using the sigmoid identity σ(x) = 1 − σ(−x), we have ∆gi(θ) = oiβ∆ψi =: ∆gi . Here we see that the gradient shift caused by flipping the label of one sample is a constant vector, independent of the … view at source ↗

read the original abstract

Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first illustrate that flipping one preference label induces a parameter-independent shift in the DPO gradient. Using this key property, we can then convert the targeted poisoning problem into a structured binary sparse approximation problem. To solve this problem, we develop two attack methods: Binary-Aware Lattice Attack (BAL-A) and Binary Matching Pursuit Attack (BMP-A). BAL-A embeds the binary flip selection problem into a binary-aware lattice and applies Lenstra-Lenstra-Lov\'asz reduction and Babai's nearest plane algorithm; we provide sufficient conditions that enforce binary coefficients and recover the minimum-flip objective. BMP-A adapts binary matching pursuit to our non-normalized gradient dictionary and yields coherence-based recovery guarantees and robustness (impossibility) certificates for $K$-flip budgets. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset validate the theory and highlight how dictionary geometry governs attack success.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reduces label-flip poisoning in log-linear DPO to binary sparse approximation and supplies two algorithms with recovery conditions.

read the letter

The main point is that flipping a preference label in log-linear DPO produces a fixed gradient shift equal to the feature difference vector, independent of the current parameters. This turns the problem of choosing which labels to flip into a binary sparse approximation task, which the authors solve with BAL-A (lattice reduction via LLL and Babai) and BMP-A (adapted matching pursuit). Both come with sufficient conditions or coherence guarantees for recovery under their stated budgets and dictionaries.

Referee Report

0 major / 4 minor

Summary. The paper claims that for log-linear DPO (linear reward model under Bradley-Terry), flipping one preference label produces a parameter-independent shift in the DPO gradient equal to the fixed feature difference vector. This property converts the targeted poisoning attack into a structured binary sparse approximation problem over a gradient dictionary. The authors introduce two solvers: BAL-A, which embeds the problem in a binary-aware lattice and applies LLL reduction plus Babai's nearest-plane algorithm with sufficient conditions guaranteeing binary coefficients and minimum-flip recovery; and BMP-A, which adapts binary matching pursuit to the non-normalized dictionary and supplies coherence-based recovery guarantees plus K-flip robustness certificates. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset are used to validate the theory and illustrate the role of dictionary geometry.

Significance. If the central reduction and guarantees hold, the work supplies a theoretically grounded, computationally efficient attack on offline RLHF that directly exploits the structure of the DPO loss rather than relying on black-box optimization. The explicit derivation of the constant gradient shift, the conversion to sparse approximation, the provision of sufficient conditions for BAL-A, and the coherence-based certificates for BMP-A constitute clear strengths. The empirical results on real preference data further demonstrate that dictionary geometry governs attack success, which is useful for both attack design and potential defenses.

minor comments (4)

Abstract and introduction should explicitly restate the scope limitation to log-linear DPO at the outset so readers immediately understand that the parameter-independence result does not apply to non-linear reward models or additional regularizers.
In the BAL-A section, the sufficient conditions for enforcing binary coefficients after lattice reduction should be illustrated with a small numerical example or a remark on how often they are satisfied in practice for typical preference feature dimensions.
The BMP-A coherence bound and robustness certificates would benefit from a brief comparison table showing how the achieved coherence values on the Stanford dataset relate to the theoretical thresholds for exact recovery.
Figure captions for the synthetic dictionary experiments should include the precise values of dimension, number of atoms, and sparsity level used, to allow direct reproduction of the reported success rates.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment, accurate summary of our contributions, and recommendation for minor revision. We are pleased that the central reduction of label-flip attacks to structured binary sparse approximation, the BAL-A and BMP-A solvers with their respective recovery guarantees, and the role of dictionary geometry are recognized as strengths.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core derivation begins from the explicit log-linear DPO loss under the Bradley-Terry model and directly computes the gradient difference induced by a single label flip, yielding a constant shift vector Delta = phi_w - phi_l that is independent of theta by algebraic cancellation in the sigmoid terms. This property is then used to recast the poisoning objective as a binary sparse approximation problem over the feature dictionary. The subsequent BAL-A and BMP-A algorithms apply standard lattice reduction (LLL + Babai) and matching pursuit, respectively, whose sufficient conditions and coherence guarantees are imported from the external literature on sparse approximation rather than being fitted or self-referenced within the paper. No load-bearing step equates a claimed result to its own inputs by construction, renames a fitted quantity as a prediction, or relies on self-citation chains; the analysis remains self-contained within the stated scope of log-linear DPO.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the log-linear structure of DPO and the parameter-independence of single-flip gradient shifts; no new entities are postulated and no parameters are fitted inside the attack construction itself.

axioms (2)

domain assumption DPO loss is log-linear in the model parameters
Explicitly stated as the setting for the attack analysis
domain assumption Single label flip produces a parameter-independent gradient shift
Described as the key property that enables the reduction to sparse approximation

pith-pipeline@v0.9.0 · 5496 in / 1367 out tokens · 30220 ms · 2026-05-08T19:25:48.070349+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

flipping one preference label induces a parameter-independent shift in the DPO gradient ... convert the targeted poisoning problem into a structured binary sparse approximation problem

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.