Efficient Preference Poisoning Attack on Offline RLHF
Pith reviewed 2026-05-08 19:25 UTC · model grok-4.3
The pith
Flipping one preference label in log-linear DPO creates a parameter-independent gradient shift that turns targeted poisoning into a binary sparse approximation problem.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In log-linear DPO, flipping one preference label induces a parameter-independent shift in the DPO gradient. This converts the targeted poisoning problem into a structured binary sparse approximation problem, which BAL-A and BMP-A solve using lattice reduction and binary matching pursuit with sufficient conditions, coherence-based guarantees, and robustness certificates.
What carries the argument
The parameter-independent gradient shift induced by a single preference label flip, which reduces the poisoning attack to binary sparse approximation over a non-normalized gradient dictionary.
If this is right
- BAL-A recovers the minimum number of flips when the lattice reduction and nearest-plane steps satisfy the stated sufficient conditions for binary coefficients.
- BMP-A provides coherence-based recovery guarantees and impossibility certificates that bound attack success for any K-flip budget.
- Attack effectiveness is governed by the geometry of the gradient dictionary constructed from the preference data.
- The same reduction applies to any log-linear preference optimization objective that admits an additive gradient contribution per sample.
Where Pith is reading between the lines
- If preference datasets are collected from public or crowdsourced sources, an adversary could pre-compute the dictionary once and reuse the lattice or pursuit solver for multiple target policies.
- Defenses that add small non-linear regularizers or switch to non-log-linear objectives would invalidate the parameter-independence step and thereby block this family of attacks.
- The coherence measure that controls BMP-A recovery could be used as a dataset-quality metric to identify preference collections that are naturally harder to poison.
- Extending the lattice construction to include higher-order interactions among flips might yield tighter bounds when multiple labels affect overlapping gradient directions.
Load-bearing premise
The DPO objective must be strictly log-linear in the parameters so the gradient shift from any single label flip remains independent of the current parameter vector.
What would settle it
Observe that the gradient shift after one label flip changes with the current parameter values when the model is trained with a non-linear preference objective or with regularization terms that break log-linearity.
Figures
read the original abstract
Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first illustrate that flipping one preference label induces a parameter-independent shift in the DPO gradient. Using this key property, we can then convert the targeted poisoning problem into a structured binary sparse approximation problem. To solve this problem, we develop two attack methods: Binary-Aware Lattice Attack (BAL-A) and Binary Matching Pursuit Attack (BMP-A). BAL-A embeds the binary flip selection problem into a binary-aware lattice and applies Lenstra-Lenstra-Lov\'asz reduction and Babai's nearest plane algorithm; we provide sufficient conditions that enforce binary coefficients and recover the minimum-flip objective. BMP-A adapts binary matching pursuit to our non-normalized gradient dictionary and yields coherence-based recovery guarantees and robustness (impossibility) certificates for $K$-flip budgets. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset validate the theory and highlight how dictionary geometry governs attack success.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that for log-linear DPO (linear reward model under Bradley-Terry), flipping one preference label produces a parameter-independent shift in the DPO gradient equal to the fixed feature difference vector. This property converts the targeted poisoning attack into a structured binary sparse approximation problem over a gradient dictionary. The authors introduce two solvers: BAL-A, which embeds the problem in a binary-aware lattice and applies LLL reduction plus Babai's nearest-plane algorithm with sufficient conditions guaranteeing binary coefficients and minimum-flip recovery; and BMP-A, which adapts binary matching pursuit to the non-normalized dictionary and supplies coherence-based recovery guarantees plus K-flip robustness certificates. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset are used to validate the theory and illustrate the role of dictionary geometry.
Significance. If the central reduction and guarantees hold, the work supplies a theoretically grounded, computationally efficient attack on offline RLHF that directly exploits the structure of the DPO loss rather than relying on black-box optimization. The explicit derivation of the constant gradient shift, the conversion to sparse approximation, the provision of sufficient conditions for BAL-A, and the coherence-based certificates for BMP-A constitute clear strengths. The empirical results on real preference data further demonstrate that dictionary geometry governs attack success, which is useful for both attack design and potential defenses.
minor comments (4)
- Abstract and introduction should explicitly restate the scope limitation to log-linear DPO at the outset so readers immediately understand that the parameter-independence result does not apply to non-linear reward models or additional regularizers.
- In the BAL-A section, the sufficient conditions for enforcing binary coefficients after lattice reduction should be illustrated with a small numerical example or a remark on how often they are satisfied in practice for typical preference feature dimensions.
- The BMP-A coherence bound and robustness certificates would benefit from a brief comparison table showing how the achieved coherence values on the Stanford dataset relate to the theoretical thresholds for exact recovery.
- Figure captions for the synthetic dictionary experiments should include the precise values of dimension, number of atoms, and sparsity level used, to allow direct reproduction of the reported success rates.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, accurate summary of our contributions, and recommendation for minor revision. We are pleased that the central reduction of label-flip attacks to structured binary sparse approximation, the BAL-A and BMP-A solvers with their respective recovery guarantees, and the role of dictionary geometry are recognized as strengths.
Circularity Check
No significant circularity identified
full rationale
The paper's core derivation begins from the explicit log-linear DPO loss under the Bradley-Terry model and directly computes the gradient difference induced by a single label flip, yielding a constant shift vector Delta = phi_w - phi_l that is independent of theta by algebraic cancellation in the sigmoid terms. This property is then used to recast the poisoning objective as a binary sparse approximation problem over the feature dictionary. The subsequent BAL-A and BMP-A algorithms apply standard lattice reduction (LLL + Babai) and matching pursuit, respectively, whose sufficient conditions and coherence guarantees are imported from the external literature on sparse approximation rather than being fitted or self-referenced within the paper. No load-bearing step equates a claimed result to its own inputs by construction, renames a fitted quantity as a prediction, or relies on self-citation chains; the analysis remains self-contained within the stated scope of log-linear DPO.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption DPO loss is log-linear in the model parameters
- domain assumption Single label flip produces a parameter-independent gradient shift
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
flipping one preference label induces a parameter-independent shift in the DPO gradient ... convert the targeted poisoning problem into a structured binary sparse approximation problem
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.