arxiv: 2605.02495 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 3 theorem links

· Lean Theorem

Efficient Preference Poisoning Attack on Offline RLHF

Chenye Yang , Weiyu Xu , Lifeng Lai

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords preference poisoningoffline RLHFDPOlabel flip attacksparse approximationlattice reductionbinary matching pursuitgradient shift

0 comments

The pith

Flipping one preference label in log-linear DPO creates a parameter-independent gradient shift that turns targeted poisoning into a binary sparse approximation problem.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in offline RLHF using log-linear Direct Preference Optimization, flipping a single preference label produces a fixed shift in the gradient vector that does not depend on the current model parameters. This independence lets the authors reformulate the problem of selecting which labels to flip as a structured binary sparse approximation task over a dictionary of gradient directions. They introduce two solution methods: BAL-A, which embeds the selection into a binary-aware lattice and applies lattice reduction plus nearest-plane search, and BMP-A, which adapts binary matching pursuit to deliver coherence-based recovery guarantees and impossibility certificates for a given flip budget. A reader would care because the result demonstrates how minimal changes to a pre-collected preference dataset can steer the trained policy with theoretical efficiency and success conditions.

Core claim

In log-linear DPO, flipping one preference label induces a parameter-independent shift in the DPO gradient. This converts the targeted poisoning problem into a structured binary sparse approximation problem, which BAL-A and BMP-A solve using lattice reduction and binary matching pursuit with sufficient conditions, coherence-based guarantees, and robustness certificates.

What carries the argument

The parameter-independent gradient shift induced by a single preference label flip, which reduces the poisoning attack to binary sparse approximation over a non-normalized gradient dictionary.

If this is right

BAL-A recovers the minimum number of flips when the lattice reduction and nearest-plane steps satisfy the stated sufficient conditions for binary coefficients.
BMP-A provides coherence-based recovery guarantees and impossibility certificates that bound attack success for any K-flip budget.
Attack effectiveness is governed by the geometry of the gradient dictionary constructed from the preference data.
The same reduction applies to any log-linear preference optimization objective that admits an additive gradient contribution per sample.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If preference datasets are collected from public or crowdsourced sources, an adversary could pre-compute the dictionary once and reuse the lattice or pursuit solver for multiple target policies.
Defenses that add small non-linear regularizers or switch to non-log-linear objectives would invalidate the parameter-independence step and thereby block this family of attacks.
The coherence measure that controls BMP-A recovery could be used as a dataset-quality metric to identify preference collections that are naturally harder to poison.
Extending the lattice construction to include higher-order interactions among flips might yield tighter bounds when multiple labels affect overlapping gradient directions.

Load-bearing premise

The DPO objective must be strictly log-linear in the parameters so the gradient shift from any single label flip remains independent of the current parameter vector.

What would settle it

Observe that the gradient shift after one label flip changes with the current parameter values when the model is trained with a non-linear preference objective or with regularization terms that break log-linearity.

Figures

Figures reproduced from arXiv: 2605.02495 by Chenye Yang, Lifeng Lai, Weiyu Xu.

**Figure 2.** Figure 2: TPR of BMP-A on synthetic V as a function of K⋆ view at source ↗

**Figure 1.** Figure 1: TPR of BAL-A on synthetic V as a function of M view at source ↗

**Figure 3.** Figure 3: True positive rate of BAL-A on V from SHP as a function of M view at source ↗

**Figure 4.** Figure 4: ℓ2 distance between learned parameters and ℓ1 distance between learned policies, comparing training on the BAL-A attacked subset D˜(ˆx) versus training on the ground-truth attacked subset D˜(x ⋆ ), as a function of M view at source ↗

**Figure 5.** Figure 5: Histogram of pairwise normalized correlations for two subsets of SHP: a random subset and a low-coherence subset. y = V x⋆ , then run BMP-A with tolerance ε = 10−3 up to budget tmax = 15 with 200 trials. The low-coherence subset yields consistently higher TPR as the budget increases and drives the residual down faster, often reaching a near-zero residual around K⋆ . In contrast, on the random subset BMP-A … view at source ↗

**Figure 6.** Figure 6: True positive rate and residual of BMP-A on V from different subset of SHP as a function of budget K. and similarly for µ = πθµ . Therefore, log πθ(ai | si) µ(ai | si) − log πθ(a ′ i | si) µ(a ′ i | si) = ψ(si , ai) ⊤θ − ψ(si , ai) ⊤θµ − ψ(si , a′ i ) ⊤θ + ψ(si , a′ i ) ⊤θµ = ∆ψ ⊤ i (θ − θµ), where ∆ψi := ψ(si , ai) − ψ(si , a′ i ) ∈ R d . Plugging into (1) gives per-sample loss ℓi(θ) = − log σ view at source ↗

**Figure 7.** Figure 7: ℓ2 distance between learned parameters and ℓ1 distance between learned policies, comparing training on the BMP-A attacked D˜(ˆx) versus training on the ground-truth attacked D˜(x ⋆ ), as a function of budget K. Using the sigmoid identity σ(x) = 1 − σ(−x), we have ∆gi(θ) = oiβ∆ψi =: ∆gi . Here we see that the gradient shift caused by flipping the label of one sample is a constant vector, independent of the … view at source ↗

read the original abstract

Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first illustrate that flipping one preference label induces a parameter-independent shift in the DPO gradient. Using this key property, we can then convert the targeted poisoning problem into a structured binary sparse approximation problem. To solve this problem, we develop two attack methods: Binary-Aware Lattice Attack (BAL-A) and Binary Matching Pursuit Attack (BMP-A). BAL-A embeds the binary flip selection problem into a binary-aware lattice and applies Lenstra-Lenstra-Lov\'asz reduction and Babai's nearest plane algorithm; we provide sufficient conditions that enforce binary coefficients and recover the minimum-flip objective. BMP-A adapts binary matching pursuit to our non-normalized gradient dictionary and yields coherence-based recovery guarantees and robustness (impossibility) certificates for $K$-flip budgets. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset validate the theory and highlight how dictionary geometry governs attack success.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reduces label-flip poisoning in log-linear DPO to binary sparse approximation and supplies two algorithms with recovery conditions.

read the letter

The main point is that flipping a preference label in log-linear DPO produces a fixed gradient shift equal to the feature difference vector, independent of the current parameters. This turns the problem of choosing which labels to flip into a binary sparse approximation task, which the authors solve with BAL-A (lattice reduction via LLL and Babai) and BMP-A (adapted matching pursuit). Both come with sufficient conditions or coherence guarantees for recovery under their stated budgets and dictionaries.

Referee Report

0 major / 4 minor

Summary. The paper claims that for log-linear DPO (linear reward model under Bradley-Terry), flipping one preference label produces a parameter-independent shift in the DPO gradient equal to the fixed feature difference vector. This property converts the targeted poisoning attack into a structured binary sparse approximation problem over a gradient dictionary. The authors introduce two solvers: BAL-A, which embeds the problem in a binary-aware lattice and applies LLL reduction plus Babai's nearest-plane algorithm with sufficient conditions guaranteeing binary coefficients and minimum-flip recovery; and BMP-A, which adapts binary matching pursuit to the non-normalized dictionary and supplies coherence-based recovery guarantees plus K-flip robustness certificates. Experiments on synthetic dictionaries and the Stanford Human Preferences dataset are used to validate the theory and illustrate the role of dictionary geometry.

Significance. If the central reduction and guarantees hold, the work supplies a theoretically grounded, computationally efficient attack on offline RLHF that directly exploits the structure of the DPO loss rather than relying on black-box optimization. The explicit derivation of the constant gradient shift, the conversion to sparse approximation, the provision of sufficient conditions for BAL-A, and the coherence-based certificates for BMP-A constitute clear strengths. The empirical results on real preference data further demonstrate that dictionary geometry governs attack success, which is useful for both attack design and potential defenses.

minor comments (4)

Abstract and introduction should explicitly restate the scope limitation to log-linear DPO at the outset so readers immediately understand that the parameter-independence result does not apply to non-linear reward models or additional regularizers.
In the BAL-A section, the sufficient conditions for enforcing binary coefficients after lattice reduction should be illustrated with a small numerical example or a remark on how often they are satisfied in practice for typical preference feature dimensions.
The BMP-A coherence bound and robustness certificates would benefit from a brief comparison table showing how the achieved coherence values on the Stanford dataset relate to the theoretical thresholds for exact recovery.
Figure captions for the synthetic dictionary experiments should include the precise values of dimension, number of atoms, and sparsity level used, to allow direct reproduction of the reported success rates.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment, accurate summary of our contributions, and recommendation for minor revision. We are pleased that the central reduction of label-flip attacks to structured binary sparse approximation, the BAL-A and BMP-A solvers with their respective recovery guarantees, and the role of dictionary geometry are recognized as strengths.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core derivation begins from the explicit log-linear DPO loss under the Bradley-Terry model and directly computes the gradient difference induced by a single label flip, yielding a constant shift vector Delta = phi_w - phi_l that is independent of theta by algebraic cancellation in the sigmoid terms. This property is then used to recast the poisoning objective as a binary sparse approximation problem over the feature dictionary. The subsequent BAL-A and BMP-A algorithms apply standard lattice reduction (LLL + Babai) and matching pursuit, respectively, whose sufficient conditions and coherence guarantees are imported from the external literature on sparse approximation rather than being fitted or self-referenced within the paper. No load-bearing step equates a claimed result to its own inputs by construction, renames a fitted quantity as a prediction, or relies on self-citation chains; the analysis remains self-contained within the stated scope of log-linear DPO.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the log-linear structure of DPO and the parameter-independence of single-flip gradient shifts; no new entities are postulated and no parameters are fitted inside the attack construction itself.

axioms (2)

domain assumption DPO loss is log-linear in the model parameters
Explicitly stated as the setting for the attack analysis
domain assumption Single label flip produces a parameter-independent gradient shift
Described as the key property that enables the reduction to sparse approximation

pith-pipeline@v0.9.0 · 5496 in / 1367 out tokens · 30220 ms · 2026-05-08T19:25:48.070349+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

flipping one preference label induces a parameter-independent shift in the DPO gradient ... convert the targeted poisoning problem into a structured binary sparse approximation problem

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

141 extracted references · 33 canonical work pages · 5 internal anchors

[1]

Advances in Neural Information Processing Systems , volume=

Stochastic multi-armed-bandit problem with non-stationary rewards , author=. Advances in Neural Information Processing Systems , volume=. 2014 , month=

2014
[2]

arXiv preprint arXiv:2301.12595 , year=

Adversarial Attacks on Adversarial Bandits , author=. arXiv preprint arXiv:2301.12595 , year=

work page arXiv
[3]

SIAM Journal on Computing , volume=

The nonstochastic multiarmed bandit problem , author=. SIAM Journal on Computing , volume=. 2002 , publisher=

2002
[4]

Proceedings of the IEEE Annual Foundations of Computer Science , pages=

Gambling in a rigged casino: The adversarial multi-armed bandit problem , author=. Proceedings of the IEEE Annual Foundations of Computer Science , pages=. 1995 , month=

1995
[5]

IEEE Transactions on Signal Processing , volume=

Action-manipulation attacks against stochastic bandits: Attacks and defense , author=. IEEE Transactions on Signal Processing , volume=
[6]

arXiv preprint arXiv:2112.05367 , year=

Efficient Action Poisoning Attacks on Linear Contextual Bandits , author=. arXiv preprint arXiv:2112.05367 , year=

work page arXiv
[7]

and Hajiesmaili, M

Yang, L. and Hajiesmaili, M. and Talebi, S. and Lui, J. C. S. and Wong, W. S. , booktitle =. Adversarial Bandits with Corruptions: Regret Lower Bound and No-regret Algorithm , volume =
[8]

Proceedings of the International Conference on Algorithmic Learning Theory , year=

On upper-confidence bound policies for switching bandit problems , author=. Proceedings of the International Conference on Algorithmic Learning Theory , year=
[9]

arXiv preprint arXiv:2201.01628 , year=

Bridging adversarial and nonstationary multi-armed bandit , author=. arXiv preprint arXiv:2201.01628 , year=

work page arXiv
[10]

Proceedings of the International Conference on Artificial Intelligence and Statistics , pages =

Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks , author =. Proceedings of the International Conference on Artificial Intelligence and Statistics , pages =. 2022 , month =

2022
[11]

Proceedings of the International Conference on Artificial Intelligence and Statistics , pages =

Stochastic linear bandits robust to adversarial attacks , author=. Proceedings of the International Conference on Artificial Intelligence and Statistics , pages =. 2021 , volume=

2021
[12]

and Liu, G

Yang, C. and Liu, G. and Lai, L. , booktitle =. Reward Attack on Stochastic Bandits with Non-stationary Rewards , pages=. 2023 , month =

2023
[13]

Proceedings of the Conference on Learning Theory , pages =

Adaptively tracking the best bandit arm with an unknown number of distribution changes , author=. Proceedings of the Conference on Learning Theory , pages =. 2019 , month=

2019
[14]

Stochastic Systems , volume=

Optimal Exploration--Exploitation in a Multi-armed Bandit Problem with Non-stationary Rewards , author=. Stochastic Systems , volume=
[15]

Proceedings of the International Conference on Artificial Intelligence and Statistics , pages =

An Optimal Algorithm for Stochastic and Adversarial Bandits , author =. Proceedings of the International Conference on Artificial Intelligence and Statistics , pages =. 2019 , volume =

2019
[16]

Management Science , volume=

Hedging the drift: Learning to optimize under nonstationarity , author=. Management Science , volume=. 2022 , publisher=

2022
[17]

Proceedings of the International Conference on Machine Learning , pages =

The intrinsic robustness of stochastic bandits to strategic manipulation , author=. Proceedings of the International Conference on Machine Learning , pages =. 2020 , month=

2020
[18]

Proceedings of the International Conference on Machine Learning , pages =

Data poisoning attacks on stochastic bandits , author=. Proceedings of the International Conference on Machine Learning , pages =. 2019 , month=

2019
[19]

Advances in Neural Information Processing Systems , volume=

Robust lipschitz bandits to adversarial corruptions , author=. Advances in Neural Information Processing Systems , volume=
[20]

and Liu, Q

Wang, Y. and Liu, Q. and Jin, C. , journal=. Is
[21]

Proceedings of the International Conference on Machine Learning , pages=

Control regularization for reduced variance reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , pages=. 2019 , month =

2019
[22]

Proceedings of the International Conference on Machine Learning , pages=

Constrained policy optimization , author=. Proceedings of the International Conference on Machine Learning , pages=. 2017 , month =

2017
[23]

International Statistical Review , volume=

On choosing and bounding probability metrics , author=. International Statistical Review , volume=
[24]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Optimal Attack and Defense for Reinforcement Learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2024 , month=

2024
[25]

Understanding the Limits of Poison- ing Attacks in Episodic Reinforcement Learning.CoRR, abs/2208.13663,

Understanding the limits of poisoning attacks in episodic reinforcement learning , author=. arXiv preprint arXiv:2208.13663 , year=

work page arXiv
[26]

Proceedings of the ACM/IEEE International Conference on Cyber-Physical Systems , pages=

Query-based targeted action-space adversarial policies on deep reinforcement learning agents , author=. Proceedings of the ACM/IEEE International Conference on Cyber-Physical Systems , pages=. 2021 , month=

2021
[27]

Proceedings of the International Conference on Machine Learning , pages=

Adaptive reward-poisoning attacks against reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , pages=. 2020 , month=

2020
[28]

Advances in Neural Information Processing Systems , volume=

Efficient Adversarial Attacks on Online Multi-agent Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=. 2023 , month =

2023
[29]

Advances in Neural Information Processing Systems , volume=

Provably efficient black-box action poisoning attacks against reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=. 2021 , month=

2021
[30]

Proceedings of the International Conference on Machine Learning , pages=

Minimax regret bounds for reinforcement learning , author=. Proceedings of the International Conference on Machine Learning , pages=. 2017 , month=

2017
[31]

Proceedings of the Conference on Learning Theory , pages=

Settling the sample complexity of online reinforcement learning , author=. Proceedings of the Conference on Learning Theory , pages=. 2024 , month=

2024
[32]

The method of paired comparisons , author=

Rank analysis of incomplete block designs: I. The method of paired comparisons , author=. Biometrika , volume=
[33]

Advances in Neural Information Processing Systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=. 2023 , month=

2023
[34]

Advances in Neural Information Processing Systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in Neural Information Processing Systems , volume=. 2017 , month=

2017
[35]

BPR: Bayesian Personalized Ranking from Implicit Feedback

BPR: Bayesian personalized ranking from implicit feedback , author=. arXiv preprint arXiv:1205.2618 , year=

work page internal anchor Pith review arXiv
[36]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Open problems and fundamental limitations of reinforcement learning from human feedback , author=. arXiv preprint arXiv:2307.15217 , year=

work page internal anchor Pith review arXiv
[37]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page Pith review arXiv
[38]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review arXiv
[39]

2023 , url =

Anthropic , title =. 2023 , url =

2023
[40]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review arXiv
[41]

2024 , note =

Meta AI , title =. 2024 , note =

2024
[42]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page Pith review arXiv
[43]

arXiv preprint arXiv:2402.01920 , year=

Preference Poisoning Attacks on Reward Model Learning , author=. arXiv preprint arXiv:2402.01920 , year=

work page arXiv
[44]

Rando and F

Universal jailbreak backdoors from poisoned human feedback , author=. arXiv preprint arXiv:2311.14455 , year=

work page arXiv
[45]

Best-of-Venom: Attacking RLHF by Injecting Poi- soned Preference Data.CoRR, abs/2404.05530,

Baumg. Best-of-Venom: Attacking. arXiv preprint arXiv:2404.05530 , year=

work page arXiv
[46]

and Lou, X

Yan, Y. and Lou, X. and Li, J. and Zhang, Y. and Xie, J. and Yu, C. and Wang, Y. and Yan, D. and Shen, Y. , journal=. Reward-Robust
[47]

Saying is believing

The political ideology of conversational AI: Converging evidence on ChatGPT's pro-environmental, left-libertarian orientation , author=. arXiv preprint arXiv:2301.01768 , year=

work page arXiv
[48]

Fine-Tuning Language Models from Human Preferences

Fine-tuning language models from human preferences , author=. arXiv preprint arXiv:1909.08593 , year=

work page internal anchor Pith review arXiv 1909
[49]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=. 2022 , month =

2022
[50]

Journal of the Royal Statistical Society Series C: Applied Statistics , volume=

The analysis of permutations , author=. Journal of the Royal Statistical Society Series C: Applied Statistics , volume=
[51]

and Gu, S

Jaques, N. and Gu, S. and Bahdanau, D. and Hern. Sequence tutor: Conservative fine-tuning of sequence generation models with. Proceedings of the International Conference on Machine Learning , pages=. 2017 , month=

2017
[52]

Jaques, J

Human-centric dialog training via offline reinforcement learning , author=. arXiv preprint arXiv:2010.05848 , year=

work page arXiv 2010
[53]

Bulletin of the American Mathematical Society , volume =

Some aspects of the sequential design of experiments , author =. Bulletin of the American Mathematical Society , volume =. 1952 , month =

1952
[54]

London: Chapman and Hall , volume=

Bandit problems: sequential allocation of experiments (Monographs on statistics and applied probability) , author=. London: Chapman and Hall , volume=
[55]

A Bradford Book , year=

Reinforcement learning: An introduction , author=. A Bradford Book , year=
[56]

Advances in Neural Information Processing Systems , volume=

Adversarial attacks on stochastic bandits , author=. Advances in Neural Information Processing Systems , volume=. 2018 , month=

2018
[57]

and Liu, G

Yang, C. and Liu, G. and Lai, L. , journal=. Stochastic Bandits With Non-Stationary Rewards: Reward Attack and Defense , year=
[58]

Proceedings of the International Conference on Machine Learning and Data Mining in Pattern Recognition , pages=

Vulnerability of deep reinforcement learning to policy induction attacks , author=. Proceedings of the International Conference on Machine Learning and Data Mining in Pattern Recognition , pages=. 2017 , month=

2017
[59]

Proceedings of the International Conference on Decision and Game Theory for Security , pages=

Deceptive reinforcement learning under adversarial manipulations on cost signals , author=. Proceedings of the International Conference on Decision and Game Theory for Security , pages=. 2019 , month=

2019
[60]

Advances in Neural Information Processing Systems , volume=

Policy poisoning in batch reinforcement learning and control , author=. Advances in Neural Information Processing Systems , volume=. 2019 , month=

2019
[61]

Proceedings of the Annual ACM SIGACT Symposium on Theory of Computing , pages=

Stochastic bandits robust to adversarial corruptions , author=. Proceedings of the Annual ACM SIGACT Symposium on Theory of Computing , pages=
[62]

Proceedings of the Conference on Learning Theory , pages=

Better algorithms for stochastic bandits with adversarial corruptions , author=. Proceedings of the Conference on Learning Theory , pages=. 2019 , address=

2019
[63]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Robust stochastic bandit algorithms under probabilistic unbounded adversarial attack , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=. 2020 , address=

2020
[64]

Proceedings of the International Conference on Artificial Intelligence and Statistics , pages=

Corruption-robust offline reinforcement learning , author=. Proceedings of the International Conference on Artificial Intelligence and Statistics , pages=. 2022 , month=

2022
[65]

Advances in Neural Information Processing Systems , volume=

Corruption-robust offline reinforcement learning with general function approximation , author=. Advances in Neural Information Processing Systems , volume=
[66]

Proceedings of the International Conference on Machine Learning , pages=

Action robust reinforcement learning and applications in continuous control , author=. Proceedings of the International Conference on Machine Learning , pages=. 2019 , month=

2019
[67]

Advances in Neural Information Processing Systems , volume=

Robust reinforcement learning from corrupted human feedback , author=. Advances in Neural Information Processing Systems , volume=
[68]

Proceedings of the International Conference on Machine Learning , pages=

Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation , author=. Proceedings of the International Conference on Machine Learning , pages=. 2022 , month=

2022
[69]

and Pacchiano, A

Saha, A. and Pacchiano, A. and Lee, J. , booktitle=. Dueling. 2023 , month=

2023
[70]

Proceedings of the International Conference on Machine Learning , pages=

Principled reinforcement learning with human feedback from pairwise or k-wise comparisons , author=. Proceedings of the International Conference on Machine Learning , pages=. 2023 , month=

2023
[71]

Proceedings of the ICML 2023 Workshop The Many Facets of Preference-Based Learning , year=

Provable offline reinforcement learning with human feedback , author=. Proceedings of the ICML 2023 Workshop The Many Facets of Preference-Based Learning , year=

2023
[72]

arXiv preprint arXiv:2305.18505 , year=

Provable reward-agnostic preference-based reinforcement learning , author=. arXiv preprint arXiv:2305.18505 , year=

work page arXiv
[73]

and Sun, W

Wu, R. and Sun, W. , journal=. Making
[74]

arXiv preprint arXiv:2402.17257 , year=

RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences , author=. arXiv preprint arXiv:2402.17257 , year=

work page arXiv
[75]

arXiv preprint arXiv:2402.06734 , year=

Corruption Robust Offline Reinforcement Learning with Human Feedback , author=. arXiv preprint arXiv:2402.06734 , year=

work page arXiv
[76]

and Chakraborty, S

Pathmanathan, P. and Chakraborty, S. and Liu, X. and Liang, Y. and Huang, F. , journal=. Is poisoning a real threat to
[77]

Chowdhury, S. R. and Kini, A. and Natarajan, N. , journal=. Provably robust
[78]

arXiv preprint arXiv:2407.07880 , year=

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization , author=. arXiv preprint arXiv:2407.07880 , year=

work page arXiv
[79]

and Dong, H

Xiong, W. and Dong, H. and Ye, C. and Wang, Z. and Zhong, H. and Ji, H. and Jiang, N. and Zhang, T. , journal=. Iterative preference learning from human feedback: Bridging theory and practice for
[80]

arXiv preprint arXiv:2309.06657 , year=

Statistical rejection sampling improves preference optimization , author=. arXiv preprint arXiv:2309.06657 , year=

work page arXiv

Showing first 80 references.