Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets
Pith reviewed 2026-05-21 20:48 UTC · model grok-4.3
The pith
Density-ratio weighted behavioral cloning recovers the clean expert policy from contaminated offline datasets using a small verified reference set.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The method weights each trajectory in the behavioral cloning loss by the estimated density ratio between the clean reference distribution and the observed (possibly corrupted) distribution, obtained via a binary classifier. This weighting ensures that the optimization prioritizes expert-like behavior and downweights corrupted samples. The authors prove that the resulting policy converges to the optimal clean expert policy with error bounds independent of the fraction of corrupted data.
What carries the argument
Trajectory-level density ratios estimated by a binary discriminator distinguishing the small clean reference set from the corrupted dataset, clipped and used as importance weights in the behavioral cloning objective.
If this is right
- The learned policy maintains near-optimal performance on continuous control benchmarks even at high contamination ratios.
- Finite-sample error bounds hold independently of the contamination fraction for any poisoning mechanism.
- The approach succeeds across reward, state, transition, and action poisoning protocols without prior knowledge of the corruption type.
- It outperforms standard behavioral cloning as well as BCQ and BRAC under the same contaminated data conditions.
Where Pith is reading between the lines
- The same density-ratio weighting could be inserted into other offline RL objectives such as Q-learning variants to improve their robustness.
- The method suggests a practical pathway for deploying imitation learning in safety-critical robotics where limited verified clean data can be collected separately.
- Adaptive or online selection of the clean reference set could further reduce the need for upfront verification.
Load-bearing premise
A small verified clean reference set exists whose distribution is close to the true clean expert trajectories, enabling reliable estimation of density ratios by the binary discriminator.
What would settle it
Observing that the weighted policy's performance degrades proportionally with increasing contamination rate, even when provided with the clean reference set, would falsify the independence of the bounds from contamination.
Figures
read the original abstract
Offline reinforcement learning (RL) enables policy optimization from fixed datasets, making it suitable for safety-critical applications where online exploration is infeasible. However, these datasets are often contaminated by adversarial poisoning, system errors, or low-quality samples, leading to degraded policy performance in standard behavioral cloning (BC) and offline RL methods. This paper introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), a robust imitation learning approach that uses a small, verified clean reference set to estimate trajectory-level density ratios via a binary discriminator. These ratios are clipped and used as weights in the BC objective to prioritize clean expert behavior while down-weighting or discarding corrupted data, without requiring knowledge of the contamination mechanism. We establish theoretical guarantees showing convergence to the clean expert policy with finite-sample bounds that are independent of the contamination rate. A comprehensive evaluation framework is established, which incorporates various poisoning protocols (reward, state, transition, and action) on continuous control benchmarks. Experiments demonstrate that Weighted BC maintains near-optimal performance even at high contamination ratios outperforming baselines such as traditional BC, batch-constrained Q-learning (BCQ) and behavior regularized actor-critic (BRAC).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Density-Ratio Weighted Behavioral Cloning (Weighted BC) for offline imitation learning from contaminated datasets. It trains a binary discriminator on a small verified clean reference set versus the corrupted dataset to estimate trajectory-level density ratios, clips these ratios, and uses them as weights in the behavioral cloning objective. The central claims are theoretical convergence to the clean expert policy together with finite-sample bounds independent of the contamination rate, supported by experiments on continuous control tasks under reward, state, transition, and action poisoning protocols that show robustness over standard BC, BCQ, and BRAC.
Significance. If the finite-sample bounds can be shown to hold with the stated independence, the work would provide a practically useful robustification of behavioral cloning that avoids explicit modeling of the contamination process. The empirical protocol covering multiple poisoning types and the use of a clean reference for density-ratio weighting are concrete strengths. The approach is a natural extension of importance weighting ideas to the imitation setting and could be impactful for safety-critical offline RL if the theory is tightened.
major comments (2)
- [Theoretical analysis] Abstract and theoretical analysis section: the claim of finite-sample bounds independent of contamination rate α is load-bearing for the main contribution. Standard concentration arguments on the weighted empirical risk would typically introduce a 1/sqrt((1-α)N) factor once low-weight samples are effectively discarded; the manuscript must exhibit the specific lemma or proof step that cancels this dependence (e.g., via a uniform bound on the discriminator error that itself does not degrade with α).
- [Method] Method and assumptions section: the convergence guarantee rests on the clean reference set being sufficiently close in distribution to the true expert trajectories so that the binary discriminator yields reliable weights. The paper should state an explicit quantitative assumption (e.g., total-variation or Hellinger distance bound between reference and clean expert) and show how it propagates into the finite-sample rate.
minor comments (2)
- [Experiments] Clarify the precise clipping threshold used for the density ratios and whether it is chosen adaptively or fixed; report its sensitivity in the experiments.
- [Discussion] Add a limitations paragraph discussing the practical requirement of obtaining a verified clean reference set and the degradation that occurs when this set is too small or distributionally mismatched.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments help clarify the presentation of our theoretical results. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Theoretical analysis] Abstract and theoretical analysis section: the claim of finite-sample bounds independent of contamination rate α is load-bearing for the main contribution. Standard concentration arguments on the weighted empirical risk would typically introduce a 1/sqrt((1-α)N) factor once low-weight samples are effectively discarded; the manuscript must exhibit the specific lemma or proof step that cancels this dependence (e.g., via a uniform bound on the discriminator error that itself does not degrade with α).
Authors: We thank the referee for this observation. The independence from α follows from the fact that the binary discriminator is trained solely on the fixed-size clean reference set; its estimation error is controlled by a uniform convergence result (Lemma 4 in the appendix) whose sample complexity depends only on the reference set size and the Rademacher complexity of the discriminator class, with no dependence on α or the size of the corrupted dataset. The weights are then clipped to a constant range independent of α, so that the subsequent Hoeffding-type bound on the weighted behavioral cloning objective (Theorem 3) yields a rate of order 1/sqrt(N) that does not contain an extra 1/sqrt(1-α) factor. We will revise the theoretical analysis section to add an explicit pointer to Lemma 4 and a short paragraph walking through this cancellation. revision: yes
-
Referee: [Method] Method and assumptions section: the convergence guarantee rests on the clean reference set being sufficiently close in distribution to the true expert trajectories so that the binary discriminator yields reliable weights. The paper should state an explicit quantitative assumption (e.g., total-variation or Hellinger distance bound between reference and clean expert) and show how it propagates into the finite-sample rate.
Authors: We agree that an explicit quantitative assumption would strengthen the statement of the results. The current manuscript implicitly treats the reference set as drawn from the clean expert distribution. We will add Assumption 3 stating that the total-variation distance between the reference distribution and the true expert trajectory distribution is bounded by a small constant δ. In the proof of Theorem 2 we will show that this δ appears as an additive O(δ) term in the finite-sample bound; the same term propagates into the rate of Theorem 3. These changes will appear in the revised Method and Theoretical Analysis sections. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents a method that trains a binary discriminator on a small verified clean reference set versus the contaminated dataset to obtain trajectory-level density ratios, which are then clipped and used as weights in the behavioral cloning objective. Theoretical guarantees are stated for convergence to the clean expert policy with finite-sample bounds claimed to be independent of the contamination rate. No load-bearing step in the described derivation reduces by construction to its inputs: the bounds are not a fitted parameter renamed as a prediction, the discriminator is not defined in terms of the final policy, and no uniqueness theorem or ansatz is imported via self-citation. The central result is derived from standard weighted empirical risk analysis under explicit assumptions on the clean reference set and discriminator quality, making the chain self-contained rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A small verified clean reference set is available whose trajectories are drawn from the same distribution as the clean expert behavior.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
wi = clip(r(τi), ε, C) ... r(τ) = dφ(τ)/(1−dφ(τ)) ... uniform clean-risk approximation bounds that are independent of contamination severity under appropriate clipping thresholds
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1 (Uniform clean-risk approximation) ... Eclip := E_p[(w⋆ − C)+ + (ε − w⋆)+]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Security Considerations for Multi-agent Systems
No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
Reference graph
Works this paper leans on
-
[1]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[2]
Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013
work page 2013
-
[3]
Poisoning attacks against support vector machines
Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. InProceedings of the 29th Interna- tional Conference on Machine Learning (ICML), 2012
work page 2012
-
[4]
Off-policy deep reinforcement learning without exploration
Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019
work page 2052
-
[5]
Behavior Regularized Offline Reinforcement Learning
Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning.arXiv preprint arXiv:1911.11361, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[6]
Conservative q-learning for offline reinforcement learning
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 1179– 1191, 2020
work page 2020
-
[7]
Adversarial Attacks on Neural Network Policies
Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies.arXiv preprint arXiv:1702.02284, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[8]
Adversarial policies: Attacking deep reinforcement learning
Adam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, and Stuart Russell. Adversarial policies: Attacking deep reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2020
work page 2020
-
[9]
Robust adversarial reinforcement learning
Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. InInternational Conference on Machine Learning, pages 2817–2826, 2017
work page 2017
-
[10]
Robust deep reinforcement learning against adversarial perturbations on state observations
Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. InAdvances in Neural Information Processing Systems, volume 33, pages 21024–21037, 2020
work page 2020
-
[11]
Discriminator-weighted offline imitation learning from suboptimal demonstrations
Haoran Xu, Xianyuan Zhan, Honglei Yin, and Huiling Qin. Discriminator-weighted offline imitation learning from suboptimal demonstrations. InInternational Conference on Machine Learning, pages 24725–24742, 2022
work page 2022
-
[12]
Rafael F Prudencio, Marcos R O A Maximo, and Esther L Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8043–8061, 2023
work page 2023
-
[13]
Ali Baheri. Implicit constraint-aware off-policy correction for offline reinforcement learning.arXiv preprint arXiv:2506.14058, 2025
-
[14]
Offline reinforcement learning with implicit Q-learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. InInternational Conference on Learning Representations, 2022
work page 2022
-
[15]
A minimalist approach to offline reinforcement learning
Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 34, pages 20132–20145, 2021
work page 2021
-
[16]
Xiaoyu Chen, Yufeng Zhang, and Tengyu Wang. Robust offline reinforcement learning with uncertainty quantification.Journal of Machine Learning Research, 25(3):1–42, 2024
work page 2024
-
[17]
Certified defenses for data poisoning attacks
Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. Certified defenses for data poisoning attacks. InAdvances in Neural Information Processing Systems, volume 30, 2017
work page 2017
-
[18]
Adaptive reward poisoning attacks against reinforcement learning
Xuezhou Zhang, Yuzhe Ma, Adish Singla, and Xiaojin Zhu. Adaptive reward poisoning attacks against reinforcement learning. InInterna- tional Conference on Machine Learning, pages 11225–11234, 2020
work page 2020
-
[19]
Policy poisoning in batch reinforcement learning and control
Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, and Adish Singla. Policy poisoning in batch reinforcement learning and control. InAdvances in Neural Information Processing Systems, volume 34, pages 14570–14581, 2021
work page 2021
-
[20]
Wei Sun, Yuxuan Li, and Shaofeng Zhang. Byzantine-robust federated reinforcement learning with optimal statistical guarantees.IEEE Transactions on Information Theory, 70(2):1123–1140, 2024
work page 2024
-
[21]
Certified adversarial robustness for deep reinforcement learning
Fan Wu, Linyi Li, Zijian Huang, Yevgeniy V orobeychik, and Ding Zhao. Certified adversarial robustness for deep reinforcement learning. InConference on Robot Learning, pages 456–467, 2024
work page 2024
-
[22]
Imitation learning: A survey of learning methods.ACM Computing Surveys, 50(2):1–35, 2017
Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods.ACM Computing Surveys, 50(2):1–35, 2017
work page 2017
-
[23]
Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning.Foundations and Trends in Robotics, 7(1-2):1–179, 2018
work page 2018
-
[24]
Generative adversarial imitation learning
Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. InAdvances in Neural Information Processing Systems, volume 29, 2016
work page 2016
-
[25]
Imitation learning via off-policy distribution matching
Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy distribution matching. InInternational Conference on Learning Representations, 2020
work page 2020
-
[26]
Expert confidence-aware imitation learning
Daniel Spencer, Jessica Zhang, and Csaba Szepesvari. Expert confidence-aware imitation learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 15234–15242, 2024
work page 2024
-
[27]
Lingxiao Wang, Zhuoran Zhou, and Jonathan Scarlett. Selective imitation learning from high-quality demonstrations.Machine Learning, 113(5):2871–2898, 2024
work page 2024
-
[28]
Adaptive weighted imitation learning from multimodal demonstrations
Yang Liu, Abhishek Gupta, and Pieter Abbeel. Adaptive weighted imitation learning from multimodal demonstrations. InInternational Conference on Robotics and Automation, pages 8974–8981, 2024
work page 2024
-
[29]
Algaedice: Policy gradient from arbitrary experience
Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019
-
[30]
GenDICE: Generalized offline estimation of stationary values
Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. GenDICE: Generalized offline estimation of stationary values. InInternational Conference on Learning Representations, 2020
work page 2020
-
[31]
Contrastive learning as goal-conditioned reinforcement learning
Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, and Ruslan Salakhutdinov. Contrastive learning as goal-conditioned reinforcement learning. InAdvances in Neural Information Processing Systems, volume 35, 2022
work page 2022
-
[32]
Offline reinforcement learning with in-sample q-learning
Xiaolong Ma, Yinlam Chen, Lihong Li, and Zhaoran Zhou. Offline reinforcement learning with in-sample q-learning. InInternational Conference on Machine Learning, pages 14650–14661, 2022
work page 2022
-
[33]
Harshit Sikchi, Wenxuan Zheng, and Emma Brunskill. Dual importance sampling for off-policy evaluation and learning.Journal of Machine Learning Research, 25(67):1–48, 2024
work page 2024
-
[34]
Kernel-based density ratio estimation for continuous control
Minghui Chen, Qiang Liu, and Jun Wang. Kernel-based density ratio estimation for continuous control. InAAAI Conference on Artificial Intelligence, volume 39, pages 11234–11242, 2025
work page 2025
-
[35]
Robust deep reinforcement learning against adversarial perturbations on state observations
Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. InAdvances in Neural Information Processing Systems, volume 33, pages 21024–21037, 2021
work page 2021
-
[36]
Wei Pan, Jin Zhang, and Evangelos Theodorou. Adversarial robustness certification for neural network control systems.IEEE Transactions on Automatic Control, 69(3):1567–1582, 2024
work page 2024
-
[37]
Provably robust rein- forcement learning via pac-bayes theory
Kaiyue Li, Songtao Wang, and Mladen Kolar. Provably robust rein- forcement learning via pac-bayes theory. InInternational Conference on Machine Learning, pages 19456–19467, 2024
work page 2024
-
[38]
Lin Yang, Jiaming Zheng, Ming Li, and Jianfeng Feng. Safe reinforcement learning under adversarial corruption.IEEE Transactions on Neural Networks and Learning Systems, 36(1):234–248, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.