pith. sign in

arxiv: 2510.01479 · v2 · pith:OZ2UPLYEnew · submitted 2025-10-01 · 💻 cs.LG · cs.SY· eess.SY

Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

Pith reviewed 2026-05-21 20:48 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY
keywords imitation learningbehavioral cloningoffline reinforcement learningdensity ratio estimationrobust policy learningcorrupted datasetstrajectory weighting
0
0 comments X

The pith

Density-ratio weighted behavioral cloning recovers the clean expert policy from contaminated offline datasets using a small verified reference set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Density-Ratio Weighted Behavioral Cloning as a robust imitation learning approach for offline settings where datasets contain corrupted or adversarial samples. It trains a binary discriminator on a small clean reference set to estimate trajectory-level density ratios, then clips and applies these ratios as weights in the standard behavioral cloning objective. This prioritizes clean expert trajectories while down-weighting corrupted ones without requiring knowledge of the corruption process. Theoretical results establish convergence to the clean expert policy along with finite-sample bounds that remain independent of the contamination rate. Experiments across continuous control benchmarks under multiple poisoning protocols confirm that the method sustains near-optimal performance where standard behavioral cloning and other offline RL baselines degrade.

Core claim

The method weights each trajectory in the behavioral cloning loss by the estimated density ratio between the clean reference distribution and the observed (possibly corrupted) distribution, obtained via a binary classifier. This weighting ensures that the optimization prioritizes expert-like behavior and downweights corrupted samples. The authors prove that the resulting policy converges to the optimal clean expert policy with error bounds independent of the fraction of corrupted data.

What carries the argument

Trajectory-level density ratios estimated by a binary discriminator distinguishing the small clean reference set from the corrupted dataset, clipped and used as importance weights in the behavioral cloning objective.

If this is right

  • The learned policy maintains near-optimal performance on continuous control benchmarks even at high contamination ratios.
  • Finite-sample error bounds hold independently of the contamination fraction for any poisoning mechanism.
  • The approach succeeds across reward, state, transition, and action poisoning protocols without prior knowledge of the corruption type.
  • It outperforms standard behavioral cloning as well as BCQ and BRAC under the same contaminated data conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same density-ratio weighting could be inserted into other offline RL objectives such as Q-learning variants to improve their robustness.
  • The method suggests a practical pathway for deploying imitation learning in safety-critical robotics where limited verified clean data can be collected separately.
  • Adaptive or online selection of the clean reference set could further reduce the need for upfront verification.

Load-bearing premise

A small verified clean reference set exists whose distribution is close to the true clean expert trajectories, enabling reliable estimation of density ratios by the binary discriminator.

What would settle it

Observing that the weighted policy's performance degrades proportionally with increasing contamination rate, even when provided with the clean reference set, would falsify the independence of the bounds from contamination.

Figures

Figures reproduced from arXiv: 2510.01479 by Ali Baheri, Shriram Karpoora Sundara Pandian.

Figure 1
Figure 1. Figure 1: Average return as a function of contamination level [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Relative performance improvement of Weighted BC over the best baseline at each contamination level, shown as percentage gains. Green cells [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance retention normalized to clean baseline ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Offline reinforcement learning (RL) enables policy optimization from fixed datasets, making it suitable for safety-critical applications where online exploration is infeasible. However, these datasets are often contaminated by adversarial poisoning, system errors, or low-quality samples, leading to degraded policy performance in standard behavioral cloning (BC) and offline RL methods. This paper introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), a robust imitation learning approach that uses a small, verified clean reference set to estimate trajectory-level density ratios via a binary discriminator. These ratios are clipped and used as weights in the BC objective to prioritize clean expert behavior while down-weighting or discarding corrupted data, without requiring knowledge of the contamination mechanism. We establish theoretical guarantees showing convergence to the clean expert policy with finite-sample bounds that are independent of the contamination rate. A comprehensive evaluation framework is established, which incorporates various poisoning protocols (reward, state, transition, and action) on continuous control benchmarks. Experiments demonstrate that Weighted BC maintains near-optimal performance even at high contamination ratios outperforming baselines such as traditional BC, batch-constrained Q-learning (BCQ) and behavior regularized actor-critic (BRAC).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Density-Ratio Weighted Behavioral Cloning (Weighted BC) for offline imitation learning from contaminated datasets. It trains a binary discriminator on a small verified clean reference set versus the corrupted dataset to estimate trajectory-level density ratios, clips these ratios, and uses them as weights in the behavioral cloning objective. The central claims are theoretical convergence to the clean expert policy together with finite-sample bounds independent of the contamination rate, supported by experiments on continuous control tasks under reward, state, transition, and action poisoning protocols that show robustness over standard BC, BCQ, and BRAC.

Significance. If the finite-sample bounds can be shown to hold with the stated independence, the work would provide a practically useful robustification of behavioral cloning that avoids explicit modeling of the contamination process. The empirical protocol covering multiple poisoning types and the use of a clean reference for density-ratio weighting are concrete strengths. The approach is a natural extension of importance weighting ideas to the imitation setting and could be impactful for safety-critical offline RL if the theory is tightened.

major comments (2)
  1. [Theoretical analysis] Abstract and theoretical analysis section: the claim of finite-sample bounds independent of contamination rate α is load-bearing for the main contribution. Standard concentration arguments on the weighted empirical risk would typically introduce a 1/sqrt((1-α)N) factor once low-weight samples are effectively discarded; the manuscript must exhibit the specific lemma or proof step that cancels this dependence (e.g., via a uniform bound on the discriminator error that itself does not degrade with α).
  2. [Method] Method and assumptions section: the convergence guarantee rests on the clean reference set being sufficiently close in distribution to the true expert trajectories so that the binary discriminator yields reliable weights. The paper should state an explicit quantitative assumption (e.g., total-variation or Hellinger distance bound between reference and clean expert) and show how it propagates into the finite-sample rate.
minor comments (2)
  1. [Experiments] Clarify the precise clipping threshold used for the density ratios and whether it is chosen adaptively or fixed; report its sensitivity in the experiments.
  2. [Discussion] Add a limitations paragraph discussing the practical requirement of obtaining a verified clean reference set and the degradation that occurs when this set is too small or distributionally mismatched.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments help clarify the presentation of our theoretical results. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Theoretical analysis] Abstract and theoretical analysis section: the claim of finite-sample bounds independent of contamination rate α is load-bearing for the main contribution. Standard concentration arguments on the weighted empirical risk would typically introduce a 1/sqrt((1-α)N) factor once low-weight samples are effectively discarded; the manuscript must exhibit the specific lemma or proof step that cancels this dependence (e.g., via a uniform bound on the discriminator error that itself does not degrade with α).

    Authors: We thank the referee for this observation. The independence from α follows from the fact that the binary discriminator is trained solely on the fixed-size clean reference set; its estimation error is controlled by a uniform convergence result (Lemma 4 in the appendix) whose sample complexity depends only on the reference set size and the Rademacher complexity of the discriminator class, with no dependence on α or the size of the corrupted dataset. The weights are then clipped to a constant range independent of α, so that the subsequent Hoeffding-type bound on the weighted behavioral cloning objective (Theorem 3) yields a rate of order 1/sqrt(N) that does not contain an extra 1/sqrt(1-α) factor. We will revise the theoretical analysis section to add an explicit pointer to Lemma 4 and a short paragraph walking through this cancellation. revision: yes

  2. Referee: [Method] Method and assumptions section: the convergence guarantee rests on the clean reference set being sufficiently close in distribution to the true expert trajectories so that the binary discriminator yields reliable weights. The paper should state an explicit quantitative assumption (e.g., total-variation or Hellinger distance bound between reference and clean expert) and show how it propagates into the finite-sample rate.

    Authors: We agree that an explicit quantitative assumption would strengthen the statement of the results. The current manuscript implicitly treats the reference set as drawn from the clean expert distribution. We will add Assumption 3 stating that the total-variation distance between the reference distribution and the true expert trajectory distribution is bounded by a small constant δ. In the proof of Theorem 2 we will show that this δ appears as an additive O(δ) term in the finite-sample bound; the same term propagates into the rate of Theorem 3. These changes will appear in the revised Method and Theoretical Analysis sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a method that trains a binary discriminator on a small verified clean reference set versus the contaminated dataset to obtain trajectory-level density ratios, which are then clipped and used as weights in the behavioral cloning objective. Theoretical guarantees are stated for convergence to the clean expert policy with finite-sample bounds claimed to be independent of the contamination rate. No load-bearing step in the described derivation reduces by construction to its inputs: the bounds are not a fitted parameter renamed as a prediction, the discriminator is not defined in terms of the final policy, and no uniqueness theorem or ansatz is imported via self-citation. The central result is derived from standard weighted empirical risk analysis under explicit assumptions on the clean reference set and discriminator quality, making the chain self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the existence of a small verified clean reference set whose distribution matches the expert policy closely enough for the discriminator to produce useful weights; no free parameters are explicitly fitted in the abstract description, and no new physical entities are introduced.

axioms (1)
  • domain assumption A small verified clean reference set is available whose trajectories are drawn from the same distribution as the clean expert behavior.
    This premise is required to train the binary discriminator that produces the density-ratio weights; it is invoked when the abstract states that the ratios are estimated via a binary discriminator on the clean set.

pith-pipeline@v0.9.0 · 5741 in / 1492 out tokens · 22967 ms · 2026-05-21T20:48:04.116504+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Security Considerations for Multi-agent Systems

    cs.CR 2026-03 unverdicted novelty 6.0

    No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

  2. [2]

    Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

    Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

  3. [3]

    Poisoning attacks against support vector machines

    Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. InProceedings of the 29th Interna- tional Conference on Machine Learning (ICML), 2012

  4. [4]

    Off-policy deep reinforcement learning without exploration

    Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

  5. [5]

    Behavior Regularized Offline Reinforcement Learning

    Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning.arXiv preprint arXiv:1911.11361, 2019

  6. [6]

    Conservative q-learning for offline reinforcement learning

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 1179– 1191, 2020

  7. [7]

    Adversarial Attacks on Neural Network Policies

    Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies.arXiv preprint arXiv:1702.02284, 2017

  8. [8]

    Adversarial policies: Attacking deep reinforcement learning

    Adam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, and Stuart Russell. Adversarial policies: Attacking deep reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2020

  9. [9]

    Robust adversarial reinforcement learning

    Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. InInternational Conference on Machine Learning, pages 2817–2826, 2017

  10. [10]

    Robust deep reinforcement learning against adversarial perturbations on state observations

    Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. InAdvances in Neural Information Processing Systems, volume 33, pages 21024–21037, 2020

  11. [11]

    Discriminator-weighted offline imitation learning from suboptimal demonstrations

    Haoran Xu, Xianyuan Zhan, Honglei Yin, and Huiling Qin. Discriminator-weighted offline imitation learning from suboptimal demonstrations. InInternational Conference on Machine Learning, pages 24725–24742, 2022

  12. [12]

    A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8043–8061, 2023

    Rafael F Prudencio, Marcos R O A Maximo, and Esther L Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8043–8061, 2023

  13. [13]

    Implicit constraint-aware off-policy correction for offline reinforcement learning.arXiv preprint arXiv:2506.14058, 2025

    Ali Baheri. Implicit constraint-aware off-policy correction for offline reinforcement learning.arXiv preprint arXiv:2506.14058, 2025

  14. [14]

    Offline reinforcement learning with implicit Q-learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. InInternational Conference on Learning Representations, 2022

  15. [15]

    A minimalist approach to offline reinforcement learning

    Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 34, pages 20132–20145, 2021

  16. [16]

    Robust offline reinforcement learning with uncertainty quantification.Journal of Machine Learning Research, 25(3):1–42, 2024

    Xiaoyu Chen, Yufeng Zhang, and Tengyu Wang. Robust offline reinforcement learning with uncertainty quantification.Journal of Machine Learning Research, 25(3):1–42, 2024

  17. [17]

    Certified defenses for data poisoning attacks

    Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. Certified defenses for data poisoning attacks. InAdvances in Neural Information Processing Systems, volume 30, 2017

  18. [18]

    Adaptive reward poisoning attacks against reinforcement learning

    Xuezhou Zhang, Yuzhe Ma, Adish Singla, and Xiaojin Zhu. Adaptive reward poisoning attacks against reinforcement learning. InInterna- tional Conference on Machine Learning, pages 11225–11234, 2020

  19. [19]

    Policy poisoning in batch reinforcement learning and control

    Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, and Adish Singla. Policy poisoning in batch reinforcement learning and control. InAdvances in Neural Information Processing Systems, volume 34, pages 14570–14581, 2021

  20. [20]

    Byzantine-robust federated reinforcement learning with optimal statistical guarantees.IEEE Transactions on Information Theory, 70(2):1123–1140, 2024

    Wei Sun, Yuxuan Li, and Shaofeng Zhang. Byzantine-robust federated reinforcement learning with optimal statistical guarantees.IEEE Transactions on Information Theory, 70(2):1123–1140, 2024

  21. [21]

    Certified adversarial robustness for deep reinforcement learning

    Fan Wu, Linyi Li, Zijian Huang, Yevgeniy V orobeychik, and Ding Zhao. Certified adversarial robustness for deep reinforcement learning. InConference on Robot Learning, pages 456–467, 2024

  22. [22]

    Imitation learning: A survey of learning methods.ACM Computing Surveys, 50(2):1–35, 2017

    Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods.ACM Computing Surveys, 50(2):1–35, 2017

  23. [23]

    An algorithmic perspective on imitation learning.Foundations and Trends in Robotics, 7(1-2):1–179, 2018

    Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning.Foundations and Trends in Robotics, 7(1-2):1–179, 2018

  24. [24]

    Generative adversarial imitation learning

    Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. InAdvances in Neural Information Processing Systems, volume 29, 2016

  25. [25]

    Imitation learning via off-policy distribution matching

    Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy distribution matching. InInternational Conference on Learning Representations, 2020

  26. [26]

    Expert confidence-aware imitation learning

    Daniel Spencer, Jessica Zhang, and Csaba Szepesvari. Expert confidence-aware imitation learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 15234–15242, 2024

  27. [27]

    Selective imitation learning from high-quality demonstrations.Machine Learning, 113(5):2871–2898, 2024

    Lingxiao Wang, Zhuoran Zhou, and Jonathan Scarlett. Selective imitation learning from high-quality demonstrations.Machine Learning, 113(5):2871–2898, 2024

  28. [28]

    Adaptive weighted imitation learning from multimodal demonstrations

    Yang Liu, Abhishek Gupta, and Pieter Abbeel. Adaptive weighted imitation learning from multimodal demonstrations. InInternational Conference on Robotics and Automation, pages 8974–8981, 2024

  29. [29]

    Algaedice: Policy gradient from arbitrary experience

    Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019

  30. [30]

    GenDICE: Generalized offline estimation of stationary values

    Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. GenDICE: Generalized offline estimation of stationary values. InInternational Conference on Learning Representations, 2020

  31. [31]

    Contrastive learning as goal-conditioned reinforcement learning

    Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, and Ruslan Salakhutdinov. Contrastive learning as goal-conditioned reinforcement learning. InAdvances in Neural Information Processing Systems, volume 35, 2022

  32. [32]

    Offline reinforcement learning with in-sample q-learning

    Xiaolong Ma, Yinlam Chen, Lihong Li, and Zhaoran Zhou. Offline reinforcement learning with in-sample q-learning. InInternational Conference on Machine Learning, pages 14650–14661, 2022

  33. [33]

    Dual importance sampling for off-policy evaluation and learning.Journal of Machine Learning Research, 25(67):1–48, 2024

    Harshit Sikchi, Wenxuan Zheng, and Emma Brunskill. Dual importance sampling for off-policy evaluation and learning.Journal of Machine Learning Research, 25(67):1–48, 2024

  34. [34]

    Kernel-based density ratio estimation for continuous control

    Minghui Chen, Qiang Liu, and Jun Wang. Kernel-based density ratio estimation for continuous control. InAAAI Conference on Artificial Intelligence, volume 39, pages 11234–11242, 2025

  35. [35]

    Robust deep reinforcement learning against adversarial perturbations on state observations

    Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. InAdvances in Neural Information Processing Systems, volume 33, pages 21024–21037, 2021

  36. [36]

    Adversarial robustness certification for neural network control systems.IEEE Transactions on Automatic Control, 69(3):1567–1582, 2024

    Wei Pan, Jin Zhang, and Evangelos Theodorou. Adversarial robustness certification for neural network control systems.IEEE Transactions on Automatic Control, 69(3):1567–1582, 2024

  37. [37]

    Provably robust rein- forcement learning via pac-bayes theory

    Kaiyue Li, Songtao Wang, and Mladen Kolar. Provably robust rein- forcement learning via pac-bayes theory. InInternational Conference on Machine Learning, pages 19456–19467, 2024

  38. [38]

    Safe reinforcement learning under adversarial corruption.IEEE Transactions on Neural Networks and Learning Systems, 36(1):234–248, 2025

    Lin Yang, Jiaming Zheng, Ming Li, and Jianfeng Feng. Safe reinforcement learning under adversarial corruption.IEEE Transactions on Neural Networks and Learning Systems, 36(1):234–248, 2025