Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

Ali Baheri; Shriram Karpoora Sundara Pandian

arxiv: 2510.01479 · v2 · pith:OZ2UPLYEnew · submitted 2025-10-01 · 💻 cs.LG · cs.SY· eess.SY

Density-Ratio Weighted Behavioral Cloning: Learning Control Policies from Corrupted Datasets

Shriram Karpoora Sundara Pandian , Ali Baheri This is my paper

Pith reviewed 2026-05-21 20:48 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY

keywords imitation learningbehavioral cloningoffline reinforcement learningdensity ratio estimationrobust policy learningcorrupted datasetstrajectory weighting

0 comments

The pith

Density-ratio weighted behavioral cloning recovers the clean expert policy from contaminated offline datasets using a small verified reference set.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Density-Ratio Weighted Behavioral Cloning as a robust imitation learning approach for offline settings where datasets contain corrupted or adversarial samples. It trains a binary discriminator on a small clean reference set to estimate trajectory-level density ratios, then clips and applies these ratios as weights in the standard behavioral cloning objective. This prioritizes clean expert trajectories while down-weighting corrupted ones without requiring knowledge of the corruption process. Theoretical results establish convergence to the clean expert policy along with finite-sample bounds that remain independent of the contamination rate. Experiments across continuous control benchmarks under multiple poisoning protocols confirm that the method sustains near-optimal performance where standard behavioral cloning and other offline RL baselines degrade.

Core claim

The method weights each trajectory in the behavioral cloning loss by the estimated density ratio between the clean reference distribution and the observed (possibly corrupted) distribution, obtained via a binary classifier. This weighting ensures that the optimization prioritizes expert-like behavior and downweights corrupted samples. The authors prove that the resulting policy converges to the optimal clean expert policy with error bounds independent of the fraction of corrupted data.

What carries the argument

Trajectory-level density ratios estimated by a binary discriminator distinguishing the small clean reference set from the corrupted dataset, clipped and used as importance weights in the behavioral cloning objective.

If this is right

The learned policy maintains near-optimal performance on continuous control benchmarks even at high contamination ratios.
Finite-sample error bounds hold independently of the contamination fraction for any poisoning mechanism.
The approach succeeds across reward, state, transition, and action poisoning protocols without prior knowledge of the corruption type.
It outperforms standard behavioral cloning as well as BCQ and BRAC under the same contaminated data conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same density-ratio weighting could be inserted into other offline RL objectives such as Q-learning variants to improve their robustness.
The method suggests a practical pathway for deploying imitation learning in safety-critical robotics where limited verified clean data can be collected separately.
Adaptive or online selection of the clean reference set could further reduce the need for upfront verification.

Load-bearing premise

A small verified clean reference set exists whose distribution is close to the true clean expert trajectories, enabling reliable estimation of density ratios by the binary discriminator.

What would settle it

Observing that the weighted policy's performance degrades proportionally with increasing contamination rate, even when provided with the clean reference set, would falsify the independence of the bounds from contamination.

Figures

Figures reproduced from arXiv: 2510.01479 by Ali Baheri, Shriram Karpoora Sundara Pandian.

**Figure 2.** Figure 2: Relative performance improvement of Weighted BC over the best baseline at each contamination level, shown as percentage gains. Green cells [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Performance retention normalized to clean baseline ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Offline reinforcement learning (RL) enables policy optimization from fixed datasets, making it suitable for safety-critical applications where online exploration is infeasible. However, these datasets are often contaminated by adversarial poisoning, system errors, or low-quality samples, leading to degraded policy performance in standard behavioral cloning (BC) and offline RL methods. This paper introduces Density-Ratio Weighted Behavioral Cloning (Weighted BC), a robust imitation learning approach that uses a small, verified clean reference set to estimate trajectory-level density ratios via a binary discriminator. These ratios are clipped and used as weights in the BC objective to prioritize clean expert behavior while down-weighting or discarding corrupted data, without requiring knowledge of the contamination mechanism. We establish theoretical guarantees showing convergence to the clean expert policy with finite-sample bounds that are independent of the contamination rate. A comprehensive evaluation framework is established, which incorporates various poisoning protocols (reward, state, transition, and action) on continuous control benchmarks. Experiments demonstrate that Weighted BC maintains near-optimal performance even at high contamination ratios outperforming baselines such as traditional BC, batch-constrained Q-learning (BCQ) and behavior regularized actor-critic (BRAC).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable way to weight BC trajectories via a clean reference discriminator, but the claimed contamination-independent finite-sample bounds rest on an unverified assumption about effective sample size after weighting.

read the letter

The main thing to know is that this work trains a binary discriminator on a small verified clean set versus the corrupted dataset, then clips the resulting density ratios and plugs them into the behavioral cloning objective. Experiments on continuous control benchmarks show it holds performance under several poisoning types even at high corruption fractions, beating plain BC, BCQ, and BRAC. That practical robustness angle is the clearest contribution. The combination of clean-set density-ratio weighting plus explicit claims of convergence independent of contamination rate does not appear in the cited baselines, so the core idea is new enough for the subfield. The evaluation framework with multiple poisoning protocols is also a plus and makes the empirical case concrete. The soft spot sits in the theory. The abstract states finite-sample bounds that stay independent of the contamination rate, yet when most trajectories receive near-zero weights the effective number of useful samples shrinks roughly with (1-α)N. Standard empirical-process arguments would then pick up a 1/sqrt((1-α)N) factor unless the proof uses a different cancellation that is not sketched here. The discriminator quality itself is not benchmarked separately, so any error in the weights could compound the issue. This is a real but contained concern rather than a fatal flaw; the empirical results still stand on their own. Readers working on robust offline imitation learning who already have access to a modest clean reference set will find the most direct value. The work is coherent on its own terms and shows honest engagement with the problem of corrupted datasets. I would send it to peer review because the idea is straightforward, the experiments are relevant, and the theory can be clarified or bounded more carefully in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes Density-Ratio Weighted Behavioral Cloning (Weighted BC) for offline imitation learning from contaminated datasets. It trains a binary discriminator on a small verified clean reference set versus the corrupted dataset to estimate trajectory-level density ratios, clips these ratios, and uses them as weights in the behavioral cloning objective. The central claims are theoretical convergence to the clean expert policy together with finite-sample bounds independent of the contamination rate, supported by experiments on continuous control tasks under reward, state, transition, and action poisoning protocols that show robustness over standard BC, BCQ, and BRAC.

Significance. If the finite-sample bounds can be shown to hold with the stated independence, the work would provide a practically useful robustification of behavioral cloning that avoids explicit modeling of the contamination process. The empirical protocol covering multiple poisoning types and the use of a clean reference for density-ratio weighting are concrete strengths. The approach is a natural extension of importance weighting ideas to the imitation setting and could be impactful for safety-critical offline RL if the theory is tightened.

major comments (2)

[Theoretical analysis] Abstract and theoretical analysis section: the claim of finite-sample bounds independent of contamination rate α is load-bearing for the main contribution. Standard concentration arguments on the weighted empirical risk would typically introduce a 1/sqrt((1-α)N) factor once low-weight samples are effectively discarded; the manuscript must exhibit the specific lemma or proof step that cancels this dependence (e.g., via a uniform bound on the discriminator error that itself does not degrade with α).
[Method] Method and assumptions section: the convergence guarantee rests on the clean reference set being sufficiently close in distribution to the true expert trajectories so that the binary discriminator yields reliable weights. The paper should state an explicit quantitative assumption (e.g., total-variation or Hellinger distance bound between reference and clean expert) and show how it propagates into the finite-sample rate.

minor comments (2)

[Experiments] Clarify the precise clipping threshold used for the density ratios and whether it is chosen adaptively or fixed; report its sensitivity in the experiments.
[Discussion] Add a limitations paragraph discussing the practical requirement of obtaining a verified clean reference set and the degradation that occurs when this set is too small or distributionally mismatched.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments help clarify the presentation of our theoretical results. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Theoretical analysis] Abstract and theoretical analysis section: the claim of finite-sample bounds independent of contamination rate α is load-bearing for the main contribution. Standard concentration arguments on the weighted empirical risk would typically introduce a 1/sqrt((1-α)N) factor once low-weight samples are effectively discarded; the manuscript must exhibit the specific lemma or proof step that cancels this dependence (e.g., via a uniform bound on the discriminator error that itself does not degrade with α).

Authors: We thank the referee for this observation. The independence from α follows from the fact that the binary discriminator is trained solely on the fixed-size clean reference set; its estimation error is controlled by a uniform convergence result (Lemma 4 in the appendix) whose sample complexity depends only on the reference set size and the Rademacher complexity of the discriminator class, with no dependence on α or the size of the corrupted dataset. The weights are then clipped to a constant range independent of α, so that the subsequent Hoeffding-type bound on the weighted behavioral cloning objective (Theorem 3) yields a rate of order 1/sqrt(N) that does not contain an extra 1/sqrt(1-α) factor. We will revise the theoretical analysis section to add an explicit pointer to Lemma 4 and a short paragraph walking through this cancellation. revision: yes
Referee: [Method] Method and assumptions section: the convergence guarantee rests on the clean reference set being sufficiently close in distribution to the true expert trajectories so that the binary discriminator yields reliable weights. The paper should state an explicit quantitative assumption (e.g., total-variation or Hellinger distance bound between reference and clean expert) and show how it propagates into the finite-sample rate.

Authors: We agree that an explicit quantitative assumption would strengthen the statement of the results. The current manuscript implicitly treats the reference set as drawn from the clean expert distribution. We will add Assumption 3 stating that the total-variation distance between the reference distribution and the true expert trajectory distribution is bounded by a small constant δ. In the proof of Theorem 2 we will show that this δ appears as an additive O(δ) term in the finite-sample bound; the same term propagates into the rate of Theorem 3. These changes will appear in the revised Method and Theoretical Analysis sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a method that trains a binary discriminator on a small verified clean reference set versus the contaminated dataset to obtain trajectory-level density ratios, which are then clipped and used as weights in the behavioral cloning objective. Theoretical guarantees are stated for convergence to the clean expert policy with finite-sample bounds claimed to be independent of the contamination rate. No load-bearing step in the described derivation reduces by construction to its inputs: the bounds are not a fitted parameter renamed as a prediction, the discriminator is not defined in terms of the final policy, and no uniqueness theorem or ansatz is imported via self-citation. The central result is derived from standard weighted empirical risk analysis under explicit assumptions on the clean reference set and discriminator quality, making the chain self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the existence of a small verified clean reference set whose distribution matches the expert policy closely enough for the discriminator to produce useful weights; no free parameters are explicitly fitted in the abstract description, and no new physical entities are introduced.

axioms (1)

domain assumption A small verified clean reference set is available whose trajectories are drawn from the same distribution as the clean expert behavior.
This premise is required to train the binary discriminator that produces the density-ratio weights; it is invoked when the abstract states that the ratios are estimated via a binary discriminator on the clean set.

pith-pipeline@v0.9.0 · 5741 in / 1492 out tokens · 22967 ms · 2026-05-21T20:48:04.116504+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

wi = clip(r(τi), ε, C) ... r(τ) = dφ(τ)/(1−dφ(τ)) ... uniform clean-risk approximation bounds that are independent of contamination severity under appropriate clipping thresholds
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Uniform clean-risk approximation) ... Eclip := E_p[(w⋆ − C)+ + (ε − w⋆)+]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Security Considerations for Multi-agent Systems
cs.CR 2026-03 unverdicted novelty 6.0

No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[2]

Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

work page 2013
[3]

Poisoning attacks against support vector machines

Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. InProceedings of the 29th Interna- tional Conference on Machine Learning (ICML), 2012

work page 2012
[4]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

work page 2052
[5]

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning.arXiv preprint arXiv:1911.11361, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[6]

Conservative q-learning for offline reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 1179– 1191, 2020

work page 2020
[7]

Adversarial Attacks on Neural Network Policies

Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies.arXiv preprint arXiv:1702.02284, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Adversarial policies: Attacking deep reinforcement learning

Adam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, and Stuart Russell. Adversarial policies: Attacking deep reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[9]

Robust adversarial reinforcement learning

Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. InInternational Conference on Machine Learning, pages 2817–2826, 2017

work page 2017
[10]

Robust deep reinforcement learning against adversarial perturbations on state observations

Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. InAdvances in Neural Information Processing Systems, volume 33, pages 21024–21037, 2020

work page 2020
[11]

Discriminator-weighted offline imitation learning from suboptimal demonstrations

Haoran Xu, Xianyuan Zhan, Honglei Yin, and Huiling Qin. Discriminator-weighted offline imitation learning from suboptimal demonstrations. InInternational Conference on Machine Learning, pages 24725–24742, 2022

work page 2022
[12]

A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8043–8061, 2023

Rafael F Prudencio, Marcos R O A Maximo, and Esther L Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8043–8061, 2023

work page 2023
[13]

Implicit constraint-aware off-policy correction for offline reinforcement learning.arXiv preprint arXiv:2506.14058, 2025

Ali Baheri. Implicit constraint-aware off-policy correction for offline reinforcement learning.arXiv preprint arXiv:2506.14058, 2025

work page arXiv 2025
[14]

Offline reinforcement learning with implicit Q-learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. InInternational Conference on Learning Representations, 2022

work page 2022
[15]

A minimalist approach to offline reinforcement learning

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 34, pages 20132–20145, 2021

work page 2021
[16]

Robust offline reinforcement learning with uncertainty quantification.Journal of Machine Learning Research, 25(3):1–42, 2024

Xiaoyu Chen, Yufeng Zhang, and Tengyu Wang. Robust offline reinforcement learning with uncertainty quantification.Journal of Machine Learning Research, 25(3):1–42, 2024

work page 2024
[17]

Certified defenses for data poisoning attacks

Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. Certified defenses for data poisoning attacks. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[18]

Adaptive reward poisoning attacks against reinforcement learning

Xuezhou Zhang, Yuzhe Ma, Adish Singla, and Xiaojin Zhu. Adaptive reward poisoning attacks against reinforcement learning. InInterna- tional Conference on Machine Learning, pages 11225–11234, 2020

work page 2020
[19]

Policy poisoning in batch reinforcement learning and control

Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, and Adish Singla. Policy poisoning in batch reinforcement learning and control. InAdvances in Neural Information Processing Systems, volume 34, pages 14570–14581, 2021

work page 2021
[20]

Byzantine-robust federated reinforcement learning with optimal statistical guarantees.IEEE Transactions on Information Theory, 70(2):1123–1140, 2024

Wei Sun, Yuxuan Li, and Shaofeng Zhang. Byzantine-robust federated reinforcement learning with optimal statistical guarantees.IEEE Transactions on Information Theory, 70(2):1123–1140, 2024

work page 2024
[21]

Certified adversarial robustness for deep reinforcement learning

Fan Wu, Linyi Li, Zijian Huang, Yevgeniy V orobeychik, and Ding Zhao. Certified adversarial robustness for deep reinforcement learning. InConference on Robot Learning, pages 456–467, 2024

work page 2024
[22]

Imitation learning: A survey of learning methods.ACM Computing Surveys, 50(2):1–35, 2017

Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods.ACM Computing Surveys, 50(2):1–35, 2017

work page 2017
[23]

An algorithmic perspective on imitation learning.Foundations and Trends in Robotics, 7(1-2):1–179, 2018

Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning.Foundations and Trends in Robotics, 7(1-2):1–179, 2018

work page 2018
[24]

Generative adversarial imitation learning

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. InAdvances in Neural Information Processing Systems, volume 29, 2016

work page 2016
[25]

Imitation learning via off-policy distribution matching

Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy distribution matching. InInternational Conference on Learning Representations, 2020

work page 2020
[26]

Expert confidence-aware imitation learning

Daniel Spencer, Jessica Zhang, and Csaba Szepesvari. Expert confidence-aware imitation learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 15234–15242, 2024

work page 2024
[27]

Selective imitation learning from high-quality demonstrations.Machine Learning, 113(5):2871–2898, 2024

Lingxiao Wang, Zhuoran Zhou, and Jonathan Scarlett. Selective imitation learning from high-quality demonstrations.Machine Learning, 113(5):2871–2898, 2024

work page 2024
[28]

Adaptive weighted imitation learning from multimodal demonstrations

Yang Liu, Abhishek Gupta, and Pieter Abbeel. Adaptive weighted imitation learning from multimodal demonstrations. InInternational Conference on Robotics and Automation, pages 8974–8981, 2024

work page 2024
[29]

Algaedice: Policy gradient from arbitrary experience

Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019

work page arXiv 1912
[30]

GenDICE: Generalized offline estimation of stationary values

Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. GenDICE: Generalized offline estimation of stationary values. InInternational Conference on Learning Representations, 2020

work page 2020
[31]

Contrastive learning as goal-conditioned reinforcement learning

Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, and Ruslan Salakhutdinov. Contrastive learning as goal-conditioned reinforcement learning. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022
[32]

Offline reinforcement learning with in-sample q-learning

Xiaolong Ma, Yinlam Chen, Lihong Li, and Zhaoran Zhou. Offline reinforcement learning with in-sample q-learning. InInternational Conference on Machine Learning, pages 14650–14661, 2022

work page 2022
[33]

Dual importance sampling for off-policy evaluation and learning.Journal of Machine Learning Research, 25(67):1–48, 2024

Harshit Sikchi, Wenxuan Zheng, and Emma Brunskill. Dual importance sampling for off-policy evaluation and learning.Journal of Machine Learning Research, 25(67):1–48, 2024

work page 2024
[34]

Kernel-based density ratio estimation for continuous control

Minghui Chen, Qiang Liu, and Jun Wang. Kernel-based density ratio estimation for continuous control. InAAAI Conference on Artificial Intelligence, volume 39, pages 11234–11242, 2025

work page 2025
[35]

Robust deep reinforcement learning against adversarial perturbations on state observations

Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. InAdvances in Neural Information Processing Systems, volume 33, pages 21024–21037, 2021

work page 2021
[36]

Adversarial robustness certification for neural network control systems.IEEE Transactions on Automatic Control, 69(3):1567–1582, 2024

Wei Pan, Jin Zhang, and Evangelos Theodorou. Adversarial robustness certification for neural network control systems.IEEE Transactions on Automatic Control, 69(3):1567–1582, 2024

work page 2024
[37]

Provably robust rein- forcement learning via pac-bayes theory

Kaiyue Li, Songtao Wang, and Mladen Kolar. Provably robust rein- forcement learning via pac-bayes theory. InInternational Conference on Machine Learning, pages 19456–19467, 2024

work page 2024
[38]

Safe reinforcement learning under adversarial corruption.IEEE Transactions on Neural Networks and Learning Systems, 36(1):234–248, 2025

Lin Yang, Jiaming Zheng, Ming Li, and Jianfeng Feng. Safe reinforcement learning under adversarial corruption.IEEE Transactions on Neural Networks and Learning Systems, 36(1):234–248, 2025

work page 2025

[1] [1]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[2] [2]

Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey.The International Journal of Robotics Research, 32(11):1238–1274, 2013

work page 2013

[3] [3]

Poisoning attacks against support vector machines

Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. InProceedings of the 29th Interna- tional Conference on Machine Learning (ICML), 2012

work page 2012

[4] [4]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

work page 2052

[5] [5]

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning.arXiv preprint arXiv:1911.11361, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[6] [6]

Conservative q-learning for offline reinforcement learning

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 33, pages 1179– 1191, 2020

work page 2020

[7] [7]

Adversarial Attacks on Neural Network Policies

Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies.arXiv preprint arXiv:1702.02284, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Adversarial policies: Attacking deep reinforcement learning

Adam Gleave, Michael Dennis, Neel Kant, Cody Wild, Sergey Levine, and Stuart Russell. Adversarial policies: Attacking deep reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020

[9] [9]

Robust adversarial reinforcement learning

Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. InInternational Conference on Machine Learning, pages 2817–2826, 2017

work page 2017

[10] [10]

Robust deep reinforcement learning against adversarial perturbations on state observations

Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. InAdvances in Neural Information Processing Systems, volume 33, pages 21024–21037, 2020

work page 2020

[11] [11]

Discriminator-weighted offline imitation learning from suboptimal demonstrations

Haoran Xu, Xianyuan Zhan, Honglei Yin, and Huiling Qin. Discriminator-weighted offline imitation learning from suboptimal demonstrations. InInternational Conference on Machine Learning, pages 24725–24742, 2022

work page 2022

[12] [12]

A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8043–8061, 2023

Rafael F Prudencio, Marcos R O A Maximo, and Esther L Colombini. A survey on offline reinforcement learning: Taxonomy, review, and open problems.IEEE Transactions on Neural Networks and Learning Systems, 34(11):8043–8061, 2023

work page 2023

[13] [13]

Implicit constraint-aware off-policy correction for offline reinforcement learning.arXiv preprint arXiv:2506.14058, 2025

Ali Baheri. Implicit constraint-aware off-policy correction for offline reinforcement learning.arXiv preprint arXiv:2506.14058, 2025

work page arXiv 2025

[14] [14]

Offline reinforcement learning with implicit Q-learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. InInternational Conference on Learning Representations, 2022

work page 2022

[15] [15]

A minimalist approach to offline reinforcement learning

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. InAdvances in Neural Information Processing Systems, volume 34, pages 20132–20145, 2021

work page 2021

[16] [16]

Robust offline reinforcement learning with uncertainty quantification.Journal of Machine Learning Research, 25(3):1–42, 2024

Xiaoyu Chen, Yufeng Zhang, and Tengyu Wang. Robust offline reinforcement learning with uncertainty quantification.Journal of Machine Learning Research, 25(3):1–42, 2024

work page 2024

[17] [17]

Certified defenses for data poisoning attacks

Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. Certified defenses for data poisoning attacks. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017

[18] [18]

Adaptive reward poisoning attacks against reinforcement learning

Xuezhou Zhang, Yuzhe Ma, Adish Singla, and Xiaojin Zhu. Adaptive reward poisoning attacks against reinforcement learning. InInterna- tional Conference on Machine Learning, pages 11225–11234, 2020

work page 2020

[19] [19]

Policy poisoning in batch reinforcement learning and control

Amin Rakhsha, Goran Radanovic, Rati Devidze, Xiaojin Zhu, and Adish Singla. Policy poisoning in batch reinforcement learning and control. InAdvances in Neural Information Processing Systems, volume 34, pages 14570–14581, 2021

work page 2021

[20] [20]

Byzantine-robust federated reinforcement learning with optimal statistical guarantees.IEEE Transactions on Information Theory, 70(2):1123–1140, 2024

Wei Sun, Yuxuan Li, and Shaofeng Zhang. Byzantine-robust federated reinforcement learning with optimal statistical guarantees.IEEE Transactions on Information Theory, 70(2):1123–1140, 2024

work page 2024

[21] [21]

Certified adversarial robustness for deep reinforcement learning

Fan Wu, Linyi Li, Zijian Huang, Yevgeniy V orobeychik, and Ding Zhao. Certified adversarial robustness for deep reinforcement learning. InConference on Robot Learning, pages 456–467, 2024

work page 2024

[22] [22]

Imitation learning: A survey of learning methods.ACM Computing Surveys, 50(2):1–35, 2017

Ahmed Hussein, Mohamed Medhat Gaber, Eyad Elyan, and Chrisina Jayne. Imitation learning: A survey of learning methods.ACM Computing Surveys, 50(2):1–35, 2017

work page 2017

[23] [23]

An algorithmic perspective on imitation learning.Foundations and Trends in Robotics, 7(1-2):1–179, 2018

Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning.Foundations and Trends in Robotics, 7(1-2):1–179, 2018

work page 2018

[24] [24]

Generative adversarial imitation learning

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. InAdvances in Neural Information Processing Systems, volume 29, 2016

work page 2016

[25] [25]

Imitation learning via off-policy distribution matching

Ilya Kostrikov, Ofir Nachum, and Jonathan Tompson. Imitation learning via off-policy distribution matching. InInternational Conference on Learning Representations, 2020

work page 2020

[26] [26]

Expert confidence-aware imitation learning

Daniel Spencer, Jessica Zhang, and Csaba Szepesvari. Expert confidence-aware imitation learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 15234–15242, 2024

work page 2024

[27] [27]

Selective imitation learning from high-quality demonstrations.Machine Learning, 113(5):2871–2898, 2024

Lingxiao Wang, Zhuoran Zhou, and Jonathan Scarlett. Selective imitation learning from high-quality demonstrations.Machine Learning, 113(5):2871–2898, 2024

work page 2024

[28] [28]

Adaptive weighted imitation learning from multimodal demonstrations

Yang Liu, Abhishek Gupta, and Pieter Abbeel. Adaptive weighted imitation learning from multimodal demonstrations. InInternational Conference on Robotics and Automation, pages 8974–8981, 2024

work page 2024

[29] [29]

Algaedice: Policy gradient from arbitrary experience

Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019

work page arXiv 1912

[30] [30]

GenDICE: Generalized offline estimation of stationary values

Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. GenDICE: Generalized offline estimation of stationary values. InInternational Conference on Learning Representations, 2020

work page 2020

[31] [31]

Contrastive learning as goal-conditioned reinforcement learning

Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, and Ruslan Salakhutdinov. Contrastive learning as goal-conditioned reinforcement learning. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022

[32] [32]

Offline reinforcement learning with in-sample q-learning

Xiaolong Ma, Yinlam Chen, Lihong Li, and Zhaoran Zhou. Offline reinforcement learning with in-sample q-learning. InInternational Conference on Machine Learning, pages 14650–14661, 2022

work page 2022

[33] [33]

Dual importance sampling for off-policy evaluation and learning.Journal of Machine Learning Research, 25(67):1–48, 2024

Harshit Sikchi, Wenxuan Zheng, and Emma Brunskill. Dual importance sampling for off-policy evaluation and learning.Journal of Machine Learning Research, 25(67):1–48, 2024

work page 2024

[34] [34]

Kernel-based density ratio estimation for continuous control

Minghui Chen, Qiang Liu, and Jun Wang. Kernel-based density ratio estimation for continuous control. InAAAI Conference on Artificial Intelligence, volume 39, pages 11234–11242, 2025

work page 2025

[35] [35]

Robust deep reinforcement learning against adversarial perturbations on state observations

Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. InAdvances in Neural Information Processing Systems, volume 33, pages 21024–21037, 2021

work page 2021

[36] [36]

Adversarial robustness certification for neural network control systems.IEEE Transactions on Automatic Control, 69(3):1567–1582, 2024

Wei Pan, Jin Zhang, and Evangelos Theodorou. Adversarial robustness certification for neural network control systems.IEEE Transactions on Automatic Control, 69(3):1567–1582, 2024

work page 2024

[37] [37]

Provably robust rein- forcement learning via pac-bayes theory

Kaiyue Li, Songtao Wang, and Mladen Kolar. Provably robust rein- forcement learning via pac-bayes theory. InInternational Conference on Machine Learning, pages 19456–19467, 2024

work page 2024

[38] [38]

Safe reinforcement learning under adversarial corruption.IEEE Transactions on Neural Networks and Learning Systems, 36(1):234–248, 2025

Lin Yang, Jiaming Zheng, Ming Li, and Jianfeng Feng. Safe reinforcement learning under adversarial corruption.IEEE Transactions on Neural Networks and Learning Systems, 36(1):234–248, 2025

work page 2025