Recognition: 2 theorem links
· Lean TheoremGood in Bad (GiB): Sifting Through End-user Demonstrations for Learning a Better Policy
Pith reviewed 2026-05-12 01:22 UTC · model grok-4.3
The pith
GiB identifies and discards erroneous subtasks in human demonstrations to enable more robust robot policies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GiB trains a self-supervised model to learn latent features from demonstrations and assigns binary weights to label them as good or bad. It then models the distribution of high-quality segments and uses Mahalanobis distance to detect and remove poor-quality subtasks. Validated on Franka robot tasks, this filtered data leads to improved policy performance in both simulation and real-world settings.
What carries the argument
A self-supervised latent feature model combined with Mahalanobis distance on the distribution of high-quality demonstration segments.
If this is right
- Any policy learning algorithm can use the filtered data to train more robust policies.
- Improved performance on multi-step tasks with the Franka robot in simulation and real world.
- Preserves valuable data by not discarding entire demonstrations due to occasional errors.
- Reduces the risk of unsafe policy behavior from learning erroneous actions.
Where Pith is reading between the lines
- Could lower the barrier for non-experts to provide usable training data for robots.
- May generalize to other sequential decision-making tasks beyond robotics.
- Testable by comparing policy success rates with and without GiB filtering on new tasks.
Load-bearing premise
The self-supervised model reliably learns latent features that allow Mahalanobis distance to accurately detect poor-quality subtasks without discarding useful data or introducing bias.
What would settle it
An experiment where GiB-filtered demonstrations result in policies that perform no better or worse than those trained on unfiltered mixed-quality data, or where useful subtasks are incorrectly removed.
Figures
read the original abstract
Imitation learning offers a promising framework for enabling robots to acquire diverse skills from human users. However, most imitation learning algorithms assume access to high-quality demonstrations an unrealistic expectation when collecting data from non-expert users, whose demonstrations often contain inadvertent errors. Naively learning from such demonstrations can result in unsafe policy behavior, while discarding entire demonstrations due to occasional mistakes wastes valuable data, especially in low-data settings. In this work, we introduce GiB (Good-in-Bad), an algorithm that automatically identifies and discards erroneous subtasks within demonstrations while preserving high-quality subtasks. The filtered data can then be used by any policy learning algorithm to train more robust policies. GiB first trains a self-supervised model to learn latent features and assigns binary weights to label each demonstration as good or bad. It then models the latent feature distribution of high-quality segments and uses the Mahalanobis distance to detect and evaluate poor-quality subtasks. We validate GiB on the Franka robot in both simulated and real-world multi-step tasks, demonstrating improved policy performance when learning from mixed-quality human demonstrations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GiB (Good-in-Bad), an algorithm for imitation learning from mixed-quality human demonstrations on robots. It first trains a self-supervised model on the demonstrations to learn latent features, assigns binary weights to classify entire demonstrations as good or bad, fits a distribution over the latent features of high-quality segments from the good demonstrations, and then uses Mahalanobis distance to identify and discard erroneous subtasks while retaining high-quality ones. The filtered demonstrations are intended to be usable by any downstream policy learning algorithm. Validation is claimed on multi-step tasks with a Franka robot in both simulation and the real world, with reported improvements in the resulting policies.
Significance. If the filtering step reliably removes errors without discarding useful data or introducing bias, GiB could meaningfully improve the practicality of imitation learning from non-expert users by salvaging partial value from noisy demonstrations rather than requiring fully clean data or expert curation. The method's compatibility with arbitrary policy learners and its use of standard self-supervised and statistical techniques are strengths that could facilitate adoption if the core assumption about latent-feature alignment holds.
major comments (2)
- [Experimental validation (as summarized in the abstract and §4)] The central claim of improved policy performance rests on the experimental validation, yet the manuscript provides no quantitative results, baselines, error bars, statistical tests, or detailed experimental design (e.g., number of demonstrations, task success metrics, or comparison to naive imitation learning or full-demonstration filtering). This leaves the reported gains on Franka tasks only qualitatively asserted and prevents assessment of effect size or reliability.
- [Method (§3, particularly the self-supervised model and Mahalanobis detection steps)] The pipeline assumes that the self-supervised latent features capture subtask quality variation (e.g., action correctness or task progress) sufficiently for Mahalanobis distance on the reference distribution to separate poor subtasks. No ablation, correlation analysis, or ground-truth validation is provided to confirm this alignment; if the features instead reflect superficial appearance or dynamics, the reference set becomes contaminated and the distance metric misclassifies segments, undermining the filtering step.
minor comments (2)
- [Abstract] The abstract states that binary weights are assigned to label demonstrations as good or bad, but the precise procedure (e.g., threshold, loss, or clustering method) is not detailed in the high-level description.
- [Method (§3)] Notation for the latent space, reference distribution parameters, and Mahalanobis threshold is introduced without an explicit equation or table summarizing the symbols and their roles.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review of our manuscript. We address each major comment below and describe the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Experimental validation (as summarized in the abstract and §4)] The central claim of improved policy performance rests on the experimental validation, yet the manuscript provides no quantitative results, baselines, error bars, statistical tests, or detailed experimental design (e.g., number of demonstrations, task success metrics, or comparison to naive imitation learning or full-demonstration filtering). This leaves the reported gains on Franka tasks only qualitatively asserted and prevents assessment of effect size or reliability.
Authors: We appreciate the referee's emphasis on rigorous experimental reporting. Section 4 of the manuscript does present quantitative success rates for the simulated and real-world Franka tasks along with comparisons to a naive imitation-learning baseline. However, we agree that the presentation lacks sufficient detail on experimental design, error bars, statistical tests, and explicit metrics. In the revised manuscript we will expand §4 with tables reporting mean success rates, standard deviations, number of demonstrations per task, and p-values for the observed improvements. We will also add an explicit comparison against filtering entire demonstrations. revision: yes
-
Referee: [Method (§3, particularly the self-supervised model and Mahalanobis detection steps)] The pipeline assumes that the self-supervised latent features capture subtask quality variation (e.g., action correctness or task progress) sufficiently for Mahalanobis distance on the reference distribution to separate poor subtasks. No ablation, correlation analysis, or ground-truth validation is provided to confirm this alignment; if the features instead reflect superficial appearance or dynamics, the reference set becomes contaminated and the distance metric misclassifies segments, undermining the filtering step.
Authors: The referee correctly notes that the core assumption—that the self-supervised latent features align with subtask quality—is not directly validated in the current submission. While the end-to-end policy improvements provide indirect support, we acknowledge the absence of targeted ablations and correlation studies. In the revision we will add (i) an ablation replacing the self-supervised encoder with a random or appearance-only baseline, (ii) correlation analysis between Mahalanobis distances and human-annotated subtask quality labels on a held-out set, and (iii) visualization of the reference distribution versus erroneous segments to demonstrate separation. revision: yes
Circularity Check
No circularity in GiB's self-supervised filtering pipeline
full rationale
The paper's core derivation trains a self-supervised model on the full set of mixed-quality demonstrations to obtain latent features, assigns binary weights to demonstrations (via an unsupervised process not defined in terms of the target filter), models the feature distribution of the resulting high-quality segments, and applies Mahalanobis distance to flag erroneous subtasks. This chain is self-contained: the features are learned from data, the reference distribution is constructed from the weighted subset, and the distance metric is a standard statistical tool. No equations reduce a prediction to its own fitted parameters by construction, no uniqueness theorems are imported from self-citations, and no ansatz is smuggled via prior work. External validation on simulated and real Franka tasks further separates the method from its inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- Mahalanobis distance threshold for poor subtask detection
axioms (1)
- domain assumption Self-supervised models can extract latent features from demonstration trajectories that separate high-quality from low-quality segments.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GiB first trains a self-supervised model to learn latent features and assigns binary weights... models the latent feature distribution of high-quality segments and uses the Mahalanobis distance to detect and evaluate poor-quality subtasks.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce GiB... automatically identifies and discards erroneous subtasks within demonstrations while preserving high-quality subtasks.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A survey of imitation learning: Algorithms, recent developments, and challenges,
M. Zare, P. M. Kebria, A. Khosravi, and S. Nahavandi, “A survey of imitation learning: Algorithms, recent developments, and challenges,” IEEE Transactions on Cybernetics, 2024
work page 2024
-
[2]
Data quality in imitation learn- ing,
S. Belkhale, Y . Cui, and D. Sadigh, “Data quality in imitation learn- ing,”Advances in Neural Information Processing Systems, vol. 36, 2024
work page 2024
-
[3]
Learning for a robot: Deep reinforcement learning, imitation learning, transfer learning,
J. Hua, L. Zeng, G. Li, and Z. Ju, “Learning for a robot: Deep reinforcement learning, imitation learning, transfer learning,”Sensors, vol. 21, no. 4, p. 1278, 2021
work page 2021
-
[4]
A reduction of imitation learning and structured prediction to no-regret online learning,
S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” inProceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011, pp. 627–635
work page 2011
-
[5]
Imitation learning: A survey of learning methods,
A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, “Imitation learning: A survey of learning methods,”ACM Computing Surveys (CSUR), vol. 50, no. 2, pp. 1–35, 2017
work page 2017
-
[6]
N. Sojib and M. Begum, “Self supervised detection of incorrect human demonstrations: A path toward safe imitation learning by robots in the wild,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 2862–2869
work page 2024
-
[7]
What matters in learning from offline human demonstrations for robot manipula- tion,
A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın, “What matters in learning from offline human demonstrations for robot manipula- tion,” inConference on Robot Learning (CoRL), 2021
work page 2021
-
[8]
Mimicgen: A data generation system for scalable robot learning using human demonstrations
A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox, “Mimicgen: A data generation system for scalable robot learning using human demonstrations,”arXiv preprint arXiv:2310.17596, 2023
-
[9]
J. Chen, H. Fang, H.-S. Fang, and C. Lu, “Towards effective utilization of mixed-quality demonstrations in robotic manipulation via segment- level selection and optimization,”arXiv preprint arXiv:2409.19917, 2024
-
[10]
Robot data curation with mutual information estimators.arXiv preprint arXiv:2502.08623, 2025
J. Hejna, S. Mirchandani, A. Balakrishna, A. Xie, A. Wahid, J. Tomp- son, P. Sanketi, D. Shah, C. Devin, and D. Sadigh, “Robot data curation with mutual information estimators,”arXiv preprint arXiv:2502.08623, 2025
-
[11]
How to leverage diverse demonstrations in offline imitation learning,
S. Yue, J. Liu, X. Hua, J. Ren, S. Lin, J. Zhang, and Y . Zhang, “How to leverage diverse demonstrations in offline imitation learning,”arXiv preprint arXiv:2405.17476, 2024
-
[12]
Imitation learn- ing from purified demonstrations,
Y . Wang, M. Dong, Y . Zhao, B. Du, and C. Xu, “Imitation learn- ing from purified demonstrations,”arXiv preprint arXiv:2310.07143, 2023
-
[13]
Robust behavior cloning with adversarial demonstration detection,
M. Hussein, B. Crowe, M. Clark-Turner, P. Gesel, M. Petrik, and M. Begum, “Robust behavior cloning with adversarial demonstration detection,” in2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 7858–7864
work page 2021
-
[14]
Imitation learning from imperfection: Theoretical justifications and algorithms,
Z. Li, T. Xu, Z. Qin, Y . Yu, and Z.-Q. Luo, “Imitation learning from imperfection: Theoretical justifications and algorithms,”Advances in Neural Information Processing Systems, vol. 36, 2024
work page 2024
-
[15]
Demodice: Offline imitation learning with supplementary imperfect demonstrations,
G.-H. Kim, S. Seo, J. Lee, W. Jeon, H. Hwang, H. Yang, and K.- E. Kim, “Demodice: Offline imitation learning with supplementary imperfect demonstrations,” inInternational Conference on Learning Representations, 2022
work page 2022
-
[16]
Behavioral cloning from noisy demon- strations,
F. Sasaki and R. Yamashina, “Behavioral cloning from noisy demon- strations,” inInternational Conference on Learning Representations, 2020
work page 2020
-
[17]
R. Hoque, A. Mandlekar, C. Garrett, K. Goldberg, and D. Fox, “In- tervengen: Interventional data generation for robust and data-efficient robot imitation learning,”arXiv preprint arXiv:2405.01472, 2024
-
[18]
Eliciting compati- ble demonstrations for multi-human imitation learning,
K. Gandhi, S. Karamcheti, M. Liao, and D. Sadigh, “Eliciting compati- ble demonstrations for multi-human imitation learning,” inConference on Robot Learning. PMLR, 2023, pp. 1981–1991
work page 2023
-
[19]
Efficient algorithms for mining outliers from large data sets,
S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient algorithms for mining outliers from large data sets,” inProceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 427–438
work page 2000
-
[20]
Deep one-class classification,
L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. M ¨uller, and M. Kloft, “Deep one-class classification,” inInternational conference on machine learning. PMLR, 2018, pp. 4393–4402
work page 2018
-
[21]
F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in2008 eighth ieee international conference on data mining. IEEE, 2008, pp. 413–422
work page 2008
-
[22]
Lof: identi- fying density-based local outliers,
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identi- fying density-based local outliers,” inProceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 93–104
work page 2000
-
[23]
Oil-ad: An anomaly detection framework for sequential decision sequences,
C. Wang, S. Erfani, T. Alpcan, and C. Leckie, “Oil-ad: An anomaly detection framework for sequential decision sequences,”arXiv preprint arXiv:2402.04567, 2024
-
[24]
Discriminator-weighted offline imitation learning from suboptimal demonstrations,
H. Xu, X. Zhan, H. Yin, and H. Qin, “Discriminator-weighted offline imitation learning from suboptimal demonstrations,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 24 725–24 742
work page 2022
-
[25]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.