arxiv: 2605.01529 · v2 · submitted 2026-05-02 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Good in Bad (GiB): Sifting Through End-user Demonstrations for Learning a Better Policy

Noushad Sojib , Momotaz Begum

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:22 UTC · model grok-4.3

classification 💻 cs.RO

keywords imitation learningrobot demonstration filteringsubtask quality detectionMahalanobis distanceself-supervised learningpolicy robustnessend-user dataFranka robot

0 comments

The pith

GiB identifies and discards erroneous subtasks in human demonstrations to enable more robust robot policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GiB to handle imperfect demonstrations from non-expert users in imitation learning for robots. It automatically filters out bad subtasks while keeping good ones, instead of discarding entire demos or using all data naively. This allows any policy learning method to train safer and more effective behaviors. The approach matters because collecting perfect demos is unrealistic, especially in low-data settings, and errors can lead to unsafe robot actions.

Core claim

GiB trains a self-supervised model to learn latent features from demonstrations and assigns binary weights to label them as good or bad. It then models the distribution of high-quality segments and uses Mahalanobis distance to detect and remove poor-quality subtasks. Validated on Franka robot tasks, this filtered data leads to improved policy performance in both simulation and real-world settings.

What carries the argument

A self-supervised latent feature model combined with Mahalanobis distance on the distribution of high-quality demonstration segments.

If this is right

Any policy learning algorithm can use the filtered data to train more robust policies.
Improved performance on multi-step tasks with the Franka robot in simulation and real world.
Preserves valuable data by not discarding entire demonstrations due to occasional errors.
Reduces the risk of unsafe policy behavior from learning erroneous actions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Could lower the barrier for non-experts to provide usable training data for robots.
May generalize to other sequential decision-making tasks beyond robotics.
Testable by comparing policy success rates with and without GiB filtering on new tasks.

Load-bearing premise

The self-supervised model reliably learns latent features that allow Mahalanobis distance to accurately detect poor-quality subtasks without discarding useful data or introducing bias.

What would settle it

An experiment where GiB-filtered demonstrations result in policies that perform no better or worse than those trained on unfiltered mixed-quality data, or where useful subtasks are incorrectly removed.

Figures

Figures reproduced from arXiv: 2605.01529 by Momotaz Begum, Noushad Sojib.

**Figure 1.** Figure 1: GiB subtask evaluation pipeline. (A) A self-supervised model learns latent embeddings view at source ↗

**Figure 2.** Figure 2: Simulation task (Kitchen, left) and real-world Franka task (right). view at source ↗

**Figure 4.** Figure 4: Multimodal kitchen environment. To demonstrate multimodal way of performing the same task, we created another Kitchen environment by modifying the Kitchen environment in MimicGen. It is a twostep task with added additional features to the environment. The task requires picking up the pot from the stove and placing it on the serving region (the red square on the table). To increase diversity, we c… view at source ↗

**Figure 3.** Figure 3: Visualization of a coffee-making task where the first subtask is view at source ↗

**Figure 5.** Figure 5: Even though the task can be performed in two distinct modes, view at source ↗

**Figure 6.** Figure 6: Success rate under different ρ values for Square and Mug tasks view at source ↗

read the original abstract

Imitation learning offers a promising framework for enabling robots to acquire diverse skills from human users. However, most imitation learning algorithms assume access to high-quality demonstrations an unrealistic expectation when collecting data from non-expert users, whose demonstrations often contain inadvertent errors. Naively learning from such demonstrations can result in unsafe policy behavior, while discarding entire demonstrations due to occasional mistakes wastes valuable data, especially in low-data settings. In this work, we introduce GiB (Good-in-Bad), an algorithm that automatically identifies and discards erroneous subtasks within demonstrations while preserving high-quality subtasks. The filtered data can then be used by any policy learning algorithm to train more robust policies. GiB first trains a self-supervised model to learn latent features and assigns binary weights to label each demonstration as good or bad. It then models the latent feature distribution of high-quality segments and uses the Mahalanobis distance to detect and evaluate poor-quality subtasks. We validate GiB on the Franka robot in both simulated and real-world multi-step tasks, demonstrating improved policy performance when learning from mixed-quality human demonstrations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GiB filters bad subtasks from noisy demos via self-supervised latents and Mahalanobis distance, but the abstract gives almost no numbers to show the filter works or improves policies reliably.

read the letter

The paper's main contribution is a two-stage pipeline called GiB. It first trains a self-supervised model on mixed-quality demonstrations to learn latent features and assigns binary labels to whole demos as good or bad. It then fits a distribution to the high-quality segments and uses Mahalanobis distance to flag and drop erroneous subtasks, leaving the rest for any downstream policy learner. This targets the common problem of non-expert data in imitation learning without wasting entire trajectories.

Referee Report

2 major / 2 minor

Summary. The paper introduces GiB (Good-in-Bad), an algorithm for imitation learning from mixed-quality human demonstrations on robots. It first trains a self-supervised model on the demonstrations to learn latent features, assigns binary weights to classify entire demonstrations as good or bad, fits a distribution over the latent features of high-quality segments from the good demonstrations, and then uses Mahalanobis distance to identify and discard erroneous subtasks while retaining high-quality ones. The filtered demonstrations are intended to be usable by any downstream policy learning algorithm. Validation is claimed on multi-step tasks with a Franka robot in both simulation and the real world, with reported improvements in the resulting policies.

Significance. If the filtering step reliably removes errors without discarding useful data or introducing bias, GiB could meaningfully improve the practicality of imitation learning from non-expert users by salvaging partial value from noisy demonstrations rather than requiring fully clean data or expert curation. The method's compatibility with arbitrary policy learners and its use of standard self-supervised and statistical techniques are strengths that could facilitate adoption if the core assumption about latent-feature alignment holds.

major comments (2)

[Experimental validation (as summarized in the abstract and §4)] The central claim of improved policy performance rests on the experimental validation, yet the manuscript provides no quantitative results, baselines, error bars, statistical tests, or detailed experimental design (e.g., number of demonstrations, task success metrics, or comparison to naive imitation learning or full-demonstration filtering). This leaves the reported gains on Franka tasks only qualitatively asserted and prevents assessment of effect size or reliability.
[Method (§3, particularly the self-supervised model and Mahalanobis detection steps)] The pipeline assumes that the self-supervised latent features capture subtask quality variation (e.g., action correctness or task progress) sufficiently for Mahalanobis distance on the reference distribution to separate poor subtasks. No ablation, correlation analysis, or ground-truth validation is provided to confirm this alignment; if the features instead reflect superficial appearance or dynamics, the reference set becomes contaminated and the distance metric misclassifies segments, undermining the filtering step.

minor comments (2)

[Abstract] The abstract states that binary weights are assigned to label demonstrations as good or bad, but the precise procedure (e.g., threshold, loss, or clustering method) is not detailed in the high-level description.
[Method (§3)] Notation for the latent space, reference distribution parameters, and Mahalanobis threshold is introduced without an explicit equation or table summarizing the symbols and their roles.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We address each major comment below and describe the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Experimental validation (as summarized in the abstract and §4)] The central claim of improved policy performance rests on the experimental validation, yet the manuscript provides no quantitative results, baselines, error bars, statistical tests, or detailed experimental design (e.g., number of demonstrations, task success metrics, or comparison to naive imitation learning or full-demonstration filtering). This leaves the reported gains on Franka tasks only qualitatively asserted and prevents assessment of effect size or reliability.

Authors: We appreciate the referee's emphasis on rigorous experimental reporting. Section 4 of the manuscript does present quantitative success rates for the simulated and real-world Franka tasks along with comparisons to a naive imitation-learning baseline. However, we agree that the presentation lacks sufficient detail on experimental design, error bars, statistical tests, and explicit metrics. In the revised manuscript we will expand §4 with tables reporting mean success rates, standard deviations, number of demonstrations per task, and p-values for the observed improvements. We will also add an explicit comparison against filtering entire demonstrations. revision: yes
Referee: [Method (§3, particularly the self-supervised model and Mahalanobis detection steps)] The pipeline assumes that the self-supervised latent features capture subtask quality variation (e.g., action correctness or task progress) sufficiently for Mahalanobis distance on the reference distribution to separate poor subtasks. No ablation, correlation analysis, or ground-truth validation is provided to confirm this alignment; if the features instead reflect superficial appearance or dynamics, the reference set becomes contaminated and the distance metric misclassifies segments, undermining the filtering step.

Authors: The referee correctly notes that the core assumption—that the self-supervised latent features align with subtask quality—is not directly validated in the current submission. While the end-to-end policy improvements provide indirect support, we acknowledge the absence of targeted ablations and correlation studies. In the revision we will add (i) an ablation replacing the self-supervised encoder with a random or appearance-only baseline, (ii) correlation analysis between Mahalanobis distances and human-annotated subtask quality labels on a held-out set, and (iii) visualization of the reference distribution versus erroneous segments to demonstrate separation. revision: yes

Circularity Check

0 steps flagged

No circularity in GiB's self-supervised filtering pipeline

full rationale

The paper's core derivation trains a self-supervised model on the full set of mixed-quality demonstrations to obtain latent features, assigns binary weights to demonstrations (via an unsupervised process not defined in terms of the target filter), models the feature distribution of the resulting high-quality segments, and applies Mahalanobis distance to flag erroneous subtasks. This chain is self-contained: the features are learned from data, the reference distribution is constructed from the weighted subset, and the distance metric is a standard statistical tool. No equations reduce a prediction to its own fitted parameters by construction, no uniqueness theorems are imported from self-citations, and no ansatz is smuggled via prior work. External validation on simulated and real Franka tasks further separates the method from its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on standard machine learning assumptions with minimal new postulates; one potential free parameter for distance thresholding is implied but not quantified.

free parameters (1)

Mahalanobis distance threshold for poor subtask detection
Used to classify segments as erroneous; value not specified in abstract but required for the detection step.

axioms (1)

domain assumption Self-supervised models can extract latent features from demonstration trajectories that separate high-quality from low-quality segments.
Directly invoked as the first step of GiB.

pith-pipeline@v0.9.0 · 5495 in / 1258 out tokens · 43554 ms · 2026-05-12T01:22:08.068440+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GiB first trains a self-supervised model to learn latent features and assigns binary weights... models the latent feature distribution of high-quality segments and uses the Mahalanobis distance to detect and evaluate poor-quality subtasks.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce GiB... automatically identifies and discards erroneous subtasks within demonstrations while preserving high-quality subtasks.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

A survey of imitation learning: Algorithms, recent developments, and challenges,

M. Zare, P. M. Kebria, A. Khosravi, and S. Nahavandi, “A survey of imitation learning: Algorithms, recent developments, and challenges,” IEEE Transactions on Cybernetics, 2024

work page 2024
[2]

Data quality in imitation learn- ing,

S. Belkhale, Y . Cui, and D. Sadigh, “Data quality in imitation learn- ing,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[3]

Learning for a robot: Deep reinforcement learning, imitation learning, transfer learning,

J. Hua, L. Zeng, G. Li, and Z. Ju, “Learning for a robot: Deep reinforcement learning, imitation learning, transfer learning,”Sensors, vol. 21, no. 4, p. 1278, 2021

work page 2021
[4]

A reduction of imitation learning and structured prediction to no-regret online learning,

S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” inProceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2011, pp. 627–635

work page 2011
[5]

Imitation learning: A survey of learning methods,

A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne, “Imitation learning: A survey of learning methods,”ACM Computing Surveys (CSUR), vol. 50, no. 2, pp. 1–35, 2017

work page 2017
[6]

Self supervised detection of incorrect human demonstrations: A path toward safe imitation learning by robots in the wild,

N. Sojib and M. Begum, “Self supervised detection of incorrect human demonstrations: A path toward safe imitation learning by robots in the wild,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 2862–2869

work page 2024
[7]

What matters in learning from offline human demonstrations for robot manipula- tion,

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın, “What matters in learning from offline human demonstrations for robot manipula- tion,” inConference on Robot Learning (CoRL), 2021

work page 2021
[8]

Mimicgen: A data generation system for scalable robot learning using human demonstrations

A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox, “Mimicgen: A data generation system for scalable robot learning using human demonstrations,”arXiv preprint arXiv:2310.17596, 2023

work page arXiv 2023
[9]

Towards effective utilization of mixed-quality demonstrations in robotic manipulation via segment- level selection and optimization,

J. Chen, H. Fang, H.-S. Fang, and C. Lu, “Towards effective utilization of mixed-quality demonstrations in robotic manipulation via segment- level selection and optimization,”arXiv preprint arXiv:2409.19917, 2024

work page arXiv 2024
[10]

Robot data curation with mutual information estimators.arXiv preprint arXiv:2502.08623, 2025

J. Hejna, S. Mirchandani, A. Balakrishna, A. Xie, A. Wahid, J. Tomp- son, P. Sanketi, D. Shah, C. Devin, and D. Sadigh, “Robot data curation with mutual information estimators,”arXiv preprint arXiv:2502.08623, 2025

work page arXiv 2025
[11]

How to leverage diverse demonstrations in offline imitation learning,

S. Yue, J. Liu, X. Hua, J. Ren, S. Lin, J. Zhang, and Y . Zhang, “How to leverage diverse demonstrations in offline imitation learning,”arXiv preprint arXiv:2405.17476, 2024

work page arXiv 2024
[12]

Imitation learn- ing from purified demonstrations,

Y . Wang, M. Dong, Y . Zhao, B. Du, and C. Xu, “Imitation learn- ing from purified demonstrations,”arXiv preprint arXiv:2310.07143, 2023

work page arXiv 2023
[13]

Robust behavior cloning with adversarial demonstration detection,

M. Hussein, B. Crowe, M. Clark-Turner, P. Gesel, M. Petrik, and M. Begum, “Robust behavior cloning with adversarial demonstration detection,” in2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 7858–7864

work page 2021
[14]

Imitation learning from imperfection: Theoretical justifications and algorithms,

Z. Li, T. Xu, Z. Qin, Y . Yu, and Z.-Q. Luo, “Imitation learning from imperfection: Theoretical justifications and algorithms,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[15]

Demodice: Offline imitation learning with supplementary imperfect demonstrations,

G.-H. Kim, S. Seo, J. Lee, W. Jeon, H. Hwang, H. Yang, and K.- E. Kim, “Demodice: Offline imitation learning with supplementary imperfect demonstrations,” inInternational Conference on Learning Representations, 2022

work page 2022
[16]

Behavioral cloning from noisy demon- strations,

F. Sasaki and R. Yamashina, “Behavioral cloning from noisy demon- strations,” inInternational Conference on Learning Representations, 2020

work page 2020
[17]

In- tervengen: Interventional data generation for robust and data-efficient robot imitation learning,

R. Hoque, A. Mandlekar, C. Garrett, K. Goldberg, and D. Fox, “In- tervengen: Interventional data generation for robust and data-efficient robot imitation learning,”arXiv preprint arXiv:2405.01472, 2024

work page arXiv 2024
[18]

Eliciting compati- ble demonstrations for multi-human imitation learning,

K. Gandhi, S. Karamcheti, M. Liao, and D. Sadigh, “Eliciting compati- ble demonstrations for multi-human imitation learning,” inConference on Robot Learning. PMLR, 2023, pp. 1981–1991

work page 2023
[19]

Efficient algorithms for mining outliers from large data sets,

S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient algorithms for mining outliers from large data sets,” inProceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 427–438

work page 2000
[20]

Deep one-class classification,

L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. M ¨uller, and M. Kloft, “Deep one-class classification,” inInternational conference on machine learning. PMLR, 2018, pp. 4393–4402

work page 2018
[21]

Isolation forest,

F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in2008 eighth ieee international conference on data mining. IEEE, 2008, pp. 413–422

work page 2008
[22]

Lof: identi- fying density-based local outliers,

M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identi- fying density-based local outliers,” inProceedings of the 2000 ACM SIGMOD international conference on Management of data, 2000, pp. 93–104

work page 2000
[23]

Oil-ad: An anomaly detection framework for sequential decision sequences,

C. Wang, S. Erfani, T. Alpcan, and C. Leckie, “Oil-ad: An anomaly detection framework for sequential decision sequences,”arXiv preprint arXiv:2402.04567, 2024

work page arXiv 2024
[24]

Discriminator-weighted offline imitation learning from suboptimal demonstrations,

H. Xu, X. Zhan, H. Yin, and H. Qin, “Discriminator-weighted offline imitation learning from suboptimal demonstrations,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 24 725–24 742

work page 2022
[25]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, 2024

work page 2024