pith. sign in

arxiv: 2606.10229 · v1 · pith:RGYVHZODnew · submitted 2026-06-08 · 💻 cs.RO · cs.LG

What Demonstration Curation Metrics Do to Your Policy

Pith reviewed 2026-06-27 16:00 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords demonstration curationbehavior cloningdefect detectionpolicy evaluationepisode lengthLIBERO benchmarkAUROCrobotics
0
0 comments X

The pith

Demonstration curation metrics with the highest defect-detection scores produce the worst downstream policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether metrics that accurately flag defective demonstration episodes also yield higher-quality behavior-cloning policies after filtering. On a LIBERO pick-and-place task that injects a controlled defect of early gripper release, detection AUROC and final policy success rate turn out to be sharply decoupled. The strongest detector reaches 0.804 AUROC yet delivers only 13.3 percent task success, while a weaker detector at 0.638 AUROC reaches 90 percent success, nearly matching an oracle trained on clean data. Five of seven metrics rely on episode length as a hidden proxy for the defect label, which inflates their reported AUROCs until length is controlled. The work concludes that curation methods must be judged by the policies they produce rather than by the defects they detect.

Core claim

On the LIBERO pick-and-place benchmark with a controlled structural defect of early gripper release during the carry phase, demonstration-curation metrics decouple from policy quality: the metric with the highest defect-detection AUROC of 0.804 produces the worst curated policy at 13.3 percent task success, while a metric with a lower AUROC of 0.638 produces a policy that nearly matches the oracle trained on ground-truth clean data at 90.0 percent versus 93.3 percent.

What carries the argument

The decoupling between a curation metric's defect-detection AUROC and the task success rate of the behavior-cloning policy trained on the filtered demonstrations.

If this is right

  • Five of the seven metrics exploit episode length as a trivial proxy for the defect label, inflating reported AUROCs to near-perfect values.
  • Any curation benchmark must control for episode length before reporting detection accuracy.
  • The contaminated baseline succeeds on only 3.3 percent of rollouts, and the two best curation methods close this gap to within 3 percentage points of the 93.3 percent oracle ceiling.
  • Curation methods should be evaluated by the policy they produce, not the defects they flag.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Metrics that avoid length proxies may still need direct testing on policy outcomes rather than detection scores alone.
  • The same decoupling could appear in other imitation-learning domains whenever data defects affect control differently than they affect length statistics.
  • Selecting curation metrics via policy-performance feedback loops on a small held-out set could be a practical next step.

Load-bearing premise

The controlled early-gripper-release defect on this pick-and-place task produces defective episodes whose detection properties generalize to the broader class of real-world training defects that curation methods are intended to handle.

What would settle it

Re-running the same metrics on a different task or defect type where episode length does not correlate with the defect label and checking whether the AUROC-to-policy-success reversal persists.

Figures

Figures reproduced from arXiv: 2606.10229 by Aarav Bedi.

Figure 1
Figure 1. Figure 1: Detection AUROC versus downstream task success for all seven [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

We study whether demonstration-curation metrics that detect defective training episodes also improve the downstream behavior-cloning policy that trains on the curated data. On a contact-rich LIBERO pick-and-place benchmark with a controlled structural defect (early gripper release during the carry phase), we find that the two quantities are sharply decoupled. The metric with the highest defect-detection AUROC (0.804) produces the worst curated policy (13.3% task success), while a metric with a substantially lower AUROC (0.638) produces a policy that nearly matches the oracle trained on ground-truth clean data (90.0% vs. 93.3%). We further show that five of the seven metrics we evaluate exploit episode length as a trivial proxy for the defect label, a confound that inflates reported AUROCs to near-perfect values and disappears once episode length is controlled. Across all conditions, the contaminated baseline succeeds on only 3.3% of rollouts, and the two best curation methods close this to within 3 percentage points of the 93.3% oracle ceiling. Our results argue that curation methods should be evaluated by the policy they produce, not the defects they flag, and that any curation benchmark must control for episode length before reporting detection accuracy. We release the testbed, all metric implementations, and the evaluation pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript empirically studies demonstration curation metrics on a contact-rich LIBERO pick-and-place task with a single controlled structural defect (early gripper release in the carry phase). It reports that defect-detection AUROC and downstream behavior-cloning policy success are decoupled: the metric with highest AUROC (0.804) yields the lowest curated policy success (13.3%), while a metric with lower AUROC (0.638) reaches 90.0% success (near the 93.3% oracle trained on clean data). Five of seven metrics are shown to exploit episode length as a proxy for the defect label, inflating AUROCs; controlling for length removes the effect. The contaminated baseline achieves only 3.3% success. The authors conclude that curation should be evaluated by resulting policy quality rather than detection AUROC and that benchmarks must control for episode length; they release the testbed, metric implementations, and evaluation pipeline.

Significance. If the reported decoupling is robust, the work would usefully redirect curation research in imitation learning toward direct policy evaluation over proxy detection metrics. The concrete numbers (AUROC-policy mismatch, near-oracle recovery by two methods, and the episode-length confound) are actionable, and the open release of code and testbed enables verification and extension. The result is internally consistent with the single-defect setup but its broader prescriptive claim depends on stability across defect classes.

major comments (1)
  1. [experimental results] The central prescriptive claim—that curation methods 'should be evaluated by the policy they produce' and that 'any curation benchmark must control for episode length'—rests on results from one controlled defect (early gripper release during carry) on one LIBERO task. The relative ranking of metrics by AUROC versus policy impact may not be stable for other defect signatures (sensor noise, perceptual failures, incomplete sub-trajectories). Section on experimental results (and abstract): additional defect classes are needed to support the general recommendation.
minor comments (2)
  1. [abstract and methods] The abstract states that five of seven metrics exploit episode length but does not name the metrics or show the controlled-length AUROC table; adding an explicit table or subsection listing all seven metrics and their length-controlled AUROCs would improve clarity.
  2. [results] The success-rate numbers (13.3%, 90.0%, 93.3%, 3.3%) are reported without error bars or statistical tests; including standard errors or significance tests across the N rollouts would strengthen the decoupling claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [experimental results] The central prescriptive claim—that curation methods 'should be evaluated by the policy they produce' and that 'any curation benchmark must control for episode length'—rests on results from one controlled defect (early gripper release during carry) on one LIBERO task. The relative ranking of metrics by AUROC versus policy impact may not be stable for other defect signatures (sensor noise, perceptual failures, incomplete sub-trajectories). Section on experimental results (and abstract): additional defect classes are needed to support the general recommendation.

    Authors: We agree that the experiments use a single controlled defect type on one task, which limits the strength of the general prescriptive recommendation. The study deliberately employs this isolated structural defect to demonstrate the AUROC-policy decoupling and length confound in a reproducible way. We will revise the abstract and experimental results section to explicitly qualify the claims as applying to this defect class, state that the relative metric rankings may vary for other signatures, and add a limitations paragraph noting the need for future validation across additional defect types. This is a partial revision, as we cannot collect new data for additional defect classes within the revision timeline but can clarify scope and limitations in the text. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation of metrics vs. policy outcomes

full rationale

The paper reports experimental results on a LIBERO pick-and-place task with one controlled defect type. It measures defect-detection AUROC for seven metrics and separately measures task success of behavior-cloning policies trained on data curated by each metric. These quantities are computed from distinct held-out rollouts against an oracle (ground-truth clean data) and a contaminated baseline. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes appear in the provided text; the central claim of decoupling is an observed empirical pattern, not a reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study; the abstract introduces no new theoretical constructs, fitted parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5761 in / 1255 out tokens · 23146 ms · 2026-06-27T16:00:36.244115+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 2 canonical work pages · 2 internal anchors

  1. [1]

    A robust and sensitive metric for quantifying move- ment smoothness,

    S. Balasubramanian, A. Melendez-Calderon, and E. Bur- det, “A robust and sensitive metric for quantifying move- ment smoothness,”IEEE Trans. Biomed. Eng., vol. 59, no. 8, pp. 2126–2136, 2012

  2. [2]

    ALVINN: An autonomous land vehi- cle in a neural network,

    D. A. Pomerleau, “ALVINN: An autonomous land vehi- cle in a neural network,” inAdvances in Neural Infor- mation Processing Systems, 1989

  3. [3]

    A reduction of imitation learning and structured prediction to no-regret online learning,

    S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” inProc. AISTATS, 2011

  4. [4]

    Learning fine-grained bimanual manipulation with low-cost hard- ware,

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hard- ware,” inProc. RSS, 2023

  5. [5]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration, “Open X- Embodiment: Robotic learning datasets and RT-X models,” arXiv:2310.08864, 2023

  6. [6]

    DROID: A large-scale in-the-wild robot manipulation dataset,

    A. Khazatsky et al., “DROID: A large-scale in-the-wild robot manipulation dataset,” inProc. RSS, 2024

  7. [7]

    LeRobot: An open-source library for end-to-end robot learning,

    R. Cadene et al., “LeRobot: An open-source library for end-to-end robot learning,” inProc. ICLR, 2026

  8. [8]

    What matters in learning from offline human demonstrations for robot manipulation,

    A. Mandlekar et al., “What matters in learning from offline human demonstrations for robot manipulation,” inProc. CoRL, 2021

  9. [9]

    Isolation forest,

    F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” inProc. IEEE ICDM, 2008

  10. [10]

    LIBERO: Benchmarking knowledge trans- fer in lifelong robot learning,

    B. Liu et al., “LIBERO: Benchmarking knowledge trans- fer in lifelong robot learning,” inProc. NeurIPS, 2023

  11. [11]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Yuke Zhu, J. Wong, A. Mandlekar, R. Mart ´ın-Mart´ın, A. Joshi, K. Lin, A. Maddukuri, S. Nasiriany, and Yifeng Zhu, “robosuite: A modular simulation framework and benchmark for robot learning,” arXiv:2009.12293, 2020