pith. sign in

arxiv: 2606.05588 · v1 · pith:UYYA2O23new · submitted 2026-06-04 · 💻 cs.RO · cs.LG

Auditing Demonstration Curation Metrics: Action-Only Scorers Fail on the Structural Defects That Degrade Imitation Policies

Pith reviewed 2026-06-28 01:50 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords imitation learningdemonstration curationbehavior cloningstructural defectsaction-only metricsstate trajectorypolicy performance
0
0 comments X

The pith

Action-only metrics for curating imitation demonstrations miss structural errors like wrong actions at key moments, and some score the defective data higher, leaving policies no better than the uncurated baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a controlled testbed that injects known defect types into demonstration data and measures both how well each of seven curation metrics separates defective from clean data and whether the resulting filtered set produces a better behavior-cloning policy. Subtle defects such as correlated action noise, tremor, and truncation are caught by multivariate outlier scoring, and removing them recovers the full downstream performance gap. Structural defects, in which the demonstration executes a wrong action at a critical moment, remain invisible to every action-only metric tested; two of those metrics actually assign higher scores to the defective demonstrations. When those metrics are used for curation, the trained policy performs at or below the uncurated baseline. Metrics that inspect state trajectories detect the structural defects, yet the strongest of them closes only one third of the performance gap. Detection accuracy alone does not guarantee downstream policy improvement.

Core claim

In a testbed with known injected defects, structural errors where a demonstration takes a wrong action at a key moment remain invisible to action-only curation metrics, and two such metrics score the defective demonstrations as higher quality; when used to curate training data, these metrics produce policies no better than or worse than the uncurated baseline. Metrics that inspect state trajectories detect the structural errors, yet the best of them recovers only one third of the downstream performance gap. High accuracy at separating clean from defective data does not guarantee improvement in the imitation policy trained on the curated set.

What carries the argument

The controlled testbed that injects known defect types into demonstrations and evaluates curation metrics on both separation accuracy and downstream policy success after filtering.

If this is right

  • Subtle perturbations such as correlated action noise or truncation can be removed by multivariate outlier scoring to recover the full downstream performance gap.
  • Action-only metrics remain blind to structural errors and two of them assign higher quality scores to defective demonstrations.
  • Curation with the inverted action-only metrics leaves the trained policy at or below the uncurated baseline.
  • Metrics that examine state trajectories detect structural errors, but the strongest recovers only one third of the downstream gap.
  • High detection accuracy on defective demonstrations does not ensure downstream policy improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If real human demonstrations contain structural errors at key moments, relying on action-only metrics for curation may systematically preserve or amplify harmful data.
  • Future metrics that combine action sequences with state trajectories could close more than one third of the performance gap left by current methods.
  • The released testbed provides a standard way to evaluate whether new curation scores improve actual policy success rather than only detection accuracy.

Load-bearing premise

The injected defects of the tested types are representative of the structural and subtle errors that occur in real human-collected demonstration data.

What would settle it

Apply the same set of metrics to a collection of real human demonstrations that contain independently verified structural errors at known moments, then measure whether the downstream policy success rates match the patterns observed in the controlled testbed.

Figures

Figures reproduced from arXiv: 2606.05588 by Aarav Bedi (University of California, Berkeley).

Figure 1
Figure 1. Figure 1: Defect-detection AUROC by metric and regime. No metric is reliable [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Downstream policy success in the structural-defect regime. Dashed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Imitation-learning policies inherit the quality of the demonstrations they are trained on, and a growing set of curation metrics promise to score and filter low-quality demonstrations automatically. These metrics are each validated on different data with different protocols, so it is unclear which of them actually identify the demonstrations that harm a policy. We build a controlled testbed in which demonstration defects are injected with known type, and audit seven curation metrics along two axes: how well each separates defective from clean demonstrations, and whether training a behavior-cloning policy on each metric's curated subset improves task success. We study two defect regimes. Subtle perturbations (correlated action noise, tremor, truncation) are detectable by multivariate outlier scoring and, once removed, recover the full downstream gap. Structural errors, where the demonstration executes a wrong action at a key moment, are invisible to every action-only metric we test, and two of them are inverted: they score defective demonstrations as higher quality and, used for curation, tend to leave the policy at or below the uncurated baseline rather than above it. Only metrics that examine the state trajectory detect structural errors, and even the best of them recovers just a third of the downstream gap. High detection accuracy does not guarantee downstream improvement. We release the testbed and all curation implementations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper constructs a controlled testbed by injecting known defects (subtle perturbations like correlated noise/tremor/truncation, and structural errors like wrong actions at key moments) into demonstrations. It audits seven curation metrics along two axes: separation of defective vs. clean demos, and whether curation improves downstream behavior-cloning policy success. Findings: subtle defects are detectable by outlier scoring with full gap recovery; structural errors are invisible to all action-only metrics (with two inverted, yielding policies at or below baseline); state-trajectory metrics detect them but recover only ~1/3 of the gap. High detection accuracy does not guarantee downstream gains. The testbed and implementations are released.

Significance. If the results hold, the work is significant as an empirical audit exposing limitations of action-only curation metrics in imitation learning, showing they can be blind or counterproductive to structural defects that harm policies. The controlled testbed design with known defect types supports clear evaluation axes. Releasing the testbed and code is a clear strength for reproducibility and community follow-up. The observation that detection accuracy does not imply policy improvement is a useful cautionary finding.

major comments (2)
  1. [Testbed construction] Testbed construction (defect injection protocol): The central claims about action-only metrics failing on structural errors rest on the specific synthetic injection of 'wrong action at key moments,' but the manuscript provides no quantitative comparison (e.g., action histogram divergence or state-transition statistics) to real human demonstration errors with labeled defects; this is load-bearing for transfer to practical curation scenarios.
  2. [Evaluation results] Downstream evaluation results: The claim that even the best state-trajectory metric recovers only one-third of the downstream gap requires explicit reporting of the number of random seeds, variance, and statistical significance to substantiate the quantitative recovery fraction and the inversion results for the two action-only metrics.
minor comments (2)
  1. The abstract refers to seven metrics without naming or categorizing them (action-only vs. state-trajectory); adding a summary table would improve immediate clarity.
  2. Clarify in the methods how 'key moments' for structural error injection are identified or selected, as this affects reproducibility of the testbed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below.

read point-by-point responses
  1. Referee: [Testbed construction] Testbed construction (defect injection protocol): The central claims about action-only metrics failing on structural errors rest on the specific synthetic injection of 'wrong action at key moments,' but the manuscript provides no quantitative comparison (e.g., action histogram divergence or state-transition statistics) to real human demonstration errors with labeled defects; this is load-bearing for transfer to practical curation scenarios.

    Authors: The testbed uses controlled synthetic injection precisely to isolate known defect types and attribute metric/policy effects without confounding factors. We make no claim that the injected defects match the statistical distribution of real human errors (which would require labeled real data that does not exist in public benchmarks). The core result is that action-only metrics are blind or inverted on structural defects of this form; we will add an explicit limitations paragraph stating the synthetic scope and that transfer to unlabeled real curation remains an open question. revision: partial

  2. Referee: [Evaluation results] Downstream evaluation results: The claim that even the best state-trajectory metric recovers only one-third of the downstream gap requires explicit reporting of the number of random seeds, variance, and statistical significance to substantiate the quantitative recovery fraction and the inversion results for the two action-only metrics.

    Authors: We agree these details are necessary. The revised manuscript will report the exact number of random seeds (10 per condition), mean and standard deviation of success rates across seeds, and paired t-test p-values for the one-third recovery claim and the two inversion cases. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical audit with external testbed

full rationale

The paper is an empirical audit that constructs a controlled testbed, injects known defect types into demonstrations, and measures metric performance via separation accuracy and downstream behavior-cloning success rates. No derivations, equations, or first-principles predictions are present that could reduce to fitted inputs or self-referential definitions. Central claims rest on direct experimental comparisons against the externally constructed testbed rather than any self-citation chain or ansatz. This is the most common honest finding for purely empirical work that remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of the synthetic defects and on standard assumptions of behavior cloning; no free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)
  • domain assumption Behavior cloning performance is a monotonic function of demonstration quality when defects are of the tested types.
    Implicit in the downstream policy success evaluation.

pith-pipeline@v0.9.1-grok · 5769 in / 1352 out tokens · 36372 ms · 2026-06-28T01:50:04.628243+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    ALVINN: An autonomous land vehicle in a neural network,

    D. A. Pomerleau, “ALVINN: An autonomous land vehicle in a neural network,” inAdvances in Neural Information Processing Systems, 1989

  2. [2]

    A reduction of imitation learning and structured prediction to no-regret online learning,

    S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” inProc. AISTATS, 2011

  3. [3]

    Learning fine-grained bimanual manipulation with low-cost hardware,

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” inProc. RSS, 2023

  4. [4]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration, “Open X-Embodiment: Robotic learning datasets and RT-X models,” arXiv:2310.08864, 2023

  5. [5]

    BridgeData V2: A dataset for robot learning at scale,

    H. Walkeet al., “BridgeData V2: A dataset for robot learning at scale,” inProc. CoRL, 2023

  6. [6]

    DROID: A large-scale in-the-wild robot manipu- lation dataset,

    A. Khazatskyet al., “DROID: A large-scale in-the-wild robot manipu- lation dataset,” inProc. RSS, 2024

  7. [7]

    LeRobot: An open-source library for end-to-end robot learning,

    R. Cadeneet al., “LeRobot: An open-source library for end-to-end robot learning,” inProc. ICLR, 2026, arXiv:2602.22818

  8. [8]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    A. Mandlekaret al., “What matters in learning from offline human demonstrations for robot manipulation,” inProc. Conf. on Robot Learn- ing (CoRL), 2021, arXiv:2108.03298

  9. [9]

    Isolation forest,

    F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” inProc. IEEE Int. Conf. Data Mining (ICDM), 2008

  10. [10]

    A robust and sensitive metric for quantifying movement smoothness,

    S. Balasubramanian, A. Melendez-Calderon, and E. Burdet, “A robust and sensitive metric for quantifying movement smoothness,”IEEE Trans. Biomed. Eng., vol. 59, no. 8, pp. 2126–2136, 2012