pith. sign in

arxiv: 2505.04897 · v2 · submitted 2025-05-08 · 💻 cs.RO · cs.LG

CubeDAgger: Interactive Imitation Learning for Dynamic Systems with Efficient yet Low-risk Interaction

Pith reviewed 2026-05-22 16:59 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords interactive imitation learningdynamic systemsrobot stabilityexpert supervisionaction consensuscolored noise explorationscooping task
0
0 comments X

The pith

CubeDAgger replaces expert switching with action consensus to keep dynamic robots stable during imitation learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CubeDAgger as a way to make interactive imitation learning work on dynamic tasks where prior switching methods cause abrupt action jumps and loss of stability. It starts from the EnsembleDAgger baseline and adds three targeted changes: regularization that forces the supervision threshold to activate properly, replacement of the switch with an optimal consensus among several action candidates, and injection of autoregressive colored noise into the agent's actions. Simulations confirm that the resulting policies stay robust and keep the system dynamically stable throughout the interaction phase. Real-robot scooping trials with a human expert further show that a working policy can be obtained from scratch after only thirty minutes of supervised interaction.

Core claim

CubeDAgger improves robustness while preserving dynamic stability in interactive imitation learning for dynamic systems by adding regularization to the supervision threshold, converting the expert-agent switch into an optimal consensus among multiple action candidates, and injecting autoregressive colored noise for time-consistent exploration; these changes allow policies to be trained from scratch with limited expert time, as verified in simulation and in thirty-minute real-robot scooping experiments.

What carries the argument

The optimal consensus system of multiple action candidates, which replaces direct expert-agent switching so that supervision timing no longer produces discontinuous control signals.

If this is right

  • Trained policies remain dynamically stable even while receiving occasional expert corrections.
  • Robust control can be achieved for contact-rich tasks such as scooping after brief human supervision.
  • The same three improvements can be applied to other dynamic robot behaviors that previously failed under switching-based imitation learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The consensus approach may generalize to other imitation settings where discontinuous actions destabilize underactuated or fast dynamics.
  • Shorter expert sessions could make imitation learning practical for tasks that currently demand long teleoperation.
  • Testing the colored-noise component in isolation would clarify how much of the stability gain comes from time-consistent exploration rather than the consensus step.

Load-bearing premise

Turning the expert switching system into an optimal consensus of action candidates will remove abrupt changes without creating new instability or slowing down learning.

What would settle it

A real-robot run in which the policy loses dynamic stability or requires far more than thirty minutes of expert interaction to reach comparable robustness.

Figures

Figures reproduced from arXiv: 2505.04897 by Taisuke Kobayashi.

Figure 1
Figure 1. Figure 1: Scooping three balls by a net attached on a quadruped [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Geometries for designing pi and wk Note that the intermediate pi has a mixture of these prop￾erties. As the consensus should be fair while anomalous opinions should be excluded, it is worth desigining the proper pi according to the distribution shape of candidates. The above properties are the same even when wk is included, since it only vary the pseudo-number of candidates. Thus, given appropriate weights… view at source ↗
Figure 3
Figure 3. Figure 3: Time consistency of red noise (∆t = 0.05 and T = 3) Therefore, this study employes colored noise, which has been reported to accelerate exploration rather than white one, inspired by the literature [26]. However, previous stud￾ies have used an implementation that generates time-series noises with a specified time step in advance, perhaps in order to investigate general colored noise, which is not flexible … view at source ↗
Figure 4
Figure 4. Figure 4: Trajectories during data collection (dashed lines: the experts’ average scores) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Statistical evaluation with 21 random seeds for each [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation of real-world scooping task V. CONCLUSION This paper proposed a novel IIL method, named Cube￾DAgger, an improved version of EnsembleDAgger, for mak￾ing it applicable to dynamic tasks. The improvements are threefold: i) control the output variance of the ensemble model to make the safety decision work better; ii) design an optimization problem to derive consensus from multiple action candidates; … view at source ↗
read the original abstract

Interactive imitation learning makes an agent's control policy robust by stepwise supervisions from an expert. The recent algorithms mostly employ expert-agent switching systems to reduce the expert's burden by limitedly selecting the supervision timing. However, this approach is useful only for static tasks; in dynamic tasks, timing discrepancies cause abrupt changes in actions, losing the robot's dynamic stability. This paper therefore proposes a novel method, named CubeDAgger, which improves robustness with less dynamic stability violations even for dynamic tasks. The proposed method is designed on a baseline, EnsembleDAgger, with three improvements. The first adds a regularization to explicitly activate the threshold for deciding the supervision timing. The second transforms the expert-agent switching system to an optimal consensus system of multiple action candidates. Third, autoregressive colored noise is injected to the agent's actions for time-consistent exploration. These improvements are verified by simulations, showing that the trained policies are sufficiently robust while maintaining dynamic stability during interaction. Finally, real-robot scooping experiments with a human expert demonstrate that the proposed method can learn robust policies from scratch based on just 30 minutes of interaction. https://youtu.be/kBl3SCTnVEM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CubeDAgger as an extension of EnsembleDAgger for interactive imitation learning in dynamic robotic systems. It adds three modifications to the baseline: (1) regularization that explicitly activates the supervision threshold, (2) replacement of binary expert-agent switching with an optimal consensus over multiple action candidates, and (3) injection of autoregressive colored noise into the agent's actions for time-consistent exploration. The central claim is that these changes yield robust policies for dynamic tasks while reducing dynamic stability violations, supported by simulation results and a real-robot scooping demonstration that learns from scratch using only 30 minutes of human-expert interaction.

Significance. If the empirical results and stability claims hold under closer scrutiny, the work would offer a practical advance in reducing expert burden for interactive imitation learning on systems with fast dynamics, where abrupt action switches are known to cause instability. The real-robot scooping result with limited interaction time would be a notable efficiency demonstration for contact-rich manipulation tasks.

major comments (2)
  1. [§3.2] §3.2 (Optimal Consensus System): The central stability claim rests on the assertion that replacing binary switching with an optimal consensus over action candidates eliminates abrupt changes without introducing new instability or latency. However, the manuscript provides no solver iteration counts, measured wall-clock latency on the target hardware, or Lyapunov-style argument showing that the blended actions remain compatible with the plant's natural frequencies. This analysis is load-bearing for dynamic tasks such as scooping, where 10-20 ms timing errors can destabilize the closed-loop trajectory.
  2. [Experiments] Experimental results (simulations and real-robot section): The reported success in maintaining dynamic stability is not accompanied by quantitative metrics such as success rates, number of stability violations per trial, or error bars across repeated runs. Without these, it is difficult to assess whether the consensus step truly preserves or improves robustness relative to the EnsembleDAgger baseline.
minor comments (2)
  1. [Abstract] The abstract states that policies are 'sufficiently robust' but does not define the quantitative criteria used for this judgment; a brief definition or reference to the evaluation metrics in the main text would improve clarity.
  2. [Method] Notation for the consensus weighting parameters and the autoregressive noise process should be introduced with explicit equations rather than descriptive text to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional analysis and quantitative results where feasible.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Optimal Consensus System): The central stability claim rests on the assertion that replacing binary switching with an optimal consensus over action candidates eliminates abrupt changes without introducing new instability or latency. However, the manuscript provides no solver iteration counts, measured wall-clock latency on the target hardware, or Lyapunov-style argument showing that the blended actions remain compatible with the plant's natural frequencies. This analysis is load-bearing for dynamic tasks such as scooping, where 10-20 ms timing errors can destabilize the closed-loop trajectory.

    Authors: We agree that explicit solver details and latency measurements would strengthen the presentation. In the revision we will report typical iteration counts for the consensus optimization (a small convex program solved at each step) and wall-clock latency measured on the real-robot hardware. Regarding formal stability, the manuscript does not contain a Lyapunov argument; instead we rely on the design property that the consensus produces a convex combination of candidate actions, which empirically reduces abrupt switches and stability violations compared with binary switching. We will add a brief discussion of this smoothness property and its relation to the plant's natural frequencies, while acknowledging that a full Lyapunov analysis remains future work. revision: partial

  2. Referee: [Experiments] Experimental results (simulations and real-robot section): The reported success in maintaining dynamic stability is not accompanied by quantitative metrics such as success rates, number of stability violations per trial, or error bars across repeated runs. Without these, it is difficult to assess whether the consensus step truly preserves or improves robustness relative to the EnsembleDAgger baseline.

    Authors: We concur that the experimental section would benefit from more rigorous quantitative reporting. The revised manuscript will include success rates, mean number of stability violations per trial, and standard deviations (error bars) computed over repeated simulation trials and multiple real-robot scooping runs. These metrics will be presented for both CubeDAgger and the EnsembleDAgger baseline to enable direct comparison of robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in algorithmic design choices verified empirically

full rationale

The paper proposes CubeDAgger as an extension of the EnsembleDAgger baseline with three independent algorithmic improvements: regularization to activate the supervision threshold, transformation of expert-agent switching into an optimal consensus over multiple action candidates, and injection of autoregressive colored noise for time-consistent exploration. These modifications are presented as design choices whose benefits for robustness and dynamic stability are verified through simulations and real-robot scooping experiments with a human expert, rather than through any closed mathematical derivation or equations. No load-bearing steps reduce claimed outcomes to self-definitions, fitted parameters renamed as predictions, or unverified self-citation chains; the results rest on external empirical testing against dynamic tasks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The paper introduces no new physical axioms or invented entities. It relies on standard assumptions from imitation learning (expert provides correct actions, dynamics are Markovian) and adds three algorithmic modifications whose parameters are presumably tuned on the reported tasks.

free parameters (2)
  • supervision threshold
    Regularized to decide when to query the expert; value not stated in abstract but central to the first improvement.
  • consensus weighting parameters
    Used to combine multiple action candidates; introduced with the second improvement.
axioms (1)
  • domain assumption Expert actions are always available and correct when queried
    Implicit in all interactive imitation learning; invoked when describing the switching/consensus system.

pith-pipeline@v0.9.0 · 5732 in / 1473 out tokens · 56208 ms · 2026-05-22T16:59:56.460467+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    Interactive imitation learning in robotics: A survey,

    C. Celemin, R. P ´erez-Dattari, E. Chisari, G. Franzese, L. de Souza Rosa, R. Prakash, Z. Ajanovi ´c, M. Ferraz, A. Valada, J. Kober,et al., “Interactive imitation learning in robotics: A survey,” Foundations and Trends® in Robotics, vol. 10, no. 1-2, pp. 1–197, 2022

  2. [2]

    A reduction of imitation learning and structured prediction to no-regret online learning,

    S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” inInternational Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 2011, pp. 627–635

  3. [3]

    Hg-dagger: Interactive imitation learning with human experts,

    M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer, “Hg-dagger: Interactive imitation learning with human experts,” in International Conference on Robotics and Automation. IEEE, 2019, pp. 8077–8083

  4. [4]

    En- sembledagger: A bayesian approach to safe imitation learning,

    K. Menda, K. Driggs-Campbell, and M. J. Kochenderfer, “En- sembledagger: A bayesian approach to safe imitation learning,” in IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2019, pp. 5041–5048

  5. [5]

    Uncertainty-aware data aggregation for deep imitation learning,

    Y . Cui, D. Isele, S. Niekum, and K. Fujimura, “Uncertainty-aware data aggregation for deep imitation learning,” inInternational Conference on Robotics and Automation. IEEE, 2019, pp. 761–767

  6. [6]

    Leveraging demonstrator-perceived preci- sion for safe interactive imitation learning of clearance-limited tasks,

    H. Oh and T. Matsubara, “Leveraging demonstrator-perceived preci- sion for safe interactive imitation learning of clearance-limited tasks,” IEEE Robotics and Automation Letters, vol. 9, no. 4, pp. 3387–3394, 2024

  7. [7]

    Evaluation of com- munication and human response latency for (human) teleoperation,

    D. G. Black, D. Andjelic, and S. E. Salcudean, “Evaluation of com- munication and human response latency for (human) teleoperation,” IEEE Transactions on Medical Robotics and Bionics, vol. 6, no. 1, pp. 53–63, 2024

  8. [8]

    Basic problems in stability and design of switched systems,

    D. Liberzon and A. S. Morse, “Basic problems in stability and design of switched systems,”IEEE control systems magazine, vol. 19, no. 5, pp. 59–70, 1999

  9. [9]

    Lazydag- ger: Reducing context switching in interactive imitation learning,

    R. Hoque, A. Balakrishna, C. Putterman, M. Luo, D. S. Brown, D. Seita, B. Thananjeyan, E. Novoseller, and K. Goldberg, “Lazydag- ger: Reducing context switching in interactive imitation learning,” in IEEE international conference on automation science and engineering. IEEE, 2021, pp. 502–509

  10. [10]

    Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning,

    R. Hoque, A. Balakrishna, E. Novoseller, A. Wilcox, D. S. Brown, and K. Goldberg, “Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning,” inConference on Robot Learning. PMLR, 2022, pp. 598–608

  11. [11]

    Fleet-dagger: Interactive robot fleet learning with scalable human supervision,

    R. Hoque, L. Y . Chen, S. Sharma, K. Dharmarajan, B. Thananjeyan, P. Abbeel, and K. Goldberg, “Fleet-dagger: Interactive robot fleet learning with scalable human supervision,” inConference on Robot Learning. PMLR, 2023, pp. 368–380

  12. [12]

    Dart: Noise injection for robust imitation learning,

    M. Laskey, J. Lee, R. Fox, A. Dragan, and K. Goldberg, “Dart: Noise injection for robust imitation learning,” inConference on robot learning. PMLR, 2017, pp. 143–156

  13. [13]

    Bayesian distur- bance injection: Robust imitation learning of flexible policies for robot manipulation,

    H. Oh, H. Sasaki, B. Michael, and T. Matsubara, “Bayesian distur- bance injection: Robust imitation learning of flexible policies for robot manipulation,”Neural Networks, vol. 158, pp. 42–58, 2023

  14. [14]

    Better-than-demonstrator imitation learning via automatically-ranked demonstrations,

    D. S. Brown, W. Goo, and S. Niekum, “Better-than-demonstrator imitation learning via automatically-ranked demonstrations,” inCon- ference on robot learning. PMLR, 2020, pp. 330–359

  15. [15]

    R. S. Sutton and A. G. Barto,Reinforcement learning: An introduction. MIT press, 2018

  16. [16]

    Balancing exploration and exploitation with information and randomization,

    R. C. Wilson, E. Bonawitz, V . D. Costa, and R. B. Ebitz, “Balancing exploration and exploitation with information and randomization,” Current opinion in behavioral sciences, vol. 38, pp. 49–56, 2021

  17. [17]

    A survey of inverse reinforce- ment learning,

    S. Adams, T. Cody, and P. A. Beling, “A survey of inverse reinforce- ment learning,”Artificial Intelligence Review, vol. 55, no. 6, pp. 4307– 4346, 2022

  18. [18]

    Guided reinforcement learning: A review and evaluation for efficient and effective real-world robotics [survey],

    J. Eßer, N. Bach, C. Jestel, O. Urbann, and S. Kerner, “Guided reinforcement learning: A review and evaluation for efficient and effective real-world robotics [survey],”IEEE Robotics & Automation Magazine, vol. 30, no. 2, pp. 67–85, 2022

  19. [19]

    Rlif: Interactive imi- tation learning as reinforcement learning,

    J. Luo, P. Dong, Y . Zhai, Y . Ma, and S. Levine, “Rlif: Interactive imi- tation learning as reinforcement learning,” inInternational Conference on Learning Representations, 2024

  20. [20]

    A framework for behavioural cloning

    M. Bain and C. Sammut, “A framework for behavioural cloning.” in Machine Intelligence 15, 1995, pp. 103–129

  21. [21]

    Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization,

    W. E. L. Ilboudo, T. Kobayashi, and T. Matsubara, “Adaterm: Adaptive t-distribution estimated robust moments for noise-robust stochastic gradient optimization,”Neurocomputing, vol. 557, p. 126692, 2023

  22. [22]

    Disagreement-regularized im- itation learning,

    K. Brantley, W. Sun, and M. Henaff, “Disagreement-regularized im- itation learning,” inInternational Conference on Learning Represen- tations, 2020

  23. [23]

    Lira: Light-robust adversary for model-based reinforce- ment learning in real world,

    T. Kobayashi, “Lira: Light-robust adversary for model-based reinforce- ment learning in real world,”Robotics and Autonomous Systems, p. 105057, 2025

  24. [24]

    On central tendency and dispersion measures for intervals and hypercubes,

    M. Chavent and J. Saracco, “On central tendency and dispersion measures for intervals and hypercubes,”Communications in Statis- tics—Theory and Methods, vol. 37, no. 9, pp. 1471–1482, 2008

  25. [25]

    An enhancement of the bisection method average performance preserving minmax optimality,

    I. F. Oliveira and R. H. Takahashi, “An enhancement of the bisection method average performance preserving minmax optimality,”ACM Transactions on Mathematical Software, vol. 47, no. 1, pp. 1–24, 2021

  26. [26]

    Pink noise is all you need: Colored noise exploration in deep reinforcement learning,

    O. Eberhard, J. Hollenstein, C. Pinneri, and G. Martius, “Pink noise is all you need: Colored noise exploration in deep reinforcement learning,” inThe Eleventh International Conference on Learning Representations, 2023

  27. [27]

    Revisiting experience replayable conditions,

    T. Kobayashi, “Revisiting experience replayable conditions,”Applied Intelligence, vol. 54, no. 19, pp. 9381–9394, 2024

  28. [28]

    Deep exploration via bootstrapped dqn,

    I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, “Deep exploration via bootstrapped dqn,”Advances in neural information processing systems, vol. 29, 2016

  29. [29]

    Dinov2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, et al., “Dinov2: Learning robust visual features without supervision,” Transactions on Machine Learning Research

  30. [30]

    Adaptive nonlinear system identification with echo state networks,

    H. Jaeger, “Adaptive nonlinear system identification with echo state networks,”Advances in neural information processing systems, vol. 15, 2002