pith. sign in

arxiv: 2502.07645 · v3 · submitted 2025-02-11 · 💻 cs.RO

From Action Labels to Sets: Rethinking Action Supervision for Imitation Learning from Corrective Feedback

Pith reviewed 2026-05-23 03:57 UTC · model grok-4.3

classification 💻 cs.RO
keywords imitation learningcorrective feedbackbehavior cloningset-valued supervisioncontrastive learninghuman-in-the-looprobotics
0
0 comments X

The pith

Imitation learning replaces single action labels with sets of desired actions built from human corrections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Behavior cloning typically trains policies to match exact human action labels, but this breaks down when the labels are noisy, relative, or incomplete. The paper replaces point targets with sets of acceptable actions derived from interactive corrections. Policies are optimized to place probability mass across the entire set instead of matching one specific action. This handles absolute and relative corrections alike and supports multi-modal behaviors. Experiments indicate the method matches standard performance with clean data while showing greater robustness when feedback is imperfect.

Core claim

The paper establishes that imitation learning from corrective feedback can be reformulated by constructing sets of desired actions from human corrections and optimizing policies to distribute probability mass across these sets, rather than targeting single actions. This set-based supervision naturally accommodates imperfect feedback and enables representation of complex behaviors.

What carries the argument

Set-valued action targets constructed from human corrections and optimized via contrastive policy learning to place probability mass over the sets.

If this is right

  • The formulation accommodates both absolute and relative corrections.
  • It supports representation of complex multi-modal behaviors.
  • Performance remains competitive with state-of-the-art methods under accurate data.
  • Robustness increases substantially under noisy, relative, and partial feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This set-based view could allow effective learning from fewer total corrections by tolerating ambiguity in each one.
  • The same supervision change might improve other imitation methods that currently rely on point labels.
  • Policies trained this way could show better real-world generalization when human input varies across sessions.

Load-bearing premise

Sets of desired actions constructed from human corrections reliably capture the intended behavior without systematic bias or inconsistency.

What would settle it

A controlled experiment where the true set of optimal actions is known in advance and a policy trained on constructed sets from simulated corrections fails to assign high probability to that true set.

Figures

Figures reproduced from arXiv: 2502.07645 by Cosimo Della Santina, Jens Kober, Robert Babuska, Rodrigo P\'erez-Dattari, Zhaoting Li.

Figure 1
Figure 1. Figure 1: A: Our method operates in an Interactive Imitation Learning framework. Example rollouts of this framework are shown in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CLIC. This figure illustrates three iter￾ations of the CLIC framework in Fig. 1A. The gray-shaded area represents desired action spaces, while the contour map shows the value of the EBM. Taking the second iteration as an example: the teacher provides corrections (c1) on the robot’s action sampled from its policy (p1), resulting in a new desired action space and refining the overall desired acti… view at source ↗
Figure 3
Figure 3. Figure 3: 2D examples of desired action spaces, shown as gray￾shaded regions. (a1) Desired half-space. (a2) Polytope desired action space with different α. (a3) Polytope desired space with different ε. (b) Circular desired action space with different ε. into two. For one observed action pair, multiple desired half￾spaces can be defined. By intersecting all these half-spaces, we can obtain the polytope desired space.… view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the observation model Pr[a ∈ Aˆ(a r ,a h )|a, s] for all a ∈A. In each figure, the state s, human action a h , and robot action a r are fixed, while the action a varies across the action space. The black dotted line denotes the boundary of Aˆ(a r ,a h ). (a) Desired half-space; (b) Polytope desired action space; (c) Circular desired action space. where r(a r , a h ) = (1 − ε)D(a r , a h ) i… view at source ↗
Figure 8
Figure 8. Figure 8: Tasks for the simulation experiments. Each task is tested with various feedback types, including accurate demonstrations, noisy demonstrations, and relative corrective feedback. For the TwoArm-Lift task, partial feedback is also tested by applying feedback only to one of the robots. enforcing this simplified inequality (See Appendix E): πθ(a − i |s) ≤ πθ(a + i |s), i = 1, . . . , NI , (18) where (a − i , a… view at source ↗
Figure 7
Figure 7. Figure 7: 2D example of training an EBM with PolicyShaping in Algorithm 1. The batch size is 1, and the same observed action pair is used over three steps. Initially, the EBM has the most density outside the desired action space. After the first update step, the peak density shifts toward the desired action space but still retains significant density outside it. With two additional update steps, the EBM is mostly in… view at source ↗
Figure 9
Figure 9. Figure 9: Hyperparameter analysis of the directional certainty parameter α for CLIC-Half. The right figure visualizes how different values of α adjust the desired action space in 3D [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ablation study: (1) effects of the temperature param￾eter T. (2) Policy-weighted Bayes loss vs uniform Bayes loss. actions are better than human actions in the dataset. ADP also fails as all action dimensions are corrupted, violating its assumption. CLIC-Circular fails as its circular desired action space excludes the optimal action. In contrast, CLIC-Half and CLIC-Explicit construct polytope desired acti… view at source ↗
Figure 11
Figure 11. Figure 11: Learned EBM landscapes across different trials. The figure compares the energy landscapes learned by CLIC, PVP, and IBC after training in a 2D action space. Each row corresponds to the resulting EBMs of each trial. In the middle part, we visualize the process of how CLIC-Circular reduces to IBC as ε increases. CLIC-Circular ( with ε = 0.5) effectively trains EBM across different trials, leading to consist… view at source ↗
Figure 12
Figure 12. Figure 12 [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Experiment results for the Insert-T task, categorized by difficulty levels (easy, medium, and hard). Each column shows the performance metrics for a given difficulty level, along with examples of initial states for that level. “CLIC￾Half (offline)” denotes results for CLIC-Half trained offline. over the entire action space across ten trials. These metrics are computed by averaging the results over the 10 … view at source ↗
Figure 14
Figure 14. Figure 14: shows the experiment results of the ball-catching task, reporting the success rate of catching the ball within one, two, and three attempts. By the end of training, the robot achieves a 1.0 success rate for catching the ball within two attempts, and its first-attempt success rate continues to improve to 0.4. One post-training policy rollout of a successful first￾attempt catch is shown in [PITH_FULL_IMAGE… view at source ↗
Figure 15
Figure 15. Figure 15: Experiment results for the water-pouring task. scratch, and it was easier to intervene in a 6D action space. As the policy improved, relative corrections made it easier to refine the policy in specific regions of the state space. The experimental data is shown in [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Illustration of polytope desired action space: (a) Generating contrastive action pairs from one observed action pair. Squares denote implicit information. (b) Examples of different α for the same h ∗ . With ε = 0, when α ≥ 2β (left), a + eh ∗ is inside the desired action space. When α < 2β (right), a+eh ∗ is outside the desired action space. (c) Example of the intersection of the desired half-spaces. Furt… view at source ↗
read the original abstract

Behavior cloning (BC) optimizes policies by treating human demonstrations as pointwise action labels. While effective with accurate action labels, this formulation is brittle in practice: when human-provided actions are imperfect, treating each label as an exact target can steer the policy away from the underlying desired behavior, particularly when expressive models are used (e.g., energy-based models). As a result, we propose a human-in-the-loop alternative that replaces pointwise supervision with set-valued action targets. We introduce Contrastive policy Learning from Interactive Corrections (CLIC). CLIC leverages human corrections to construct and refine sets of desired actions, and optimizes a policy to place probability mass over these sets rather than over a single action target. This formulation naturally accommodates both absolute and relative corrections and can represent complex multi-modal behaviors. Extensive simulation and real-robot experiments show that the proposed approach leads to effective policy learning across diverse settings: CLIC remains competitive with the state of the art under accurate data while being substantially more robust under noisy, relative, and partial feedback. Our implementation is publicly available at https://clic-webpage.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes Contrastive policy Learning from Interactive Corrections (CLIC) as an alternative to standard behavior cloning. Instead of optimizing policies toward pointwise action labels from demonstrations, CLIC uses human corrections to construct and refine sets of desired actions and trains the policy to place probability mass over these sets. The approach is claimed to handle absolute and relative corrections, support multi-modal behaviors, and remain competitive with state-of-the-art methods on clean data while being more robust to noisy, relative, and partial feedback, as shown in simulation and real-robot experiments. The implementation is released publicly.

Significance. If the set-construction procedure from corrections proves reliable, the method could meaningfully improve robustness in human-in-the-loop imitation learning, particularly with expressive models prone to overfitting imperfect labels. The public code release and dual simulation/real-robot validation are positive attributes that support reproducibility and practical relevance.

minor comments (3)
  1. The abstract asserts competitive performance and robustness but does not name the specific baselines, metrics, or statistical tests used; §4 or §5 should include a concise table or paragraph summarizing these to allow immediate assessment of the experimental claims.
  2. The description of how correction sets are constructed and refined (mentioned in the abstract) would benefit from an explicit algorithmic listing or pseudocode early in §3 to clarify the mapping from absolute/relative/partial feedback to set elements.
  3. Figure captions and axis labels in the experimental results should explicitly state the number of trials or seeds used for each curve to support the robustness claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work, the recognition of its potential significance for robust human-in-the-loop imitation learning, and the recommendation for minor revision. The referee's description accurately reflects the core ideas of CLIC, including the shift from pointwise to set-valued supervision and the empirical validation across simulation and real-robot settings. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context describe a new imitation learning formulation (CLIC) that replaces pointwise action labels with sets constructed from human corrections, optimizing policy probability mass over those sets. No equations, derivations, or self-citations are exhibited that reduce the central claim or any 'prediction' to a fitted input or prior result by construction. The set-construction step and loss are presented as independent of the target outcomes, with experiments asserted to validate performance on both clean and noisy data; the derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the core idea is a change in supervision target construction rather than new postulated objects.

pith-pipeline@v0.9.0 · 5740 in / 932 out tokens · 40885 ms · 2026-05-23T03:57:11.034207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Wavelet Policy: Imitation Learning in the Scale Domain with World Prior Memory

    cs.RO 2025-04 unverdicted novelty 6.0

    Wavelet Policy combines world prior memory from background images with wavelet-domain multi-scale action modeling via a single-encoder multiple-decoder architecture to improve long-horizon robotic imitation learning.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Implicit behavioral cloning,

    P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson, “Implicit behavioral cloning,” inConf. on Robot Learn., pp. 158–168, PMLR, 2022

  2. [2]

    An algorithmic perspective on imitation learning,

    T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, J. Peters, et al., “An algorithmic perspective on imitation learning,”Found. Trends Robotics, vol. 7, no. 1-2, pp. 1–179, 2018

  3. [3]

    Re- cent advances in robot learning from demonstration,

    H. Ravichandar, A. S. Polydoros, S. Chernova, and A. Billard, “Re- cent advances in robot learning from demonstration,”Annu. Review Control. Robotics, Auton. Syst., vol. 3, no. 1, pp. 297–330, 2020

  4. [4]

    A survey of imitation learning: Algorithms, recent developments, and challenges,

    M. Zare, P. M. Kebria, A. Khosravi, and S. Nahavandi, “A survey of imitation learning: Algorithms, recent developments, and challenges,” IEEE Trans. on Cybern., 2024

  5. [5]

    A survey of communicating robot learning during human-robot in- teraction,

    S. Habibian, A. A. Valdivia, L. H. Blumenschein, and D. P. Losey, “A survey of communicating robot learning during human-robot in- teraction,”The Int. J. Robotics Research, vol. 0, no. 0, p. 02783649241281369, 0

  6. [6]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” in Proc. Robotics: Sci. Syst. (RSS), 2023

  7. [7]

    Conditional energy- based models for implicit policies: The gap between theory and prac- tice,

    D.-N. Ta, E. Cousineau, H. Zhao, and S. Feng, “Conditional energy- based models for implicit policies: The gap between theory and prac- tice,”arXiv preprint arXiv:2207.05824, 2022

  8. [8]

    Goal conditioned imitation learning using score-based diffusion policies,

    M. Reuss, M. Li, X. Jia, and R. Lioutikov, “Goal conditioned imitation learning using score-based diffusion policies,” inRobotics: Sci. Syst., 2023

  9. [9]

    Fast and robust visuomotor riemannian flow matching policy,

    H. Ding, N. Jaquier, J. Peters, and L. Rozo, “Fast and robust visuomotor riemannian flow matching policy,”arXiv preprint arXiv:2412.10855, 2024

  10. [10]

    Deep generative models in robotics: A survey on learning from multimodal demonstrations,

    J. Urain, A. Mandlekar, Y . Du, M. Shafiullah, D. Xu, K. Fragkiadaki, G. Chalvatzaki, and J. Peters, “Deep generative models in robotics: A survey on learning from multimodal demonstrations,”arXiv preprint arXiv:2408.04380, 2024

  11. [11]

    Interactive imitation learning in robotics: A survey,

    C. Celemin, R. P ´erez-Dattari, E. Chisari, G. Franzese, L. de Souza Rosa, R. Prakash, Z. Ajanovi ´c, M. Ferraz, A. Valada, J. Kober,et al., “Interactive imitation learning in robotics: A survey,”Found. Trends Robotics, vol. 10, no. 1-2, pp. 1–197, 2022

  12. [12]

    Reinforcement learning of motor skills using policy search and human corrective advice,

    C. Celemin, G. Maeda, J. Ruiz-del Solar, J. Peters, and J. Kober, “Reinforcement learning of motor skills using policy search and human corrective advice,”The Int. J. Robotics Research, vol. 38, no. 14, pp. 1560–1580, 2019

  13. [13]

    Contin- uous control for high-dimensional state spaces: An interactive learning approach,

    R. P ´erez-Dattari, C. Celemin, J. Ruiz-del Solar, and J. Kober, “Contin- uous control for high-dimensional state spaces: An interactive learning approach,” in2019 Int. Conf. on Robotics Autom. (ICRA), pp. 7611– 7617, IEEE, 2019

  14. [14]

    An interactive framework for learning continuous actions policies based on corrective feedback,

    C. Celemin and J. Ruiz-del Solar, “An interactive framework for learning continuous actions policies based on corrective feedback,”J. Intell. & Robotic Syst., vol. 95, pp. 77–97, 2019

  15. [15]

    Implicit generation and modeling with energy based models,

    Y . Du and I. Mordatch, “Implicit generation and modeling with energy based models,” inAdv. Neural Inf. Process. Syst., vol. 32, 2019

  16. [16]

    Towards tight convex relaxations for contact- rich manipulation,

    B. P. Graesdal, S. Y . C. Chia, T. Marcucci, S. Morozov, A. Amice, P. A. Parrilo, and R. Tedrake, “Towards tight convex relaxations for contact- rich manipulation,” inProc. Robotics: Sci. Syst. (RSS), 2024

  17. [17]

    Song and D

    Y . Song and D. P. Kingma, “How to train your energy-based models,” arXiv preprint arXiv:2101.03288, 2021

  18. [18]

    Deep unsupervised learning using nonequilibrium thermodynamics,

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” inInt. Conf. on Mach. Learn., pp. 2256–2265, PMLR, 2015

  19. [19]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdv. Neural Inf. Process. Syst., vol. 33, pp. 6840–6851, 2020

  20. [20]

    Score-based generative modeling through stochastic differ- ential equations,

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differ- ential equations,” inInt. Conf. on Learn. Represent., 2021

  21. [21]

    Energy-based contact planning under uncertainty for robot air hockey,

    J. Jankowski, A. Maric, P. Liu, D. Tateo, J. Peters, and S. Calinon, “Energy-based contact planning under uncertainty for robot air hockey,” CoRR, 2024

  22. [22]

    Using im- plicit behavior cloning and dynamic movement primitive to facilitate reinforcement learning for robot motion planning,

    Z. Zhang, J. Hong, A. M. S. Enayati, and H. Najjaran, “Using im- plicit behavior cloning and dynamic movement primitive to facilitate reinforcement learning for robot motion planning,”IEEE Trans. on Robotics, 2024

  23. [23]

    Iifl: Implicit interactive fleet learning from heterogeneous human supervisors,

    G. Datta, R. Hoque, A. Gu, E. Solowjow, and K. Goldberg, “Iifl: Implicit interactive fleet learning from heterogeneous human supervisors,” in Conf. on Robot Learn., pp. 2340–2356, PMLR, 2023

  24. [24]

    Diff-dagger: Uncertainty estimation with diffusion policy for robotic manipulation,

    S.-W. Lee and Y .-L. Kuo, “Diff-dagger: Uncertainty estimation with diffusion policy for robotic manipulation,”arXiv preprint arXiv:2410.14868, 2024

  25. [25]

    Deep reinforcement learning from human preferences,

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” inAdv. Neu- ral Inf. Process. Syst., vol. 30, 2017

  26. [26]

    Learning preferences for manipulation tasks from online coactive feedback,

    A. Jain, S. Sharma, T. Joachims, and A. Saxena, “Learning preferences for manipulation tasks from online coactive feedback,”The Int. J. Robotics Research, vol. 34, no. 10, pp. 1296–1313, 2015

  27. [27]

    Pebble: Feedback-efficient interac- tive reinforcement learning via relabeling experience and unsupervised pre-training,

    K. Lee, L. M. Smith, and P. Abbeel, “Pebble: Feedback-efficient interac- tive reinforcement learning via relabeling experience and unsupervised pre-training,” inInt. Conf. on Mach. Learn., pp. 6152–6163, PMLR, 2021

  28. [28]

    Learning to summarize with human feedback,

    N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Rad- ford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,” inAdv. Neural Inf. Process. Syst., vol. 33, pp. 3008–3021, 2020

  29. [29]

    Trajectory improvement and reward learning from comparative language feedback,

    Z. Yang, M. Jun, J. Tien, S. Russell, A. Dragan, and E. Biyik, “Trajectory improvement and reward learning from comparative language feedback,” in8th Annu. Conf. on Robot Learn., 2024

  30. [30]

    Contrastive preference learning: Learning from human feedback without reinforcement learning,

    J. Hejna, R. Rafailov, H. Sikchi, C. Finn, S. Niekum, W. B. Knox, and D. Sadigh, “Contrastive preference learning: Learning from human feedback without reinforcement learning,” inThe Twelfth Int. Conf. on Learn. Represent., 2024

  31. [31]

    Calibrating sequence likelihood improves conditional language generation,

    Y . Zhao, M. Khalman, R. Joshi, S. Narayan, M. Saleh, and P. J. Liu, “Calibrating sequence likelihood improves conditional language generation,” inThe Eleventh Int. Conf. on Learn. Represent., 2022

  32. [32]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inAdv. Neural Inf. Process. Syst., vol. 36, 2024

  33. [33]

    Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations,

    D. Brown, W. Goo, P. Nagarajan, and S. Niekum, “Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations,” inInt. Conf. on Mach. Learn., pp. 783–792, PMLR, 2019

  34. [34]

    Batch active learning of reward functions from human preferences,

    E. Biyik, N. Anari, and D. Sadigh, “Batch active learning of reward functions from human preferences,”ACM Trans. on Human-Robot Interact., vol. 13, no. 2, pp. 1–27, 2024

  35. [35]

    Hindsight PRIORs for reward learning from human preferences,

    M. Verma and K. Metcalf, “Hindsight PRIORs for reward learning from human preferences,” inThe Twelfth Int. Conf. on Learn. Represent., 2024

  36. [36]

    Learning robot objectives from physical human interaction,

    A. Bajcsy, D. P. Losey, M. K. O’malley, and A. D. Dragan, “Learning robot objectives from physical human interaction,” inConf. on Robot Learn., pp. 217–226, PMLR, 2017

  37. [37]

    Including uncertainty when learning from human corrections,

    D. P. Losey and M. K. O’Malley, “Including uncertainty when learning from human corrections,” inConf. on Robot Learn., pp. 123–132, PMLR, 2018

  38. [38]

    Learning from human directional corrections,

    W. Jin, T. D. Murphey, Z. Lu, and S. Mou, “Learning from human directional corrections,”IEEE Trans. on Robotics, vol. 39, no. 1, pp. 625–644, 2022

  39. [39]

    Interactive learning with corrective feedback for policies based on deep neural networks,

    R. P ´erez-Dattari, C. Celemin, J. Ruiz-del Solar, and J. Kober, “Interactive learning with corrective feedback for policies based on deep neural networks,” inProc. 2018 Int. Symp. on Exp. Robotics, pp. 353–363, Springer, 2020

  40. [40]

    Towards corrective deep imitation learning in data intensive environments: Helping robots to learn faster by leveraging human knowledge,

    I. Lopez Bosque, “Towards corrective deep imitation learning in data intensive environments: Helping robots to learn faster by leveraging human knowledge,” master’s thesis, Delft University of Technology, Nov. 2021

  41. [41]

    Interactive imitation learning in 18 state-space,

    S. Jauhri, C. Celemin, and J. Kober, “Interactive imitation learning in 18 state-space,” inConf. on Robot Learn., pp. 682–692, PMLR, 2021

  42. [42]

    Learning from active human involvement through proxy value propagation,

    Z. Peng, W. Mo, C. Duan, Q. Li, and B. Zhou, “Learning from active human involvement through proxy value propagation,” inAdv. Neural Inf. Process. Syst., 2023

  43. [43]

    Reinforcement learning with deep energy-based policies,

    T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning with deep energy-based policies,” inInt. Conf. on Mach. Learn., pp. 1352–1361, PMLR, 2017

  44. [44]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInt. Conf. on Mach. Learn., pp. 1861–1870, PMLR, 2018

  45. [45]

    Aligning human intent from imperfect demonstrations with confidence-based inverse soft-q learning,

    X. Bu, W. Li, Z. Liu, Z. Ma, and P. Huang, “Aligning human intent from imperfect demonstrations with confidence-based inverse soft-q learning,” IEEE Robotics Autom. Lett., 2024

  46. [46]

    Bayesian reparameteri- zation of reward-conditioned reinforcement learning with energy-based models,

    W. Ding, T. Che, D. Zhao, and M. Pavone, “Bayesian reparameteri- zation of reward-conditioned reinforcement learning with energy-based models,” inInt. Conf. on Mach. Learn., pp. 8053–8066, PMLR, 2023

  47. [47]

    Inverse preference learning: Preference-based rl without a reward function,

    J. Hejna and D. Sadigh, “Inverse preference learning: Preference-based rl without a reward function,” inAdv. Neural Inf. Process. Syst., vol. 36, 2024

  48. [48]

    Learning from interventions: Human-robot interaction as both explicit and implicit feedback,

    J. Spencer, S. Choudhury, M. Barnes, M. Schmittle, M. Chiang, P. Ra- madge, and S. Srinivasa, “Learning from interventions: Human-robot interaction as both explicit and implicit feedback,” in16th Robotics: Sci. Syst. RSS 2020, MIT Press Journals, 2020

  49. [49]

    Flow contrastive estimation of energy-based models,

    R. Gao, E. Nijkamp, D. P. Kingma, Z. Xu, A. M. Dai, and Y . N. Wu, “Flow contrastive estimation of energy-based models,” inProc. IEEE/CVF Conf. on Comput. Vis. Pattern Recognit., pp. 7518– 7528, 2020

  50. [50]

    Hard negative mixing for contrastive learning,

    Y . Kalantidis, M. B. Sariyildiz, N. Pion, P. Weinzaepfel, and D. Larlus, “Hard negative mixing for contrastive learning,” inAdv. Neural Inf. Process. Syst., vol. 33, pp. 21798–21809, 2020

  51. [51]

    Representation Learning with Contrastive Predictive Coding

    A. v. d. Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

  52. [52]

    Revisiting energy based models as policies: Ranking noise contrastive estimation and interpolating energy models,

    S. Singh, S. Tu, and V . Sindhwani, “Revisiting energy based models as policies: Ranking noise contrastive estimation and interpolating energy models,”Trans. on Mach. Learn. Research, 2024

  53. [53]

    A reduction of imitation learn- ing and structured prediction to no-regret online learning,

    S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learn- ing and structured prediction to no-regret online learning,” inProc. Fourteenth Int. Conf. on Artif. Intell. Stat., pp. 627–635, JMLR Workshop and Conference Proceedings, 2011

  54. [54]

    Hg- dagger: Interactive imitation learning with human experts,

    M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer, “Hg- dagger: Interactive imitation learning with human experts,” in2019 Int. Conf. on Robotics Autom. (ICRA), pp. 8077–8083, IEEE, 2019

  55. [55]

    Ambient diffusion: Learning clean distributions from corrupted data,

    G. Daras, K. Shah, Y . Dagan, A. Gollakota, A. Dimakis, and A. Klivans, “Ambient diffusion: Learning clean distributions from corrupted data,” Adv. Neural Inf. Process. Syst., vol. 36, pp. 288–313, 2023

  56. [56]

    robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

    Y . Zhu, J. Wong, A. Mandlekar, R. Mart´ın-Mart´ın, A. Joshi, S. Nasiriany, and Y . Zhu, “robosuite: A modular simulation framework and benchmark for robot learning,” inarXiv preprint arXiv:2009.12293, 2020

  57. [57]

    Interactive learning of temporal features for control: Shap- ing policies and state representations from human feedback,

    R. P ´erez-Dattari, C. Celemin, G. Franzese, J. Ruiz-del Solar, and J. Kober, “Interactive learning of temporal features for control: Shap- ing policies and state representations from human feedback,”IEEE Robotics & Autom. Mag., vol. 27, no. 2, pp. 46–54, 2020

  58. [58]

    Bayesian learning via stochastic gradient langevin dynamics,

    M. Welling and Y . W. Teh, “Bayesian learning via stochastic gradient langevin dynamics,” inProc. 28th Int. Conf. on Mach. Learn. (ICML-11), pp. 681–688, Citeseer, 2011