pith. machine review for the scientific record. sign in

arxiv: 2605.12000 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:12 UTC · model grok-4.3

classification 💻 cs.LG
keywords multi-objective imitation learningPareto-optimal policiesbehavioral cloningMOMDPminimax optimalityimitation learning
0
0 comments X

The pith

MA-BC partitions conflicting expert data and pools the rest to recover Pareto-optimal policies faster than separate learners in multi-objective imitation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-objective imitation learning requires recovering policies on the Pareto front when given trajectories from several experts, each optimal for a different trade-off in a MOMDP. Naively mixing all data produces dominated policies that lose on some objectives. MA-BC detects and splits state-action pairs where experts visibly disagree, then pools the remaining consistent pairs to train one policy. This yields a strictly faster statistical convergence rate to the Pareto front than any method that treats each expert dataset in isolation. A new lower bound shows the rate is minimax optimal.

Core claim

MA-BC systematically partitions divergent expert trajectories while pooling non-conflicting state-action pairs, proving convergence to Pareto-optimal policies at a faster statistical rate than any learner handling each expert dataset independently and matching the minimax lower bound established for the multi-objective imitation setting.

What carries the argument

Multi-Output Augmented Behavioral Cloning (MA-BC), which identifies observable behavior conflicts in expert data and pools only the non-conflicting state-action pairs to produce a single policy on the Pareto front.

If this is right

  • MA-BC reaches the same Pareto-front performance with fewer total samples than training a separate imitator on each expert dataset.
  • The same partitioning-plus-pooling logic extends from discrete environments to continuous LQR control without changing the core guarantees.
  • Policies trained by MA-BC avoid the dominated behavior that results from simply aggregating all expert trajectories.
  • A matching lower bound confirms that no other algorithm can improve on the sample complexity MA-BC achieves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The partitioning step could be replaced by any reliable disagreement detector, opening the method to larger or partially observed state spaces.
  • In practice the approach may reduce total training cost when multiple competing objectives must be satisfied from limited expert data.
  • Similar split-and-pool logic might apply to multi-task imitation where tasks share some but not all optimal actions.

Load-bearing premise

The demonstrations come from Pareto-optimal experts and observable conflicts in state-action pairs can be reliably partitioned without further structure on the dynamics or rewards.

What would settle it

An experiment or counterexample in a MOMDP where MA-BC's convergence rate to the Pareto front fails to beat independent per-expert learners, for instance when conflicts cannot be cleanly separated from the given trajectories.

Figures

Figures reproduced from arXiv: 2605.12000 by Claire Vernade, Luca Viano, Volkan Cevher, Ziyad Sheebaelhamd.

Figure 1
Figure 1. Figure 1: (Left) Geometry of the MOMDP Return Space. The space of expected returns J is a convex polytope [26]. The Pareto front (bold blue line) consists of boundary faces whose outward normal vectors w lie entirely within the positive orthant R d ≥0 . Boundary faces with outward normal vectors containing negative components represent Pareto-dominated policies. All vertex policies are deterministic. (Right) Naive B… view at source ↗
Figure 2
Figure 2. Figure 2: Sample efficiency and suboptimality gaps across all three discrete multi-objective environments. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation of dataset concentrability on MA-BC performance. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Continuous state-action imitation results for the LQR drone control task. The plots report the [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Lower bound construction. For the hard MDP see fig. [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Hard MDP from [38] with n states. The learner must identify the expert action a ∗ in states rarely visited (s2 . . . sn). Otherwise, it falls in the red absorbing state xend. Then, we have that for all i ∈ [d], we have that Ji(π ⋆ ) − Ji(ˆπ1) ≥ Ji(π1) − Ji(˜π1) = (1 − γ) −1 X x∈X ν π ⋆ (x) [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Action Consistency Graph on Xdiv. The nodes represent the conflicting state-action pairs observed at the divergent states. Edges connect actions that were observed co-occurring within the same trajectory. The graph partitions the conflicting actions into consistent, expert-specific datasets (Ddiv,ℓ), successfully binding the latent experts’ actions on the divergent set of states. 28 [PITH_FULL_IMAGE:figur… view at source ↗
Figure 8
Figure 8. Figure 8: Overview of the discrete multi-objective environments used for empirical evaluation. [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The DST Return Polytope. The exact continuous Pareto frontier is bounded by deterministic optimal policies (green vertices). Aliased policies (where multiple distinct optimal policies yield the exact same expected return) are highlighted with purple circles. Imitation learning (Failure II) We now turn our attention to the imitation learning setting. We select two extreme experts: a time-optimizing expert (… view at source ↗
Figure 10
Figure 10. Figure 10: The Geometric Failure of Naive BC (Failure II). We collect data from two extreme experts, highlighted using the blue and red stars. Naive BC aggregates all of this data into a single joint dataset. Because the experts fundamentally disagree at bottleneck states, the resulting Maximum Likelihood Estimate policy diffuses its probability mass destructively, causing its expected return to sag deeply into the … view at source ↗
Figure 11
Figure 11. Figure 11: Failure II: Naive BC vs. Closest Optimal Mixture. (Left) The policy learned by Naive BC is overly stochastic at conflicting states (red cells). (Right) The closest theoretically optimal policy on the Pareto front. This policy achieves continuous returns by randomizing at only a single critical state, properly trading off between two adjacent treasures without diffusing entropy throughout the trajectory. I… view at source ↗
Figure 12
Figure 12. Figure 12: Failure I: The Failure of Isolated BC (N1 = N2 = 5). We depict the time-optimizing policy learned via Isolated BC (Left) versus MA-BC (Right) using only 5 trajectories. MA-BC aggressively pools non-divergent actions from the treasure-optimizing expert to fill in unvisited states. While largely helpful, MA-BC introduces temporary bias in the low-data regime (e.g., the actions above cells 16, 19, 20, and 22… view at source ↗
Figure 13
Figure 13. Figure 13: MA-BC Self-Correction (N1 = N2 = 50). As the sample size increases to 50 trajectories per expert, the bias previously introduced by MA-BC ( [PITH_FULL_IMAGE:figures/full_fig_p039_13.png] view at source ↗
read the original abstract

This work investigates multi-objective imitation learning: the problem of recovering policies that lie on the Pareto front given demonstrations from multiple Pareto-optimal experts in a Multi-Objective Markov Decision Process (MOMDP). Standard imitation approaches are ill-equipped for this regime, as naively aggregating conflicting expert trajectories can result in dominated policies. To address this, we introduce Multi-Output Augmented Behavioral Cloning (MA-BC), an algorithm that systematically partitions divergent expert data while pooling state-action pairs where no behavior conflict is observed. Theoretically, we prove that MA-BC converges to Pareto-optimal policies at a faster statistical rate than any learner that considers each expert dataset independently. Furthermore, we establish a novel lower bound for multi-objective imitation learning, demonstrating that MA-BC is minimax optimal. Finally, we empirically validate our algorithm across diverse discrete environments and, guided by our theoretical insights, extend and evaluate MA-BC on a continuous Linear Quadratic Regulator (LQR) control task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Multi-Output Augmented Behavioral Cloning (MA-BC) for multi-objective imitation learning in MOMDPs. Given demonstrations from multiple Pareto-optimal experts, the algorithm partitions state-action pairs exhibiting behavioral conflicts while pooling non-conflicting pairs. It claims to prove that MA-BC recovers Pareto-optimal policies at a faster statistical rate than any method treating expert datasets independently, establishes a matching lower bound showing minimax optimality, and validates the approach empirically on discrete environments plus a continuous LQR task.

Significance. If the central claims hold, the work is significant for providing the first algorithm with provably faster rates and minimax optimality for multi-objective imitation learning, directly addressing the problem of dominated policies from naive aggregation of conflicting experts. The combination of a novel lower-bound construction, explicit rate improvement over independent learners, and extension to continuous control constitutes a substantive contribution to imitation learning and multi-objective RL.

major comments (1)
  1. [Theoretical analysis (main convergence theorem and lower-bound construction)] The faster rate and minimax optimality rest on the claim that observable conflicts can be reliably partitioned from samples while safely pooling the remainder. In a general MOMDP, conflicts between Pareto-optimal experts at the same state can depend on the unknown vector reward and transition kernel in ways not recoverable from empirical frequencies alone; the manuscript provides no explicit high-probability guarantee or additional structural assumption ensuring the empirical partition matches the true one. This assumption is load-bearing for the central convergence claim.
minor comments (2)
  1. [Abstract] The abstract refers to 'diverse discrete environments' without naming them; listing the specific domains (e.g., grid-world variants) would improve reproducibility and context.
  2. [Introduction / Preliminaries] Notation for the vector-valued reward and the Pareto front could be introduced earlier with a small illustrative example to aid readers unfamiliar with MOMDPs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the significance of our contribution and for the constructive comment on the theoretical analysis. We address the major comment below.

read point-by-point responses
  1. Referee: [Theoretical analysis (main convergence theorem and lower-bound construction)] The faster rate and minimax optimality rest on the claim that observable conflicts can be reliably partitioned from samples while safely pooling the remainder. In a general MOMDP, conflicts between Pareto-optimal experts at the same state can depend on the unknown vector reward and transition kernel in ways not recoverable from empirical frequencies alone; the manuscript provides no explicit high-probability guarantee or additional structural assumption ensuring the empirical partition matches the true one. This assumption is load-bearing for the central convergence claim.

    Authors: We thank the referee for identifying this important point. The current manuscript defines the partition based on observed empirical conflicts in the collected demonstrations and proceeds under the assumption that this empirical partition correctly identifies true behavioral conflicts. While standard concentration inequalities (e.g., Hoeffding bounds on empirical frequencies) imply that the partition is correct with high probability for sufficiently large sample sizes, we agree that the manuscript does not make this guarantee explicit in the statement or proof of the main convergence theorem. In the revised version, we will add a supporting lemma that provides an explicit high-probability bound (at least 1-δ) on the event that the empirical partition matches the true conflict set, using standard concentration arguments on the state-action visitation frequencies. This lemma will be integrated into the proof of the main theorem to ensure the faster statistical rate holds with high probability. No additional structural assumptions on the MOMDP are required beyond those already present in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: independent statistical rates and novel lower bound

full rationale

The paper defines the MA-BC algorithm explicitly as partitioning observed conflicts and pooling non-conflicting pairs, then derives convergence rates faster than independent learners plus a separate minimax lower bound. No quoted step reduces the claimed rates or optimality to a fitted parameter renamed as prediction, a self-definitional loop, or a load-bearing self-citation chain. The lower-bound construction is presented as novel and external to the algorithm definition. This is the normal case of a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard MDP assumptions and the existence of Pareto-optimal expert demonstrations; no free parameters or new physical entities are introduced.

axioms (2)
  • domain assumption Experts provide demonstrations from Pareto-optimal policies in a Multi-Objective Markov Decision Process
    Stated in the problem setup and used to define the target Pareto front.
  • standard math Standard concentration inequalities apply to the statistical rates of behavioral cloning
    Invoked to establish faster convergence than independent learners.

pith-pipeline@v0.9.0 · 5474 in / 1305 out tokens · 111081 ms · 2026-05-13T07:12:17.881078+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 1 internal anchor

  1. [1]

    Abbeel, D

    P. Abbeel, D. Dolgov, A. Y. Ng, and S. Thrun. Apprenticeship learning for motion planning with application to parking lot navigation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2008

  2. [2]

    Abbeel and A

    P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. InInternational Conference on Machine Learning (ICML), 2004

  3. [3]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  4. [4]

    A data-driven dynamic obstacle avoidance method for liquid-carrying plant protection uavs.Agronomy, 12(4), 2022

    Shibbir Ahmed, Baijing Qiu, Chun-Wei Kong, Huang Xin, Fiaz Ahmad, and Jinlong Lin. A data-driven dynamic obstacle avoidance method for liquid-carrying plant protection uavs.Agronomy, 12(4), 2022

  5. [5]

    Learning all optimal policies with multiple criteria

    Leon Barrett and Srini Narayanan. Learning all optimal policies with multiple criteria. InProceedings of the 25th International Conference on Machine Learning, ICML ’08, page 41–47, New York, NY, USA,

  6. [6]

    Association for Computing Machinery

  7. [7]

    Relative entropy inverse reinforcement learning

    Abdeslam Boularias, Jens Kober, and Jan Peters. Relative entropy inverse reinforcement learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 182–189. JMLR Workshop and Conference Proceedings, 2011

  8. [8]

    Diffusion policy: Visuomotor policy learning via action diffusion, 2024

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024

  9. [9]

    Ultrafeedback: Boosting language models with scaled ai feedback.arXiv preprint arXiv:2310.01377,

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, et al. Ultrafeedback: Boosting language models with scaled ai feedback.arXiv preprint arXiv:2310.01377, 2023

  10. [10]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  11. [11]

    Is behavior cloning all you need? understanding horizon in imitation learning.arXiv preprint arXiv:2407.15007, 2024

    Dylan J Foster, Adam Block, and Dipendra Misra. Is behavior cloning all you need? understanding horizon in imitation learning.arXiv preprint arXiv:2407.15007, 2024

  12. [12]

    Select to perfect: Imitating desired behavior from large multi-agent data.arXiv preprint arXiv:2405.03735, 2024

    Tim Franzmeyer, Edith Elkind, Philip Torr, Jakob Foerster, and Joao Henriques. Select to perfect: Imitating desired behavior from large multi-agent data.arXiv preprint arXiv:2405.03735, 2024

  13. [13]

    Learning equilibria from data: Provably efficient multi-agent imitation learning, 2025

    Till Freihaut, Luca Viano, Volkan Cevher, Matthieu Geist, and Giorgia Ramponi. Learning equilibria from data: Provably efficient multi-agent imitation learning, 2025

  14. [14]

    Rate optimal learning of equilibria from data.arXiv preprint arXiv:2510.09325, 2025

    Till Freihaut, Luca Viano, Emanuele Nevali, Volkan Cevher, Matthieu Geist, and Giorgia Ramponi. Rate optimal learning of equilibria from data.arXiv preprint arXiv:2510.09325, 2025

  15. [15]

    Characterization of optimal policies in vector-valued markovian decision processes

    Nagata Furukawa. Characterization of optimal policies in vector-valued markovian decision processes. Mathematics of operations research, 5(2):271–279, 1980

  16. [16]

    Ho and S

    J. Ho and S. Ermon. Generative adversarial imitation learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2016

  17. [17]

    J. Ho, J. K. Gupta, and S. Ermon. Model-free imitation learning with policy optimization. InInternational Conference on Machine Learning (ICML), 2016

  18. [18]

    Multi-objective lqr with linear scalarization.arXiv preprint arXiv:2408.04488, 2024

    Ali Jadbabaie, Devavrat Shah, and Sean R Sinclair. Multi-objective lqr with linear scalarization.arXiv preprint arXiv:2408.04488, 2024

  19. [19]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 11

  20. [20]

    OpenVLA: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learning, 2024

  21. [21]

    Pareto inverse reinforcement learning for diverse expert policy generation.arXiv preprint arXiv:2408.12110, 2024

    Woo Kyung Kim, Minjong Yoo, and Honguk Woo. Pareto inverse reinforcement learning for diverse expert policy generation.arXiv preprint arXiv:2408.12110, 2024

  22. [22]

    Pretraining language models with human preferences

    Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Vinayak Bhalerao, Christopher Buckley, Jason Phang, Samuel R Bowman, and Ethan Perez. Pretraining language models with human preferences. In International conference on machine learning, pages 17506–17533. PMLR, 2023

  23. [23]

    Coordinated multi-agent imitation learning

    Hoang M Le, Yisong Yue, Peter Carr, and Patrick Lucey. Coordinated multi-agent imitation learning. InInternational Conference on Machine Learning, pages 1995–2003. PMLR, 2017

  24. [24]

    Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto

    Seungjae Lee, Yibin Wang, Haritheja Etukuru, H. Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. InMulti-modal Foundation Model meets Embodied AI Workshop @ ICML2024, 2024

  25. [25]

    Learning strategy representation for imitation learning in multi-agent games

    Shiqi Lei, Kanghoon Lee, Linjing Li, and Jinkyoo Park. Learning strategy representation for imitation learning in multi-agent games. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 18163–18171, 2025

  26. [26]

    How to find the exact pareto front for multi-objective mdps? arXiv preprint arXiv:2410.15557, 2024

    Yining Li, Peizhong Ju, and Ness B Shroff. How to find the exact pareto front for multi-objective mdps? arXiv preprint arXiv:2410.15557, 2024

  27. [27]

    Multi-objective reinforcement learning: Convexity, stationarity and pareto optimality

    Haoye Lu, Daniel Herman, and Yaoliang Yu. Multi-objective reinforcement learning: Convexity, stationarity and pareto optimality. InThe Eleventh International Conference on Learning Representations, 2023

  28. [28]

    Molter.Classic graph problems made temporal – a parameterized complexity analysis

    H. Molter.Classic graph problems made temporal – a parameterized complexity analysis. Foundations of Computing. Universitätsverlag der TU Berlin, 2020

  29. [29]

    Inverse q-learning done right: Offline imitation learning inq π-realizable mdps, 2025

    Antoine Moulin, Gergely Neu, and Luca Viano. Inverse q-learning done right: Offline imitation learning inq π-realizable mdps, 2025

  30. [30]

    Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900, 2025

    Antoine Moulin, Gergely Neu, and Luca Viano. Optimistically optimistic exploration for provably efficient infinite-horizon reinforcement and imitation learning.arXiv preprint arXiv:2502.13900, 2025

  31. [31]

    A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. InInternational Conference on Machine Learning (ICML), 2000

  32. [32]

    An algorithmic perspective on imitation learning.Foundations and Trends in Robotics, 2018

    T Osa, J Pajarinen, G Neumann, JA Bagnell, P Abbeel, and J Peters. An algorithmic perspective on imitation learning.Foundations and Trends in Robotics, 2018

  33. [33]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  34. [34]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  35. [35]

    D. A. Pomerleau. Efficient training of artificial neural networks for autonomous navigation.Neural Computation, 3(1):88–97, 1991

  36. [36]

    Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming

    Martin L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. 12

  37. [37]

    Improving language under- standing by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language under- standing by generative pre-training. 2018

  38. [38]

    On the value of interaction and function approximation in imitation learning.Advances in Neural Information Processing Systems, 34:1325–1336, 2021

    Nived Rajaraman, Yanjun Han, Lin Yang, Jingbo Liu, Jiantao Jiao, and Kannan Ramchandran. On the value of interaction and function approximation in imitation learning.Advances in Neural Information Processing Systems, 34:1325–1336, 2021

  39. [39]

    Toward the fundamental limits of imitation learning.Advances in Neural Information Processing Systems, 33:2914–2924, 2020

    Nived Rajaraman, Lin Yang, Jiantao Jiao, and Kannan Ramchandran. Toward the fundamental limits of imitation learning.Advances in Neural Information Processing Systems, 33:2914–2924, 2020

  40. [40]

    A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

    Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making.Journal of Artificial Intelligence Research, 48:67–113, 2013

  41. [41]

    PhD thesis, University of Amsterdam, 2016

    Diederik Marijn Roijers.Multi-Objective Decision-Theoretic Planning. PhD thesis, University of Amsterdam, 2016

  42. [42]

    S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2011

  43. [43]

    Efficient reductions for imitation learning

    Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2010

  44. [44]

    Learning agents for uncertain environments (extended abstract)

    Stuart Russell. Learning agents for uncertain environments (extended abstract). InAnnual Conference on Computational Learning Theory (COLT), 1998

  45. [45]

    Smith, and Yejin Choi

    Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. Social bias frames: Reasoning about social and power implications of language. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5477–5490, Online, July 2020. Association for Computational Linguistics

  46. [46]

    Online apprenticeship learning.arXiv:2102.06924, 2021

    Lior Shani, Tom Zahavy, and Shie Mannor. Online apprenticeship learning.arXiv:2102.06924, 2021

  47. [47]

    Stochastic games.Proceedings of the national academy of sciences, 39(10):1095–1100, 1953

    Lloyd S Shapley. Stochastic games.Proceedings of the national academy of sciences, 39(10):1095–1100, 1953

  48. [48]

    Quantization-free autoregressive action transformer

    Ziyad Sheebaelhamd, Michael Tschannen, Michael Muehlebach, and Claire Vernade. Quantization-free autoregressive action transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  49. [49]

    Multi-agent generative adversarial imitation learning.Advances in neural information processing systems, 31, 2018

    Jiaming Song, Hongyu Ren, Dorsa Sadigh, and Stefano Ermon. Multi-agent generative adversarial imitation learning.Advances in neural information processing systems, 31, 2018

  50. [50]

    Of moments and matching: A game-theoretic framework for closing the imitation gap

    Gokul Swamy, Sanjiban Choudhury, J Andrew Bagnell, and Steven Wu. Of moments and matching: A game-theoretic framework for closing the imitation gap. InInternational Conference on Machine Learning, pages 10022–10032. PMLR, 2021

  51. [51]

    U. Syed, M. Bowling, and R.E. Schapire. Apprenticeship learning using linear programming. In International Conference on Machine Learning (ICML), 2008

  52. [52]

    Syed and R

    U. Syed and R. E. Schapire. A game-theoretic approach to apprenticeship learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2007

  53. [53]

    Multi-agent imitation learning: Value is easy, regret is hard

    Jingwu Tang, Gokul Swamy, Fei Fang, and Zhiwei Steven Wu. Multi-agent imitation learning: Value is easy, regret is hard. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 27790–27816. Curran Associates, Inc., 2024

  54. [54]

    Multi-agent imitation learning with function approximation: Linear markov games and beyond.arXiv preprint arXiv:2602.22810, 2026

    Luca Viano, Till Freihaut, Emanuele Nevali, Volkan Cevher, Matthieu Geist, and Giorgia Ramponi. Multi-agent imitation learning with function approximation: Linear markov games and beyond.arXiv preprint arXiv:2602.22810, 2026. 13

  55. [55]

    Proximal point imitation learning.Advances in Neural Information Processing Systems, 35:24309–24326, 2022

    Luca Viano, Angeliki Kamoutsi, Gergely Neu, Igor Krawczuk, and Volkan Cevher. Proximal point imitation learning.Advances in Neural Information Processing Systems, 35:24309–24326, 2022

  56. [56]

    Imitation learning in discounted linear MDPs without exploration assumptions

    Luca Viano, Stratis Skoulakis, and Volkan Cevher. Imitation learning in discounted linear MDPs without exploration assumptions. InForty-first International Conference on Machine Learning, 2024

  57. [57]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, Andre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

  58. [58]

    Huang, and Nicolas Heess

    Joe Watson, Sandy H. Huang, and Nicolas Heess. Coherent soft imitation learning, 2023

  59. [59]

    D.J. White. Multi-objective infinite-horizon discounted markov decision processes.Journal of mathemat- ical analysis and applications, 89(2):639–647, 1982

  60. [60]

    A broad-coverage challenge corpus for sentence understanding through inference

    Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics, 2018

  61. [61]

    Imitating language via scalable inverse reinforcement learning.Advances in Neural Information Processing Systems, 37:90714–90735, 2024

    Markus Wulfmeier, Michael Bloesch, Nino Vieillard, Arun Ahuja, Jörg Bornschein, Sandy Huang, Artem Sokolov, Matt Barnes, Guillaume Desjardins, Alex Bewley, et al. Imitating language via scalable inverse reinforcement learning.Advances in Neural Information Processing Systems, 37:90714–90735, 2024

  62. [62]

    Provably efficient adversarial imitation learning with unknown transitions.arXiv preprint arXiv:2306.06563, 2023

    Tian Xu, Ziniu Li, Yang Yu, and Zhi-Quan Luo. Provably efficient adversarial imitation learning with unknown transitions.arXiv preprint arXiv:2306.06563, 2023

  63. [63]

    A generalized algorithm for multi-objective reinforcement learning and policy adaptation

    Runzhe Yang, Xingyuan Sun, and Karthik Narasimhan. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché- Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

  64. [64]

    Multi-agent adversarial inverse reinforcement learning

    Lantao Yu, Jiaming Song, and Stefano Ermon. Multi-agent adversarial inverse reinforcement learning. InInternational conference on machine learning, pages 7194–7201. PMLR, 2019

  65. [65]

    P Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

  66. [66]

    B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. InNational Conference on Artificial Intelligence (AAAI), 2008. 14 Contents of Appendix A Related Work 16 B Omitted Proofs 18 B.1 Proof of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 B.2 Proof of Theorem...

  67. [67]

    for a complete overview of the classics. Recent theoretical developmentsThe first solid sample complexity investigation in imitation learning is by [38], which also introduced the lower bound construction on which we build on for ours to show that behavioral cloning is actually optimal across offline algorithms. Beyond the tabular setting, [10] proved tha...

  68. [68]

    Aliasing:Multiple distinct deterministic policies collapse into the exact same geometric vertex of the return polytope

  69. [69]

    distance-1

    Edge-bound Policies:Deterministic policies may map to the flat edge connecting two vertices, rather than forming new extreme points. Because of these phenomena, the strict "distance-1" property between adjacent vertices breaks down. Our result in Theorem 2.1 holds in general, and is agnostic to the existence ofaliasedoredge-boundpolicies. B.2 Proof of The...

  70. [70]

    24 Then, we have that max M∈{M 1,M2,M3,M4} E[J M,i(π⋆)−J M,i(ˆπ1)]≥ 1 8(1−γ) KX k=2 1 N1 + 1 1− 1 N1 + 1 N1 ≥ 1 8(1−γ) KX k=2 1 e(N1 + 1) ≥Ω K (1−γ)N 1

    Then, switching the attention back to(3), taking the maximum over both sides and defining asJM,i(π)as the ith entry of the performance vector for policyπ in the environment M, we have that max M∈{M 1,M2,M3,M4} E[J M,i(π⋆)−J M,i(ˆπ1)]≥ 1 8(1−γ) X x∈X \Xcommon ν0(x)P[π 1(x)̸= ˆπ1(x)] = 1 8(1−γ) X x∈X \Xcommon ν0(x)P[x /∈ D1] = 1 8(1−γ) X x∈X \Xcommon ν0(x)(...

  71. [71]

    falling off

    in the graph, which is an NP-hard problem. Thus, we restrict our setting to pair-wise distinct actions, and leave the extension to arbitrary experts for future work. Cluster 1 (Expert 1 Actions) Cluster 2 (Expert 2 Actions) (x1, a1) (x2, a2) (x3, a3) (x4, a4) τA τB (x1, a′ 1) (x2, a′ 2) (x3, a′ 3) (x4, a′ 4) τC No Linking Trajectories (Mutually Exclusive)...