Recognition: no theorem link
Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift
Pith reviewed 2026-05-12 03:18 UTC · model grok-4.3
The pith
Selective imitation learning lets agents stop acting when dynamics shift makes expert demonstrations unreliable, using a small set of validator policies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SeqRejectron builds a stopping rule from a small collection of validator policies whose size is independent of the horizon and policy class, delivering horizon-free sample complexity of order log of policy class size over epsilon squared for deterministic policies under sparse costs, and similar guarantees for stochastic policies via cumulative Hellinger distance.
What carries the argument
SeqRejectron's validator policy set, a compact collection of policies used to determine when to reject an action and stop imitating.
Load-bearing premise
Unlabeled state trajectories from the same expert are available in the test environment, along with the sparse costs assumption for the deterministic case.
What would settle it
An empirical test where the regret of the selective policy before stopping exceeds the predicted bound when dynamics shift arbitrarily, or when no test trajectories are provided.
Figures
read the original abstract
Behavior cloning provides strong imitation learning guarantees when training and test environments share the same dynamics. However, in many deployment settings the test environment's transitions differ from training, and classical offline IL offers no recourse: the learner must commit to an action at every state, even when its demonstrations are uninformative and could lead to arbitrary degradation of performance. This motivates the study of selective imitation, where the learner may choose to stop when it cannot act reliably. We introduce a model for selective imitation under arbitrary dynamics shift: given labeled expert demonstrations from a training environment and unlabeled state trajectories from the same expert in a test environment, the learner outputs a selective policy that is complete (rarely stops in training) and sound (incurs low regret before stopping in test). Our algorithm, SeqRejectron, constructs a stopping rule using a small set of validator policies whose size is independent of the horizon or policy class. For deterministic policies, this yields horizon-free $\tilde{O}(\log|\Pi|/\epsilon^2)$ sample complexity, assuming sparse costs. For stochastic policies, we obtain analogous horizon-free guarantees using a cumulative Hellinger stopping time. We extend the framework to misspecified experts and different expert policies across train and test and obtain results that gracefully degrade with the amount of misspecification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SeqRejectron for selective imitation learning under arbitrary dynamics shift. Given labeled expert trajectories from a training environment and unlabeled state trajectories from the same expert in a test environment, the algorithm outputs a selective policy that is complete (rarely stops on training data) and sound (low regret before stopping on test data). The core construction uses a small set of validator policies whose cardinality is independent of the horizon and policy class size. For deterministic policies this yields horizon-free sample complexity Õ(log|Π|/ε²) under a sparse-cost assumption; for stochastic policies an analogous bound is obtained via a cumulative Hellinger stopping time. The framework is extended to misspecified experts and differing train/test expert policies, with guarantees that degrade gracefully with the degree of misspecification.
Significance. If the stated bounds hold, the result is significant because it supplies the first horizon-free sample-complexity guarantees for imitation learning under arbitrary dynamics shift, achieved by a stopping rule whose computational cost does not grow with horizon. The validator-set construction (size independent of |Π| and T) is a technically clean device that avoids the usual dependence on horizon in disagreement-based or disagreement-coefficient arguments. The paper also supplies explicit, falsifiable assumptions (sparse costs, availability of unlabeled test trajectories) together with graceful-degradation results for misspecification, which strengthens the practical relevance of the theory.
minor comments (2)
- [Theorem 3.1 and Section 4] The sparse-cost assumption is stated only in the abstract and in the deterministic theorem; a single, self-contained definition (including the precise constant or support size) should appear in the main theorem statement and be referenced from the stochastic and misspecification extensions.
- [Sections 3 and 5] Notation for the validator set V and the stopping time τ is introduced in Section 3 but reused with slightly different indexing in the stochastic case (Section 5); a unified notation table or consistent subscript convention would reduce reader effort.
Simulated Author's Rebuttal
We thank the referee for the positive and insightful review, as well as the recommendation for minor revision. The summary accurately reflects the core contributions of SeqRejectron, including the horizon-free sample-complexity guarantees and the use of a small validator set whose size is independent of both the horizon and the policy class. We are pleased that the technical cleanliness of the validator construction and the graceful degradation under misspecification are highlighted as strengthening the practical relevance of the results.
Circularity Check
No significant circularity detected
full rationale
The paper presents SeqRejectron as an explicit algorithmic construction of a stopping rule from a small validator policy set whose size is stated to be independent of horizon and policy class. The horizon-free sample complexity bounds are derived under explicitly listed assumptions (unlabeled test trajectories from the expert and sparse costs for the deterministic case) rather than by fitting parameters to the same data used for the final guarantee or by reducing the claimed result to a self-citation chain. No load-bearing step in the abstract or described framework reduces by construction to its own inputs, and the derivation remains self-contained once the stated assumptions are granted.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard concentration inequalities suffice to obtain the stated log|Π|/ε² sample bounds
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , volume=
Beyond perturbations: Learning guarantees with arbitrary adversarial test examples , author=. Advances in Neural Information Processing Systems , volume=
-
[2]
Advances in Neural Information Processing Systems , volume=
Is behavior cloning all you need? understanding horizon in imitation learning , author=. Advances in Neural Information Processing Systems , volume=
-
[3]
2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) , pages=
Domain randomization for transferring deep neural networks from simulation to the real world , author=. 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) , pages=. 2017 , organization=
work page 2017
-
[4]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Exploring the limitations of behavior cloning for autonomous driving , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[5]
Advances in neural information processing systems , volume=
Alvinn: An autonomous land vehicle in a neural network , author=. Advances in neural information processing systems , volume=
-
[6]
Efficient reductions for imitation learning , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=
work page 2010
-
[7]
A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=
work page 2011
-
[8]
IEEE Transactions on information theory , volume=
On optimum recognition error and reject tradeoff , author=. IEEE Transactions on information theory , volume=. 2003 , publisher=
work page 2003
- [9]
-
[10]
Algorithmic Learning Theory , pages=
Efficient learning with arbitrary covariate shift , author=. Algorithmic Learning Theory , pages=. 2021 , organization=
work page 2021
-
[11]
Advances in Neural Information Processing Systems , volume=
Tolerant algorithms for learning with arbitrary covariate shift , author=. Advances in Neural Information Processing Systems , volume=
-
[12]
Advances in neural information processing systems , volume=
Selective classification for deep neural networks , author=. Advances in neural information processing systems , volume=
-
[13]
Journal of Computer and System Sciences , volume=
Efficient algorithms for online decision problems , author=. Journal of Computer and System Sciences , volume=. 2005 , publisher=
work page 2005
-
[14]
Behavioral Cloning from Observation
Behavioral cloning from observation , author=. arXiv preprint arXiv:1805.01954 , year=
-
[15]
Advances in Neural Information Processing Systems , volume=
Toward the fundamental limits of imitation learning , author=. Advances in Neural Information Processing Systems , volume=
-
[16]
Proceedings of the forty-eighth annual ACM symposium on Theory of Computing , pages=
The computational power of optimization in online learning , author=. Proceedings of the forty-eighth annual ACM symposium on Theory of Computing , pages=
-
[17]
arXiv preprint arXiv:2006.13916 , year=
Off-dynamics reinforcement learning: Training for transfer with domain classifiers , author=. arXiv preprint arXiv:2006.13916 , year=
-
[18]
Advances in Neural Information Processing Systems , volume=
Robust inverse reinforcement learning under transition dynamics mismatch , author=. Advances in Neural Information Processing Systems , volume=
-
[19]
International Conference on Machine Learning , pages=
Robust imitation learning against variations in environment dynamics , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[20]
arXiv preprint arXiv:2002.11879 , year=
State-only imitation with transition dynamics mismatch , author=. arXiv preprint arXiv:2002.11879 , year=
-
[21]
Advances in Neural Information Processing Systems , volume=
An imitation from observation approach to transfer learning with dynamics mismatch , author=. Advances in Neural Information Processing Systems , volume=
-
[22]
Imitation Learning for Multi-turn LM Agents via On-policy Expert Corrections , author=. arXiv preprint arXiv:2512.14895 , year=
-
[23]
The International Journal of Robotics Research , volume=
Learning dexterous in-hand manipulation , author=. The International Journal of Robotics Research , volume=. 2020 , publisher=
work page 2020
-
[24]
2019 International Conference on Robotics and Automation (ICRA) , pages=
Safe reinforcement learning with model uncertainty estimates , author=. 2019 International Conference on Robotics and Automation (ICRA) , pages=. 2019 , organization=
work page 2019
-
[25]
Advances in Neural Information Processing Systems , volume=
Bridging offline reinforcement learning and imitation learning: A tale of pessimism , author=. Advances in Neural Information Processing Systems , volume=
-
[26]
Advances in neural information processing systems , volume=
Conservative q-learning for offline reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[27]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Offline reinforcement learning: Tutorial, review, and perspectives on open problems , author=. arXiv preprint arXiv:2005.01643 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[28]
arXiv preprint arXiv:2503.13162 , year=
Efficient imitation under misspecification , author=. arXiv preprint arXiv:2503.13162 , year=
-
[29]
International Conference on Algorithmic Learning Theory , pages=
On the hardness of domain adaptation and the utility of unlabeled target samples , author=. International Conference on Algorithmic Learning Theory , pages=. 2012 , organization=
work page 2012
-
[30]
arXiv preprint arXiv:2110.03239 , year=
Understanding domain randomization for sim-to-real transfer , author=. arXiv preprint arXiv:2110.03239 , year=
-
[31]
CAD2RL: Real Single-Image Flight without a Single Real Image
Cad2rl: Real single-image flight without a single real image , author=. arXiv preprint arXiv:1611.04201 , year=
-
[32]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Sim-to-real via sim-to-sim: Data-efficient robotic grasping via randomized-to-canonical adaptation networks , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[33]
2018 IEEE international conference on robotics and automation (ICRA) , pages=
Sim-to-real transfer of robotic control with dynamics randomization , author=. 2018 IEEE international conference on robotics and automation (ICRA) , pages=. 2018 , organization=
work page 2018
-
[34]
Sim-to-Real: Learning Agile Locomotion For Quadruped Robots
Sim-to-real: Learning agile locomotion for quadruped robots , author=. arXiv preprint arXiv:1804.10332 , year=
-
[35]
arXiv preprint arXiv:1910.07113 , year=
Solving rubik's cube with a robot hand , author=. arXiv preprint arXiv:1910.07113 , year=
-
[36]
arXiv preprint arXiv:2502.12310 , year=
Domain randomization is sample efficient for linear quadratic control , author=. arXiv preprint arXiv:2502.12310 , year=
-
[37]
The Fourteenth International Conference on Learning Representations , year=
Statistical Guarantees for Offline Domain Randomization , author=. The Fourteenth International Conference on Learning Representations , year=
-
[38]
2019 international conference on robotics and automation (ICRA) , pages=
Closing the sim-to-real loop: Adapting simulation randomization with real world experience , author=. 2019 international conference on robotics and automation (ICRA) , pages=. 2019 , organization=
work page 2019
-
[39]
Preparing for the Unknown: Learning a Universal Policy with Online System Identification
Preparing for the unknown: Learning a universal policy with online system identification , author=. arXiv preprint arXiv:1702.02453 , year=
-
[40]
Rma: Rapid motor adaptation for legged robots,
Rma: Rapid motor adaptation for legged robots , author=. arXiv preprint arXiv:2107.04034 , year=
-
[41]
Conference on Robot Learning , pages=
Active domain randomization , author=. Conference on Robot Learning , pages=. 2020 , organization=
work page 2020
-
[42]
Conference on robot learning , pages=
Sim-to-real robot learning from pixels with progressive nets , author=. Conference on robot learning , pages=. 2017 , organization=
work page 2017
-
[43]
Proceedings of the IEEE , volume=
A game theoretic approach to controller design for hybrid systems , author=. Proceedings of the IEEE , volume=. 2000 , publisher=
work page 2000
-
[44]
2019 18th European control conference (ECC) , pages=
Control barrier functions: Theory and applications , author=. 2019 18th European control conference (ECC) , pages=. 2019 , organization=
work page 2019
-
[45]
International workshop on hybrid systems: Computation and control , pages=
Safety verification of hybrid systems using barrier certificates , author=. International workshop on hybrid systems: Computation and control , pages=. 2004 , organization=
work page 2004
-
[46]
2017 IEEE 56th annual conference on decision and control (CDC) , pages=
Hamilton-jacobi reachability: A brief overview and recent advances , author=. 2017 IEEE 56th annual conference on decision and control (CDC) , pages=. 2017 , organization=
work page 2017
-
[47]
Mathematics of Operations Research , volume=
Robust dynamic programming , author=. Mathematics of Operations Research , volume=. 2005 , publisher=
work page 2005
-
[48]
Robust control of Markov decision processes with uncertain transition matrices , author=. Operations Research , volume=. 2005 , publisher=
work page 2005
-
[49]
Mathematics of Operations Research , volume=
Robust Markov decision processes , author=. Mathematics of Operations Research , volume=. 2013 , publisher=
work page 2013
-
[50]
International conference on machine learning , pages=
Robust adversarial reinforcement learning , author=. International conference on machine learning , pages=. 2017 , organization=
work page 2017
-
[51]
arXiv preprint arXiv:1610.01283 , year=
Epopt: Learning robust neural network policies using model ensembles , author=. arXiv preprint arXiv:1610.01283 , year=
-
[52]
Advances in neural information processing systems , volume=
Robust deep reinforcement learning against adversarial perturbations on state observations , author=. Advances in neural information processing systems , volume=
-
[53]
International Conference on Machine Learning , pages=
Action robust reinforcement learning and applications in continuous control , author=. International Conference on Machine Learning , pages=. 2019 , organization=
work page 2019
-
[54]
International Conference on Artificial Intelligence and Statistics , pages=
Sample complexity of robust reinforcement learning with a generative model , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=
work page 2022
-
[55]
Journal of Machine Learning Research , volume=
Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity , author=. Journal of Machine Learning Research , volume=
-
[56]
Advances in neural information processing systems , volume=
Generative adversarial imitation learning , author=. Advances in neural information processing systems , volume=
-
[57]
Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , url =
Learning robust rewards with adversarial inverse reinforcement learning , author=. arXiv preprint arXiv:1710.11248 , year=
-
[58]
arXiv preprint arXiv:1912.05032 , year=
Imitation learning via off-policy distribution matching , author=. arXiv preprint arXiv:1912.05032 , year=
- [59]
-
[60]
international conference on machine learning , pages=
Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. international conference on machine learning , pages=. 2016 , organization=
work page 2016
-
[61]
Advances in neural information processing systems , volume=
Simple and scalable predictive uncertainty estimation using deep ensembles , author=. Advances in neural information processing systems , volume=
-
[62]
Uncertainty-Aware Reinforcement Learning for Collision Avoidance
Uncertainty-aware reinforcement learning for collision avoidance , author=. arXiv preprint arXiv:1702.01182 , year=
-
[63]
IEEE Robotics and Automation Letters , volume=
Safe planning in dynamic environments using conformal prediction , author=. IEEE Robotics and Automation Letters , volume=. 2023 , publisher=
work page 2023
-
[64]
Learning for Dynamics and Control Conference , pages=
Adaptive conformal prediction for motion planning among dynamic agents , author=. Learning for Dynamics and Control Conference , pages=. 2023 , organization=
work page 2023
-
[65]
Conformal prediction for stl runtime verification , author=. Proceedings of the ACM/IEEE 14th International Conference on Cyber-Physical Systems (with CPS-IoT Week 2023) , pages=
work page 2023
-
[66]
2024 IEEE 63rd Conference on Decision and Control (CDC) , pages=
Single trajectory conformal prediction , author=. 2024 IEEE 63rd Conference on Decision and Control (CDC) , pages=. 2024 , organization=
work page 2024
-
[67]
Foundations and Trends in Machine Learning , volume=
Conformal prediction: A gentle introduction , author=. Foundations and Trends in Machine Learning , volume=. 2023 , publisher=
work page 2023
-
[68]
Advances in neural information processing systems , volume=
Conformal prediction under covariate shift , author=. Advances in neural information processing systems , volume=
-
[69]
Advances in Neural Information Processing Systems , volume=
Adaptive conformal inference under distribution shift , author=. Advances in Neural Information Processing Systems , volume=
-
[70]
Journal of statistical planning and inference , volume=
Improving predictive inference under covariate shift by weighting the log-likelihood function , author=. Journal of statistical planning and inference , volume=. 2000 , publisher=
work page 2000
- [71]
-
[72]
A theory of learning from different domains , author=. Machine learning , volume=. 2010 , publisher=
work page 2010
-
[73]
arXiv preprint arXiv:0902.3430 , year=
Domain adaptation: Learning bounds and algorithms , author=. arXiv preprint arXiv:0902.3430 , year=
-
[74]
Reinforcement learning: An introduction , author=. 1998 , publisher=
work page 1998
-
[75]
Handbooks in operations research and management science , volume=
Markov decision processes , author=. Handbooks in operations research and management science , volume=. 1990 , publisher=
work page 1990
-
[76]
Efficient Discrepancy Testing for Learning with Distribution Shift , author=. NeurIPS , year=
-
[77]
37th Annual Conference on Learning Theory, COLT 2024 (to appear) , year=
Testable Learning with Distribution Shift , author=. 37th Annual Conference on Learning Theory, COLT 2024 (to appear) , year=
work page 2024
-
[78]
37th Annual Conference on Learning Theory, COLT 2024 (to appear) , year=
Learning Intersections of Halfspaces with Distribution Shift: Improved Algorithms and SQ Lower Bounds , author=. 37th Annual Conference on Learning Theory, COLT 2024 (to appear) , year=
work page 2024
-
[79]
International Conference on Machine Learning , pages=
Of moments and matching: A game-theoretic framework for closing the imitation gap , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[80]
Advances in Neural Information Processing Systems , volume=
Minimax optimal online imitation learning via replay estimation , author=. Advances in Neural Information Processing Systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.