State-Conditional Adversarial Learning: An Off-Policy Visual Domain Transfer Method for End-to-End Imitation Learning

Shengfan Cao; Yuxiang Liu

arxiv: 2512.05335 · v3 · pith:KMQOMTODnew · submitted 2025-12-05 · 💻 cs.RO

State-Conditional Adversarial Learning: An Off-Policy Visual Domain Transfer Method for End-to-End Imitation Learning

Yuxiang Liu , Shengfan Cao This is my paper

Pith reviewed 2026-05-21 18:17 UTC · model grok-4.3

classification 💻 cs.RO

keywords visual domain transferimitation learningadversarial learningdomain adaptationoff-policy learningend-to-end controlautonomous driving

0 comments

The pith

The target-domain imitation loss is upper bounded by source loss plus state-conditional latent KL divergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that the imitation loss in a target visual domain can be upper bounded by the source domain loss combined with a state-conditional latent KL divergence between the observation models. This theoretical result motivates an adversarial training method that aligns the latent representations conditioned on the current system state. The approach works in challenging settings where target data is off-policy, lacks expert demonstrations, and is limited in quantity. Sympathetic readers would care because it offers a principled way to transfer end-to-end policies across visual domains with minimal target data, as shown in driving simulations.

Core claim

The target-domain imitation loss can be upper bounded by the source-domain loss plus a state-conditional latent KL divergence between source and target observation models. Guided by this bound, State-Conditional Adversarial Learning aligns the latent distributions using a discriminator-based estimator of the conditional KL term to enable effective off-policy transfer.

What carries the argument

State-Conditional Adversarial Learning, which uses a discriminator to estimate and minimize the state-conditional KL divergence for aligning source and target latent observations.

If this is right

The method permits imitation learning transfer without expert data or on-policy samples in the target domain.
It supports robust policy transfer and strong sample efficiency in visually diverse settings.
Experiments in autonomous driving environments confirm effective cross-domain performance.
The bound provides a concrete objective that the adversarial alignment directly minimizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bounding approach could extend to domain adaptation in other sequential control tasks.
Tighter analysis might incorporate additional terms for state dynamics mismatch.
Real-vehicle tests would check whether simulator results hold under physical sensor noise.

Load-bearing premise

A discriminator-based estimator can reliably approximate the state-conditional KL divergence and that aligning latent distributions conditioned on state is sufficient to control the bound for policy transfer.

What would settle it

A case where the discriminator reports low conditional KL but the measured target imitation loss stays high, showing the bound does not hold in practice.

Figures

Figures reproduced from arXiv: 2512.05335 by Shengfan Cao, Yuxiang Liu.

**Figure 1.** Figure 1: PCA Visualization of Latent Space with (left) and without(right) using SCAL. The latent vectors presented are sampled from exactly the same path-tracking trajectory. 5.3 Comparison with Prior Works Compared to [3] [8], which relies on pixel-level CycleGAN translation and assumes a large pool of unlabeled target images, our framework can tackle realistic settings requiring high sample efficiency. Moreo… view at source ↗

**Figure 2.** Figure 2: Two Example Domains in our experiments with the same track shape but drastically different visual characters. 6.1 Off-Policy Evaluation Study We verify the validity of our theoretical analysis by presenting the strong positive correlation between Jt(θ) (the intractable term in objective (6)) and the quantity Js(θ) + Eps(x|πθ) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: SCAL compared with perfect baseline under different Bs distributions. x-axis: Target-domain buffer size. y-axis: Maximum trajectory length achieved in the target domain. SCAL trained with Bt distribution 1(yellow); SCAL trained with Bt distribution 2(blue); SCAL trained with Bt distribution 3(purple). Perfect baseline(Black). The shaded area represents variance [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: demonstration of low-speed-to-high-speed [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

We study visual domain transfer for end-to-end imitation learning in a realistic and challenging setting where target-domain data are strictly off-policy, expert-free, and scarce. We first provide a theoretical analysis showing that the target-domain imitation loss can be upper bounded by the source-domain loss plus a state-conditional latent KL divergence between source and target observation models. Guided by this result, we propose State- Conditional Adversarial Learning, an off-policy adversarial framework that aligns latent distributions conditioned on system state using a discriminator-based estimator of the conditional KL term. Experiments on visually diverse autonomous driving environments built on the BARC-CARLA simulator demonstrate that SCAL achieves robust transfer and strong sample efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a state-conditional KL bound to motivate an adversarial method for off-policy visual transfer in imitation learning, and the CARLA experiments show practical gains, but the discriminator estimator's reliability with scarce target data is a real soft spot.

read the letter

The core contribution is a bound showing that target imitation loss is upper-bounded by source loss plus a state-conditional latent KL between observation models, then minimized via a discriminator that conditions on both latent and state. This gives a clear reason to align conditionally rather than globally, which fits the off-policy, expert-free, scarce-data setting they target for visual imitation learning in robotics. The experiments on visually varied BARC-CARLA driving environments report better transfer robustness and sample efficiency than baselines, which is the kind of result that matters for simulation-to-real work. The theoretical step is direct and the motivation tracks the problem they set up. The method is a specific extension of adversarial domain adaptation rather than a wholesale new framework, but the conditioning and the off-policy focus are the parts that feel fresh. The stress-test concern about the estimator holds up on the details given. With limited target trajectories the conditional discriminator sees incomplete state coverage, especially in continuous spaces, so the KL approximation can be noisy. Joint end-to-end training of the encoder further loosens the fixed-observation-model assumption the bound starts from. The paper would be stronger with diagnostics showing how tightly the bound is actually controlled or ablations isolating the conditional term. This is aimed at researchers doing visual domain adaptation for end-to-end policies in embodied settings like autonomous driving. A reader who wants a theoretically guided adversarial baseline with simulator results will find usable material here. The work shows honest engagement with the imitation and domain-adaptation literature and has enough substance to go to a serious referee, though the estimator reliability will likely draw revision requests.

Referee Report

2 major / 2 minor

Summary. The paper derives a theoretical upper bound showing that the target-domain imitation loss is at most the source-domain loss plus a state-conditional latent KL divergence between source and target observation models. Guided by this bound, it introduces State-Conditional Adversarial Learning (SCAL), an off-policy adversarial method that uses a discriminator taking latent features and state as input to estimate and minimize the conditional KL term. Experiments in visually diverse autonomous driving scenarios on the BARC-CARLA simulator report improved transfer performance and sample efficiency under scarce, expert-free, off-policy target data.

Significance. If the bound derivation is rigorous and the discriminator provides a sufficiently accurate estimator of the state-conditional KL despite limited target coverage, the work would supply a principled mechanism for visual domain transfer in end-to-end imitation learning. This addresses a practically important robotics setting where source and target visual domains differ and target expert data are unavailable, potentially improving robustness without requiring on-policy target collection.

major comments (2)

[Theoretical analysis] Theoretical analysis section (bound derivation): the upper bound is stated to hold for fixed source and target observation models, yet the method performs joint end-to-end optimization of the latent encoder together with the policy. It is unclear whether the bound remains valid once the latent representations are no longer treated as given, which is load-bearing for the claim that minimizing the estimated KL controls target imitation loss.
[Method and Experiments] Method and experimental sections (discriminator estimator): with scarce off-policy target data the state-conditional discriminator receives limited state coverage, raising the risk that the density-ratio or KL estimate is inaccurate or biased. The manuscript should supply concrete evidence (e.g., ablation on estimator quality, state-coverage diagnostics, or comparison against oracle KL) to show the approximation is reliable enough to keep the bound controlled in practice.

minor comments (2)

[Method] Notation for the discriminator input (latent vector concatenated with state) and the precise definition of the conditional KL term could be stated more explicitly to avoid ambiguity when readers reconstruct the estimator.
[Experiments] Figure captions describing the BARC-CARLA environments would benefit from explicit mention of the visual domain shifts (lighting, texture, camera parameters) to help readers assess the transfer difficulty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments regarding the validity of the theoretical bound under joint optimization and the reliability of the state-conditional discriminator estimator. We address each major comment below and outline revisions to clarify and strengthen the presentation.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical analysis section (bound derivation): the upper bound is stated to hold for fixed source and target observation models, yet the method performs joint end-to-end optimization of the latent encoder together with the policy. It is unclear whether the bound remains valid once the latent representations are no longer treated as given, which is load-bearing for the claim that minimizing the estimated KL controls target imitation loss.

Authors: We thank the referee for this precise observation. The bound is formally derived for fixed observation models that induce the latent distributions. During joint optimization the encoder parameters evolve, so the bound applies instantaneously to the current latent representations at each training step. Minimizing the estimated state-conditional KL therefore continues to act on the right-hand side of the bound for the latents present at that iteration. We will revise the theoretical analysis section to explicitly discuss this dynamic interpretation and to state that the bound supplies a principled motivation whose practical utility is corroborated by the reported experiments. revision: partial
Referee: [Method and Experiments] Method and experimental sections (discriminator estimator): with scarce off-policy target data the state-conditional discriminator receives limited state coverage, raising the risk that the density-ratio or KL estimate is inaccurate or biased. The manuscript should supply concrete evidence (e.g., ablation on estimator quality, state-coverage diagnostics, or comparison against oracle KL) to show the approximation is reliable enough to keep the bound controlled in practice.

Authors: We agree that limited state coverage under scarce off-policy target data could in principle bias the conditional KL estimate. The current experiments already show consistent gains in transfer performance and sample efficiency, indicating that the estimator remains useful in the evaluated regimes. To supply the requested concrete evidence we will add (i) an ablation comparing the learned discriminator estimate against an oracle KL computed in simulation and (ii) state-coverage diagnostics (e.g., histograms of visited states in source versus target). These results will be included in the revised experimental section. revision: yes

Circularity Check

0 steps flagged

Theoretical upper bound on target imitation loss is derived independently via standard divergence inequalities

full rationale

The paper states it provides a theoretical analysis deriving that target-domain imitation loss is upper-bounded by source-domain loss plus state-conditional latent KL between observation models. This is a standard application of change-of-measure or divergence bounding arguments to the imitation objective and does not reduce to any fitted parameter, discriminator output, or self-referential definition by construction. The subsequent SCAL method uses a discriminator to estimate and minimize the KL term as an algorithmic implementation, but the bound itself treats the observation models as given and remains a mathematical inequality independent of how the KL is approximated. No self-citations, ansatzes smuggled via prior work, or renaming of known results are indicated as load-bearing. The derivation chain is self-contained against external benchmarks such as existing domain-adaptation bounds.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. The approach implicitly assumes that latent representations exist and that state information is available for conditioning.

pith-pipeline@v0.9.0 · 5645 in / 998 out tokens · 32738 ms · 2026-05-21T18:17:34.331244+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

target-domain imitation loss can be upper bounded by the source-domain loss plus a state-conditional latent KL divergence between source and target observation models
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

aligns latent distributions conditioned on system state using a discriminator-based estimator of the conditional KL term

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 10 internal anchors

[1]

Wasserstein GAN

Martin Arjovsky, Soumith Chintala, and L´ eon Bottou. Wasserstein gan. InProceedings of the 34th International Conference on Machine Learning, 2017. URLhttps://arxiv.org/abs/ 1701.07875

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

La costruzione di una scala musicale attraverso i numeri

Sanjeev Arora, Yi Zhang, et al. Games of gan: Game-theoretical models for gener- ative adversarial networks.arXiv preprint arXiv:1802.05952, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Learning to Drive from Simulation without Real World Labels

Alex Bewley, Alexander Zempleni, Valerio Or- tenzi, and Ingmar Posner. Learning to drive from simulation without real world labels.arXiv preprint arXiv:1812.03823, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars,

work page
[5]

URLhttps://arxiv.org/abs/1604. 07316. arXiv:1604.07316

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Domain- adversarial training of neural networks

Yaroslav Ganin and Victor Lempitsky. Domain- adversarial training of neural networks. InJour- nal of Machine Learning Research, volume 17, pages 1–35, 2016

work page 2016
[7]

Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014

work page 2014
[8]

Self- supervised policy adaptation during deploy- ment

Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Aleny` a, Pieter Abbeel, Alexei A Efros, Lerrel Pinto, and Xiaolong Wang. Self- supervised policy adaptation during deploy- ment. InProceedings of the 9th Interna- tional Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/ forum?id=o_V-MjyyGV_

work page 2021
[9]

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

Irina Higgins, Arka Pal, Andrei A. Rusu, et al. Darla: Improving zero-shot transfer in reinforce- ment learning.arXiv preprint arXiv:1707.08475, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Generative adversarial imitation learning.Advances in neu- ral information processing systems, 29, 2016

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.Advances in neu- ral information processing systems, 29, 2016

work page 2016
[11]

Ap- proximately optimal approximate reinforcement 9 learning

Sham Kakade and John Langford. Ap- proximately optimal approximate reinforcement 9 learning. InProceedings of the nineteenth inter- national conference on machine learning, pages 267–274, 2002

work page 2002
[12]

End-to-end training of deep vi- suomotor policies.Journal of Machine Learning Research, 17(1):1334–1373, 2016

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep vi- suomotor policies.Journal of Machine Learning Research, 17(1):1334–1373, 2016

work page 2016
[13]

Domain adversarial reinforce- ment learning.arXiv preprint arXiv:2102.07097,

Bonnie Li, Vincent Fran¸ cois-Lavet, Thang Doan, and Joelle Pineau. Domain adversarial reinforce- ment learning.arXiv preprint arXiv:2102.07097,

work page arXiv
[14]

URLhttps://arxiv.org/abs/2102. 07097

work page
[15]

Optimization-based au- tonomous racing of 1:43 scale rc cars.Opti- mal Control Applications and Methods, 36(5): 628–647, July 2014

Alexander Liniger, Alexander Domahidi, and Manfred Morari. Optimization-based au- tonomous racing of 1:43 scale rc cars.Opti- mal Control Applications and Methods, 36(5): 628–647, July 2014. ISSN 1099-1514. doi: 10.1002/oca.2123. URLhttp://dx.doi.org/ 10.1002/oca.2123

work page doi:10.1002/oca.2123 2014
[16]

Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Conditional adversarial do- main adaptation.Advances in Neural Informa- tion Processing Systems, 31, 2018

work page 2018
[17]

Pal, and Liam Paull

Bhairav Mehta, Manfred Diaz, Florian Golemo, Christopher J. Pal, and Liam Paull. Active do- main randomization. In Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura, editors,Pro- ceedings of the Conference on Robot Learning, volume 100 ofProceedings of Machine Learning Research, pages 1162–1176. PMLR, Oct 30–Nov 1 2020

work page 2020
[18]

Conditional Generative Adversarial Nets

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. InarXiv preprint arXiv:1411.1784, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

cGANs with Projection Discriminator

Takeru Miyato and Masanori Koyama. Condi- tional gans with projection discriminator.arXiv preprint arXiv:1802.05637, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Visual at- tention prediction improves performance of au- tonomous drone racing agents.arXiv preprint arXiv:2201.02569, 2022

Christian Pfeiffer, Simon Wengeler, Antonio Lo- quercio, and Davide Scaramuzza. Visual at- tention prediction improves performance of au- tonomous drone racing agents.arXiv preprint arXiv:2201.02569, 2022

work page arXiv 2022
[21]

Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-To-End Learning from Demonstration

Rouhollah Rahmatizadeh, Pooya Abolghasemi, Ladislau B¨ ol¨ oni, and Sergey Levine. Vision- based multi-task manipulation for inexpensive robots using end-to-end learning from demon- stration.arXiv preprint arXiv:1707.02920, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Reinforcement and Imitation Learning via Interactive No-Regret Learning

St´ ephane Ross and J. Andrew Bagnell. Rein- forcement and imitation learning via interactive no-regret learning.CoRR, abs/1406.5979, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[23]

A reduction of imitation learning and structured prediction to no-regret online learning

Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey Gordon, David Dun- son, and Miroslav Dud´ ık, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 15 ofProceedings of Machine Learn- i...

work page 2011
[24]

Stadie, Pieter Abbeel, and Ilya Sutskever

Bradly C. Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning. In International Conference on Learning Represen- tations (ICLR), 2017. URLhttps://arxiv. org/abs/1703.01703. Preprint

work page arXiv 2017
[25]

Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping

Bradly C Stadie, Pieter Abbeel, Ilya Sutskever, et al. A framework for few-shot policy trans- fer through observation mapping and behavior cloning.arXiv preprint arXiv:1709.07857, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Domain adaptive imita- tion learning.arXiv preprint arXiv:1907.03683, 2019

Wenxuan Sun, Bryan Lim, Matthew Taylor, and Gita Sukthankar. Domain adaptive imita- tion learning.arXiv preprint arXiv:1907.03683, 2019

work page arXiv 1907
[27]

Domain randomization for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pages 23–30. IEEE, 2017

work page 2017
[28]

Bootstrap- ping reinforcement learning with imitation 10 for vision-based agile flight.arXiv preprint arXiv:2403.12203, 2024

Jiaxu Xing, Angel Romero, Leonard Bauers- feld, and Davide Scaramuzza. Bootstrap- ping reinforcement learning with imitation 10 for vision-based agile flight.arXiv preprint arXiv:2403.12203, 2024

work page arXiv 2024
[29]

Query- efficient imitation learning for end-to-end sim- ulated driving

Jiakai Zhang and Kyunghyun Cho. Query- efficient imitation learning for end-to-end sim- ulated driving. InProceedings of the Thirty- First AAAI Conference on Artificial Intelli- gence, pages 2891–2897. AAAI Press, 2017

work page 2017
[30]

Invariance through latent alignment.arXiv preprint arXiv:2106.10863, 2021

Xingyao Zhou et al. Invariance through latent alignment.arXiv preprint arXiv:2106.10863, 2021

work page arXiv 2021
[31]

Viola: Imitation learning for vision- based manipulation with object proposal priors

Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision- based manipulation with object proposal priors. InProceedings of Conference on Robot Learning (CoRL), 2022. 8 Appendix 8.1 proof appendix for Lemma 4.1 This section aims to proof the correctness of 4.1 Proof.Note that the expert can be viewed as a history-dependen...

work page 2022

[1] [1]

Wasserstein GAN

Martin Arjovsky, Soumith Chintala, and L´ eon Bottou. Wasserstein gan. InProceedings of the 34th International Conference on Machine Learning, 2017. URLhttps://arxiv.org/abs/ 1701.07875

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

La costruzione di una scala musicale attraverso i numeri

Sanjeev Arora, Yi Zhang, et al. Games of gan: Game-theoretical models for gener- ative adversarial networks.arXiv preprint arXiv:1802.05952, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Learning to Drive from Simulation without Real World Labels

Alex Bewley, Alexander Zempleni, Valerio Or- tenzi, and Ingmar Posner. Learning to drive from simulation without real world labels.arXiv preprint arXiv:1812.03823, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars,

work page

[5] [5]

URLhttps://arxiv.org/abs/1604. 07316. arXiv:1604.07316

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Domain- adversarial training of neural networks

Yaroslav Ganin and Victor Lempitsky. Domain- adversarial training of neural networks. InJour- nal of Machine Learning Research, volume 17, pages 1–35, 2016

work page 2016

[7] [7]

Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.Advances in Neural Information Processing Systems, 27, 2014

work page 2014

[8] [8]

Self- supervised policy adaptation during deploy- ment

Nicklas Hansen, Rishabh Jangir, Yu Sun, Guillem Aleny` a, Pieter Abbeel, Alexei A Efros, Lerrel Pinto, and Xiaolong Wang. Self- supervised policy adaptation during deploy- ment. InProceedings of the 9th Interna- tional Conference on Learning Representations (ICLR), 2021. URLhttps://openreview.net/ forum?id=o_V-MjyyGV_

work page 2021

[9] [9]

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

Irina Higgins, Arka Pal, Andrei A. Rusu, et al. Darla: Improving zero-shot transfer in reinforce- ment learning.arXiv preprint arXiv:1707.08475, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Generative adversarial imitation learning.Advances in neu- ral information processing systems, 29, 2016

Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.Advances in neu- ral information processing systems, 29, 2016

work page 2016

[11] [11]

Ap- proximately optimal approximate reinforcement 9 learning

Sham Kakade and John Langford. Ap- proximately optimal approximate reinforcement 9 learning. InProceedings of the nineteenth inter- national conference on machine learning, pages 267–274, 2002

work page 2002

[12] [12]

End-to-end training of deep vi- suomotor policies.Journal of Machine Learning Research, 17(1):1334–1373, 2016

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep vi- suomotor policies.Journal of Machine Learning Research, 17(1):1334–1373, 2016

work page 2016

[13] [13]

Domain adversarial reinforce- ment learning.arXiv preprint arXiv:2102.07097,

Bonnie Li, Vincent Fran¸ cois-Lavet, Thang Doan, and Joelle Pineau. Domain adversarial reinforce- ment learning.arXiv preprint arXiv:2102.07097,

work page arXiv

[14] [14]

URLhttps://arxiv.org/abs/2102. 07097

work page

[15] [15]

Optimization-based au- tonomous racing of 1:43 scale rc cars.Opti- mal Control Applications and Methods, 36(5): 628–647, July 2014

Alexander Liniger, Alexander Domahidi, and Manfred Morari. Optimization-based au- tonomous racing of 1:43 scale rc cars.Opti- mal Control Applications and Methods, 36(5): 628–647, July 2014. ISSN 1099-1514. doi: 10.1002/oca.2123. URLhttp://dx.doi.org/ 10.1002/oca.2123

work page doi:10.1002/oca.2123 2014

[16] [16]

Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Conditional adversarial do- main adaptation.Advances in Neural Informa- tion Processing Systems, 31, 2018

work page 2018

[17] [17]

Pal, and Liam Paull

Bhairav Mehta, Manfred Diaz, Florian Golemo, Christopher J. Pal, and Liam Paull. Active do- main randomization. In Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura, editors,Pro- ceedings of the Conference on Robot Learning, volume 100 ofProceedings of Machine Learning Research, pages 1162–1176. PMLR, Oct 30–Nov 1 2020

work page 2020

[18] [18]

Conditional Generative Adversarial Nets

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. InarXiv preprint arXiv:1411.1784, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[19] [19]

cGANs with Projection Discriminator

Takeru Miyato and Masanori Koyama. Condi- tional gans with projection discriminator.arXiv preprint arXiv:1802.05637, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Visual at- tention prediction improves performance of au- tonomous drone racing agents.arXiv preprint arXiv:2201.02569, 2022

Christian Pfeiffer, Simon Wengeler, Antonio Lo- quercio, and Davide Scaramuzza. Visual at- tention prediction improves performance of au- tonomous drone racing agents.arXiv preprint arXiv:2201.02569, 2022

work page arXiv 2022

[21] [21]

Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-To-End Learning from Demonstration

Rouhollah Rahmatizadeh, Pooya Abolghasemi, Ladislau B¨ ol¨ oni, and Sergey Levine. Vision- based multi-task manipulation for inexpensive robots using end-to-end learning from demon- stration.arXiv preprint arXiv:1707.02920, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Reinforcement and Imitation Learning via Interactive No-Regret Learning

St´ ephane Ross and J. Andrew Bagnell. Rein- forcement and imitation learning via interactive no-regret learning.CoRR, abs/1406.5979, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[23] [23]

A reduction of imitation learning and structured prediction to no-regret online learning

Stephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Geoffrey Gordon, David Dun- son, and Miroslav Dud´ ık, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 15 ofProceedings of Machine Learn- i...

work page 2011

[24] [24]

Stadie, Pieter Abbeel, and Ilya Sutskever

Bradly C. Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning. In International Conference on Learning Represen- tations (ICLR), 2017. URLhttps://arxiv. org/abs/1703.01703. Preprint

work page arXiv 2017

[25] [25]

Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping

Bradly C Stadie, Pieter Abbeel, Ilya Sutskever, et al. A framework for few-shot policy trans- fer through observation mapping and behavior cloning.arXiv preprint arXiv:1709.07857, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

Domain adaptive imita- tion learning.arXiv preprint arXiv:1907.03683, 2019

Wenxuan Sun, Bryan Lim, Matthew Taylor, and Gita Sukthankar. Domain adaptive imita- tion learning.arXiv preprint arXiv:1907.03683, 2019

work page arXiv 1907

[27] [27]

Domain randomization for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS), pages 23–30. IEEE, 2017

work page 2017

[28] [28]

Bootstrap- ping reinforcement learning with imitation 10 for vision-based agile flight.arXiv preprint arXiv:2403.12203, 2024

Jiaxu Xing, Angel Romero, Leonard Bauers- feld, and Davide Scaramuzza. Bootstrap- ping reinforcement learning with imitation 10 for vision-based agile flight.arXiv preprint arXiv:2403.12203, 2024

work page arXiv 2024

[29] [29]

Query- efficient imitation learning for end-to-end sim- ulated driving

Jiakai Zhang and Kyunghyun Cho. Query- efficient imitation learning for end-to-end sim- ulated driving. InProceedings of the Thirty- First AAAI Conference on Artificial Intelli- gence, pages 2891–2897. AAAI Press, 2017

work page 2017

[30] [30]

Invariance through latent alignment.arXiv preprint arXiv:2106.10863, 2021

Xingyao Zhou et al. Invariance through latent alignment.arXiv preprint arXiv:2106.10863, 2021

work page arXiv 2021

[31] [31]

Viola: Imitation learning for vision- based manipulation with object proposal priors

Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision- based manipulation with object proposal priors. InProceedings of Conference on Robot Learning (CoRL), 2022. 8 Appendix 8.1 proof appendix for Lemma 4.1 This section aims to proof the correctness of 4.1 Proof.Note that the expert can be viewed as a history-dependen...

work page 2022