Environment Probing Interaction Policies

Abhinav Gupta; Lerrel Pinto; Wenxuan Zhou

arxiv: 1907.11740 · v1 · pith:X3H5TSIEnew · submitted 2019-07-26 · 💻 cs.RO · cs.AI· cs.LG

Environment Probing Interaction Policies

Wenxuan Zhou , Lerrel Pinto , Abhinav Gupta This is my paper

Pith reviewed 2026-05-24 15:37 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords reinforcement learningenvironment generalizationpolicy transferprobing policiestransition predictabilityrobot learningRL generalization

0 comments

The pith

Policies that probe a new environment for its dynamics can then solve tasks in it more effectively than policies trained to ignore environment differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning policies often fail when moved from training environments to slightly different test ones. Instead of seeking invariance to those differences, the paper proposes training an Environment-Probing Interaction policy that interacts with a new environment to gather information about its behavior. This policy is trained with a reward that increases when its trajectory makes future transitions more predictable, and the collected information is then supplied as extra input to a task-specific policy. A sympathetic reader would care because the work offers an alternative to standard generalization techniques that try to average away environment variations. Experiments indicate that the resulting conditioned task policies achieve higher success on novel environments than common baselines.

Core claim

The central claim is that an EPI policy trained to maximize transition predictability in its trajectories extracts implicit environment-specific information that, when provided as conditioning input, allows a task policy to outperform methods that learn invariant policies across novel testing environments.

What carries the argument

The Environment-Probing Interaction (EPI) policy, which uses a reward based on transition predictability to extract implicit environment behavior for conditioning a task policy.

If this is right

EPI-conditioned task-specific policies significantly outperform commonly used policy generalization methods on novel testing environments.
The probing step allows a policy to identify and exploit environment nuances instead of remaining invariant to them.
Better transfer occurs by conditioning actions on environment-specific information obtained through the predictability reward.
The separation of probing and task execution enables the task policy to perform environment-conditioned actions once the probe trajectory is available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Probing data gathered under the predictability reward might be reusable across multiple tasks within the same environment without retraining the probe policy.
If predictability does not align with task needs, the method could be extended by making the reward partially task-aware while keeping probing separate.
The approach could be tested in settings with sparse task rewards, since the probing phase operates independently of task success.

Load-bearing premise

That a reward based only on improved transition predictability will lead the probing policy to collect information useful for the downstream task rather than irrelevant details.

What would settle it

An experiment showing no performance improvement on novel environments for the EPI-conditioned task policy compared with non-probing baselines, or where the predictability reward produces trajectories uncorrelated with task success.

Figures

Figures reproduced from arXiv: 1907.11740 by Abhinav Gupta, Lerrel Pinto, Wenxuan Zhou.

**Figure 2.** Figure 2: Striker Environment: (a) Illustration of the environment (b) Training curves of the EPI [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Hopper Environment: (a) Illustration of the environment (b) Training curves of the EPI [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Generalizability of the EPI-policy on Hopper. The grey area represents the training range [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

A key challenge in reinforcement learning (RL) is environment generalization: a policy trained to solve a task in one environment often fails to solve the same task in a slightly different test environment. A common approach to improve inter-environment transfer is to learn policies that are invariant to the distribution of testing environments. However, we argue that instead of being invariant, the policy should identify the specific nuances of an environment and exploit them to achieve better performance. In this work, we propose the 'Environment-Probing' Interaction (EPI) policy, a policy that probes a new environment to extract an implicit understanding of that environment's behavior. Once this environment-specific information is obtained, it is used as an additional input to a task-specific policy that can now perform environment-conditioned actions to solve a task. To learn these EPI-policies, we present a reward function based on transition predictability. Specifically, a higher reward is given if the trajectory generated by the EPI-policy can be used to better predict transitions. We experimentally show that EPI-conditioned task-specific policies significantly outperform commonly used policy generalization methods on novel testing environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EPI introduces a predictability-driven probing policy as an alternative to invariance for RL generalization, but the abstract gives no experimental details to check if the probing actually helps the task.

read the letter

The main point is that the paper trains a separate EPI policy to probe a new environment using a reward based only on how well its trajectory lets you predict future transitions, then passes that information to a task policy so it can condition its actions on the specific environment. This is framed as better than trying to learn policies invariant across environments. The predictability reward is a straightforward mechanism and the separation of probing from task execution is a clean design choice. It builds on existing RL ideas about exploration and prediction but packages them as a distinct training procedure for adaptation. The abstract claims the combined system beats common generalization methods on novel test environments. That framing is worth considering for anyone thinking about sim-to-real or distribution shift in control. The soft spot is exactly the one the stress-test flags: the reward has no task signal, so the EPI policy could settle on any consistent dynamics that improve predictability, including background features or sensor quirks that do not help the downstream task. The abstract supplies no mechanism to enforce task relevance and no description of baselines, environment distributions, or statistical controls. Without those, the outperformance claim cannot be evaluated. This is for RL researchers focused on generalization and transfer. A reader looking for new procedural ideas rather than final results could extract the core proposal. It deserves peer review so the experiments can be checked for whether the probing actually delivers task-useful information or just incidental predictability gains.

Referee Report

2 major / 1 minor

Summary. The paper proposes the Environment-Probing Interaction (EPI) policy for RL environment generalization. Rather than learning invariant policies, an EPI policy is trained with a reward based solely on transition predictability to probe a new environment and extract implicit dynamics information; this information is then provided as additional input to a task-specific policy that performs environment-conditioned actions. The central claim is that EPI-conditioned task-specific policies significantly outperform commonly used policy generalization methods on novel testing environments.

Significance. If the experimental results are robust and the predictability reward is shown to yield task-relevant information rather than incidental correlations, the work could meaningfully challenge the dominance of invariance-based transfer methods in RL by demonstrating the value of active, environment-specific probing. This would be particularly relevant for robotics applications with varying dynamics.

major comments (2)

[Abstract] Abstract: the claim that EPI-conditioned policies 'significantly outperform commonly used policy generalization methods' supplies no details on baselines, environment distributions, statistical significance, or controls, which is load-bearing for the central experimental claim and prevents evaluation of soundness.
[Abstract] Abstract (method description): the reward is defined solely in terms of improved transition predictability with no auxiliary loss, task signal, or alignment mechanism described; this leaves open whether the EPI policy collects information useful to the downstream task or merely any consistent dynamics (e.g., background features), directly undermining the weakest assumption required by the central claim.

minor comments (1)

[Abstract] The abstract would benefit from one sentence naming the experimental domains or task types used to ground the outperformance claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We agree that the abstract needs to be expanded for clarity on the experimental claims and method. We have revised the abstract accordingly and respond point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that EPI-conditioned policies 'significantly outperform commonly used policy generalization methods' supplies no details on baselines, environment distributions, statistical significance, or controls, which is load-bearing for the central experimental claim and prevents evaluation of soundness.

Authors: We agree that the abstract's brevity makes the central claim difficult to evaluate. In the revised manuscript we have expanded the abstract to specify the baselines (domain randomization, invariant feature learning, and meta-learning approaches), the environment distribution (procedural variations in dynamics parameters such as mass, friction, and damping), and that reported improvements are statistically significant across multiple random seeds with full details and controls provided in the experimental section. revision: yes
Referee: [Abstract] Abstract (method description): the reward is defined solely in terms of improved transition predictability with no auxiliary loss, task signal, or alignment mechanism described; this leaves open whether the EPI policy collects information useful to the downstream task or merely any consistent dynamics (e.g., background features), directly undermining the weakest assumption required by the central claim.

Authors: The method intentionally trains the EPI policy using only the transition-predictability reward, with no auxiliary task loss or explicit alignment term, so that probing occurs independently of the downstream task. The task relevance arises because the resulting environment representation is provided as input to the task policy, enabling conditioned actions; the paper's experiments show this yields clear task-performance gains on novel environments, indicating the captured dynamics are task-relevant rather than incidental. We have revised the abstract to state this conditioning step explicitly and to note the empirical support for relevance. revision: yes

Circularity Check

0 steps flagged

No circularity; training procedure is self-contained with no derived quantities or self-referential definitions

full rationale

The paper presents EPI as a new training procedure: an EPI policy is trained with a reward defined directly from transition predictability, then its output is fed as conditioning to a task policy. No equations derive a 'prediction' or first-principles result that reduces to fitted parameters by construction. The central claim is an empirical performance comparison on novel environments, not a mathematical derivation. No self-citations are load-bearing for any uniqueness theorem or ansatz. The method is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the core idea introduces the EPI policy construct and predictability reward without further decomposition.

pith-pipeline@v0.9.0 · 5719 in / 945 out tokens · 17435 ms · 2026-05-24T15:37:22.940499+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

a reward function based on transition predictability. Specifically, a higher reward is given if the trajectory generated by the EPI-policy can be used to better predict transitions.
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the agent first performs 'environment-probing' interactions that extract information from an environment, then leverages this information to achieve the goal with a task-specific policy.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 22 internal anchors

[1]

Armstrong

B. Armstrong. On ﬁnding ’exciting’ trajectories for identiﬁcation experiments involving systems with non-linear dynamics. In Proceedings. 1987 IEEE International Conference on Robotics and Automation, volume 4, pp. 1131–1139, March

work page 1987
[2]

Josh C Bongard and Hod Lipson

doi: 10.1109/ROBOT.1987.1087968. Josh C Bongard and Hod Lipson. Nonlinear system identiﬁcation using coevolution of models and tests. IEEE Transactions on Evolutionary Computation, 9(4):361–384,

work page doi:10.1109/robot.1987.1087968 1987
[3]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning

Ignasi Clavera, Anusha Nagabandi, Ronald S Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt: Meta-learning for model-based control. arXiv preprint arXiv:1803.11347,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Learning to Perform Physics Experiments via Deep Reinforcement Learning

Misha Denil, Pulkit Agrawal, Tejas D Kulkarni, Tom Erez, Peter Battaglia, and Nando de Fre- itas. Learning to perform physics experiments via deep reinforcement learning. arXiv preprint arXiv:1611.01843,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Learning with Augmented Features for Heterogeneous Domain Adaptation

Lixin Duan, Dong Xu, and Ivor Tsang. Learning with augmented features for heterogeneous domain adaptation. arXiv preprint arXiv:1206.4660,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. InInternational Conference on Machine Learning, pp. 1329–1338, 2016a. Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2 : Fast reinforcement learning via slow reinforcement learning....

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Learning Invariant Feature Spaces to Transfer Skills with Reinforcement Learning

Abhishek Gupta, Coline Devin, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Learning invariant feature spaces to transfer skills with reinforcement learning. arXiv preprint arXiv:1703.02949,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Stephen James, Andrew J Davison, and Edward Johns. Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. arXiv preprint arXiv:1707.02267,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Adam: A Method for Stochastic Optimization

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

What you saw is not what you get: Domain adaptation using asymmetric kernel transforms

9 Published as a conference paper at ICLR 2019 Brian Kulis, Kate Saenko, and Trevor Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1785–1792. IEEE,

work page 2019
[12]

Revisiting Batch Normalization For Practical Domain Adaptation

Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting batch normaliza- tion for practical domain adaptation. arXiv preprint arXiv:1603.04779,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Learning Transferable Features with Deep Adaptation Networks

Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Curiosity-driven exploration by self-supervised prediction

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), volume 2017,

work page 2017
[16]

Zero-Shot Visual Imitation

Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-shot visual imitation. arXiv preprint arXiv:1804.08606,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Sim-to-Real Transfer of Robotic Control with Dynamics Randomization

Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. arXiv preprint arXiv:1710.06537,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. arXiv preprint arXiv:1804.02717,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Asymmetric Actor Critic for Image-Based Robot Learning

Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asym- metric actor critic for image-based robot learning. arXiv preprint arXiv:1710.06542, 2017a. Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforce- ment learning. ICML, 2017b. Aravind Rajeswaran, Sarvjeet Ghotra, Sergey Le...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Sim-to-Real Robot Learning from Pixels with Progressive Nets

Andrei A Rusu, Matej Vecerik, Thomas Roth¨orl, Nicolas Heess, Razvan Pascanu, and Raia Hadsell. Sim-to-real robot learning from pixels with progressive nets. arXiv preprint arXiv:1610.04286,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

CAD2RL: Real Single-Image Flight without a Single Real Image

Fereshteh Sadeghi and Sergey Levine. (cad)2rl: Real single-image ﬂight without a single real image. arXiv preprint arXiv:1611.04201,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Meta Reinforcement Learning with Latent Variable Gaussian Processes

Steind´or Sæmundsson, Katja Hofmann, and Marc Peter Deisenroth. Meta reinforcement learning with latent variable gaussian processes. arXiv preprint arXiv:1803.07551,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Do- main randomization for transferring deep neural networks from simulation to the real world

10 Published as a conference paper at ICLR 2019 Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Do- main randomization for transferring deep neural networks from simulation to the real world. IROS,

work page 2019
[24]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–

work page 2012
[25]

Deep Domain Confusion: Maximizing for Domain Invariance

Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Learning to reinforcement learn

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning

Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Preparing for the Unknown: Learning a Universal Policy with Online System Identification

Wenhao Yu, Jie Tan, C Karen Liu, and Greg Turk. Preparing for the unknown: Learning a universal policy with online system identiﬁcation. arXiv preprint arXiv:1702.02453,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

We will describe the details of the environments in this section

11 Published as a conference paper at ICLR 2019 APPENDIX A E NVIRONMENT DESCRIPTIONS We used Hopper and Striker environments from OpenAI Gym (Brockman et al., 2016). We will describe the details of the environments in this section. Hopper: Hopper consists of four body parts and three joints. It has an 3-dimensional action space including motor commands fo...

work page 2019
[30]

The prediction model is trained from scratch every 50 policy updates

with rllab implementation (Duan et al., 2016a). The prediction model is trained from scratch every 50 policy updates. Training from scratch is to avoid overﬁtting which will lead to unintended increasing reward for the EPI-policy. The EPI-policy is trained for 200∼400 iterations in total with a batch size of 10000 timesteps. The task policy will then use ...

work page 2019

[1] [1]

Armstrong

B. Armstrong. On ﬁnding ’exciting’ trajectories for identiﬁcation experiments involving systems with non-linear dynamics. In Proceedings. 1987 IEEE International Conference on Robotics and Automation, volume 4, pp. 1131–1139, March

work page 1987

[2] [2]

Josh C Bongard and Hod Lipson

doi: 10.1109/ROBOT.1987.1087968. Josh C Bongard and Hod Lipson. Nonlinear system identiﬁcation using coevolution of models and tests. IEEE Transactions on Evolutionary Computation, 9(4):361–384,

work page doi:10.1109/robot.1987.1087968 1987

[3] [3]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning

Ignasi Clavera, Anusha Nagabandi, Ronald S Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt: Meta-learning for model-based control. arXiv preprint arXiv:1803.11347,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Learning to Perform Physics Experiments via Deep Reinforcement Learning

Misha Denil, Pulkit Agrawal, Tejas D Kulkarni, Tom Erez, Peter Battaglia, and Nando de Fre- itas. Learning to perform physics experiments via deep reinforcement learning. arXiv preprint arXiv:1611.01843,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Learning with Augmented Features for Heterogeneous Domain Adaptation

Lixin Duan, Dong Xu, and Ivor Tsang. Learning with augmented features for heterogeneous domain adaptation. arXiv preprint arXiv:1206.4660,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. InInternational Conference on Machine Learning, pp. 1329–1338, 2016a. Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2 : Fast reinforcement learning via slow reinforcement learning....

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Learning Invariant Feature Spaces to Transfer Skills with Reinforcement Learning

Abhishek Gupta, Coline Devin, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Learning invariant feature spaces to transfer skills with reinforcement learning. arXiv preprint arXiv:1703.02949,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Stephen James, Andrew J Davison, and Edward Johns. Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task. arXiv preprint arXiv:1707.02267,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Adam: A Method for Stochastic Optimization

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

What you saw is not what you get: Domain adaptation using asymmetric kernel transforms

9 Published as a conference paper at ICLR 2019 Brian Kulis, Kate Saenko, and Trevor Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1785–1792. IEEE,

work page 2019

[12] [12]

Revisiting Batch Normalization For Practical Domain Adaptation

Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and Xiaodi Hou. Revisiting batch normaliza- tion for practical domain adaptation. arXiv preprint arXiv:1603.04779,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Learning Transferable Features with Deep Adaptation Networks

Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Curiosity-driven exploration by self-supervised prediction

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), volume 2017,

work page 2017

[16] [16]

Zero-Shot Visual Imitation

Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-shot visual imitation. arXiv preprint arXiv:1804.08606,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Sim-to-Real Transfer of Robotic Control with Dynamics Randomization

Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. arXiv preprint arXiv:1710.06537,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

DeepMimic: Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. arXiv preprint arXiv:1804.02717,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Asymmetric Actor Critic for Image-Based Robot Learning

Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asym- metric actor critic for image-based robot learning. arXiv preprint arXiv:1710.06542, 2017a. Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforce- ment learning. ICML, 2017b. Aravind Rajeswaran, Sarvjeet Ghotra, Sergey Le...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Sim-to-Real Robot Learning from Pixels with Progressive Nets

Andrei A Rusu, Matej Vecerik, Thomas Roth¨orl, Nicolas Heess, Razvan Pascanu, and Raia Hadsell. Sim-to-real robot learning from pixels with progressive nets. arXiv preprint arXiv:1610.04286,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

CAD2RL: Real Single-Image Flight without a Single Real Image

Fereshteh Sadeghi and Sergey Levine. (cad)2rl: Real single-image ﬂight without a single real image. arXiv preprint arXiv:1611.04201,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Meta Reinforcement Learning with Latent Variable Gaussian Processes

Steind´or Sæmundsson, Katja Hofmann, and Marc Peter Deisenroth. Meta reinforcement learning with latent variable gaussian processes. arXiv preprint arXiv:1803.07551,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Do- main randomization for transferring deep neural networks from simulation to the real world

10 Published as a conference paper at ICLR 2019 Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Do- main randomization for transferring deep neural networks from simulation to the real world. IROS,

work page 2019

[24] [24]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–

work page 2012

[25] [25]

Deep Domain Confusion: Maximizing for Domain Invariance

Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Learning to reinforcement learn

Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning

Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Preparing for the Unknown: Learning a Universal Policy with Online System Identification

Wenhao Yu, Jie Tan, C Karen Liu, and Greg Turk. Preparing for the unknown: Learning a universal policy with online system identiﬁcation. arXiv preprint arXiv:1702.02453,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

We will describe the details of the environments in this section

11 Published as a conference paper at ICLR 2019 APPENDIX A E NVIRONMENT DESCRIPTIONS We used Hopper and Striker environments from OpenAI Gym (Brockman et al., 2016). We will describe the details of the environments in this section. Hopper: Hopper consists of four body parts and three joints. It has an 3-dimensional action space including motor commands fo...

work page 2019

[30] [30]

The prediction model is trained from scratch every 50 policy updates

with rllab implementation (Duan et al., 2016a). The prediction model is trained from scratch every 50 policy updates. Training from scratch is to avoid overﬁtting which will lead to unintended increasing reward for the EPI-policy. The EPI-policy is trained for 200∼400 iterations in total with a batch size of 10000 timesteps. The task policy will then use ...

work page 2019