Bayesian Optimization in Variational Latent Spaces with Dynamic Compression

Akshara Rai; Danica Kragic; Rika Antonova; Tianyu Li

arxiv: 1907.04796 · v1 · pith:FECU2COXnew · submitted 2019-07-10 · 💻 cs.RO · cs.LG

Bayesian Optimization in Variational Latent Spaces with Dynamic Compression

Rika Antonova , Akshara Rai , Tianyu Li , Danica Kragic This is my paper

Pith reviewed 2026-05-24 23:48 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords Bayesian optimizationvariational autoencoderrobot controltrajectory embeddingdata-efficient optimizationlatent spaceunsupervised learningdynamic compression

0 comments

The pith

A sequential variational autoencoder embeds simulated trajectories into a latent space that supports Bayesian optimization with only 10-20 real robot trials.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that an unsupervised sequential variational autoencoder can turn simulated robot trajectories into a lower-dimensional latent space whose geometry works for defining kernels in Bayesian optimization. This would matter because many robot adaptation tasks allow only a handful of trials, making standard optimization too slow for high-dimensional controllers. The method adds dynamic compression to shrink exploration away from undesirable state-space regions without needing explicit parameter constraints. Hardware tests on a hexapod and a manipulator arm are used to check whether the resulting trajectory-based kernel reaches good performance faster than prior approaches.

Core claim

The authors claim that their model and architecture for a sequential variational autoencoder embeds the space of simulated trajectories into a lower-dimensional space of latent paths in an unsupervised way, and that combining this embedding with dynamic compression of the search space produces a trajectory-based kernel allowing ultra data-efficient Bayesian optimization for higher-dimensional robot controllers.

What carries the argument

The sequential variational autoencoder that maps simulated trajectories to latent paths, which is used to define the kernel for Bayesian optimization together with dynamic compression that reduces exploration in undesirable regions of the state space.

If this is right

Higher-dimensional controllers become feasible to optimize when the trial budget is limited to 10-20 attempts.
Robots can adapt controllers to new tasks using far fewer real-world interactions than standard methods require.
No explicit constraints on controller parameters or supervised labels are needed to focus the search.
The approach transfers from simulation-based embedding to hardware performance on both legged robots and manipulators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unsupervised trajectory embedding might be reusable for other simulation-driven search methods beyond Bayesian optimization.
Dynamic compression of undesirable regions could be adapted to kernel choices in non-robotics domains that rely on trajectory data.
If the latent space geometry proves robust, it could reduce the need for hand-designed features in related control problems.

Load-bearing premise

The geometry produced by the unsupervised variational embedding of trajectories is suitable for an effective Bayesian optimization kernel without any supervised feature extraction or task labels.

What would settle it

A set of hardware trials on the Daisy hexapod or ABB Yumi in which the proposed kernel requires substantially more than 20 evaluations to match the performance of standard Bayesian optimization baselines would show the claimed data efficiency does not hold.

Figures

Figures reproduced from arXiv: 1907.04796 by Akshara Rai, Danica Kragic, Rika Antonova, Tianyu Li.

**Figure 2.** Figure 2: A sketch of generative and inference model. Our goal is to learn p(τ, ψ|x). p(τ |x) is analogous to p(ξ|x), only the paths are encoded in a lower-dimensional latent space. This is useful for constructing kernels for efficient BO on hardware. As a measure of trajectory ‘quality’ we can keep track of how long each trajectory spends in undesirable regions (y). For the latent paths we learn the analogous not… view at source ↗

**Figure 3.** Figure 3: Daisy hexapod used in this work. Daisy Controllers: We used Central Pattern Generators (CPGs) from [40]. These are capable of generating a large number of locomotion gaits by changing the frequency, amplitude, and offset of each joint, as well as the relative phase differences between joints. Different CPG parameters can be restricted to obtain controllers with various dimensionalities. We experimented wi… view at source ↗

**Figure 4.** Figure 4: BO on Daisy hardware. Mean over 5 runs, 90% CIs. We completed 5 runs of BO on the Daisy robot hardware, initializing with 2 random samples, followed by 10 trials of BO ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: BO for Daisy in simulation. Means over 50 runs, 90% CIs [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: “Stable push” task with Yumi Our manipulation task was to push two objects from one side of the table to another without tipping them over. For Yumi environment the objects had mass and inertial properties similar to paper towel rolls (mass of 150g, 22cm height, 5cm radius); for Franka these had properties similar to wooden rolls (2kg, 22cm height, 8cm radius). Compared to ‘push-totarget’ task, our task … view at source ↗

**Figure 9.** Figure 9: BO with various kernels on Franka Emika simulation. Left: SVAE trained with same parameters as [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 7.** Figure 7: BO on ABB Yumi hardware (mean of 5 runs, 90% CIs). BO with SVAE-DC kernel was still able to significantly outperform BO with SE ( [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: BO on ABB Yumi simulation (mean of 50 runs, 90% CIs). Furthermore, we analyze how increasing the size of SVAE latent space and NNs impacts performance (middle). The larger latent space is 6 · 5 = 30D (vs 9D in other experiments), the hidden layer size of NNs is increased from 128 to 256. Larger latent space implies larger search space for BO, which could impair data efficiency. Indeed, we see what BO with… view at source ↗

**Figure 10.** Figure 10: Backbone of our SVAE generative model and inference In the above, φ = [φξ] denote parameters of the variational approximation, w = [wτ , wξ] denote the parameters of the generative part of the model. In our work, φ, w are weights of deep neural networks. It is customary to drop subscripts indicating NN weight parameters and write q, p for a shorthand notation. The derivation for the above is similar to [… view at source ↗

**Figure 11.** Figure 11: SVAE-DC training progress on Daisy. See full description in Figure [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: SVAE-DC training progress on Yumi (middle) and Franka Emika (bottom) environments. Obser [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: BO in 30D when only 3 dimensions contribute significantly. In the context of BO, consider a 30D quadratic: f(x)=P i (xi+1)2 , x∈R 30 with xi∈[0, 1]. Even on this simple quadratic BO with SE kernel gives only modest gains for the first 60 trials. Now consider f such that a large number of dimensions do not contribute significantly: f(x) = P3 i=1(xi + 1)2 + 0.001P30 i=4 xi [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

read the original abstract

Data-efficiency is crucial for autonomous robots to adapt to new tasks and environments. In this work we focus on robotics problems with a budget of only 10-20 trials. This is a very challenging setting even for data-efficient approaches like Bayesian optimization (BO), especially when optimizing higher-dimensional controllers. Simulated trajectories can be used to construct informed kernels for BO. However, previous work employed supervised ways of extracting low-dimensional features for these. We propose a model and architecture for a sequential variational autoencoder that embeds the space of simulated trajectories into a lower-dimensional space of latent paths in an unsupervised way. We further compress the search space for BO by reducing exploration in parts of the state space that are undesirable, without requiring explicit constraints on controller parameters. We validate our approach with hardware experiments on a Daisy hexapod robot and an ABB Yumi manipulator. We also present simulation experiments with further comparisons to several baselines on Daisy and two manipulators. Our experiments indicate the proposed trajectory-based kernel with dynamic compression can offer ultra data-efficient optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's unsupervised sequential VAE plus dynamic compression for trajectory kernels in BO is a reasonable architectural step, but the abstract supplies no numbers or derivations so the data-efficiency claim stays untested.

read the letter

The new piece is an unsupervised sequential VAE that turns simulated trajectories into a latent space for defining a BO kernel, followed by a dynamic compression rule that shrinks exploration in undesirable state-space regions without explicit parameter constraints. This avoids the supervised feature extraction used in earlier trajectory-kernel work and adds a compression mechanism that looks specific to this pipeline. The hardware validation on the Daisy hexapod and ABB Yumi is mentioned, which is the right setting for the 10-20 trial budget they target. That combination is the actual contribution worth noting. The central assumption is that reconstruction-plus-KL training alone will keep the latent geometry aligned with task costs; nothing in the objective forces that, so distances in the embedding could easily ignore variations that matter for the objective. The abstract states validation occurred but shows no quantitative results, baselines, error bars, or kernel derivation, which leaves the ultra data-efficient claim impossible to check. Without those details the dynamic-compression step inherits the same uncertainty. The paper is for people working on sample-efficient robot adaptation who already know the VAE and BO literature. A reader who wants to see whether the unsupervised embedding actually helps in the low-trial regime would get value from the full experiments if they exist. It deserves a serious referee because the architecture is concrete and the robotics setting is relevant, even if the current evidence is thin.

Referee Report

2 major / 1 minor

Summary. The paper proposes a sequential variational autoencoder trained unsupervised on simulated robot trajectories to produce a latent space of paths, from which a trajectory-based kernel is defined for Bayesian optimization; a dynamic compression step further restricts exploration to desirable state-space regions without explicit parameter constraints. It claims this yields ultra data-efficient optimization (10-20 trials) for high-dimensional controllers and reports validation via hardware experiments on a Daisy hexapod and ABB Yumi manipulator plus simulation comparisons against baselines on Daisy and two manipulators.

Significance. If the quantitative results demonstrate consistent gains over baselines in the low-data regime, the unsupervised latent-space kernel plus dynamic compression would constitute a useful contribution to data-efficient BO for robotics by avoiding supervised feature extraction. The hardware component strengthens the practical relevance if the reported improvements are statistically supported.

major comments (2)

[Abstract] Abstract: the claim that the method 'can offer ultra data-efficient optimization' and was 'validated with hardware experiments' is unsupported by any numerical results, error bars, baseline specifications, or performance metrics, rendering the central empirical claim impossible to assess.
[Method (VAE)] VAE training description (methods): the sequential VAE objective consists only of reconstruction and KL terms with no task costs or labels; nothing in the training therefore guarantees that latent distances preserve variations relevant to the downstream objective, which directly undermines the claim that the resulting kernel enables effective BO in the 10-20-trial regime.

minor comments (1)

[Abstract] The abstract refers to 'several baselines' and 'further comparisons' without naming them or indicating where the corresponding results appear.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to improve clarity and support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the method 'can offer ultra data-efficient optimization' and was 'validated with hardware experiments' is unsupported by any numerical results, error bars, baseline specifications, or performance metrics, rendering the central empirical claim impossible to assess.

Authors: We agree that the abstract would be strengthened by including concrete quantitative indicators to support the central claims. In the revised manuscript we will update the abstract to reference specific performance metrics from the experiments section (e.g., trials-to-convergence and relative improvement over baselines) together with a brief note on the presence of error bars and hardware validation, while preserving the abstract's high-level character. revision: yes
Referee: [Method (VAE)] VAE training description (methods): the sequential VAE objective consists only of reconstruction and KL terms with no task costs or labels; nothing in the training therefore guarantees that latent distances preserve variations relevant to the downstream objective, which directly undermines the claim that the resulting kernel enables effective BO in the 10-20-trial regime.

Authors: The referee is correct that the sequential VAE is trained in a fully unsupervised manner using only reconstruction and KL terms. This design was chosen deliberately to exploit large quantities of unlabeled simulated trajectories without requiring task labels or costs. Because the trajectories are generated by sampling the same controller parameter space later used for BO, the latent paths encode variations that are relevant to the controller's behavior; the trajectory-based kernel then operates directly on these paths. While no theoretical guarantee exists that latent distances will align with the downstream objective, the empirical results across hardware and simulation experiments show that the kernel supports effective optimization within the 10-20-trial budget. We will revise the methods section to articulate this rationale more explicitly and add a short limitations paragraph acknowledging the unsupervised nature of the pre-training. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an unsupervised sequential VAE trained solely on reconstruction and KL losses from simulated trajectories, followed by a separate dynamic compression step to define a kernel for Bayesian optimization. No equations, derivations, or method steps reduce a claimed prediction or result to the same fitted quantities or inputs by construction. The latent space geometry is produced independently of task-specific costs or labels, and experimental validation on hardware and simulation is used to support data-efficiency claims rather than any self-referential definition or self-citation chain. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; latent-space construction and dynamic compression rules are described at high level only.

pith-pipeline@v0.9.0 · 5711 in / 1003 out tokens · 22143 ms · 2026-05-24T23:48:23.397741+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 8 internal anchors

[1]

Learning Dexterous In-Hand Manipulation

OpenAI. Learning dexterous in-hand manipulation. arXiv:1808.00177, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Wilson, A

A. Wilson, A. Fern, and P. Tadepalli. Using Trajectory Data to Improve Bayesian Optimization for Reinforcement Learning. Journal of Machine Learning Research, 15(1):253–282, 2014

work page 2014
[3]

Max Balandat, Brian Karrer, Daniel Jiang, Ben Letham, Sam Daulton, Andrew Wilson, Eytan Bakshy. BoTorch. https://botorch.org/. Accessed: 2019-05

work page 2019
[4]

D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[5]

G ´omez-Bombarelli, J

R. G ´omez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hern ´andez-Lobato, B. S ´anchez- Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru- Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018

work page 2018
[6]

Lesort, N

T. Lesort, N. D ´ıaz-Rodr´ıguez, J.-F. Goudou, and D. Filliat. State representation learning for control: An overview. Neural Networks, 2018

work page 2018
[7]

Yingzhen and S

L. Yingzhen and S. Mandt. Disentangled sequential autoencoder. In International Conference on Machine Learning, pages 5656–5665, 2018

work page 2018
[8]

Fraccaro, S

M. Fraccaro, S. Kamronn, U. Paquet, and O. Winther. A disentangled recognition and nonlin- ear dynamics model for unsupervised learning. In Advances in Neural Information Processing Systems, pages 3601–3610, 2017

work page 2017
[9]

Deisenroth and C

M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efﬁcient approach to pol- icy search. In Proceedings of the 28th International Conference on machine learning (ICML- 11), pages 465–472, 2011

work page 2011
[10]

Calandra, J

R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold gaussian processes for regression. In 2016 International Joint Conference on Neural Networks (IJCNN) , pages 3338–3345. IEEE, 2016

work page 2016
[11]

A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In Artiﬁcial Intelligence and Statistics, pages 370–378, 2016

work page 2016
[12]

Thatte, H

N. Thatte, H. Duan, and H. Geyer. A method for online optimization of lower limb assistive devices with high dimensional parameter spaces. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–6. IEEE, 2018

work page 2018
[13]

S. Feng, E. Whitman, X. Xinjilefu, and C. G. Atkeson. Optimization-based Full Body Control for the DARPA Robotics Challenge. Journal of Field Robotics, 32(2):293–312, 2015

work page 2015
[14]

Y . Gong, R. Hartley, X. Da, A. Hereid, O. Harib, J.-K. Huang, and J. Grizzle. Feedback control of a cassie bipedal robot: Walking, standing, and riding a segway. arXiv:1809.07279, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Calandra

R. Calandra. Bayesian Modeling for Optimization and Control in Robotics. PhD thesis, Darm- stadt University of Technology, Germany, 2017

work page 2017
[16]

D. J. Lizotte, T. Wang, M. H. Bowling, and D. Schuurmans. Automatic Gait Optimization with Gaussian Process Regression. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), volume 7, pages 944–949, 2007

work page 2007
[17]

Tesch, J

M. Tesch, J. Schneider, and H. Choset. Using response surfaces and expected improvement to optimize snake robot gait parameters. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1069–1074. IEEE, 2011

work page 2011
[18]

Cully, J

A. Cully, J. Clune, D. Tarapore, and J.-B. Mouret. Robots that can adapt like animals. Nature, 521(7553):503–507, 2015. 9

work page 2015
[19]

A. Rai, R. Antonova, F. Meier, and C. G. Atkeson. Using simulation to improve sample- efﬁciency of bayesian optimization for bipedal robots. Journal of machine learning research, 20(49):1–24, 2019

work page 2019
[20]

Kroemer, R

O. Kroemer, R. Detry, J. Piater, and J. Peters. Combining active learning and reactive control for robot grasping. Robotics and Autonomous systems, 58(9):1105–1116, 2010

work page 2010
[21]

Montesano and M

L. Montesano and M. Lopes. Active learning of visual descriptors for grasping using non- parametric smoothed beta distributions. Robotics and Autonomous Systems , 60(3):452–462, 2012

work page 2012
[22]

Antonova, M

R. Antonova, M. Kokic, J. A. Stork, and D. Kragic. Global search with bernoulli alternation kernel for task-oriented grasping informed by simulation. In Conference on Robot Learning, pages 641–650, 2018

work page 2018
[23]

Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience

Y . Chebotar, A. Handa, V . Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox. Clos- ing the sim-to-real loop: Adapting simulation randomization with real world experience. arXiv:1810.05687, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018

work page 2018
[25]

Z. He, R. Julian, E. Heiden, H. Zhang, S. Schaal, J. Lim, G. Sukhatme, and K. Hausman. Zero-shot skill composition and simulation-to-real transfer by learning task representations. arXiv:1810.02422, 2018

work page arXiv 2018
[26]

Arnekvist, D

I. Arnekvist, D. Kragic, and J. A. Stork. VPE: Variational Policy Embedding for Transfer Reinforcement Learning. In 2019 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2019

work page 2019
[27]

J. Tan, T. Zhang, E. Coumans, A. Iscen, Y . Bai, D. Hafner, S. Bohez, and V . Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. arXiv:1804.10332, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

T. Li, A. Rai, H. Geyer, and C. G. Atkeson. Using deep reinforcement learning to learn high- level policies on the atrias biped. arXiv:1809.10811, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Haarnoja, S

T. Haarnoja, S. Ha, A. Zhou, J. Tan, G. Tucker, and S. Levine. Learning to walk via deep reinforcement learning. In Robotics: Science and Systems (RRS), 2019

work page 2019
[30]

Louizos, K

C. Louizos, K. Swersky, Y . Li, M. Welling, and R. Zemel. The variational fair autoencoder. International Conference on Learning Representations, 2016

work page 2016
[31]

Shahriari, K

B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE, 104(1):148–175, 2016

work page 2016
[32]

T. Minka. Divergence measures and message passing. Technical report, Microsoft Research, 2005

work page 2005
[33]

C. M. Bishop. Pattern recognition and machine learning. springer, 2006

work page 2006
[34]

Riquelme, M

C. Riquelme, M. Johnson, and M. Hoffman. Failure modes of variational inference for decision making. Prediction and Generative Modeling in RL Workshop (AAMAS, ICML, IJCAI), 2018

work page 2018
[35]

Variational Inference for Data-Efficient Model Learning in POMDPs

S. Tschiatschek, K. Arulkumaran, J. St ¨uhmer, and K. Hofmann. Variational inference for data-efﬁcient model learning in pomdps. arXiv:1805.09281, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

S. Bai, J. Z. Kolter, and V . Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

Bradbury, S

J. Bradbury, S. Merity, C. Xiong, and R. Socher. Quasi-recurrent neural networks. Interna- tional Conference on Learning Representations, 2017

work page 2017
[38]

http://docs.hebi.us

Hebi Robotics. http://docs.hebi.us. Accessed: 2019-06. 10

work page 2019
[39]

https://github.com/bulletphysics/bullet3

Pybullet simulator. https://github.com/bulletphysics/bullet3. Accessed: 2019-06

work page 2019
[40]

Crespi and A

A. Crespi and A. J. Ijspeert. Online optimization of swimming and crawling in an amphibious snake robot. IEEE Transactions on Robotics, 24(1):75–87, 2008

work page 2008
[41]

D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pages 3581– 3589, 2014. Appendix A: SV AE-DC Modeling Details The backbone of our model is inspired by hierarchical constructions, like those developed in [30, 41]. However, these works consid...

work page 2014

[1] [1]

Learning Dexterous In-Hand Manipulation

OpenAI. Learning dexterous in-hand manipulation. arXiv:1808.00177, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Wilson, A

A. Wilson, A. Fern, and P. Tadepalli. Using Trajectory Data to Improve Bayesian Optimization for Reinforcement Learning. Journal of Machine Learning Research, 15(1):253–282, 2014

work page 2014

[3] [3]

Max Balandat, Brian Karrer, Daniel Jiang, Ben Letham, Sam Daulton, Andrew Wilson, Eytan Bakshy. BoTorch. https://botorch.org/. Accessed: 2019-05

work page 2019

[4] [4]

D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[5] [5]

G ´omez-Bombarelli, J

R. G ´omez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hern ´andez-Lobato, B. S ´anchez- Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru- Guzik. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018

work page 2018

[6] [6]

Lesort, N

T. Lesort, N. D ´ıaz-Rodr´ıguez, J.-F. Goudou, and D. Filliat. State representation learning for control: An overview. Neural Networks, 2018

work page 2018

[7] [7]

Yingzhen and S

L. Yingzhen and S. Mandt. Disentangled sequential autoencoder. In International Conference on Machine Learning, pages 5656–5665, 2018

work page 2018

[8] [8]

Fraccaro, S

M. Fraccaro, S. Kamronn, U. Paquet, and O. Winther. A disentangled recognition and nonlin- ear dynamics model for unsupervised learning. In Advances in Neural Information Processing Systems, pages 3601–3610, 2017

work page 2017

[9] [9]

Deisenroth and C

M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efﬁcient approach to pol- icy search. In Proceedings of the 28th International Conference on machine learning (ICML- 11), pages 465–472, 2011

work page 2011

[10] [10]

Calandra, J

R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold gaussian processes for regression. In 2016 International Joint Conference on Neural Networks (IJCNN) , pages 3338–3345. IEEE, 2016

work page 2016

[11] [11]

A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In Artiﬁcial Intelligence and Statistics, pages 370–378, 2016

work page 2016

[12] [12]

Thatte, H

N. Thatte, H. Duan, and H. Geyer. A method for online optimization of lower limb assistive devices with high dimensional parameter spaces. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–6. IEEE, 2018

work page 2018

[13] [13]

S. Feng, E. Whitman, X. Xinjilefu, and C. G. Atkeson. Optimization-based Full Body Control for the DARPA Robotics Challenge. Journal of Field Robotics, 32(2):293–312, 2015

work page 2015

[14] [14]

Y . Gong, R. Hartley, X. Da, A. Hereid, O. Harib, J.-K. Huang, and J. Grizzle. Feedback control of a cassie bipedal robot: Walking, standing, and riding a segway. arXiv:1809.07279, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Calandra

R. Calandra. Bayesian Modeling for Optimization and Control in Robotics. PhD thesis, Darm- stadt University of Technology, Germany, 2017

work page 2017

[16] [16]

D. J. Lizotte, T. Wang, M. H. Bowling, and D. Schuurmans. Automatic Gait Optimization with Gaussian Process Regression. In International Joint Conference on Artiﬁcial Intelligence (IJCAI), volume 7, pages 944–949, 2007

work page 2007

[17] [17]

Tesch, J

M. Tesch, J. Schneider, and H. Choset. Using response surfaces and expected improvement to optimize snake robot gait parameters. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1069–1074. IEEE, 2011

work page 2011

[18] [18]

Cully, J

A. Cully, J. Clune, D. Tarapore, and J.-B. Mouret. Robots that can adapt like animals. Nature, 521(7553):503–507, 2015. 9

work page 2015

[19] [19]

A. Rai, R. Antonova, F. Meier, and C. G. Atkeson. Using simulation to improve sample- efﬁciency of bayesian optimization for bipedal robots. Journal of machine learning research, 20(49):1–24, 2019

work page 2019

[20] [20]

Kroemer, R

O. Kroemer, R. Detry, J. Piater, and J. Peters. Combining active learning and reactive control for robot grasping. Robotics and Autonomous systems, 58(9):1105–1116, 2010

work page 2010

[21] [21]

Montesano and M

L. Montesano and M. Lopes. Active learning of visual descriptors for grasping using non- parametric smoothed beta distributions. Robotics and Autonomous Systems , 60(3):452–462, 2012

work page 2012

[22] [22]

Antonova, M

R. Antonova, M. Kokic, J. A. Stork, and D. Kragic. Global search with bernoulli alternation kernel for task-oriented grasping informed by simulation. In Conference on Robot Learning, pages 641–650, 2018

work page 2018

[23] [23]

Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience

Y . Chebotar, A. Handa, V . Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox. Clos- ing the sim-to-real loop: Adapting simulation randomization with real world experience. arXiv:1810.05687, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [24]

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018

work page 2018

[25] [25]

Z. He, R. Julian, E. Heiden, H. Zhang, S. Schaal, J. Lim, G. Sukhatme, and K. Hausman. Zero-shot skill composition and simulation-to-real transfer by learning task representations. arXiv:1810.02422, 2018

work page arXiv 2018

[26] [26]

Arnekvist, D

I. Arnekvist, D. Kragic, and J. A. Stork. VPE: Variational Policy Embedding for Transfer Reinforcement Learning. In 2019 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2019

work page 2019

[27] [27]

J. Tan, T. Zhang, E. Coumans, A. Iscen, Y . Bai, D. Hafner, S. Bohez, and V . Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. arXiv:1804.10332, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

T. Li, A. Rai, H. Geyer, and C. G. Atkeson. Using deep reinforcement learning to learn high- level policies on the atrias biped. arXiv:1809.10811, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Haarnoja, S

T. Haarnoja, S. Ha, A. Zhou, J. Tan, G. Tucker, and S. Levine. Learning to walk via deep reinforcement learning. In Robotics: Science and Systems (RRS), 2019

work page 2019

[30] [30]

Louizos, K

C. Louizos, K. Swersky, Y . Li, M. Welling, and R. Zemel. The variational fair autoencoder. International Conference on Learning Representations, 2016

work page 2016

[31] [31]

Shahriari, K

B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE, 104(1):148–175, 2016

work page 2016

[32] [32]

T. Minka. Divergence measures and message passing. Technical report, Microsoft Research, 2005

work page 2005

[33] [33]

C. M. Bishop. Pattern recognition and machine learning. springer, 2006

work page 2006

[34] [34]

Riquelme, M

C. Riquelme, M. Johnson, and M. Hoffman. Failure modes of variational inference for decision making. Prediction and Generative Modeling in RL Workshop (AAMAS, ICML, IJCAI), 2018

work page 2018

[35] [35]

Variational Inference for Data-Efficient Model Learning in POMDPs

S. Tschiatschek, K. Arulkumaran, J. St ¨uhmer, and K. Hofmann. Variational inference for data-efﬁcient model learning in pomdps. arXiv:1805.09281, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

S. Bai, J. Z. Kolter, and V . Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[37] [37]

Bradbury, S

J. Bradbury, S. Merity, C. Xiong, and R. Socher. Quasi-recurrent neural networks. Interna- tional Conference on Learning Representations, 2017

work page 2017

[38] [38]

http://docs.hebi.us

Hebi Robotics. http://docs.hebi.us. Accessed: 2019-06. 10

work page 2019

[39] [39]

https://github.com/bulletphysics/bullet3

Pybullet simulator. https://github.com/bulletphysics/bullet3. Accessed: 2019-06

work page 2019

[40] [40]

Crespi and A

A. Crespi and A. J. Ijspeert. Online optimization of swimming and crawling in an amphibious snake robot. IEEE Transactions on Robotics, 24(1):75–87, 2008

work page 2008

[41] [41]

D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pages 3581– 3589, 2014. Appendix A: SV AE-DC Modeling Details The backbone of our model is inspired by hierarchical constructions, like those developed in [30, 41]. However, these works consid...

work page 2014