Co-training for Policy Learning

Jialin Song; Masahiro Ono; Ravi Lanka; Yisong Yue

arxiv: 1907.04484 · v1 · pith:47KLRIIMnew · submitted 2019-07-03 · 💻 cs.LG · cs.AI· stat.ML

Co-training for Policy Learning

Jialin Song , Ravi Lanka , Yisong Yue , Masahiro Ono This is my paper

Pith reviewed 2026-05-25 10:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords co-trainingpolicy learningreinforcement learningimitation learningsequential decision makingmultiple representationscombinatorial optimization

0 comments

The pith

Sufficient conditions allow co-training from two state-action views to improve policies over single-view learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies sequential decision-making policies when multiple state-action representations exist, as occurs in planning with different integer programming formulations or combinatorial problems with both integer and graph views. It adapts the classical co-training idea from classification to derive sufficient conditions where two views produce better policies than either alone. A meta-algorithm is introduced that works with reinforcement learning and imitation learning, and experiments demonstrate gains on discrete and continuous control tasks plus combinatorial optimization problems.

Core claim

Under sufficient conditions, learning from two complementary state-action representations improves upon learning from a single representation alone, and this improvement is realized by a meta-algorithm for co-training that is compatible with both reinforcement learning and imitation learning.

What carries the argument

The co-training meta-algorithm that alternates training between two views to mutually refine the policy.

If this is right

Policy quality improves when multiple formulations of the same decision problem are available and used together.
The meta-algorithm applies equally to reinforcement learning and imitation learning settings.
Gains are observed on both discrete/continuous control and combinatorial optimization tasks.
The approach directly extends co-training concepts from classification into sequential decision making.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Identifying pairs of complementary representations may be a practical bottleneck even when the theoretical conditions hold.
The framework could be extended to more than two views if the mutual-improvement conditions generalize.
Sample efficiency in reinforcement learning might increase when dual representations reduce the need for exploration in each view separately.

Load-bearing premise

The two representations supply complementary information that permits each view to improve the policy quality obtained from the other.

What would settle it

A controlled experiment in which two views satisfy the stated conditions yet produce no measurable improvement in policy quality over the single-view baseline.

Figures

Figures reproduced from arXiv: 1907.04484 by Jialin Song, Masahiro Ono, Ravi Lanka, Yisong Yue.

**Figure 1.** Figure 1: Two ways to encode minimum vertex cover (MVC) problems. Left: policies learn to operate directly on the graph view to find the minimal cover set [30]. Right: we express MVC as an integer linear program, then polices learn to traverse the resulting combinatorial search space, i.e., learn to branch-and-bound [23, 56]. (i.e., regret) versus the optimal policy. These theoretical characterizations shed light o… view at source ↗

**Figure 3.** Figure 3: Graphical model encodes the conditional inde [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Discrete & continuous control tasks. Experiment results are across 5 random seeded runs. Shaded area [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of CoPiEr with other learningbased baselines and a commercial solver, Gurobi. The y-axis measure relative gaps of various methods compared with CoPiEr Final. CoPiEr Final outperforms all the baselines. Notably, the gaps are significant because getting optimizing over large graphs is very challenging. selection policy for branch-and-bound search. A node selection policy determines which node to… view at source ↗

**Figure 6.** Figure 6: Comparison of CoPiEr with other learningbased baselines and a commercial solver, Gurobi. The yaxis measure relative gaps of various methods compared with CoPiEr Final. CoPiEr Final outperforms all the baselines. Notably, the scale of problems as measured by the number of integer variables far exceed previous state-of-the-art method [56]. aggregating both policies. The effectiveness of CoPiEr enables solv… view at source ↗

**Figure 7.** Figure 7: Two views for Risk-Aware Path Planning. On [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

We study the problem of learning sequential decision-making policies in settings with multiple state-action representations. Such settings naturally arise in many domains, such as planning (e.g., multiple integer programming formulations) and various combinatorial optimization problems (e.g., those with both integer programming and graph-based formulations). Inspired by the classical co-training framework for classification, we study the problem of co-training for policy learning. We present sufficient conditions under which learning from two views can improve upon learning from a single view alone. Motivated by these theoretical insights, we present a meta-algorithm for co-training for sequential decision making. Our framework is compatible with both reinforcement learning and imitation learning. We validate the effectiveness of our approach across a wide range of tasks, including discrete/continuous control and combinatorial optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Co-training for policies makes sense as an idea but the experiments never check whether the stated sufficient conditions actually hold in the tested domains.

read the letter

The paper adapts the co-training setup from classification to policy learning when two different state-action representations are available. It states sufficient conditions under which the two views can improve policy quality over a single view, then gives a meta-algorithm that works for both reinforcement learning and imitation learning. The framing fits domains like planning and combinatorial optimization where multiple formulations (integer programs, graphs) naturally exist, and the experiments cover discrete/continuous control plus combinatorial tasks. That breadth is useful and shows the method is not limited to toy settings. The conditions themselves are the main new element; prior single-view results are not claimed to be recovered as special cases. The soft spot is exactly the one the stress-test flags. The theory requires specific properties on the views (complementarity, error reduction, etc.), yet the paper reports no measurements of those quantities on the actual tasks. The empirical gains therefore stand on their own as evidence that some form of co-training can help, but they do not confirm that the gains occur because the derived conditions are met. That leaves the theoretical contribution somewhat decoupled from the results. Readers working on multi-view or multi-representation RL, or on learning for combinatorial problems, would find the meta-algorithm worth trying. The work is coherent enough on its own terms to deserve referee time rather than a desk reject, though any review would likely press on closing the theory-experiment gap.

Referee Report

2 major / 2 minor

Summary. The paper studies co-training for sequential decision-making policies when multiple state-action representations are available. It states sufficient conditions under which two views yield mutual policy improvement over a single view, introduces a meta-algorithm compatible with both reinforcement learning and imitation learning, and reports empirical results on discrete/continuous control and combinatorial optimization tasks.

Significance. If the sufficient conditions are shown to hold in the evaluated domains and the reported gains are attributable to the co-training mechanism rather than other factors, the work would supply a principled extension of classical co-training to policy learning and a practical meta-algorithm for domains that admit multiple formulations.

major comments (2)

[§3 and experimental sections] §3 (sufficient conditions): the manuscript states conditions under which two views permit mutual improvement, yet the experimental sections provide no verification that any evaluated task satisfies these conditions (e.g., no reported measurement of view disagreement, conditional independence, or the relevant error-reduction quantities). Consequently the theory does not underwrite the empirical claims.
[Algorithm 1 and §4] Algorithm 1 and §4: the meta-algorithm is motivated by the theoretical conditions, but without evidence that the conditions hold on the control and combinatorial instances, it is unclear whether the observed improvements stem from the stated mechanism or from generic regularization or ensemble effects.

minor comments (2)

[§2 and experimental setup] Notation for the two views is introduced without an explicit table comparing the state-action representations used in each experimental domain.
[Abstract] The abstract claims validation 'across a wide range of tasks' but does not list the precise environments or problem sizes; a compact table would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the link between the theoretical conditions and the empirical evaluation. We address each major comment below and will revise the manuscript to incorporate additional analysis and controls.

read point-by-point responses

Referee: [§3 and experimental sections] §3 (sufficient conditions): the manuscript states conditions under which two views permit mutual improvement, yet the experimental sections provide no verification that any evaluated task satisfies these conditions (e.g., no reported measurement of view disagreement, conditional independence, or the relevant error-reduction quantities). Consequently the theory does not underwrite the empirical claims.

Authors: We agree this is a substantive gap. The sufficient conditions in §3 are presented as motivation for the meta-algorithm rather than as a claim that they hold exactly on every evaluated task. In the revision we will add a new subsection that reports empirical measurements of view disagreement and the relevant error-reduction quantities on the discrete/continuous control and combinatorial optimization tasks. This will allow readers to assess the degree to which the theoretical conditions are approximately satisfied and will clarify the extent to which the observed gains are consistent with the stated mechanism. revision: yes
Referee: [Algorithm 1 and §4] Algorithm 1 and §4: the meta-algorithm is motivated by the theoretical conditions, but without evidence that the conditions hold on the control and combinatorial instances, it is unclear whether the observed improvements stem from the stated mechanism or from generic regularization or ensemble effects.

Authors: This concern is valid. To isolate the contribution of the co-training procedure, the revised version will include additional ablation studies that compare the full co-training meta-algorithm against (i) single-view baselines, (ii) simple ensembles of independently trained policies, and (iii) regularization-only variants that do not exchange information between views. We will also report the same disagreement and error-reduction metrics on these ablations so that any performance differences can be more directly attributed to the mutual-improvement mechanism described in §3. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper states sufficient conditions under which two-view co-training improves single-view policy learning and motivates a meta-algorithm from those conditions. No equations, fitted parameters, or predictions are exhibited that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled via prior work, and no renaming of known results occurs. The derivation chain is self-contained against external benchmarks because the conditions are presented as independent theoretical statements rather than tautological redefinitions of the algorithm's outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; the central claim rests on the existence of complementary views and the validity of the (unspecified) sufficient conditions for improvement.

axioms (1)

domain assumption Multiple distinct state-action representations exist and supply complementary information for the same underlying decision process.
Invoked by the co-training setup for sequential decision making.

pith-pipeline@v0.9.0 · 5659 in / 1083 out tokens · 19900 ms · 2026-05-25T10:16:28.603101+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages

[1]

Apprenticeship learn- ing via inverse reinforcement learning

Pieter Abbeel and Andrew Y Ng. Apprenticeship learn- ing via inverse reinforcement learning. In International Conference on Machine Learning, 2004

work page 2004
[2]

SCIP : solving constraint integer pro- grams

Tobias Achterberg. SCIP : solving constraint integer pro- grams. Mathematical Programming Computation, 2009

work page 2009
[3]

Co- training and expansion: Towards bridging theory and practice

Maria-Florina Balcan, Avrim Blum, and Ke Yang. Co- training and expansion: Towards bridging theory and practice. In Neural information processing systems, 2005

work page 2005
[4]

Learning to solve smt formulas

Mislav Balunovic, Pavol Bielik, and Martin Vechev. Learning to solve smt formulas. In Neural Information Processing Systems, 2018

work page 2018
[5]

Efﬁcient co-training of linear separators under weak dependence

Avrim Blum and Yishay Mansour. Efﬁcient co-training of linear separators under weak dependence. In Conference on Learning Theory, 2017

work page 2017
[6]

Combining labeled and unlabeled data with co-training

Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Conference on Learn- ing Theory, 1998

work page 1998
[7]

The max-cut prob- lem and quadratic 0–1 optimization; polyhedral aspects, relaxations and bounds

Endre Boros and Peter L Hammer. The max-cut prob- lem and quadratic 0–1 optimization; polyhedral aspects, relaxations and bounds. Annals of Operations Research, 1991

work page 1991
[8]

Openai gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv, 2016

work page 2016
[9]

Learning to rank using gradient descent

Chris Burges, Erin Renshaw, and Matt Deeds. Learning to rank using gradient descent. In International conference on Machine learning, 1998

work page 1998
[10]

Learning to search bet- ter than your teacher

Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume, and John Langford. Learning to search bet- ter than your teacher. InInternational Conference on Ma- chine Learning, 2015

work page 2015
[11]

Co-training for domain adaptation

Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In Neural information processing systems, 2011

work page 2011
[12]

Fast policy learning through imitation and re- inforcement

Ching-An Cheng, Xinyan Yan, Nolan Wagener, and By- ron Boots. Fast policy learning through imitation and re- inforcement. In Conference on Uncertainty in Artiﬁcial Intelligence, 2018

work page 2018
[13]

Elements of infor- mation theory

Thomas M Cover and Joy A Thomas. Elements of infor- mation theory. John Wiley & Sons, 2012

work page 2012
[14]

Discriminative Em- beddings of Latent Variable Models for Structured Data

Hanjun Dai, Bo Dai, and Le Song. Discriminative Em- beddings of Latent Variable Models for Structured Data. In International Conference on Machine Learning, pages 1–23, 2016

work page 2016
[15]

Learning combinatorial optimization al- gorithms over graphs

Hanjun Dai, Elias B Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization al- gorithms over graphs. In Neural Information Processing Systems, 2017

work page 2017
[16]

Pac generalization bounds for co-training

Sanjoy Dasgupta, Michael L Littman, and David A McAllester. Pac generalization bounds for co-training. In Neural information processing systems, 2002

work page 2002
[17]

Search- based structured prediction

Hal Daum ´e, John Langford, and Daniel Marcu. Search- based structured prediction. Machine learning, 2009

work page 2009
[18]

Linear programming relaxations of maxcut

Wenceslas Fernandez de la Vega and Claire Kenyon- Mathieu. Linear programming relaxations of maxcut. In ACM-SIAM symposium on Discrete algorithms, 2007

work page 2007
[19]

Benchmarking deep reinforcement learn- ing for continuous control

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learn- ing for continuous control. In International Conference on Machine Learning, 2016

work page 2016
[20]

On the evolution of random graphs

Paul Erd ˝os and Alfr´ed R´enyi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 1960

work page 1960
[21]

Variance reduction techniques for gradient estimates in reinforcement learning

Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Re- search, 2004

work page 2004
[22]

Gurobi optimizer reference manual, 2018

LLC Gurobi Optimization. Gurobi optimizer reference manual, 2018

work page 2018
[23]

Learning to search in branch and bound algorithms

He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and bound algorithms. In Neural infor- mation processing systems, 2014

work page 2014
[24]

Deep reinforce- ment learning that matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforce- ment learning that matters. In AAAI Conference on Arti- ﬁcial Intelligence, 2018

work page 2018
[25]

Deep Q-Learning from Demonstrations

Todd Hester, Olivier Pietquin, Marc Lanctot, Tom Schaul, Dan Horgan, John Quan, Andrew Sendonaris, Ian Os- band, Gabriel Dulac-arnold, John Agapiou, and Joel Z Leibo. Deep Q-Learning from Demonstrations. In AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018
[26]

Generative adversar- ial imitation learning

Jonathan Ho and Stefano Ermon. Generative adversar- ial imitation learning. In Neural Information Processing Systems, 2016

work page 2016
[27]

Googles multilingual neural machine translation system: Enabling zero-shot translation

Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fer- nanda Vi ´egas, Martin Wattenberg, Greg Corrado, et al. Googles multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Asso- ciation for Computational Linguistics, 2017

work page 2017
[28]

Approximately opti- mal approximate reinforcement learning

Sham Kakade and John Langford. Approximately opti- mal approximate reinforcement learning. In International Conference on Machine Learning, 2002

work page 2002
[29]

Policy opti- mization with demonstrations

Bingyi Kang, Zequn Jie, and Jiashi Feng. Policy opti- mization with demonstrations. In International Confer- ence on Machine Learning, 2018

work page 2018
[30]

Learning to branch in mixed integer programming

Elias Boutros Khalil, Pierre Le Bodic, Le Song, George L Nemhauser, and Bistra N Dilkina. Learning to branch in mixed integer programming. In AAAI Conference on Artiﬁcial Intelligence, 2016

work page 2016
[31]

Email classiﬁ- cation with co-training

Svetlana Kiritchenko and Stan Matwin. Email classiﬁ- cation with co-training. In Conference of the Center for Advanced Studies on Collaborative Research, 2011

work page 2011
[32]

Reinforce- ment learning in robotics: A survey

Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforce- ment learning in robotics: A survey. The International Journal of Robotics Research, 2013

work page 2013
[33]

A co-training approach for multi-view spectral clustering

Abhishek Kumar and Hal Daum ´e. A co-training approach for multi-view spectral clustering. In International Con- ference on Machine Learning, 2011

work page 2011
[34]

Play- ing fps games with deep reinforcement learning

Guillaume Lample and Devendra Singh Chaplot. Play- ing fps games with deep reinforcement learning. In AAAI Conference on Artiﬁcial Intelligence, 2017

work page 2017
[35]

An automatic method for solving discrete programming problems

Ailsa H Land and Alison G Doig. An automatic method for solving discrete programming problems. In 50 Years of Integer Programming 1958-2008 , pages 105–

work page 1958
[36]

Hierarchical imitation and reinforcement learning

Hoang Le, Nan Jiang, Alekh Agarwal, Miroslav Dudik, Yisong Yue, and Hal Daum´e. Hierarchical imitation and reinforcement learning. In International Conference on Machine Learning, 2018

work page 2018
[37]

End-to-end training of deep visuomotor policies

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 2016

work page 2016
[38]

Continuous control with deep reinforce- ment learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforce- ment learning. In International Conference on Learning Representations, 2016

work page 2016
[39]

A simplicial branch-and-bound algorithm for solving quadratically constrained quadratic programs

Jeff Linderoth. A simplicial branch-and-bound algorithm for solving quadratically constrained quadratic programs. Mathematical programming, 2005

work page 2005
[40]

Multi- view clustering via joint nonnegative matrix factorization

Jialu Liu, Chi Wang, Jing Gao, and Jiawei Han. Multi- view clustering via joint nonnegative matrix factorization. In SIAM International Conference on Data Mining, 2013

work page 2013
[41]

De- vice placement optimization with reinforcement learning

Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. De- vice placement optimization with reinforcement learning. In International Conference on Machine Learning, 2017

work page 2017
[42]

Playing atari with deep reinforcement learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv, 2013

work page 2013
[43]

Overcoming explo- ration in reinforcement learning with demonstrations

Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wo- jciech Zaremba, and Pieter Abbeel. Overcoming explo- ration in reinforcement learning with demonstrations. In International Conference on Robotics and Automation , 2018

work page 2018
[44]

Analyzing the effective- ness and applicability of co-training

Kamal Nigam and Rayid Ghani. Analyzing the effective- ness and applicability of co-training. In ACM Conference on Information and knowledge Management, 2000

work page 2000
[45]

A comparative analysis of several asymmetric traveling salesman problem formulations

Temel ¨Oncan, ˙I Kuban Altınel, and Gilbert Laporte. A comparative analysis of several asymmetric traveling salesman problem formulations. Computers & Opera- tions Research, 2009

work page 2009
[46]

An efﬁcient motion planning algorithm for stochastic dynamic systems with constraints on probability of failure

Masahiro Ono and Brian C Williams. An efﬁcient motion planning algorithm for stochastic dynamic systems with constraints on probability of failure. In AAAI Conference on Artiﬁcial Intelligence, 2008

work page 2008
[47]

A survey of different in- teger programming formulations of the travelling sales- man problem

AJ Orman and HP Williams. A survey of different in- teger programming formulations of the travelling sales- man problem. In Optimisation, econometric and ﬁnancial analysis. Springer, 2007

work page 2007
[48]

Markov decision processes: discrete stochastic dynamic programming

Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons, 2014

work page 2014
[49]

Efﬁcient reductions for imitation learning

St ´ephane Ross and Drew Bagnell. Efﬁcient reductions for imitation learning. In International Conference on Artiﬁ- cial Intelligence and Statistics, 2010

work page 2010
[50]

Reinforcement and imitation learning via interactive no-regret learning

Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv, 2014

work page 2014
[51]

A reduction of imitation learning and structured prediction to no-regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artiﬁcial Intelligence and Statistics, 2011

work page 2011
[52]

Trust region policy optimiza- tion

John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimiza- tion. In International Conference on Machine Learning , 2015

work page 2015
[53]

Proximal policy optimization algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv, 2017

work page 2017
[54]

Mastering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Ju- lian Schrittwieser, Ioannis Antonoglou, Veda Panneer- shelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016

work page 2016
[55]

A co-regularization approach to semi-supervised learning with multiple views

Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. A co-regularization approach to semi-supervised learning with multiple views. In ICML workshop on learning with multiple views, 2005

work page 2005
[56]

Learning to search via retrospective imitation

Jialin Song, Ravi Lanka, Albert Zhao, Aadyot Bhatnagar, Yisong Yue, and Masahiro Ono. Learning to search via retrospective imitation. arXiv, 2018

work page 2018
[57]

Third- person imitation learning

Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third- person imitation learning. arXiv, 2017

work page 2017
[58]

Deeply aggrevated: Dif- ferentiable imitation learning for sequential prediction

Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated: Dif- ferentiable imitation learning for sequential prediction. In International Conference on Machine Learning, 2017

work page 2017
[59]

Policy gradient methods for rein- forcement learning with function approximation

Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for rein- forcement learning with function approximation. In Neu- ral information processing systems, 2000

work page 2000
[60]

Apprenticeship learning using linear programming

Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. InIn- ternational Conference on Machine Learning, 2008

work page 2008
[61]

A game-theoretic ap- proach to apprenticeship learning

Umar Syed and Robert E Schapire. A game-theoretic ap- proach to apprenticeship learning. In Neural information processing systems, 2008

work page 2008
[62]

Mu- joco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu- joco: A physics engine for model-based control. In Inter- national Conference on Intelligent Robots and Systems , 2012

work page 2012
[63]

Deep reinforcement learning with double q-learning

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI Conference on Artiﬁcial Intelligence, 2016

work page 2016
[64]

Co-training for cross-lingual sentiment classiﬁcation

Xiaojun Wan. Co-training for cross-lingual sentiment classiﬁcation. In Joint conference of ACL and IJCNLP . Association for Computational Linguistics, 2009

work page 2009
[65]

A new analysis of co- training

Wei Wang and Zhi-Hua Zhou. A new analysis of co- training. In International Conference on Machine Learn- ing, 2010

work page 2010
[66]

Co-training with insufﬁ- cient views

Wei Wang and Zhi-Hua Zhou. Co-training with insufﬁ- cient views. In Asian conference on machine learning , 2013

work page 2013
[67]

Dueling network archi- tectures for deep reinforcement learning

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network archi- tectures for deep reinforcement learning. In International Conference on Machine Learning, 2016

work page 2016
[68]

Maximum entropy inverse reinforcement learning

Brian Ziebart, Andrew Maas, J Andrew Bagnell, and Anind Dey. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artiﬁcial Intelligence , 2008. 8 APPENDIX 8.1 PROOFS Proof for Proposition 1: Proof. We show that maxsDJS (πB(s)∥πA(s)) is well- deﬁned for an MDP M with two representations MA and MB. From Theorem 1, we know the distribution ...

work page 2008

[1] [1]

Apprenticeship learn- ing via inverse reinforcement learning

Pieter Abbeel and Andrew Y Ng. Apprenticeship learn- ing via inverse reinforcement learning. In International Conference on Machine Learning, 2004

work page 2004

[2] [2]

SCIP : solving constraint integer pro- grams

Tobias Achterberg. SCIP : solving constraint integer pro- grams. Mathematical Programming Computation, 2009

work page 2009

[3] [3]

Co- training and expansion: Towards bridging theory and practice

Maria-Florina Balcan, Avrim Blum, and Ke Yang. Co- training and expansion: Towards bridging theory and practice. In Neural information processing systems, 2005

work page 2005

[4] [4]

Learning to solve smt formulas

Mislav Balunovic, Pavol Bielik, and Martin Vechev. Learning to solve smt formulas. In Neural Information Processing Systems, 2018

work page 2018

[5] [5]

Efﬁcient co-training of linear separators under weak dependence

Avrim Blum and Yishay Mansour. Efﬁcient co-training of linear separators under weak dependence. In Conference on Learning Theory, 2017

work page 2017

[6] [6]

Combining labeled and unlabeled data with co-training

Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Conference on Learn- ing Theory, 1998

work page 1998

[7] [7]

The max-cut prob- lem and quadratic 0–1 optimization; polyhedral aspects, relaxations and bounds

Endre Boros and Peter L Hammer. The max-cut prob- lem and quadratic 0–1 optimization; polyhedral aspects, relaxations and bounds. Annals of Operations Research, 1991

work page 1991

[8] [8]

Openai gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv, 2016

work page 2016

[9] [9]

Learning to rank using gradient descent

Chris Burges, Erin Renshaw, and Matt Deeds. Learning to rank using gradient descent. In International conference on Machine learning, 1998

work page 1998

[10] [10]

Learning to search bet- ter than your teacher

Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume, and John Langford. Learning to search bet- ter than your teacher. InInternational Conference on Ma- chine Learning, 2015

work page 2015

[11] [11]

Co-training for domain adaptation

Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In Neural information processing systems, 2011

work page 2011

[12] [12]

Fast policy learning through imitation and re- inforcement

Ching-An Cheng, Xinyan Yan, Nolan Wagener, and By- ron Boots. Fast policy learning through imitation and re- inforcement. In Conference on Uncertainty in Artiﬁcial Intelligence, 2018

work page 2018

[13] [13]

Elements of infor- mation theory

Thomas M Cover and Joy A Thomas. Elements of infor- mation theory. John Wiley & Sons, 2012

work page 2012

[14] [14]

Discriminative Em- beddings of Latent Variable Models for Structured Data

Hanjun Dai, Bo Dai, and Le Song. Discriminative Em- beddings of Latent Variable Models for Structured Data. In International Conference on Machine Learning, pages 1–23, 2016

work page 2016

[15] [15]

Learning combinatorial optimization al- gorithms over graphs

Hanjun Dai, Elias B Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization al- gorithms over graphs. In Neural Information Processing Systems, 2017

work page 2017

[16] [16]

Pac generalization bounds for co-training

Sanjoy Dasgupta, Michael L Littman, and David A McAllester. Pac generalization bounds for co-training. In Neural information processing systems, 2002

work page 2002

[17] [17]

Search- based structured prediction

Hal Daum ´e, John Langford, and Daniel Marcu. Search- based structured prediction. Machine learning, 2009

work page 2009

[18] [18]

Linear programming relaxations of maxcut

Wenceslas Fernandez de la Vega and Claire Kenyon- Mathieu. Linear programming relaxations of maxcut. In ACM-SIAM symposium on Discrete algorithms, 2007

work page 2007

[19] [19]

Benchmarking deep reinforcement learn- ing for continuous control

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learn- ing for continuous control. In International Conference on Machine Learning, 2016

work page 2016

[20] [20]

On the evolution of random graphs

Paul Erd ˝os and Alfr´ed R´enyi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 1960

work page 1960

[21] [21]

Variance reduction techniques for gradient estimates in reinforcement learning

Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Re- search, 2004

work page 2004

[22] [22]

Gurobi optimizer reference manual, 2018

LLC Gurobi Optimization. Gurobi optimizer reference manual, 2018

work page 2018

[23] [23]

Learning to search in branch and bound algorithms

He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and bound algorithms. In Neural infor- mation processing systems, 2014

work page 2014

[24] [24]

Deep reinforce- ment learning that matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforce- ment learning that matters. In AAAI Conference on Arti- ﬁcial Intelligence, 2018

work page 2018

[25] [25]

Deep Q-Learning from Demonstrations

Todd Hester, Olivier Pietquin, Marc Lanctot, Tom Schaul, Dan Horgan, John Quan, Andrew Sendonaris, Ian Os- band, Gabriel Dulac-arnold, John Agapiou, and Joel Z Leibo. Deep Q-Learning from Demonstrations. In AAAI Conference on Artiﬁcial Intelligence, 2018

work page 2018

[26] [26]

Generative adversar- ial imitation learning

Jonathan Ho and Stefano Ermon. Generative adversar- ial imitation learning. In Neural Information Processing Systems, 2016

work page 2016

[27] [27]

Googles multilingual neural machine translation system: Enabling zero-shot translation

Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fer- nanda Vi ´egas, Martin Wattenberg, Greg Corrado, et al. Googles multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Asso- ciation for Computational Linguistics, 2017

work page 2017

[28] [28]

Approximately opti- mal approximate reinforcement learning

Sham Kakade and John Langford. Approximately opti- mal approximate reinforcement learning. In International Conference on Machine Learning, 2002

work page 2002

[29] [29]

Policy opti- mization with demonstrations

Bingyi Kang, Zequn Jie, and Jiashi Feng. Policy opti- mization with demonstrations. In International Confer- ence on Machine Learning, 2018

work page 2018

[30] [30]

Learning to branch in mixed integer programming

Elias Boutros Khalil, Pierre Le Bodic, Le Song, George L Nemhauser, and Bistra N Dilkina. Learning to branch in mixed integer programming. In AAAI Conference on Artiﬁcial Intelligence, 2016

work page 2016

[31] [31]

Email classiﬁ- cation with co-training

Svetlana Kiritchenko and Stan Matwin. Email classiﬁ- cation with co-training. In Conference of the Center for Advanced Studies on Collaborative Research, 2011

work page 2011

[32] [32]

Reinforce- ment learning in robotics: A survey

Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforce- ment learning in robotics: A survey. The International Journal of Robotics Research, 2013

work page 2013

[33] [33]

A co-training approach for multi-view spectral clustering

Abhishek Kumar and Hal Daum ´e. A co-training approach for multi-view spectral clustering. In International Con- ference on Machine Learning, 2011

work page 2011

[34] [34]

Play- ing fps games with deep reinforcement learning

Guillaume Lample and Devendra Singh Chaplot. Play- ing fps games with deep reinforcement learning. In AAAI Conference on Artiﬁcial Intelligence, 2017

work page 2017

[35] [35]

An automatic method for solving discrete programming problems

Ailsa H Land and Alison G Doig. An automatic method for solving discrete programming problems. In 50 Years of Integer Programming 1958-2008 , pages 105–

work page 1958

[36] [36]

Hierarchical imitation and reinforcement learning

Hoang Le, Nan Jiang, Alekh Agarwal, Miroslav Dudik, Yisong Yue, and Hal Daum´e. Hierarchical imitation and reinforcement learning. In International Conference on Machine Learning, 2018

work page 2018

[37] [37]

End-to-end training of deep visuomotor policies

Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 2016

work page 2016

[38] [38]

Continuous control with deep reinforce- ment learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforce- ment learning. In International Conference on Learning Representations, 2016

work page 2016

[39] [39]

A simplicial branch-and-bound algorithm for solving quadratically constrained quadratic programs

Jeff Linderoth. A simplicial branch-and-bound algorithm for solving quadratically constrained quadratic programs. Mathematical programming, 2005

work page 2005

[40] [40]

Multi- view clustering via joint nonnegative matrix factorization

Jialu Liu, Chi Wang, Jing Gao, and Jiawei Han. Multi- view clustering via joint nonnegative matrix factorization. In SIAM International Conference on Data Mining, 2013

work page 2013

[41] [41]

De- vice placement optimization with reinforcement learning

Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. De- vice placement optimization with reinforcement learning. In International Conference on Machine Learning, 2017

work page 2017

[42] [42]

Playing atari with deep reinforcement learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv, 2013

work page 2013

[43] [43]

Overcoming explo- ration in reinforcement learning with demonstrations

Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wo- jciech Zaremba, and Pieter Abbeel. Overcoming explo- ration in reinforcement learning with demonstrations. In International Conference on Robotics and Automation , 2018

work page 2018

[44] [44]

Analyzing the effective- ness and applicability of co-training

Kamal Nigam and Rayid Ghani. Analyzing the effective- ness and applicability of co-training. In ACM Conference on Information and knowledge Management, 2000

work page 2000

[45] [45]

A comparative analysis of several asymmetric traveling salesman problem formulations

Temel ¨Oncan, ˙I Kuban Altınel, and Gilbert Laporte. A comparative analysis of several asymmetric traveling salesman problem formulations. Computers & Opera- tions Research, 2009

work page 2009

[46] [46]

An efﬁcient motion planning algorithm for stochastic dynamic systems with constraints on probability of failure

Masahiro Ono and Brian C Williams. An efﬁcient motion planning algorithm for stochastic dynamic systems with constraints on probability of failure. In AAAI Conference on Artiﬁcial Intelligence, 2008

work page 2008

[47] [47]

A survey of different in- teger programming formulations of the travelling sales- man problem

AJ Orman and HP Williams. A survey of different in- teger programming formulations of the travelling sales- man problem. In Optimisation, econometric and ﬁnancial analysis. Springer, 2007

work page 2007

[48] [48]

Markov decision processes: discrete stochastic dynamic programming

Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons, 2014

work page 2014

[49] [49]

Efﬁcient reductions for imitation learning

St ´ephane Ross and Drew Bagnell. Efﬁcient reductions for imitation learning. In International Conference on Artiﬁ- cial Intelligence and Statistics, 2010

work page 2010

[50] [50]

Reinforcement and imitation learning via interactive no-regret learning

Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv, 2014

work page 2014

[51] [51]

A reduction of imitation learning and structured prediction to no-regret online learning

St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artiﬁcial Intelligence and Statistics, 2011

work page 2011

[52] [52]

Trust region policy optimiza- tion

John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimiza- tion. In International Conference on Machine Learning , 2015

work page 2015

[53] [53]

Proximal policy optimization algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv, 2017

work page 2017

[54] [54]

Mastering the game of go with deep neural networks and tree search

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Ju- lian Schrittwieser, Ioannis Antonoglou, Veda Panneer- shelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016

work page 2016

[55] [55]

A co-regularization approach to semi-supervised learning with multiple views

Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. A co-regularization approach to semi-supervised learning with multiple views. In ICML workshop on learning with multiple views, 2005

work page 2005

[56] [56]

Learning to search via retrospective imitation

Jialin Song, Ravi Lanka, Albert Zhao, Aadyot Bhatnagar, Yisong Yue, and Masahiro Ono. Learning to search via retrospective imitation. arXiv, 2018

work page 2018

[57] [57]

Third- person imitation learning

Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third- person imitation learning. arXiv, 2017

work page 2017

[58] [58]

Deeply aggrevated: Dif- ferentiable imitation learning for sequential prediction

Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated: Dif- ferentiable imitation learning for sequential prediction. In International Conference on Machine Learning, 2017

work page 2017

[59] [59]

Policy gradient methods for rein- forcement learning with function approximation

Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for rein- forcement learning with function approximation. In Neu- ral information processing systems, 2000

work page 2000

[60] [60]

Apprenticeship learning using linear programming

Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. InIn- ternational Conference on Machine Learning, 2008

work page 2008

[61] [61]

A game-theoretic ap- proach to apprenticeship learning

Umar Syed and Robert E Schapire. A game-theoretic ap- proach to apprenticeship learning. In Neural information processing systems, 2008

work page 2008

[62] [62]

Mu- joco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu- joco: A physics engine for model-based control. In Inter- national Conference on Intelligent Robots and Systems , 2012

work page 2012

[63] [63]

Deep reinforcement learning with double q-learning

Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI Conference on Artiﬁcial Intelligence, 2016

work page 2016

[64] [64]

Co-training for cross-lingual sentiment classiﬁcation

Xiaojun Wan. Co-training for cross-lingual sentiment classiﬁcation. In Joint conference of ACL and IJCNLP . Association for Computational Linguistics, 2009

work page 2009

[65] [65]

A new analysis of co- training

Wei Wang and Zhi-Hua Zhou. A new analysis of co- training. In International Conference on Machine Learn- ing, 2010

work page 2010

[66] [66]

Co-training with insufﬁ- cient views

Wei Wang and Zhi-Hua Zhou. Co-training with insufﬁ- cient views. In Asian conference on machine learning , 2013

work page 2013

[67] [67]

Dueling network archi- tectures for deep reinforcement learning

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network archi- tectures for deep reinforcement learning. In International Conference on Machine Learning, 2016

work page 2016

[68] [68]

Maximum entropy inverse reinforcement learning

Brian Ziebart, Andrew Maas, J Andrew Bagnell, and Anind Dey. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artiﬁcial Intelligence , 2008. 8 APPENDIX 8.1 PROOFS Proof for Proposition 1: Proof. We show that maxsDJS (πB(s)∥πA(s)) is well- deﬁned for an MDP M with two representations MA and MB. From Theorem 1, we know the distribution ...

work page 2008