pith. sign in

arxiv: 1907.04484 · v1 · pith:47KLRIIMnew · submitted 2019-07-03 · 💻 cs.LG · cs.AI· stat.ML

Co-training for Policy Learning

Pith reviewed 2026-05-25 10:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords co-trainingpolicy learningreinforcement learningimitation learningsequential decision makingmultiple representationscombinatorial optimization
0
0 comments X

The pith

Sufficient conditions allow co-training from two state-action views to improve policies over single-view learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies sequential decision-making policies when multiple state-action representations exist, as occurs in planning with different integer programming formulations or combinatorial problems with both integer and graph views. It adapts the classical co-training idea from classification to derive sufficient conditions where two views produce better policies than either alone. A meta-algorithm is introduced that works with reinforcement learning and imitation learning, and experiments demonstrate gains on discrete and continuous control tasks plus combinatorial optimization problems.

Core claim

Under sufficient conditions, learning from two complementary state-action representations improves upon learning from a single representation alone, and this improvement is realized by a meta-algorithm for co-training that is compatible with both reinforcement learning and imitation learning.

What carries the argument

The co-training meta-algorithm that alternates training between two views to mutually refine the policy.

If this is right

  • Policy quality improves when multiple formulations of the same decision problem are available and used together.
  • The meta-algorithm applies equally to reinforcement learning and imitation learning settings.
  • Gains are observed on both discrete/continuous control and combinatorial optimization tasks.
  • The approach directly extends co-training concepts from classification into sequential decision making.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Identifying pairs of complementary representations may be a practical bottleneck even when the theoretical conditions hold.
  • The framework could be extended to more than two views if the mutual-improvement conditions generalize.
  • Sample efficiency in reinforcement learning might increase when dual representations reduce the need for exploration in each view separately.

Load-bearing premise

The two representations supply complementary information that permits each view to improve the policy quality obtained from the other.

What would settle it

A controlled experiment in which two views satisfy the stated conditions yet produce no measurable improvement in policy quality over the single-view baseline.

Figures

Figures reproduced from arXiv: 1907.04484 by Jialin Song, Masahiro Ono, Ravi Lanka, Yisong Yue.

Figure 1
Figure 1. Figure 1: Two ways to encode minimum vertex cover (MVC) problems. Left: policies learn to operate directly on the graph view to find the minimal cover set [30]. Right: we express MVC as an integer linear program, then polices learn to traverse the resulting combinatorial search space, i.e., learn to branch-and-bound [23, 56]. (i.e., regret) versus the optimal policy. These theo￾retical characterizations shed light o… view at source ↗
Figure 3
Figure 3. Figure 3: Graphical model encodes the conditional inde [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Discrete & continuous control tasks. Experiment results are across 5 random seeded runs. Shaded area [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of CoPiEr with other learning￾based baselines and a commercial solver, Gurobi. The y-axis measure relative gaps of various methods com￾pared with CoPiEr Final. CoPiEr Final outperforms all the baselines. Notably, the gaps are significant because getting optimizing over large graphs is very challenging. selection policy for branch-and-bound search. A node selection policy determines which node to… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of CoPiEr with other learning￾based baselines and a commercial solver, Gurobi. The y￾axis measure relative gaps of various methods compared with CoPiEr Final. CoPiEr Final outperforms all the baselines. Notably, the scale of problems as measured by the number of integer variables far exceed previous state-of-the-art method [56]. aggregating both policies. The effectiveness of CoPiEr enables solv… view at source ↗
Figure 7
Figure 7. Figure 7: Two views for Risk-Aware Path Planning. On [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

We study the problem of learning sequential decision-making policies in settings with multiple state-action representations. Such settings naturally arise in many domains, such as planning (e.g., multiple integer programming formulations) and various combinatorial optimization problems (e.g., those with both integer programming and graph-based formulations). Inspired by the classical co-training framework for classification, we study the problem of co-training for policy learning. We present sufficient conditions under which learning from two views can improve upon learning from a single view alone. Motivated by these theoretical insights, we present a meta-algorithm for co-training for sequential decision making. Our framework is compatible with both reinforcement learning and imitation learning. We validate the effectiveness of our approach across a wide range of tasks, including discrete/continuous control and combinatorial optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper studies co-training for sequential decision-making policies when multiple state-action representations are available. It states sufficient conditions under which two views yield mutual policy improvement over a single view, introduces a meta-algorithm compatible with both reinforcement learning and imitation learning, and reports empirical results on discrete/continuous control and combinatorial optimization tasks.

Significance. If the sufficient conditions are shown to hold in the evaluated domains and the reported gains are attributable to the co-training mechanism rather than other factors, the work would supply a principled extension of classical co-training to policy learning and a practical meta-algorithm for domains that admit multiple formulations.

major comments (2)
  1. [§3 and experimental sections] §3 (sufficient conditions): the manuscript states conditions under which two views permit mutual improvement, yet the experimental sections provide no verification that any evaluated task satisfies these conditions (e.g., no reported measurement of view disagreement, conditional independence, or the relevant error-reduction quantities). Consequently the theory does not underwrite the empirical claims.
  2. [Algorithm 1 and §4] Algorithm 1 and §4: the meta-algorithm is motivated by the theoretical conditions, but without evidence that the conditions hold on the control and combinatorial instances, it is unclear whether the observed improvements stem from the stated mechanism or from generic regularization or ensemble effects.
minor comments (2)
  1. [§2 and experimental setup] Notation for the two views is introduced without an explicit table comparing the state-action representations used in each experimental domain.
  2. [Abstract] The abstract claims validation 'across a wide range of tasks' but does not list the precise environments or problem sizes; a compact table would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the link between the theoretical conditions and the empirical evaluation. We address each major comment below and will revise the manuscript to incorporate additional analysis and controls.

read point-by-point responses
  1. Referee: [§3 and experimental sections] §3 (sufficient conditions): the manuscript states conditions under which two views permit mutual improvement, yet the experimental sections provide no verification that any evaluated task satisfies these conditions (e.g., no reported measurement of view disagreement, conditional independence, or the relevant error-reduction quantities). Consequently the theory does not underwrite the empirical claims.

    Authors: We agree this is a substantive gap. The sufficient conditions in §3 are presented as motivation for the meta-algorithm rather than as a claim that they hold exactly on every evaluated task. In the revision we will add a new subsection that reports empirical measurements of view disagreement and the relevant error-reduction quantities on the discrete/continuous control and combinatorial optimization tasks. This will allow readers to assess the degree to which the theoretical conditions are approximately satisfied and will clarify the extent to which the observed gains are consistent with the stated mechanism. revision: yes

  2. Referee: [Algorithm 1 and §4] Algorithm 1 and §4: the meta-algorithm is motivated by the theoretical conditions, but without evidence that the conditions hold on the control and combinatorial instances, it is unclear whether the observed improvements stem from the stated mechanism or from generic regularization or ensemble effects.

    Authors: This concern is valid. To isolate the contribution of the co-training procedure, the revised version will include additional ablation studies that compare the full co-training meta-algorithm against (i) single-view baselines, (ii) simple ensembles of independently trained policies, and (iii) regularization-only variants that do not exchange information between views. We will also report the same disagreement and error-reduction metrics on these ablations so that any performance differences can be more directly attributed to the mutual-improvement mechanism described in §3. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper states sufficient conditions under which two-view co-training improves single-view policy learning and motivates a meta-algorithm from those conditions. No equations, fitted parameters, or predictions are exhibited that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled via prior work, and no renaming of known results occurs. The derivation chain is self-contained against external benchmarks because the conditions are presented as independent theoretical statements rather than tautological redefinitions of the algorithm's outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; the central claim rests on the existence of complementary views and the validity of the (unspecified) sufficient conditions for improvement.

axioms (1)
  • domain assumption Multiple distinct state-action representations exist and supply complementary information for the same underlying decision process.
    Invoked by the co-training setup for sequential decision making.

pith-pipeline@v0.9.0 · 5659 in / 1083 out tokens · 19900 ms · 2026-05-25T10:16:28.603101+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages

  1. [1]

    Apprenticeship learn- ing via inverse reinforcement learning

    Pieter Abbeel and Andrew Y Ng. Apprenticeship learn- ing via inverse reinforcement learning. In International Conference on Machine Learning, 2004

  2. [2]

    SCIP : solving constraint integer pro- grams

    Tobias Achterberg. SCIP : solving constraint integer pro- grams. Mathematical Programming Computation, 2009

  3. [3]

    Co- training and expansion: Towards bridging theory and practice

    Maria-Florina Balcan, Avrim Blum, and Ke Yang. Co- training and expansion: Towards bridging theory and practice. In Neural information processing systems, 2005

  4. [4]

    Learning to solve smt formulas

    Mislav Balunovic, Pavol Bielik, and Martin Vechev. Learning to solve smt formulas. In Neural Information Processing Systems, 2018

  5. [5]

    Efficient co-training of linear separators under weak dependence

    Avrim Blum and Yishay Mansour. Efficient co-training of linear separators under weak dependence. In Conference on Learning Theory, 2017

  6. [6]

    Combining labeled and unlabeled data with co-training

    Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Conference on Learn- ing Theory, 1998

  7. [7]

    The max-cut prob- lem and quadratic 0–1 optimization; polyhedral aspects, relaxations and bounds

    Endre Boros and Peter L Hammer. The max-cut prob- lem and quadratic 0–1 optimization; polyhedral aspects, relaxations and bounds. Annals of Operations Research, 1991

  8. [8]

    Openai gym

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv, 2016

  9. [9]

    Learning to rank using gradient descent

    Chris Burges, Erin Renshaw, and Matt Deeds. Learning to rank using gradient descent. In International conference on Machine learning, 1998

  10. [10]

    Learning to search bet- ter than your teacher

    Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume, and John Langford. Learning to search bet- ter than your teacher. InInternational Conference on Ma- chine Learning, 2015

  11. [11]

    Co-training for domain adaptation

    Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In Neural information processing systems, 2011

  12. [12]

    Fast policy learning through imitation and re- inforcement

    Ching-An Cheng, Xinyan Yan, Nolan Wagener, and By- ron Boots. Fast policy learning through imitation and re- inforcement. In Conference on Uncertainty in Artificial Intelligence, 2018

  13. [13]

    Elements of infor- mation theory

    Thomas M Cover and Joy A Thomas. Elements of infor- mation theory. John Wiley & Sons, 2012

  14. [14]

    Discriminative Em- beddings of Latent Variable Models for Structured Data

    Hanjun Dai, Bo Dai, and Le Song. Discriminative Em- beddings of Latent Variable Models for Structured Data. In International Conference on Machine Learning, pages 1–23, 2016

  15. [15]

    Learning combinatorial optimization al- gorithms over graphs

    Hanjun Dai, Elias B Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization al- gorithms over graphs. In Neural Information Processing Systems, 2017

  16. [16]

    Pac generalization bounds for co-training

    Sanjoy Dasgupta, Michael L Littman, and David A McAllester. Pac generalization bounds for co-training. In Neural information processing systems, 2002

  17. [17]

    Search- based structured prediction

    Hal Daum ´e, John Langford, and Daniel Marcu. Search- based structured prediction. Machine learning, 2009

  18. [18]

    Linear programming relaxations of maxcut

    Wenceslas Fernandez de la Vega and Claire Kenyon- Mathieu. Linear programming relaxations of maxcut. In ACM-SIAM symposium on Discrete algorithms, 2007

  19. [19]

    Benchmarking deep reinforcement learn- ing for continuous control

    Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learn- ing for continuous control. In International Conference on Machine Learning, 2016

  20. [20]

    On the evolution of random graphs

    Paul Erd ˝os and Alfr´ed R´enyi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 1960

  21. [21]

    Variance reduction techniques for gradient estimates in reinforcement learning

    Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Re- search, 2004

  22. [22]

    Gurobi optimizer reference manual, 2018

    LLC Gurobi Optimization. Gurobi optimizer reference manual, 2018

  23. [23]

    Learning to search in branch and bound algorithms

    He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and bound algorithms. In Neural infor- mation processing systems, 2014

  24. [24]

    Deep reinforce- ment learning that matters

    Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforce- ment learning that matters. In AAAI Conference on Arti- ficial Intelligence, 2018

  25. [25]

    Deep Q-Learning from Demonstrations

    Todd Hester, Olivier Pietquin, Marc Lanctot, Tom Schaul, Dan Horgan, John Quan, Andrew Sendonaris, Ian Os- band, Gabriel Dulac-arnold, John Agapiou, and Joel Z Leibo. Deep Q-Learning from Demonstrations. In AAAI Conference on Artificial Intelligence, 2018

  26. [26]

    Generative adversar- ial imitation learning

    Jonathan Ho and Stefano Ermon. Generative adversar- ial imitation learning. In Neural Information Processing Systems, 2016

  27. [27]

    Googles multilingual neural machine translation system: Enabling zero-shot translation

    Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fer- nanda Vi ´egas, Martin Wattenberg, Greg Corrado, et al. Googles multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Asso- ciation for Computational Linguistics, 2017

  28. [28]

    Approximately opti- mal approximate reinforcement learning

    Sham Kakade and John Langford. Approximately opti- mal approximate reinforcement learning. In International Conference on Machine Learning, 2002

  29. [29]

    Policy opti- mization with demonstrations

    Bingyi Kang, Zequn Jie, and Jiashi Feng. Policy opti- mization with demonstrations. In International Confer- ence on Machine Learning, 2018

  30. [30]

    Learning to branch in mixed integer programming

    Elias Boutros Khalil, Pierre Le Bodic, Le Song, George L Nemhauser, and Bistra N Dilkina. Learning to branch in mixed integer programming. In AAAI Conference on Artificial Intelligence, 2016

  31. [31]

    Email classifi- cation with co-training

    Svetlana Kiritchenko and Stan Matwin. Email classifi- cation with co-training. In Conference of the Center for Advanced Studies on Collaborative Research, 2011

  32. [32]

    Reinforce- ment learning in robotics: A survey

    Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforce- ment learning in robotics: A survey. The International Journal of Robotics Research, 2013

  33. [33]

    A co-training approach for multi-view spectral clustering

    Abhishek Kumar and Hal Daum ´e. A co-training approach for multi-view spectral clustering. In International Con- ference on Machine Learning, 2011

  34. [34]

    Play- ing fps games with deep reinforcement learning

    Guillaume Lample and Devendra Singh Chaplot. Play- ing fps games with deep reinforcement learning. In AAAI Conference on Artificial Intelligence, 2017

  35. [35]

    An automatic method for solving discrete programming problems

    Ailsa H Land and Alison G Doig. An automatic method for solving discrete programming problems. In 50 Years of Integer Programming 1958-2008 , pages 105–

  36. [36]

    Hierarchical imitation and reinforcement learning

    Hoang Le, Nan Jiang, Alekh Agarwal, Miroslav Dudik, Yisong Yue, and Hal Daum´e. Hierarchical imitation and reinforcement learning. In International Conference on Machine Learning, 2018

  37. [37]

    End-to-end training of deep visuomotor policies

    Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 2016

  38. [38]

    Continuous control with deep reinforce- ment learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforce- ment learning. In International Conference on Learning Representations, 2016

  39. [39]

    A simplicial branch-and-bound algorithm for solving quadratically constrained quadratic programs

    Jeff Linderoth. A simplicial branch-and-bound algorithm for solving quadratically constrained quadratic programs. Mathematical programming, 2005

  40. [40]

    Multi- view clustering via joint nonnegative matrix factorization

    Jialu Liu, Chi Wang, Jing Gao, and Jiawei Han. Multi- view clustering via joint nonnegative matrix factorization. In SIAM International Conference on Data Mining, 2013

  41. [41]

    De- vice placement optimization with reinforcement learning

    Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. De- vice placement optimization with reinforcement learning. In International Conference on Machine Learning, 2017

  42. [42]

    Playing atari with deep reinforcement learning

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv, 2013

  43. [43]

    Overcoming explo- ration in reinforcement learning with demonstrations

    Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wo- jciech Zaremba, and Pieter Abbeel. Overcoming explo- ration in reinforcement learning with demonstrations. In International Conference on Robotics and Automation , 2018

  44. [44]

    Analyzing the effective- ness and applicability of co-training

    Kamal Nigam and Rayid Ghani. Analyzing the effective- ness and applicability of co-training. In ACM Conference on Information and knowledge Management, 2000

  45. [45]

    A comparative analysis of several asymmetric traveling salesman problem formulations

    Temel ¨Oncan, ˙I Kuban Altınel, and Gilbert Laporte. A comparative analysis of several asymmetric traveling salesman problem formulations. Computers & Opera- tions Research, 2009

  46. [46]

    An efficient motion planning algorithm for stochastic dynamic systems with constraints on probability of failure

    Masahiro Ono and Brian C Williams. An efficient motion planning algorithm for stochastic dynamic systems with constraints on probability of failure. In AAAI Conference on Artificial Intelligence, 2008

  47. [47]

    A survey of different in- teger programming formulations of the travelling sales- man problem

    AJ Orman and HP Williams. A survey of different in- teger programming formulations of the travelling sales- man problem. In Optimisation, econometric and financial analysis. Springer, 2007

  48. [48]

    Markov decision processes: discrete stochastic dynamic programming

    Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons, 2014

  49. [49]

    Efficient reductions for imitation learning

    St ´ephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In International Conference on Artifi- cial Intelligence and Statistics, 2010

  50. [50]

    Reinforcement and imitation learning via interactive no-regret learning

    Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv, 2014

  51. [51]

    A reduction of imitation learning and structured prediction to no-regret online learning

    St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics, 2011

  52. [52]

    Trust region policy optimiza- tion

    John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimiza- tion. In International Conference on Machine Learning , 2015

  53. [53]

    Proximal policy optimization algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv, 2017

  54. [54]

    Mastering the game of go with deep neural networks and tree search

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Ju- lian Schrittwieser, Ioannis Antonoglou, Veda Panneer- shelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016

  55. [55]

    A co-regularization approach to semi-supervised learning with multiple views

    Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. A co-regularization approach to semi-supervised learning with multiple views. In ICML workshop on learning with multiple views, 2005

  56. [56]

    Learning to search via retrospective imitation

    Jialin Song, Ravi Lanka, Albert Zhao, Aadyot Bhatnagar, Yisong Yue, and Masahiro Ono. Learning to search via retrospective imitation. arXiv, 2018

  57. [57]

    Third- person imitation learning

    Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third- person imitation learning. arXiv, 2017

  58. [58]

    Deeply aggrevated: Dif- ferentiable imitation learning for sequential prediction

    Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated: Dif- ferentiable imitation learning for sequential prediction. In International Conference on Machine Learning, 2017

  59. [59]

    Policy gradient methods for rein- forcement learning with function approximation

    Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for rein- forcement learning with function approximation. In Neu- ral information processing systems, 2000

  60. [60]

    Apprenticeship learning using linear programming

    Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. InIn- ternational Conference on Machine Learning, 2008

  61. [61]

    A game-theoretic ap- proach to apprenticeship learning

    Umar Syed and Robert E Schapire. A game-theoretic ap- proach to apprenticeship learning. In Neural information processing systems, 2008

  62. [62]

    Mu- joco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu- joco: A physics engine for model-based control. In Inter- national Conference on Intelligent Robots and Systems , 2012

  63. [63]

    Deep reinforcement learning with double q-learning

    Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI Conference on Artificial Intelligence, 2016

  64. [64]

    Co-training for cross-lingual sentiment classification

    Xiaojun Wan. Co-training for cross-lingual sentiment classification. In Joint conference of ACL and IJCNLP . Association for Computational Linguistics, 2009

  65. [65]

    A new analysis of co- training

    Wei Wang and Zhi-Hua Zhou. A new analysis of co- training. In International Conference on Machine Learn- ing, 2010

  66. [66]

    Co-training with insuffi- cient views

    Wei Wang and Zhi-Hua Zhou. Co-training with insuffi- cient views. In Asian conference on machine learning , 2013

  67. [67]

    Dueling network archi- tectures for deep reinforcement learning

    Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network archi- tectures for deep reinforcement learning. In International Conference on Machine Learning, 2016

  68. [68]

    Maximum entropy inverse reinforcement learning

    Brian Ziebart, Andrew Maas, J Andrew Bagnell, and Anind Dey. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence , 2008. 8 APPENDIX 8.1 PROOFS Proof for Proposition 1: Proof. We show that maxsDJS (πB(s)∥πA(s)) is well- defined for an MDP M with two representations MA and MB. From Theorem 1, we know the distribution ...