Co-training for Policy Learning
Pith reviewed 2026-05-25 10:16 UTC · model grok-4.3
The pith
Sufficient conditions allow co-training from two state-action views to improve policies over single-view learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under sufficient conditions, learning from two complementary state-action representations improves upon learning from a single representation alone, and this improvement is realized by a meta-algorithm for co-training that is compatible with both reinforcement learning and imitation learning.
What carries the argument
The co-training meta-algorithm that alternates training between two views to mutually refine the policy.
If this is right
- Policy quality improves when multiple formulations of the same decision problem are available and used together.
- The meta-algorithm applies equally to reinforcement learning and imitation learning settings.
- Gains are observed on both discrete/continuous control and combinatorial optimization tasks.
- The approach directly extends co-training concepts from classification into sequential decision making.
Where Pith is reading between the lines
- Identifying pairs of complementary representations may be a practical bottleneck even when the theoretical conditions hold.
- The framework could be extended to more than two views if the mutual-improvement conditions generalize.
- Sample efficiency in reinforcement learning might increase when dual representations reduce the need for exploration in each view separately.
Load-bearing premise
The two representations supply complementary information that permits each view to improve the policy quality obtained from the other.
What would settle it
A controlled experiment in which two views satisfy the stated conditions yet produce no measurable improvement in policy quality over the single-view baseline.
Figures
read the original abstract
We study the problem of learning sequential decision-making policies in settings with multiple state-action representations. Such settings naturally arise in many domains, such as planning (e.g., multiple integer programming formulations) and various combinatorial optimization problems (e.g., those with both integer programming and graph-based formulations). Inspired by the classical co-training framework for classification, we study the problem of co-training for policy learning. We present sufficient conditions under which learning from two views can improve upon learning from a single view alone. Motivated by these theoretical insights, we present a meta-algorithm for co-training for sequential decision making. Our framework is compatible with both reinforcement learning and imitation learning. We validate the effectiveness of our approach across a wide range of tasks, including discrete/continuous control and combinatorial optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies co-training for sequential decision-making policies when multiple state-action representations are available. It states sufficient conditions under which two views yield mutual policy improvement over a single view, introduces a meta-algorithm compatible with both reinforcement learning and imitation learning, and reports empirical results on discrete/continuous control and combinatorial optimization tasks.
Significance. If the sufficient conditions are shown to hold in the evaluated domains and the reported gains are attributable to the co-training mechanism rather than other factors, the work would supply a principled extension of classical co-training to policy learning and a practical meta-algorithm for domains that admit multiple formulations.
major comments (2)
- [§3 and experimental sections] §3 (sufficient conditions): the manuscript states conditions under which two views permit mutual improvement, yet the experimental sections provide no verification that any evaluated task satisfies these conditions (e.g., no reported measurement of view disagreement, conditional independence, or the relevant error-reduction quantities). Consequently the theory does not underwrite the empirical claims.
- [Algorithm 1 and §4] Algorithm 1 and §4: the meta-algorithm is motivated by the theoretical conditions, but without evidence that the conditions hold on the control and combinatorial instances, it is unclear whether the observed improvements stem from the stated mechanism or from generic regularization or ensemble effects.
minor comments (2)
- [§2 and experimental setup] Notation for the two views is introduced without an explicit table comparing the state-action representations used in each experimental domain.
- [Abstract] The abstract claims validation 'across a wide range of tasks' but does not list the precise environments or problem sizes; a compact table would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on strengthening the link between the theoretical conditions and the empirical evaluation. We address each major comment below and will revise the manuscript to incorporate additional analysis and controls.
read point-by-point responses
-
Referee: [§3 and experimental sections] §3 (sufficient conditions): the manuscript states conditions under which two views permit mutual improvement, yet the experimental sections provide no verification that any evaluated task satisfies these conditions (e.g., no reported measurement of view disagreement, conditional independence, or the relevant error-reduction quantities). Consequently the theory does not underwrite the empirical claims.
Authors: We agree this is a substantive gap. The sufficient conditions in §3 are presented as motivation for the meta-algorithm rather than as a claim that they hold exactly on every evaluated task. In the revision we will add a new subsection that reports empirical measurements of view disagreement and the relevant error-reduction quantities on the discrete/continuous control and combinatorial optimization tasks. This will allow readers to assess the degree to which the theoretical conditions are approximately satisfied and will clarify the extent to which the observed gains are consistent with the stated mechanism. revision: yes
-
Referee: [Algorithm 1 and §4] Algorithm 1 and §4: the meta-algorithm is motivated by the theoretical conditions, but without evidence that the conditions hold on the control and combinatorial instances, it is unclear whether the observed improvements stem from the stated mechanism or from generic regularization or ensemble effects.
Authors: This concern is valid. To isolate the contribution of the co-training procedure, the revised version will include additional ablation studies that compare the full co-training meta-algorithm against (i) single-view baselines, (ii) simple ensembles of independently trained policies, and (iii) regularization-only variants that do not exchange information between views. We will also report the same disagreement and error-reduction metrics on these ablations so that any performance differences can be more directly attributed to the mutual-improvement mechanism described in §3. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper states sufficient conditions under which two-view co-training improves single-view policy learning and motivates a meta-algorithm from those conditions. No equations, fitted parameters, or predictions are exhibited that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled via prior work, and no renaming of known results occurs. The derivation chain is self-contained against external benchmarks because the conditions are presented as independent theoretical statements rather than tautological redefinitions of the algorithm's outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multiple distinct state-action representations exist and supply complementary information for the same underlying decision process.
Reference graph
Works this paper leans on
-
[1]
Apprenticeship learn- ing via inverse reinforcement learning
Pieter Abbeel and Andrew Y Ng. Apprenticeship learn- ing via inverse reinforcement learning. In International Conference on Machine Learning, 2004
work page 2004
-
[2]
SCIP : solving constraint integer pro- grams
Tobias Achterberg. SCIP : solving constraint integer pro- grams. Mathematical Programming Computation, 2009
work page 2009
-
[3]
Co- training and expansion: Towards bridging theory and practice
Maria-Florina Balcan, Avrim Blum, and Ke Yang. Co- training and expansion: Towards bridging theory and practice. In Neural information processing systems, 2005
work page 2005
-
[4]
Learning to solve smt formulas
Mislav Balunovic, Pavol Bielik, and Martin Vechev. Learning to solve smt formulas. In Neural Information Processing Systems, 2018
work page 2018
-
[5]
Efficient co-training of linear separators under weak dependence
Avrim Blum and Yishay Mansour. Efficient co-training of linear separators under weak dependence. In Conference on Learning Theory, 2017
work page 2017
-
[6]
Combining labeled and unlabeled data with co-training
Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Conference on Learn- ing Theory, 1998
work page 1998
-
[7]
The max-cut prob- lem and quadratic 0–1 optimization; polyhedral aspects, relaxations and bounds
Endre Boros and Peter L Hammer. The max-cut prob- lem and quadratic 0–1 optimization; polyhedral aspects, relaxations and bounds. Annals of Operations Research, 1991
work page 1991
-
[8]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv, 2016
work page 2016
-
[9]
Learning to rank using gradient descent
Chris Burges, Erin Renshaw, and Matt Deeds. Learning to rank using gradient descent. In International conference on Machine learning, 1998
work page 1998
-
[10]
Learning to search bet- ter than your teacher
Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume, and John Langford. Learning to search bet- ter than your teacher. InInternational Conference on Ma- chine Learning, 2015
work page 2015
-
[11]
Co-training for domain adaptation
Minmin Chen, Kilian Q Weinberger, and John Blitzer. Co-training for domain adaptation. In Neural information processing systems, 2011
work page 2011
-
[12]
Fast policy learning through imitation and re- inforcement
Ching-An Cheng, Xinyan Yan, Nolan Wagener, and By- ron Boots. Fast policy learning through imitation and re- inforcement. In Conference on Uncertainty in Artificial Intelligence, 2018
work page 2018
-
[13]
Elements of infor- mation theory
Thomas M Cover and Joy A Thomas. Elements of infor- mation theory. John Wiley & Sons, 2012
work page 2012
-
[14]
Discriminative Em- beddings of Latent Variable Models for Structured Data
Hanjun Dai, Bo Dai, and Le Song. Discriminative Em- beddings of Latent Variable Models for Structured Data. In International Conference on Machine Learning, pages 1–23, 2016
work page 2016
-
[15]
Learning combinatorial optimization al- gorithms over graphs
Hanjun Dai, Elias B Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization al- gorithms over graphs. In Neural Information Processing Systems, 2017
work page 2017
-
[16]
Pac generalization bounds for co-training
Sanjoy Dasgupta, Michael L Littman, and David A McAllester. Pac generalization bounds for co-training. In Neural information processing systems, 2002
work page 2002
-
[17]
Search- based structured prediction
Hal Daum ´e, John Langford, and Daniel Marcu. Search- based structured prediction. Machine learning, 2009
work page 2009
-
[18]
Linear programming relaxations of maxcut
Wenceslas Fernandez de la Vega and Claire Kenyon- Mathieu. Linear programming relaxations of maxcut. In ACM-SIAM symposium on Discrete algorithms, 2007
work page 2007
-
[19]
Benchmarking deep reinforcement learn- ing for continuous control
Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learn- ing for continuous control. In International Conference on Machine Learning, 2016
work page 2016
-
[20]
On the evolution of random graphs
Paul Erd ˝os and Alfr´ed R´enyi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 1960
work page 1960
-
[21]
Variance reduction techniques for gradient estimates in reinforcement learning
Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Re- search, 2004
work page 2004
-
[22]
Gurobi optimizer reference manual, 2018
LLC Gurobi Optimization. Gurobi optimizer reference manual, 2018
work page 2018
-
[23]
Learning to search in branch and bound algorithms
He He, Hal Daume III, and Jason M Eisner. Learning to search in branch and bound algorithms. In Neural infor- mation processing systems, 2014
work page 2014
-
[24]
Deep reinforce- ment learning that matters
Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforce- ment learning that matters. In AAAI Conference on Arti- ficial Intelligence, 2018
work page 2018
-
[25]
Deep Q-Learning from Demonstrations
Todd Hester, Olivier Pietquin, Marc Lanctot, Tom Schaul, Dan Horgan, John Quan, Andrew Sendonaris, Ian Os- band, Gabriel Dulac-arnold, John Agapiou, and Joel Z Leibo. Deep Q-Learning from Demonstrations. In AAAI Conference on Artificial Intelligence, 2018
work page 2018
-
[26]
Generative adversar- ial imitation learning
Jonathan Ho and Stefano Ermon. Generative adversar- ial imitation learning. In Neural Information Processing Systems, 2016
work page 2016
-
[27]
Googles multilingual neural machine translation system: Enabling zero-shot translation
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fer- nanda Vi ´egas, Martin Wattenberg, Greg Corrado, et al. Googles multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Asso- ciation for Computational Linguistics, 2017
work page 2017
-
[28]
Approximately opti- mal approximate reinforcement learning
Sham Kakade and John Langford. Approximately opti- mal approximate reinforcement learning. In International Conference on Machine Learning, 2002
work page 2002
-
[29]
Policy opti- mization with demonstrations
Bingyi Kang, Zequn Jie, and Jiashi Feng. Policy opti- mization with demonstrations. In International Confer- ence on Machine Learning, 2018
work page 2018
-
[30]
Learning to branch in mixed integer programming
Elias Boutros Khalil, Pierre Le Bodic, Le Song, George L Nemhauser, and Bistra N Dilkina. Learning to branch in mixed integer programming. In AAAI Conference on Artificial Intelligence, 2016
work page 2016
-
[31]
Email classifi- cation with co-training
Svetlana Kiritchenko and Stan Matwin. Email classifi- cation with co-training. In Conference of the Center for Advanced Studies on Collaborative Research, 2011
work page 2011
-
[32]
Reinforce- ment learning in robotics: A survey
Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforce- ment learning in robotics: A survey. The International Journal of Robotics Research, 2013
work page 2013
-
[33]
A co-training approach for multi-view spectral clustering
Abhishek Kumar and Hal Daum ´e. A co-training approach for multi-view spectral clustering. In International Con- ference on Machine Learning, 2011
work page 2011
-
[34]
Play- ing fps games with deep reinforcement learning
Guillaume Lample and Devendra Singh Chaplot. Play- ing fps games with deep reinforcement learning. In AAAI Conference on Artificial Intelligence, 2017
work page 2017
-
[35]
An automatic method for solving discrete programming problems
Ailsa H Land and Alison G Doig. An automatic method for solving discrete programming problems. In 50 Years of Integer Programming 1958-2008 , pages 105–
work page 1958
-
[36]
Hierarchical imitation and reinforcement learning
Hoang Le, Nan Jiang, Alekh Agarwal, Miroslav Dudik, Yisong Yue, and Hal Daum´e. Hierarchical imitation and reinforcement learning. In International Conference on Machine Learning, 2018
work page 2018
-
[37]
End-to-end training of deep visuomotor policies
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 2016
work page 2016
-
[38]
Continuous control with deep reinforce- ment learning
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforce- ment learning. In International Conference on Learning Representations, 2016
work page 2016
-
[39]
A simplicial branch-and-bound algorithm for solving quadratically constrained quadratic programs
Jeff Linderoth. A simplicial branch-and-bound algorithm for solving quadratically constrained quadratic programs. Mathematical programming, 2005
work page 2005
-
[40]
Multi- view clustering via joint nonnegative matrix factorization
Jialu Liu, Chi Wang, Jing Gao, and Jiawei Han. Multi- view clustering via joint nonnegative matrix factorization. In SIAM International Conference on Data Mining, 2013
work page 2013
-
[41]
De- vice placement optimization with reinforcement learning
Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. De- vice placement optimization with reinforcement learning. In International Conference on Machine Learning, 2017
work page 2017
-
[42]
Playing atari with deep reinforcement learning
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv, 2013
work page 2013
-
[43]
Overcoming explo- ration in reinforcement learning with demonstrations
Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wo- jciech Zaremba, and Pieter Abbeel. Overcoming explo- ration in reinforcement learning with demonstrations. In International Conference on Robotics and Automation , 2018
work page 2018
-
[44]
Analyzing the effective- ness and applicability of co-training
Kamal Nigam and Rayid Ghani. Analyzing the effective- ness and applicability of co-training. In ACM Conference on Information and knowledge Management, 2000
work page 2000
-
[45]
A comparative analysis of several asymmetric traveling salesman problem formulations
Temel ¨Oncan, ˙I Kuban Altınel, and Gilbert Laporte. A comparative analysis of several asymmetric traveling salesman problem formulations. Computers & Opera- tions Research, 2009
work page 2009
-
[46]
Masahiro Ono and Brian C Williams. An efficient motion planning algorithm for stochastic dynamic systems with constraints on probability of failure. In AAAI Conference on Artificial Intelligence, 2008
work page 2008
-
[47]
A survey of different in- teger programming formulations of the travelling sales- man problem
AJ Orman and HP Williams. A survey of different in- teger programming formulations of the travelling sales- man problem. In Optimisation, econometric and financial analysis. Springer, 2007
work page 2007
-
[48]
Markov decision processes: discrete stochastic dynamic programming
Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons, 2014
work page 2014
-
[49]
Efficient reductions for imitation learning
St ´ephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In International Conference on Artifi- cial Intelligence and Statistics, 2010
work page 2010
-
[50]
Reinforcement and imitation learning via interactive no-regret learning
Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv, 2014
work page 2014
-
[51]
A reduction of imitation learning and structured prediction to no-regret online learning
St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics, 2011
work page 2011
-
[52]
Trust region policy optimiza- tion
John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimiza- tion. In International Conference on Machine Learning , 2015
work page 2015
-
[53]
Proximal policy optimization algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv, 2017
work page 2017
-
[54]
Mastering the game of go with deep neural networks and tree search
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Ju- lian Schrittwieser, Ioannis Antonoglou, Veda Panneer- shelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016
work page 2016
-
[55]
A co-regularization approach to semi-supervised learning with multiple views
Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. A co-regularization approach to semi-supervised learning with multiple views. In ICML workshop on learning with multiple views, 2005
work page 2005
-
[56]
Learning to search via retrospective imitation
Jialin Song, Ravi Lanka, Albert Zhao, Aadyot Bhatnagar, Yisong Yue, and Masahiro Ono. Learning to search via retrospective imitation. arXiv, 2018
work page 2018
-
[57]
Third- person imitation learning
Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third- person imitation learning. arXiv, 2017
work page 2017
-
[58]
Deeply aggrevated: Dif- ferentiable imitation learning for sequential prediction
Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated: Dif- ferentiable imitation learning for sequential prediction. In International Conference on Machine Learning, 2017
work page 2017
-
[59]
Policy gradient methods for rein- forcement learning with function approximation
Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for rein- forcement learning with function approximation. In Neu- ral information processing systems, 2000
work page 2000
-
[60]
Apprenticeship learning using linear programming
Umar Syed, Michael Bowling, and Robert E Schapire. Apprenticeship learning using linear programming. InIn- ternational Conference on Machine Learning, 2008
work page 2008
-
[61]
A game-theoretic ap- proach to apprenticeship learning
Umar Syed and Robert E Schapire. A game-theoretic ap- proach to apprenticeship learning. In Neural information processing systems, 2008
work page 2008
-
[62]
Mu- joco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mu- joco: A physics engine for model-based control. In Inter- national Conference on Intelligent Robots and Systems , 2012
work page 2012
-
[63]
Deep reinforcement learning with double q-learning
Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In AAAI Conference on Artificial Intelligence, 2016
work page 2016
-
[64]
Co-training for cross-lingual sentiment classification
Xiaojun Wan. Co-training for cross-lingual sentiment classification. In Joint conference of ACL and IJCNLP . Association for Computational Linguistics, 2009
work page 2009
-
[65]
A new analysis of co- training
Wei Wang and Zhi-Hua Zhou. A new analysis of co- training. In International Conference on Machine Learn- ing, 2010
work page 2010
-
[66]
Co-training with insuffi- cient views
Wei Wang and Zhi-Hua Zhou. Co-training with insuffi- cient views. In Asian conference on machine learning , 2013
work page 2013
-
[67]
Dueling network archi- tectures for deep reinforcement learning
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network archi- tectures for deep reinforcement learning. In International Conference on Machine Learning, 2016
work page 2016
-
[68]
Maximum entropy inverse reinforcement learning
Brian Ziebart, Andrew Maas, J Andrew Bagnell, and Anind Dey. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence , 2008. 8 APPENDIX 8.1 PROOFS Proof for Proposition 1: Proof. We show that maxsDJS (πB(s)∥πA(s)) is well- defined for an MDP M with two representations MA and MB. From Theorem 1, we know the distribution ...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.