TreeDQN: Sample-Efficient Off-Policy Reinforcement Learning for Combinatorial Optimization
Pith reviewed 2026-05-24 08:31 UTC · model grok-4.3
The pith
TreeDQN trains an off-policy agent to learn branching heuristics for combinatorial optimization by optimizing the geometric mean of expected return in tree MDPs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TreeDQN is an off-policy reinforcement learning method for tree MDPs that optimizes the geometric mean of expected return; the method is justified by a proof that the corresponding Bellman operator is contractive, which yields stable learning and reduces the number of required training trajectories by up to a factor of ten relative to on-policy baselines while improving solution quality on both synthetic and practical combinatorial optimization tasks.
What carries the argument
TreeDQN, an off-policy Deep Q-Network trained on the geometric mean of expected return inside a tree-structured Markov Decision Process whose Bellman operator is shown to be contractive.
If this is right
- Up to tenfold reduction in training data compared with known on-policy methods on synthetic tasks.
- Faster wall-clock training than on-policy alternatives on the same problems.
- Higher solution quality than prior state-of-the-art techniques on the ML4CO competition benchmark.
- More stable training trajectories for learning branching heuristics in Branch-and-Bound.
Where Pith is reading between the lines
- The geometric-mean objective may transfer to other tree-search or hierarchical decision problems where variance in return is high.
- Off-policy replay in tree MDPs could reduce the cost of acquiring labeled combinatorial instances for supervised heuristic learning.
- If the contraction result generalizes, similar geometric-mean training might stabilize other off-policy algorithms that operate on recursive or tree-like state spaces.
Load-bearing premise
The contraction property of the Bellman operator for the geometric-mean objective is sufficient to produce stable and sample-efficient learning of branching policies.
What would settle it
A controlled comparison on the same synthetic task set in which TreeDQN requires at least as many trajectories as the on-policy baseline to reach the same policy quality, or fails to exceed the best reported ML4CO solver performance.
Figures
read the original abstract
A convenient approach to optimally solving combinatorial optimization tasks is the Branch-and-Bound method. Its branching heuristic can be learned to solve a large set of similar tasks. The promising results here are achieved by the recently appeared on-policy reinforcement learning method based on the tree Markov Decision Process. To overcome its main disadvantages, namely, very large training time and unstable training, we propose TreeDQN (Tree Deep Q-Network), a sample-efficient off-policy RL method trained by optimizing the geometric mean of expected return. To theoretically support the training procedure for our method, we prove the contraction property of the Bellman operator for the tree MDP. As a result, our method requires up to 10 times less training data and performs faster than known on-policy methods on synthetic tasks. Moreover, TreeDQN significantly outperforms the state-of-the-art techniques on a challenging practical task from the ML4CO competition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TreeDQN, an off-policy RL algorithm for learning branching heuristics in Branch-and-Bound solvers for combinatorial optimization. It models the search as a tree MDP and trains a DQN by optimizing the geometric mean of expected return rather than the conventional additive objective; a contraction proof for the associated Bellman operator is offered to justify stable off-policy learning. Experiments claim up to 10× reduction in training data relative to on-policy baselines on synthetic instances and improved performance on an ML4CO competition task.
Significance. If the contraction result is shown to apply directly to the geometric-mean objective and the empirical gains are reproducible, the work would offer a concrete route to sample-efficient off-policy training for tree-search policies, addressing a known instability and data-hunger issue in the on-policy literature for CO. The explicit use of a geometric-mean objective together with a supporting fixed-point argument would be a distinctive technical contribution.
major comments (1)
- [§4] §4 (Theoretical analysis), the statement of the Bellman operator and its contraction: the proof is given for the standard additive expectation operator under the sup-norm, yet the training procedure optimizes the geometric mean of expected return. No re-derivation of the fixed-point equation, effective discount, or transformed operator (e.g., via logarithm) is supplied to show that the contraction still holds for the objective actually optimized; this gap directly undermines the stability guarantee claimed for the off-policy training loop.
minor comments (2)
- The geometric-mean objective is introduced without an explicit comparison to the more common log-return formulation; a short remark on numerical stability or equivalence would clarify the design choice.
- Figure 3 (learning curves) uses a log-scale y-axis without stating the base or the precise quantity plotted (mean vs. median geometric return); this makes direct comparison with the on-policy baselines harder to interpret.
Simulated Author's Rebuttal
We thank the referee for the careful reading and the constructive comment on the theoretical analysis. We address the point below.
read point-by-point responses
-
Referee: [§4] §4 (Theoretical analysis), the statement of the Bellman operator and its contraction: the proof is given for the standard additive expectation operator under the sup-norm, yet the training procedure optimizes the geometric mean of expected return. No re-derivation of the fixed-point equation, effective discount, or transformed operator (e.g., via logarithm) is supplied to show that the contraction still holds for the objective actually optimized; this gap directly undermines the stability guarantee claimed for the off-policy training loop.
Authors: We agree that the manuscript should explicitly connect the geometric-mean objective to the contraction result. The current proof establishes contraction of the standard additive Bellman operator on the tree MDP under the sup-norm. Because the geometric mean of expected return can be rewritten via a logarithm as an additive objective with a state-dependent effective discount, the same contraction argument applies after the transformation. In the revised manuscript we will add a short derivation of the transformed operator and fixed-point equation, together with the adjusted discount factor, to make this link rigorous and remove the gap. revision: yes
Circularity Check
No circularity; proof presented as independent support
full rationale
The paper introduces TreeDQN as a novel off-policy algorithm that optimizes the geometric mean of expected return and separately proves contraction of the Bellman operator on the tree MDP to justify stable training. No equation or claim reduces the proof to the training objective by construction, no parameters are fitted and then relabeled as predictions, and no load-bearing premise rests on self-citation. The contraction result is offered as new theoretical content that stands apart from the empirical performance claims, making the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bellman operator for tree MDP is a contraction mapping
Reference graph
Works this paper leans on
-
[1]
A stochastic and dynamic vehicle routing problem in the euclidean plane
Dimitris J Bertsimas and Garrett Van Ryzin. A stochastic and dynamic vehicle routing problem in the euclidean plane. Operations Research, 39(4):601–615, 1991
work page 1991
-
[2]
Harry Markowitz. Portfolio selection. The Journal of Finance, 7(1):77, March 1952
work page 1952
-
[3]
An application of combinatorial optimization to statistical physics and circuit layout design
Francisco Barahona, Martin Grötschel, Michael Jünger, and Gerhard Reinelt. An application of combinatorial optimization to statistical physics and circuit layout design. Operations Research, 36(3):493–513, 1988
work page 1988
-
[4]
Integer and combinatorial optimization, vol- ume 55
Laurence A Wolsey and George L Nemhauser. Integer and combinatorial optimization, vol- ume 55. John Wiley & Sons, 1999
work page 1999
-
[5]
An automatic method for solving discrete programming problems
Ailsa H Land and Alison G Doig. An automatic method for solving discrete programming problems. Springer, 2010
work page 2010
-
[6]
On learning and branching: a survey
Andrea Lodi and Giulia Zarpellon. On learning and branching: a survey. Top, 25:207–236, 2017
work page 2017
-
[7]
Miplib 2017: data-driven compilation of the 6th mixed-integer programming library
Ambros Gleixner, Gregor Hendel, Gerald Gamrath, Tobias Achterberg, Michael Bastubbe, Timo Berthold, Philipp Christophel, Kati Jarck, Thorsten Koch, Jeff Linderoth, et al. Miplib 2017: data-driven compilation of the 6th mixed-integer programming library. Mathematical Programming Computation, 13(3):443–490, 2021
work page 2017
-
[8]
A general reinforcement learning algorithm that masters chess, shogi, and go through self-play
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018
work page 2018
-
[9]
Dota 2 with Large Scale Deep Reinforcement Learning
Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[10]
Interferobot: aligning an optical interferometer by a reinforcement learning agent
Dmitry Sorokin, Alexander Ulanov, Ekaterina Sazhina, and Alexander Lvovsky. Interferobot: aligning an optical interferometer by a reinforcement learning agent. In H. Larochelle, M. Ran- zato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13238–13248. Curran Associates, Inc., 2020
work page 2020
-
[11]
Mag- netic control of tokamak plasmas through deep reinforcement learning
Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Mag- netic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414– 419, 2022
work page 2022
-
[12]
Reinforcement learning enhanced quantum-inspired algorithm for combinatorial optimization
Dmitrii Beloborodov, A E Ulanov, Jakob N Foerster, Shimon Whiteson, and A I Lvovsky. Reinforcement learning enhanced quantum-inspired algorithm for combinatorial optimization. Machine Learning: Science and Technology, 2(2):025009, January 2021
work page 2021
-
[13]
OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Reinforce- ment learning for variable selection in a branch and bound algorithm
Marc Etheve, Zacharie Alès, Côme Bissuel, Olivier Juan, and Safia Kedad-Sidhoum. Reinforce- ment learning for variable selection in a branch and bound algorithm. In Integration of AI and OR Techniques in Constraint Programming, 2020
work page 2020
-
[15]
Learning to branch with tree mdps
Lara Scavuzzo, Feng Chen, Didier Chetelat, Maxime Gasse, Andrea Lodi, Neil Yorke-Smith, and Karen Aardal. Learning to branch with tree mdps. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 18514–18526. Curran Associates, Inc., 2022
work page 2022
-
[16]
Towards a universal test suite for combinatorial auction algorithms
Kevin Leyton-Brown, Mark Pearson, and Yoav Shoham. Towards a universal test suite for combinatorial auction algorithms. In Proceedings of the 2nd ACM conference on Electronic commerce. ACM, October 2000
work page 2000
-
[17]
Maher, Frederic Matter, Erik Mühmer, Benjamin Müller, Marc E
Ksenia Bestuzheva, Mathieu Besançon, Wei-Kun Chen, Antonia Chmiela, Tim Donkiewicz, Jasper van Doornmalen, Leon Eifler, Oliver Gaul, Gerald Gamrath, Ambros Gleixner, Leona Gottwald, Christoph Graczyk, Katrin Halbig, Alexander Hoen, Christopher Hojny, Rolf van der Hulst, Thorsten Koch, Marco Lübbecke, Stephen J. Maher, Frederic Matter, Erik Mühmer, Benjami...
work page 2021
-
[18]
IBM ILOG Cplex. V12. 1: User’s manual for cplex. International Business Machines Corpora- tion, 46(53):157, 2009
work page 2009
-
[19]
Constraint Integer Programming
Tobias Achterberg. Constraint Integer Programming. July 2007. Accepted: 2015-11- 20T17:32:33Z
work page 2007
-
[20]
Learning to branch in mixed integer programming
Elias Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina. Learning to branch in mixed integer programming. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), February 2016
work page 2016
-
[21]
Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273– 297, 1995
work page 1995
-
[22]
Learning combinatorial optimization algorithms over graphs
Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[23]
Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L. Dill. Learning a sat solver from single-bit supervision, 2018
work page 2018
-
[24]
Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR ’17, 2017
work page 2017
-
[25]
Exact com- binatorial optimization with graph convolutional neural networks
Maxime Gasse, Didier Chetelat, Nicola Ferroni, Laurent Charlin, and Andrea Lodi. Exact com- binatorial optimization with graph convolutional neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019
work page 2019
-
[26]
Hybrid models for learning to branch
Prateek Gupta, Maxime Gasse, Elias Khalil, Pawan Mudigonda, Andrea Lodi, and Yoshua Bengio. Hybrid models for learning to branch. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18087–18097. Curran Associates, Inc., 2020
work page 2020
-
[27]
Ecole: A gym-like library for machine learning in combinatorial optimization solvers
Antoine Prouvost, Justin Dumouchelle, Lara Scavuzzo, Maxime Gasse, Didier Chételat, and Andrea Lodi. Ecole: A gym-like library for machine learning in combinatorial optimization solvers. In Learning Meets Combinatorial Algorithms at NeurIPS2020, 2020
work page 2020
-
[28]
Egon Balas and Andrew Ho. Set covering algorithms using cutting planes, heuristics, and subgradient optimization: A computational study, pages 37–60. Springer Berlin Heidelberg, Berlin, Heidelberg, 1980
work page 1980
-
[29]
Cire, Willem-Jan van Hoeve, and John Hooker
David Bergman, Andre A. Cire, Willem-Jan van Hoeve, and John Hooker. Decision Diagrams for Optimization. Springer International Publishing, 2016
work page 2016
-
[30]
Convergence of stochastic iterative dynamic programming algorithms
Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms. In J. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6. Morgan-Kaufmann, 1993
work page 1993
- [31]
-
[32]
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...
work page 2015
-
[33]
G. Cornuejols, R. Sridharan, and J.M. Thizy. A comparison of heuristics and relaxations for the capacitated plant location problem. European Journal of Operational Research, 50(3):280–297, feb 1991
work page 1991
-
[34]
Alex S. Fukunaga. A branch-and-bound algorithm for hard multiple knapsack problems. Annals of Operations Research, 184(1):97–119, November 2009. 11 A Appendix 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Comb.Auct. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Set.Cover 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Max.Ind.Set. 0.0 0.2 0.4 0.6 0.8 ...
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.