pith. sign in

arxiv: 2306.05905 · v2 · pith:6U2RO6G6new · submitted 2023-06-09 · 💻 cs.LG · math.OC

TreeDQN: Sample-Efficient Off-Policy Reinforcement Learning for Combinatorial Optimization

Pith reviewed 2026-05-24 08:31 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords reinforcement learningcombinatorial optimizationbranch and bounddeep Q-networkoff-policy learningtree MDPgeometric mean
0
0 comments X

The pith

TreeDQN trains an off-policy agent to learn branching heuristics for combinatorial optimization by optimizing the geometric mean of expected return in tree MDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes TreeDQN to address the high sample needs and instability of on-policy reinforcement learning for learning branching decisions inside Branch-and-Bound solvers. It replaces on-policy updates with an off-policy Deep Q-Network whose objective is the geometric mean of the expected return and proves that the associated Bellman operator remains a contraction mapping on the tree Markov Decision Process. This change yields training that uses up to ten times fewer samples, runs faster on synthetic problems, and produces stronger heuristics on a realistic instance drawn from the ML4CO competition.

Core claim

TreeDQN is an off-policy reinforcement learning method for tree MDPs that optimizes the geometric mean of expected return; the method is justified by a proof that the corresponding Bellman operator is contractive, which yields stable learning and reduces the number of required training trajectories by up to a factor of ten relative to on-policy baselines while improving solution quality on both synthetic and practical combinatorial optimization tasks.

What carries the argument

TreeDQN, an off-policy Deep Q-Network trained on the geometric mean of expected return inside a tree-structured Markov Decision Process whose Bellman operator is shown to be contractive.

If this is right

  • Up to tenfold reduction in training data compared with known on-policy methods on synthetic tasks.
  • Faster wall-clock training than on-policy alternatives on the same problems.
  • Higher solution quality than prior state-of-the-art techniques on the ML4CO competition benchmark.
  • More stable training trajectories for learning branching heuristics in Branch-and-Bound.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The geometric-mean objective may transfer to other tree-search or hierarchical decision problems where variance in return is high.
  • Off-policy replay in tree MDPs could reduce the cost of acquiring labeled combinatorial instances for supervised heuristic learning.
  • If the contraction result generalizes, similar geometric-mean training might stabilize other off-policy algorithms that operate on recursive or tree-like state spaces.

Load-bearing premise

The contraction property of the Bellman operator for the geometric-mean objective is sufficient to produce stable and sample-efficient learning of branching policies.

What would settle it

A controlled comparison on the same synthetic task set in which TreeDQN requires at least as many trajectories as the on-policy baseline to reach the same policy quality, or fails to exceed the best reported ML4CO solver performance.

Figures

Figures reproduced from arXiv: 2306.05905 by A. Kostin, A.V. Savchenko, D. Sorokin, G. Gusev, L. Savchenko.

Figure 1
Figure 1. Figure 1: Branch-and-Bound tree for Combinatorial Auction task [ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distributions of tree sizes for Combinatorial Auction [ [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The geometric mean of tree size for 30 validation instances. Blue - Combinatorial Auction, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: P-P plots of tree size distributions for test instances. Blue - Strong Branching, red - Imitation [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: P-P plots of tree size distributions for test instances. Blue - Strong Branching, red - Imitation [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

A convenient approach to optimally solving combinatorial optimization tasks is the Branch-and-Bound method. Its branching heuristic can be learned to solve a large set of similar tasks. The promising results here are achieved by the recently appeared on-policy reinforcement learning method based on the tree Markov Decision Process. To overcome its main disadvantages, namely, very large training time and unstable training, we propose TreeDQN (Tree Deep Q-Network), a sample-efficient off-policy RL method trained by optimizing the geometric mean of expected return. To theoretically support the training procedure for our method, we prove the contraction property of the Bellman operator for the tree MDP. As a result, our method requires up to 10 times less training data and performs faster than known on-policy methods on synthetic tasks. Moreover, TreeDQN significantly outperforms the state-of-the-art techniques on a challenging practical task from the ML4CO competition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces TreeDQN, an off-policy RL algorithm for learning branching heuristics in Branch-and-Bound solvers for combinatorial optimization. It models the search as a tree MDP and trains a DQN by optimizing the geometric mean of expected return rather than the conventional additive objective; a contraction proof for the associated Bellman operator is offered to justify stable off-policy learning. Experiments claim up to 10× reduction in training data relative to on-policy baselines on synthetic instances and improved performance on an ML4CO competition task.

Significance. If the contraction result is shown to apply directly to the geometric-mean objective and the empirical gains are reproducible, the work would offer a concrete route to sample-efficient off-policy training for tree-search policies, addressing a known instability and data-hunger issue in the on-policy literature for CO. The explicit use of a geometric-mean objective together with a supporting fixed-point argument would be a distinctive technical contribution.

major comments (1)
  1. [§4] §4 (Theoretical analysis), the statement of the Bellman operator and its contraction: the proof is given for the standard additive expectation operator under the sup-norm, yet the training procedure optimizes the geometric mean of expected return. No re-derivation of the fixed-point equation, effective discount, or transformed operator (e.g., via logarithm) is supplied to show that the contraction still holds for the objective actually optimized; this gap directly undermines the stability guarantee claimed for the off-policy training loop.
minor comments (2)
  1. The geometric-mean objective is introduced without an explicit comparison to the more common log-return formulation; a short remark on numerical stability or equivalence would clarify the design choice.
  2. Figure 3 (learning curves) uses a log-scale y-axis without stating the base or the precise quantity plotted (mean vs. median geometric return); this makes direct comparison with the on-policy baselines harder to interpret.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the constructive comment on the theoretical analysis. We address the point below.

read point-by-point responses
  1. Referee: [§4] §4 (Theoretical analysis), the statement of the Bellman operator and its contraction: the proof is given for the standard additive expectation operator under the sup-norm, yet the training procedure optimizes the geometric mean of expected return. No re-derivation of the fixed-point equation, effective discount, or transformed operator (e.g., via logarithm) is supplied to show that the contraction still holds for the objective actually optimized; this gap directly undermines the stability guarantee claimed for the off-policy training loop.

    Authors: We agree that the manuscript should explicitly connect the geometric-mean objective to the contraction result. The current proof establishes contraction of the standard additive Bellman operator on the tree MDP under the sup-norm. Because the geometric mean of expected return can be rewritten via a logarithm as an additive objective with a state-dependent effective discount, the same contraction argument applies after the transformation. In the revised manuscript we will add a short derivation of the transformed operator and fixed-point equation, together with the adjusted discount factor, to make this link rigorous and remove the gap. revision: yes

Circularity Check

0 steps flagged

No circularity; proof presented as independent support

full rationale

The paper introduces TreeDQN as a novel off-policy algorithm that optimizes the geometric mean of expected return and separately proves contraction of the Bellman operator on the tree MDP to justify stable training. No equation or claim reduces the proof to the training objective by construction, no parameters are fitted and then relabeled as predictions, and no load-bearing premise rests on self-citation. The contraction result is offered as new theoretical content that stands apart from the empirical performance claims, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the tree MDP from prior work and the new proof of contraction; no free parameters or invented entities mentioned in abstract.

axioms (1)
  • domain assumption Bellman operator for tree MDP is a contraction mapping
    Proved to support the training procedure for the off-policy method.

pith-pipeline@v0.9.0 · 5700 in / 1267 out tokens · 50021 ms · 2026-05-24T08:31:55.999201+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

  1. [1]

    A stochastic and dynamic vehicle routing problem in the euclidean plane

    Dimitris J Bertsimas and Garrett Van Ryzin. A stochastic and dynamic vehicle routing problem in the euclidean plane. Operations Research, 39(4):601–615, 1991

  2. [2]

    Portfolio selection

    Harry Markowitz. Portfolio selection. The Journal of Finance, 7(1):77, March 1952

  3. [3]

    An application of combinatorial optimization to statistical physics and circuit layout design

    Francisco Barahona, Martin Grötschel, Michael Jünger, and Gerhard Reinelt. An application of combinatorial optimization to statistical physics and circuit layout design. Operations Research, 36(3):493–513, 1988

  4. [4]

    Integer and combinatorial optimization, vol- ume 55

    Laurence A Wolsey and George L Nemhauser. Integer and combinatorial optimization, vol- ume 55. John Wiley & Sons, 1999

  5. [5]

    An automatic method for solving discrete programming problems

    Ailsa H Land and Alison G Doig. An automatic method for solving discrete programming problems. Springer, 2010

  6. [6]

    On learning and branching: a survey

    Andrea Lodi and Giulia Zarpellon. On learning and branching: a survey. Top, 25:207–236, 2017

  7. [7]

    Miplib 2017: data-driven compilation of the 6th mixed-integer programming library

    Ambros Gleixner, Gregor Hendel, Gerald Gamrath, Tobias Achterberg, Michael Bastubbe, Timo Berthold, Philipp Christophel, Kati Jarck, Thorsten Koch, Jeff Linderoth, et al. Miplib 2017: data-driven compilation of the 6th mixed-integer programming library. Mathematical Programming Computation, 13(3):443–490, 2021

  8. [8]

    A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018

  9. [9]

    Dota 2 with Large Scale Deep Reinforcement Learning

    Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019

  10. [10]

    Interferobot: aligning an optical interferometer by a reinforcement learning agent

    Dmitry Sorokin, Alexander Ulanov, Ekaterina Sazhina, and Alexander Lvovsky. Interferobot: aligning an optical interferometer by a reinforcement learning agent. In H. Larochelle, M. Ran- zato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13238–13248. Curran Associates, Inc., 2020

  11. [11]

    Mag- netic control of tokamak plasmas through deep reinforcement learning

    Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Mag- netic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414– 419, 2022

  12. [12]

    Reinforcement learning enhanced quantum-inspired algorithm for combinatorial optimization

    Dmitrii Beloborodov, A E Ulanov, Jakob N Foerster, Shimon Whiteson, and A I Lvovsky. Reinforcement learning enhanced quantum-inspired algorithm for combinatorial optimization. Machine Learning: Science and Technology, 2(2):025009, January 2021

  13. [13]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023

  14. [14]

    Reinforce- ment learning for variable selection in a branch and bound algorithm

    Marc Etheve, Zacharie Alès, Côme Bissuel, Olivier Juan, and Safia Kedad-Sidhoum. Reinforce- ment learning for variable selection in a branch and bound algorithm. In Integration of AI and OR Techniques in Constraint Programming, 2020

  15. [15]

    Learning to branch with tree mdps

    Lara Scavuzzo, Feng Chen, Didier Chetelat, Maxime Gasse, Andrea Lodi, Neil Yorke-Smith, and Karen Aardal. Learning to branch with tree mdps. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 18514–18526. Curran Associates, Inc., 2022

  16. [16]

    Towards a universal test suite for combinatorial auction algorithms

    Kevin Leyton-Brown, Mark Pearson, and Yoav Shoham. Towards a universal test suite for combinatorial auction algorithms. In Proceedings of the 2nd ACM conference on Electronic commerce. ACM, October 2000

  17. [17]

    Maher, Frederic Matter, Erik Mühmer, Benjamin Müller, Marc E

    Ksenia Bestuzheva, Mathieu Besançon, Wei-Kun Chen, Antonia Chmiela, Tim Donkiewicz, Jasper van Doornmalen, Leon Eifler, Oliver Gaul, Gerald Gamrath, Ambros Gleixner, Leona Gottwald, Christoph Graczyk, Katrin Halbig, Alexander Hoen, Christopher Hojny, Rolf van der Hulst, Thorsten Koch, Marco Lübbecke, Stephen J. Maher, Frederic Matter, Erik Mühmer, Benjami...

  18. [18]

    IBM ILOG Cplex. V12. 1: User’s manual for cplex. International Business Machines Corpora- tion, 46(53):157, 2009

  19. [19]

    Constraint Integer Programming

    Tobias Achterberg. Constraint Integer Programming. July 2007. Accepted: 2015-11- 20T17:32:33Z

  20. [20]

    Learning to branch in mixed integer programming

    Elias Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina. Learning to branch in mixed integer programming. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), February 2016

  21. [21]

    Support-vector networks

    Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273– 297, 1995

  22. [22]

    Learning combinatorial optimization algorithms over graphs

    Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  23. [23]

    Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L. Dill. Learning a sat solver from single-bit supervision, 2018

  24. [24]

    Kipf and Max Welling

    Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR ’17, 2017

  25. [25]

    Exact com- binatorial optimization with graph convolutional neural networks

    Maxime Gasse, Didier Chetelat, Nicola Ferroni, Laurent Charlin, and Andrea Lodi. Exact com- binatorial optimization with graph convolutional neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

  26. [26]

    Hybrid models for learning to branch

    Prateek Gupta, Maxime Gasse, Elias Khalil, Pawan Mudigonda, Andrea Lodi, and Yoshua Bengio. Hybrid models for learning to branch. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18087–18097. Curran Associates, Inc., 2020

  27. [27]

    Ecole: A gym-like library for machine learning in combinatorial optimization solvers

    Antoine Prouvost, Justin Dumouchelle, Lara Scavuzzo, Maxime Gasse, Didier Chételat, and Andrea Lodi. Ecole: A gym-like library for machine learning in combinatorial optimization solvers. In Learning Meets Combinatorial Algorithms at NeurIPS2020, 2020

  28. [28]

    Set covering algorithms using cutting planes, heuristics, and subgradient optimization: A computational study, pages 37–60

    Egon Balas and Andrew Ho. Set covering algorithms using cutting planes, heuristics, and subgradient optimization: A computational study, pages 37–60. Springer Berlin Heidelberg, Berlin, Heidelberg, 1980

  29. [29]

    Cire, Willem-Jan van Hoeve, and John Hooker

    David Bergman, Andre A. Cire, Willem-Jan van Hoeve, and John Hooker. Decision Diagrams for Optimization. Springer International Publishing, 2016

  30. [30]

    Convergence of stochastic iterative dynamic programming algorithms

    Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms. In J. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6. Morgan-Kaufmann, 1993

  31. [31]

    Borovkov

    Alexandr A. Borovkov. Probability Theory. Springer London, 2013

  32. [32]

    Rusu, Joel Veness, Marc G

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...

  33. [33]

    Cornuejols, R

    G. Cornuejols, R. Sridharan, and J.M. Thizy. A comparison of heuristics and relaxations for the capacitated plant location problem. European Journal of Operational Research, 50(3):280–297, feb 1991

  34. [34]

    Fukunaga

    Alex S. Fukunaga. A branch-and-bound algorithm for hard multiple knapsack problems. Annals of Operations Research, 184(1):97–119, November 2009. 11 A Appendix 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Comb.Auct. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Set.Cover 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Max.Ind.Set. 0.0 0.2 0.4 0.6 0.8 ...