TreeDQN: Sample-Efficient Off-Policy Reinforcement Learning for Combinatorial Optimization

A. Kostin; A.V. Savchenko; D. Sorokin; G. Gusev; L. Savchenko

arxiv: 2306.05905 · v2 · pith:6U2RO6G6new · submitted 2023-06-09 · 💻 cs.LG · math.OC

TreeDQN: Sample-Efficient Off-Policy Reinforcement Learning for Combinatorial Optimization

D. Sorokin , A. Kostin , L. Savchenko , G. Gusev , A.V. Savchenko This is my paper

Pith reviewed 2026-05-24 08:31 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords reinforcement learningcombinatorial optimizationbranch and bounddeep Q-networkoff-policy learningtree MDPgeometric mean

0 comments

The pith

TreeDQN trains an off-policy agent to learn branching heuristics for combinatorial optimization by optimizing the geometric mean of expected return in tree MDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes TreeDQN to address the high sample needs and instability of on-policy reinforcement learning for learning branching decisions inside Branch-and-Bound solvers. It replaces on-policy updates with an off-policy Deep Q-Network whose objective is the geometric mean of the expected return and proves that the associated Bellman operator remains a contraction mapping on the tree Markov Decision Process. This change yields training that uses up to ten times fewer samples, runs faster on synthetic problems, and produces stronger heuristics on a realistic instance drawn from the ML4CO competition.

Core claim

TreeDQN is an off-policy reinforcement learning method for tree MDPs that optimizes the geometric mean of expected return; the method is justified by a proof that the corresponding Bellman operator is contractive, which yields stable learning and reduces the number of required training trajectories by up to a factor of ten relative to on-policy baselines while improving solution quality on both synthetic and practical combinatorial optimization tasks.

What carries the argument

TreeDQN, an off-policy Deep Q-Network trained on the geometric mean of expected return inside a tree-structured Markov Decision Process whose Bellman operator is shown to be contractive.

If this is right

Up to tenfold reduction in training data compared with known on-policy methods on synthetic tasks.
Faster wall-clock training than on-policy alternatives on the same problems.
Higher solution quality than prior state-of-the-art techniques on the ML4CO competition benchmark.
More stable training trajectories for learning branching heuristics in Branch-and-Bound.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The geometric-mean objective may transfer to other tree-search or hierarchical decision problems where variance in return is high.
Off-policy replay in tree MDPs could reduce the cost of acquiring labeled combinatorial instances for supervised heuristic learning.
If the contraction result generalizes, similar geometric-mean training might stabilize other off-policy algorithms that operate on recursive or tree-like state spaces.

Load-bearing premise

The contraction property of the Bellman operator for the geometric-mean objective is sufficient to produce stable and sample-efficient learning of branching policies.

What would settle it

A controlled comparison on the same synthetic task set in which TreeDQN requires at least as many trajectories as the on-policy baseline to reach the same policy quality, or fails to exceed the best reported ML4CO solver performance.

Figures

Figures reproduced from arXiv: 2306.05905 by A. Kostin, A.V. Savchenko, D. Sorokin, G. Gusev, L. Savchenko.

**Figure 2.** Figure 2: Distributions of tree sizes for Combinatorial Auction [ [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The geometric mean of tree size for 30 validation instances. Blue - Combinatorial Auction, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: P-P plots of tree size distributions for test instances. Blue - Strong Branching, red - Imitation [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: P-P plots of tree size distributions for test instances. Blue - Strong Branching, red - Imitation [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

A convenient approach to optimally solving combinatorial optimization tasks is the Branch-and-Bound method. Its branching heuristic can be learned to solve a large set of similar tasks. The promising results here are achieved by the recently appeared on-policy reinforcement learning method based on the tree Markov Decision Process. To overcome its main disadvantages, namely, very large training time and unstable training, we propose TreeDQN (Tree Deep Q-Network), a sample-efficient off-policy RL method trained by optimizing the geometric mean of expected return. To theoretically support the training procedure for our method, we prove the contraction property of the Bellman operator for the tree MDP. As a result, our method requires up to 10 times less training data and performs faster than known on-policy methods on synthetic tasks. Moreover, TreeDQN significantly outperforms the state-of-the-art techniques on a challenging practical task from the ML4CO competition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TreeDQN shifts to off-policy training with a geometric-mean objective for tree MDPs in combinatorial optimization, claiming better sample efficiency and a supporting contraction proof.

read the letter

The paper's main contribution is TreeDQN, an off-policy DQN approach for learning branching policies in tree MDPs used for branch-and-bound solvers. It replaces the usual on-policy training with optimization of the geometric mean of expected returns and includes a proof that the Bellman operator contracts for the tree MDP. The authors report that this cuts required training data by up to 10x on synthetic tasks and improves results on an ML4CO competition instance compared with prior on-policy methods and other baselines.

Referee Report

1 major / 2 minor

Summary. The paper introduces TreeDQN, an off-policy RL algorithm for learning branching heuristics in Branch-and-Bound solvers for combinatorial optimization. It models the search as a tree MDP and trains a DQN by optimizing the geometric mean of expected return rather than the conventional additive objective; a contraction proof for the associated Bellman operator is offered to justify stable off-policy learning. Experiments claim up to 10× reduction in training data relative to on-policy baselines on synthetic instances and improved performance on an ML4CO competition task.

Significance. If the contraction result is shown to apply directly to the geometric-mean objective and the empirical gains are reproducible, the work would offer a concrete route to sample-efficient off-policy training for tree-search policies, addressing a known instability and data-hunger issue in the on-policy literature for CO. The explicit use of a geometric-mean objective together with a supporting fixed-point argument would be a distinctive technical contribution.

major comments (1)

[§4] §4 (Theoretical analysis), the statement of the Bellman operator and its contraction: the proof is given for the standard additive expectation operator under the sup-norm, yet the training procedure optimizes the geometric mean of expected return. No re-derivation of the fixed-point equation, effective discount, or transformed operator (e.g., via logarithm) is supplied to show that the contraction still holds for the objective actually optimized; this gap directly undermines the stability guarantee claimed for the off-policy training loop.

minor comments (2)

The geometric-mean objective is introduced without an explicit comparison to the more common log-return formulation; a short remark on numerical stability or equivalence would clarify the design choice.
Figure 3 (learning curves) uses a log-scale y-axis without stating the base or the precise quantity plotted (mean vs. median geometric return); this makes direct comparison with the on-policy baselines harder to interpret.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the constructive comment on the theoretical analysis. We address the point below.

read point-by-point responses

Referee: [§4] §4 (Theoretical analysis), the statement of the Bellman operator and its contraction: the proof is given for the standard additive expectation operator under the sup-norm, yet the training procedure optimizes the geometric mean of expected return. No re-derivation of the fixed-point equation, effective discount, or transformed operator (e.g., via logarithm) is supplied to show that the contraction still holds for the objective actually optimized; this gap directly undermines the stability guarantee claimed for the off-policy training loop.

Authors: We agree that the manuscript should explicitly connect the geometric-mean objective to the contraction result. The current proof establishes contraction of the standard additive Bellman operator on the tree MDP under the sup-norm. Because the geometric mean of expected return can be rewritten via a logarithm as an additive objective with a state-dependent effective discount, the same contraction argument applies after the transformation. In the revised manuscript we will add a short derivation of the transformed operator and fixed-point equation, together with the adjusted discount factor, to make this link rigorous and remove the gap. revision: yes

Circularity Check

0 steps flagged

No circularity; proof presented as independent support

full rationale

The paper introduces TreeDQN as a novel off-policy algorithm that optimizes the geometric mean of expected return and separately proves contraction of the Bellman operator on the tree MDP to justify stable training. No equation or claim reduces the proof to the training objective by construction, no parameters are fitted and then relabeled as predictions, and no load-bearing premise rests on self-citation. The contraction result is offered as new theoretical content that stands apart from the empirical performance claims, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the tree MDP from prior work and the new proof of contraction; no free parameters or invented entities mentioned in abstract.

axioms (1)

domain assumption Bellman operator for tree MDP is a contraction mapping
Proved to support the training procedure for the off-policy method.

pith-pipeline@v0.9.0 · 5700 in / 1267 out tokens · 50021 ms · 2026-05-24T08:31:55.999201+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 2 internal anchors

[1]

A stochastic and dynamic vehicle routing problem in the euclidean plane

Dimitris J Bertsimas and Garrett Van Ryzin. A stochastic and dynamic vehicle routing problem in the euclidean plane. Operations Research, 39(4):601–615, 1991

work page 1991
[2]

Portfolio selection

Harry Markowitz. Portfolio selection. The Journal of Finance, 7(1):77, March 1952

work page 1952
[3]

An application of combinatorial optimization to statistical physics and circuit layout design

Francisco Barahona, Martin Grötschel, Michael Jünger, and Gerhard Reinelt. An application of combinatorial optimization to statistical physics and circuit layout design. Operations Research, 36(3):493–513, 1988

work page 1988
[4]

Integer and combinatorial optimization, vol- ume 55

Laurence A Wolsey and George L Nemhauser. Integer and combinatorial optimization, vol- ume 55. John Wiley & Sons, 1999

work page 1999
[5]

An automatic method for solving discrete programming problems

Ailsa H Land and Alison G Doig. An automatic method for solving discrete programming problems. Springer, 2010

work page 2010
[6]

On learning and branching: a survey

Andrea Lodi and Giulia Zarpellon. On learning and branching: a survey. Top, 25:207–236, 2017

work page 2017
[7]

Miplib 2017: data-driven compilation of the 6th mixed-integer programming library

Ambros Gleixner, Gregor Hendel, Gerald Gamrath, Tobias Achterberg, Michael Bastubbe, Timo Berthold, Philipp Christophel, Kati Jarck, Thorsten Koch, Jeff Linderoth, et al. Miplib 2017: data-driven compilation of the 6th mixed-integer programming library. Mathematical Programming Computation, 13(3):443–490, 2021

work page 2017
[8]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018

work page 2018
[9]

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[10]

Interferobot: aligning an optical interferometer by a reinforcement learning agent

Dmitry Sorokin, Alexander Ulanov, Ekaterina Sazhina, and Alexander Lvovsky. Interferobot: aligning an optical interferometer by a reinforcement learning agent. In H. Larochelle, M. Ran- zato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13238–13248. Curran Associates, Inc., 2020

work page 2020
[11]

Mag- netic control of tokamak plasmas through deep reinforcement learning

Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Mag- netic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414– 419, 2022

work page 2022
[12]

Reinforcement learning enhanced quantum-inspired algorithm for combinatorial optimization

Dmitrii Beloborodov, A E Ulanov, Jakob N Foerster, Shimon Whiteson, and A I Lvovsky. Reinforcement learning enhanced quantum-inspired algorithm for combinatorial optimization. Machine Learning: Science and Technology, 2(2):025009, January 2021

work page 2021
[13]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Reinforce- ment learning for variable selection in a branch and bound algorithm

Marc Etheve, Zacharie Alès, Côme Bissuel, Olivier Juan, and Safia Kedad-Sidhoum. Reinforce- ment learning for variable selection in a branch and bound algorithm. In Integration of AI and OR Techniques in Constraint Programming, 2020

work page 2020
[15]

Learning to branch with tree mdps

Lara Scavuzzo, Feng Chen, Didier Chetelat, Maxime Gasse, Andrea Lodi, Neil Yorke-Smith, and Karen Aardal. Learning to branch with tree mdps. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 18514–18526. Curran Associates, Inc., 2022

work page 2022
[16]

Towards a universal test suite for combinatorial auction algorithms

Kevin Leyton-Brown, Mark Pearson, and Yoav Shoham. Towards a universal test suite for combinatorial auction algorithms. In Proceedings of the 2nd ACM conference on Electronic commerce. ACM, October 2000

work page 2000
[17]

Maher, Frederic Matter, Erik Mühmer, Benjamin Müller, Marc E

Ksenia Bestuzheva, Mathieu Besançon, Wei-Kun Chen, Antonia Chmiela, Tim Donkiewicz, Jasper van Doornmalen, Leon Eifler, Oliver Gaul, Gerald Gamrath, Ambros Gleixner, Leona Gottwald, Christoph Graczyk, Katrin Halbig, Alexander Hoen, Christopher Hojny, Rolf van der Hulst, Thorsten Koch, Marco Lübbecke, Stephen J. Maher, Frederic Matter, Erik Mühmer, Benjami...

work page 2021
[18]

IBM ILOG Cplex. V12. 1: User’s manual for cplex. International Business Machines Corpora- tion, 46(53):157, 2009

work page 2009
[19]

Constraint Integer Programming

Tobias Achterberg. Constraint Integer Programming. July 2007. Accepted: 2015-11- 20T17:32:33Z

work page 2007
[20]

Learning to branch in mixed integer programming

Elias Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina. Learning to branch in mixed integer programming. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), February 2016

work page 2016
[21]

Support-vector networks

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273– 297, 1995

work page 1995
[22]

Learning combinatorial optimization algorithms over graphs

Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017
[23]

Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L. Dill. Learning a sat solver from single-bit supervision, 2018

work page 2018
[24]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR ’17, 2017

work page 2017
[25]

Exact com- binatorial optimization with graph convolutional neural networks

Maxime Gasse, Didier Chetelat, Nicola Ferroni, Laurent Charlin, and Andrea Lodi. Exact com- binatorial optimization with graph convolutional neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019
[26]

Hybrid models for learning to branch

Prateek Gupta, Maxime Gasse, Elias Khalil, Pawan Mudigonda, Andrea Lodi, and Yoshua Bengio. Hybrid models for learning to branch. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18087–18097. Curran Associates, Inc., 2020

work page 2020
[27]

Ecole: A gym-like library for machine learning in combinatorial optimization solvers

Antoine Prouvost, Justin Dumouchelle, Lara Scavuzzo, Maxime Gasse, Didier Chételat, and Andrea Lodi. Ecole: A gym-like library for machine learning in combinatorial optimization solvers. In Learning Meets Combinatorial Algorithms at NeurIPS2020, 2020

work page 2020
[28]

Set covering algorithms using cutting planes, heuristics, and subgradient optimization: A computational study, pages 37–60

Egon Balas and Andrew Ho. Set covering algorithms using cutting planes, heuristics, and subgradient optimization: A computational study, pages 37–60. Springer Berlin Heidelberg, Berlin, Heidelberg, 1980

work page 1980
[29]

Cire, Willem-Jan van Hoeve, and John Hooker

David Bergman, Andre A. Cire, Willem-Jan van Hoeve, and John Hooker. Decision Diagrams for Optimization. Springer International Publishing, 2016

work page 2016
[30]

Convergence of stochastic iterative dynamic programming algorithms

Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms. In J. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6. Morgan-Kaufmann, 1993

work page 1993
[31]

Borovkov

Alexandr A. Borovkov. Probability Theory. Springer London, 2013

work page 2013
[32]

Rusu, Joel Veness, Marc G

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...

work page 2015
[33]

Cornuejols, R

G. Cornuejols, R. Sridharan, and J.M. Thizy. A comparison of heuristics and relaxations for the capacitated plant location problem. European Journal of Operational Research, 50(3):280–297, feb 1991

work page 1991
[34]

Fukunaga

Alex S. Fukunaga. A branch-and-bound algorithm for hard multiple knapsack problems. Annals of Operations Research, 184(1):97–119, November 2009. 11 A Appendix 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Comb.Auct. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Set.Cover 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Max.Ind.Set. 0.0 0.2 0.4 0.6 0.8 ...

work page 2009

[1] [1]

A stochastic and dynamic vehicle routing problem in the euclidean plane

Dimitris J Bertsimas and Garrett Van Ryzin. A stochastic and dynamic vehicle routing problem in the euclidean plane. Operations Research, 39(4):601–615, 1991

work page 1991

[2] [2]

Portfolio selection

Harry Markowitz. Portfolio selection. The Journal of Finance, 7(1):77, March 1952

work page 1952

[3] [3]

An application of combinatorial optimization to statistical physics and circuit layout design

Francisco Barahona, Martin Grötschel, Michael Jünger, and Gerhard Reinelt. An application of combinatorial optimization to statistical physics and circuit layout design. Operations Research, 36(3):493–513, 1988

work page 1988

[4] [4]

Integer and combinatorial optimization, vol- ume 55

Laurence A Wolsey and George L Nemhauser. Integer and combinatorial optimization, vol- ume 55. John Wiley & Sons, 1999

work page 1999

[5] [5]

An automatic method for solving discrete programming problems

Ailsa H Land and Alison G Doig. An automatic method for solving discrete programming problems. Springer, 2010

work page 2010

[6] [6]

On learning and branching: a survey

Andrea Lodi and Giulia Zarpellon. On learning and branching: a survey. Top, 25:207–236, 2017

work page 2017

[7] [7]

Miplib 2017: data-driven compilation of the 6th mixed-integer programming library

Ambros Gleixner, Gregor Hendel, Gerald Gamrath, Tobias Achterberg, Michael Bastubbe, Timo Berthold, Philipp Christophel, Kati Jarck, Thorsten Koch, Jeff Linderoth, et al. Miplib 2017: data-driven compilation of the 6th mixed-integer programming library. Mathematical Programming Computation, 13(3):443–490, 2021

work page 2017

[8] [8]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018

work page 2018

[9] [9]

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[10] [10]

Interferobot: aligning an optical interferometer by a reinforcement learning agent

Dmitry Sorokin, Alexander Ulanov, Ekaterina Sazhina, and Alexander Lvovsky. Interferobot: aligning an optical interferometer by a reinforcement learning agent. In H. Larochelle, M. Ran- zato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13238–13248. Curran Associates, Inc., 2020

work page 2020

[11] [11]

Mag- netic control of tokamak plasmas through deep reinforcement learning

Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Mag- netic control of tokamak plasmas through deep reinforcement learning. Nature, 602(7897):414– 419, 2022

work page 2022

[12] [12]

Reinforcement learning enhanced quantum-inspired algorithm for combinatorial optimization

Dmitrii Beloborodov, A E Ulanov, Jakob N Foerster, Shimon Whiteson, and A I Lvovsky. Reinforcement learning enhanced quantum-inspired algorithm for combinatorial optimization. Machine Learning: Science and Technology, 2(2):025009, January 2021

work page 2021

[13] [13]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Reinforce- ment learning for variable selection in a branch and bound algorithm

Marc Etheve, Zacharie Alès, Côme Bissuel, Olivier Juan, and Safia Kedad-Sidhoum. Reinforce- ment learning for variable selection in a branch and bound algorithm. In Integration of AI and OR Techniques in Constraint Programming, 2020

work page 2020

[15] [15]

Learning to branch with tree mdps

Lara Scavuzzo, Feng Chen, Didier Chetelat, Maxime Gasse, Andrea Lodi, Neil Yorke-Smith, and Karen Aardal. Learning to branch with tree mdps. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 18514–18526. Curran Associates, Inc., 2022

work page 2022

[16] [16]

Towards a universal test suite for combinatorial auction algorithms

Kevin Leyton-Brown, Mark Pearson, and Yoav Shoham. Towards a universal test suite for combinatorial auction algorithms. In Proceedings of the 2nd ACM conference on Electronic commerce. ACM, October 2000

work page 2000

[17] [17]

Maher, Frederic Matter, Erik Mühmer, Benjamin Müller, Marc E

Ksenia Bestuzheva, Mathieu Besançon, Wei-Kun Chen, Antonia Chmiela, Tim Donkiewicz, Jasper van Doornmalen, Leon Eifler, Oliver Gaul, Gerald Gamrath, Ambros Gleixner, Leona Gottwald, Christoph Graczyk, Katrin Halbig, Alexander Hoen, Christopher Hojny, Rolf van der Hulst, Thorsten Koch, Marco Lübbecke, Stephen J. Maher, Frederic Matter, Erik Mühmer, Benjami...

work page 2021

[18] [18]

IBM ILOG Cplex. V12. 1: User’s manual for cplex. International Business Machines Corpora- tion, 46(53):157, 2009

work page 2009

[19] [19]

Constraint Integer Programming

Tobias Achterberg. Constraint Integer Programming. July 2007. Accepted: 2015-11- 20T17:32:33Z

work page 2007

[20] [20]

Learning to branch in mixed integer programming

Elias Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina. Learning to branch in mixed integer programming. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), February 2016

work page 2016

[21] [21]

Support-vector networks

Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273– 297, 1995

work page 1995

[22] [22]

Learning combinatorial optimization algorithms over graphs

Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithms over graphs. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017

[23] [23]

Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L. Dill. Learning a sat solver from single-bit supervision, 2018

work page 2018

[24] [24]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR ’17, 2017

work page 2017

[25] [25]

Exact com- binatorial optimization with graph convolutional neural networks

Maxime Gasse, Didier Chetelat, Nicola Ferroni, Laurent Charlin, and Andrea Lodi. Exact com- binatorial optimization with graph convolutional neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019

[26] [26]

Hybrid models for learning to branch

Prateek Gupta, Maxime Gasse, Elias Khalil, Pawan Mudigonda, Andrea Lodi, and Yoshua Bengio. Hybrid models for learning to branch. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18087–18097. Curran Associates, Inc., 2020

work page 2020

[27] [27]

Ecole: A gym-like library for machine learning in combinatorial optimization solvers

Antoine Prouvost, Justin Dumouchelle, Lara Scavuzzo, Maxime Gasse, Didier Chételat, and Andrea Lodi. Ecole: A gym-like library for machine learning in combinatorial optimization solvers. In Learning Meets Combinatorial Algorithms at NeurIPS2020, 2020

work page 2020

[28] [28]

Set covering algorithms using cutting planes, heuristics, and subgradient optimization: A computational study, pages 37–60

Egon Balas and Andrew Ho. Set covering algorithms using cutting planes, heuristics, and subgradient optimization: A computational study, pages 37–60. Springer Berlin Heidelberg, Berlin, Heidelberg, 1980

work page 1980

[29] [29]

Cire, Willem-Jan van Hoeve, and John Hooker

David Bergman, Andre A. Cire, Willem-Jan van Hoeve, and John Hooker. Decision Diagrams for Optimization. Springer International Publishing, 2016

work page 2016

[30] [30]

Convergence of stochastic iterative dynamic programming algorithms

Tommi Jaakkola, Michael Jordan, and Satinder Singh. Convergence of stochastic iterative dynamic programming algorithms. In J. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6. Morgan-Kaufmann, 1993

work page 1993

[31] [31]

Borovkov

Alexandr A. Borovkov. Probability Theory. Springer London, 2013

work page 2013

[32] [32]

Rusu, Joel Veness, Marc G

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...

work page 2015

[33] [33]

Cornuejols, R

G. Cornuejols, R. Sridharan, and J.M. Thizy. A comparison of heuristics and relaxations for the capacitated plant location problem. European Journal of Operational Research, 50(3):280–297, feb 1991

work page 1991

[34] [34]

Fukunaga

Alex S. Fukunaga. A branch-and-bound algorithm for hard multiple knapsack problems. Annals of Operations Research, 184(1):97–119, November 2009. 11 A Appendix 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Comb.Auct. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Set.Cover 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Max.Ind.Set. 0.0 0.2 0.4 0.6 0.8 ...

work page 2009