Convex Compositional Reasoning Models

Abduragim Shtanchaev; Albert Baichorov; Arip Asadulaev; Dmitry V. Dylov; Fakhri Karray; Maksim Bobrin; Martin Tak\'a\v{c}; Meir Roketlishvili; Semyon Semenov; Viktor Kovalchuk

arxiv: 2605.23395 · v1 · pith:RYUOZNNEnew · submitted 2026-05-22 · 💻 cs.LG

Convex Compositional Reasoning Models

Meir Roketlishvili , Semyon Semenov , Maksim Bobrin , Viktor Kovalchuk , Albert Baichorov , Abduragim Shtanchaev , Fakhri Karray , Dmitry V. Dylov

show 2 more authors

Martin Tak\'a\v{c} Arip Asadulaev

This is my paper

Pith reviewed 2026-05-25 05:09 UTC · model grok-4.3

classification 💻 cs.LG

keywords compositional reasoningenergy-based modelsinput-convex neural networksconvex relaxationcombinatorial optimizationgeneralization across sizesprojected gradient methods

0 comments

The pith

Convex compositional energy models trained on small problems solve larger combinatorial instances without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the main obstacle for compositional energy-based models in reasoning is the non-convex shape of the learned energies rather than the act of composing factors. It proposes Convex Compositional Energy Minimization, which assigns an input-convex neural network to each factor and minimizes the sum over a tight convex relaxation of the feasible assignments. Because convexity is preserved when factors are added, the global objective stays convex and admits reliable first-order optimization. Training proceeds in two stages: local contrastive shaping of each factor followed by joint refinement through an unrolled solver. A reader would care if this route allows reuse of small learned pieces on instances too large to train directly.

Core claim

Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. The key bottleneck is the non-convex geometry of the learned energy landscape. CCEM parameterizes each factor with an input-convex neural network and optimizes the composed energy over a tight convex relaxation of the feasible set. Because convexity is preserved under summation, the global relaxed objective remains convex, enabling deterministic projected first-order optimization. The model is trained in two stages: factor-level contrastive learning to shape local energy basins, followed by end-to-end refinement through an unrolled proj

What carries the argument

Convex Compositional Energy Minimization (CCEM), which assigns an input-convex neural network to each factor so that the summed objective over the convex relaxation stays convex and admits deterministic optimization.

If this is right

Models trained on small subproblems or a single problem size can be applied directly to larger instances of the same combinatorial structure.
The global relaxed objective remains convex, so projected gradient methods converge to a unique minimizer without random restarts.
Factor energies learned by contrastive methods can be reused across different numbers of constraints because summation preserves convexity.
End-to-end refinement through the unrolled solver improves the factors beyond what local contrastive training alone achieves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same convex-factor construction could be tried on other energy-based formulations of planning or scheduling where the number of variables grows with problem size.
If the convex relaxation proves tight on a given problem family, the method would yield exact global optima rather than merely good feasible points.
One could test whether the two-stage procedure can be replaced by a single joint objective that still keeps each factor convex by construction.

Load-bearing premise

That parameterizing each factor with an input-convex neural network will shape local energy basins effectively while preserving convexity under summation and yielding a sufficiently tight convex relaxation for the global objective.

What would settle it

A controlled test in which models trained only on small or fixed-size instances produce solutions on larger instances whose objective values or constraint violations are no better than those from non-convex baselines or random feasible points.

Figures

Figures reproduced from arXiv: 2605.23395 by Abduragim Shtanchaev, Albert Baichorov, Arip Asadulaev, Dmitry V. Dylov, Fakhri Karray, Maksim Bobrin, Martin Tak\'a\v{c}, Meir Roketlishvili, Semyon Semenov, Viktor Kovalchuk.

**Figure 2.** Figure 2: Energy map for 8-Queens. A correct solution is assigned low energy, while adding an extra queen produces higher local energy. We address four questions: (Q1) does enforcing convexity in the scope variable improve over the unconstrained MLP score network? (Q2) does projecting onto the exact convex hull of the feasible set replace the need for a long diffusion chain? (Q3) how do projected (multi-start) Adam… view at source ↗

**Figure 3.** Figure 3: Optimized samples across projected Adam timesteps for 8-Queens. Intermediate columns [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Inference-start ablation for N-Queens. Increasing the number of independent starts [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Heatmap view of the inference-start ablation across board sizes. Larger boards require [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. In our paper, we show that a key bottleneck in compositional reasoning is not composition itself, but the non-convex geometry of the learned energy landscape. To solve this problem, we introduce Convex Compositional Energy Minimization (CCEM), a framework that parameterizes each factor with an input-convex neural network and optimizes the composed energy over a tight convex relaxation of the feasible set. Because convexity is preserved under summation, the global relaxed objective remains convex, enabling deterministic projected first-order optimization. CCEM is trained in two stages: factor-level contrastive learning to shape local energy basins, followed by end-to-end refinement through an unrolled projected solver. Our experiments show that our models trained on small subproblems or a single problem size transfer to larger instances without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Convex Compositional Energy Minimization (CCEM), a framework for compositional energy-based models in combinatorial reasoning. Each local factor is parameterized by an input-convex neural network; because convexity is preserved under summation, the global objective over a convex relaxation of the feasible set remains convex and can be optimized deterministically with projected first-order methods. Training proceeds in two stages (factor-level contrastive learning followed by end-to-end unrolled refinement). The central claim is that models trained on small subproblems or a single problem size transfer zero-shot to larger instances.

Significance. If the transfer results are supported by rigorous quantitative evidence, the work would be significant for scaling combinatorial reasoning without retraining on large instances. The technical choice to enforce input-convexity per factor correctly exploits the closure of convex functions under summation, enabling reliable first-order optimization. The two-stage training procedure (contrastive shaping of local basins plus unrolled refinement) is a reasonable design. No machine-checked proofs or parameter-free derivations are presented, but the framework is falsifiable via the transfer experiments it describes.

major comments (2)

[Abstract] Abstract: the central transfer claim ('our models trained on small subproblems or a single problem size transfer to larger instances without retraining') is asserted without any quantitative metrics, baselines, dataset sizes, ablation studies, or success rates. Because the soundness of the transfer result cannot be assessed from the provided information, the load-bearing empirical support for the main contribution is missing.
[Abstract] Abstract (framework definition): while the paper correctly states that convexity is preserved under summation, it supplies no analysis or experiments showing that the convex relaxation remains sufficiently tight as the number of factors or domain size grows. If the integrality gap widens, projected solutions on larger instances may degrade even when local factors are well-shaped, directly undermining the zero-shot transfer claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the two major comments on the abstract below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: the central transfer claim ('our models trained on small subproblems or a single problem size transfer to larger instances without retraining') is asserted without any quantitative metrics, baselines, dataset sizes, ablation studies, or success rates. Because the soundness of the transfer result cannot be assessed from the provided information, the load-bearing empirical support for the main contribution is missing.

Authors: We agree that the abstract as written does not include quantitative metrics or references to specific results. The full manuscript (Section 4) contains the supporting experiments with success rates, baselines, dataset sizes, and ablations demonstrating zero-shot transfer. We will revise the abstract to summarize these key quantitative findings. revision: yes
Referee: [Abstract] Abstract (framework definition): while the paper correctly states that convexity is preserved under summation, it supplies no analysis or experiments showing that the convex relaxation remains sufficiently tight as the number of factors or domain size grows. If the integrality gap widens, projected solutions on larger instances may degrade even when local factors are well-shaped, directly undermining the zero-shot transfer claim.

Authors: This observation is correct: the manuscript states that the relaxation is tight but provides neither a formal analysis of integrality gap growth nor dedicated experiments isolating gap size versus problem scale. The transfer experiments offer indirect empirical support, but they do not directly measure the gap. We will add a discussion of this limitation together with new experiments reporting the gap on increasing instance sizes. revision: yes

Circularity Check

0 steps flagged

No circularity: framework relies on standard convexity preservation and empirical transfer

full rationale

The derivation chain uses the known property that input-convex networks preserve convexity under summation (a mathematical fact independent of the paper), trains via contrastive learning plus unrolled optimization on small instances, and reports experimental zero-shot transfer to larger instances. No step reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction. The tightness of the relaxation is an empirical question addressed by experiments rather than assumed tautologically. No self-citations or ansatzes are invoked as load-bearing in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities can be extracted beyond the high-level claim that convexity is preserved under summation of input-convex functions.

axioms (1)

standard math Convexity is preserved under summation of input-convex functions
Invoked to ensure the global relaxed objective remains convex.

pith-pipeline@v0.9.0 · 5725 in / 1195 out tokens · 31675 ms · 2026-05-25T05:09:58.944228+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel; dAlembert_to_ODE_general_theorem matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

Because convexity is preserved under summation, the global relaxed objective remains convex... first-order optimality condition at the target solution is enough to certify global optimality
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection; RCLCombiner_isCoupling_iff echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

nonnegative factor summation preserves convexity of the relaxed global objective... convex composition rules out spurious relaxed local minima

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 10 canonical work pages · 7 internal anchors

[1]

Input convex neural networks

Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. InInternational conference on machine learning, pages 146–155. PMLR, 2017

2017
[2]

Connecting convex energy-based inference and optimal transport for domain adaptation

Arip Asadulaev. Connecting convex energy-based inference and optimal transport for domain adaptation. InEnergy Based Models Workshop-ICLR 2021

2021
[3]

A minimalist approach for domain adaptation with optimal transport

Arip Asadulaev, Vitaly Shutov, Alexander Korotin, Alexander Panfilov, Vladislava Kontsevaya, and Andrey Filchenkov. A minimalist approach for domain adaptation with optimal transport. In Sarath Chandar, Razvan Pascanu, Hanie Sedghi, and Doina Precup, editors,Proceedings of The 2nd Conference on Lifelong Learning Agents, volume 232 ofProceedings of Machine...

2023
[4]

Neural Combinatorial Optimization with Reinforcement Learning

Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combina- torial optimization with reinforcement learning.arXiv preprint arXiv:1611.09940, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Optimal Control Via Neural Networks: A Convex Approach

Yize Chen, Yuanyuan Shi, and Baosen Zhang. Optimal control via neural networks: A convex approach.arXiv preprint arXiv:1805.11835, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Xlvin: executed latent value iteration nets.arXiv preprint arXiv:2010.13146, 2020

Andreea Deac, Petar Veliˇckovi´c, Ognjen Milinkovi´c, Pierre-Luc Bacon, Jian Tang, and Mladen Nikoli´c. Xlvin: executed latent value iteration nets.arXiv preprint arXiv:2010.13146, 2020

work page arXiv 2010
[7]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[8]

An efficient graph convolutional network technique for the travelling salesman problem.arXiv preprint arXiv:1906.01227, 2019

Chaitanya K Joshi, Thomas Laurent, and Xavier Bresson. An efficient graph convolutional network technique for the travelling salesman problem.arXiv preprint arXiv:1906.01227, 2019

work page arXiv 1906
[9]

Neural GPUs Learn Algorithms

Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms.arXiv preprint arXiv:1511.08228, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

Semi-Supervised Classification with Graph Convolutional Networks

Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[11]

Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark.Advances in neural information processing systems, 34:14593–14605, 2021

Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, Alexander Filippov, and Evgeny Burnaev. Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark.Advances in neural information processing systems, 34:14593–14605, 2021

2021
[12]

A tutorial on energy-based learning.Predicting structured data, 1(0), 2006

Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, Fujie Huang, et al. A tutorial on energy-based learning.Predicting structured data, 1(0), 2006

2006
[13]

Graph colouring meets deep learning: Effective graph neural network models for combinatorial problems

Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep learning: Effective graph neural network models for combinatorial problems. In2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885. IEEE, 2019

2019
[14]

Fast t2t: Optimization consistency speeds up diffusion-based training-to-testing solving for combinatorial optimization

Yang Li, Jinpei Guo, Runzhong Wang, Hongyuan Zha, and Junchi Yan. Fast t2t: Optimization consistency speeds up diffusion-based training-to-testing solving for combinatorial optimization. Advances in Neural Information Processing Systems, 37:30179–30206, 2024

2024
[15]

Nsnet: A general neural probabilistic framework for satisfiability problems.Advances in Neural Information Processing Systems, 35:25573–25585, 2022

Zhaoyu Li and Xujie Si. Nsnet: A general neural probabilistic framework for satisfiability problems.Advances in Neural Information Processing Systems, 35:25573–25585, 2022

2022
[16]

Optimal transport mapping via input convex neural networks

Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport mapping via input convex neural networks. InInternational Conference on Machine Learning, pages 6672–6681. PMLR, 2020

2020
[17]

Generalizable reasoning through compositional energy mini- mization.arXiv preprint arXiv:2510.20607, 2025

Alexandru Oarga and Yilun Du. Generalizable reasoning through compositional energy mini- mization.arXiv preprint arXiv:2510.20607, 2025

work page arXiv 2025
[18]

Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks.Advances in Neural Information Processing Systems, 34:6695–6706, 2021

Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks.Advances in Neural Information Processing Systems, 34:6695–6706, 2021. 11

2021
[19]

Learning a SAT Solver from Single-Bit Supervision

Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L Dill. Learning a sat solver from single-bit supervision.arXiv preprint arXiv:1802.03685, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Difusco: Graph-based diffusion solvers for combinatorial optimization.Advances in neural information processing systems, 36:3706–3731, 2023

Zhiqing Sun and Yiming Yang. Difusco: Graph-based diffusion solvers for combinatorial optimization.Advances in neural information processing systems, 36:3706–3731, 2023

2023
[21]

Graph Attention Networks

Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks.arXiv preprint arXiv:1710.10903, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Generative flow networks for discrete probabilistic modeling

Dinghuai Zhang, Nikolay Malkin, Zhen Liu, Alexandra V olokhova, Aaron Courville, and Yoshua Bengio. Generative flow networks for discrete probabilistic modeling. InInternational Conference on Machine Learning, pages 26412–26428. PMLR, 2022. 12 A Proofs Proposition A.1(Non-convex factor composition admits spurious stable minima).Let Y ⊂R d be a convex rela...

2022
[23]

The composed energy also contains a point¯y∈ Y, with¯y̸=y ⋆, satisfying ∇E(¯y) = 0,∇ 2E(¯y)≻0
[24]

Moreover, for gradient descent yt+1 =y t −η∇E(y t), there exists a stepsize η >0 and an open neighborhood U⊂ Y of ¯ysuch that every initialization y0 ∈U converges to ¯y

Therefore,¯yis a strict local minimum ofE, but not a global minimizer: E(¯y)> E(y⋆). Moreover, for gradient descent yt+1 =y t −η∇E(y t), there exists a stepsize η >0 and an open neighborhood U⊂ Y of ¯ysuch that every initialization y0 ∈U converges to ¯y. Hence failure from such initializations is caused by the geometry of the composed energy landscape, ra...

[1] [1]

Input convex neural networks

Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. InInternational conference on machine learning, pages 146–155. PMLR, 2017

2017

[2] [2]

Connecting convex energy-based inference and optimal transport for domain adaptation

Arip Asadulaev. Connecting convex energy-based inference and optimal transport for domain adaptation. InEnergy Based Models Workshop-ICLR 2021

2021

[3] [3]

A minimalist approach for domain adaptation with optimal transport

Arip Asadulaev, Vitaly Shutov, Alexander Korotin, Alexander Panfilov, Vladislava Kontsevaya, and Andrey Filchenkov. A minimalist approach for domain adaptation with optimal transport. In Sarath Chandar, Razvan Pascanu, Hanie Sedghi, and Doina Precup, editors,Proceedings of The 2nd Conference on Lifelong Learning Agents, volume 232 ofProceedings of Machine...

2023

[4] [4]

Neural Combinatorial Optimization with Reinforcement Learning

Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combina- torial optimization with reinforcement learning.arXiv preprint arXiv:1611.09940, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Optimal Control Via Neural Networks: A Convex Approach

Yize Chen, Yuanyuan Shi, and Baosen Zhang. Optimal control via neural networks: A convex approach.arXiv preprint arXiv:1805.11835, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Xlvin: executed latent value iteration nets.arXiv preprint arXiv:2010.13146, 2020

Andreea Deac, Petar Veliˇckovi´c, Ognjen Milinkovi´c, Pierre-Luc Bacon, Jian Tang, and Mladen Nikoli´c. Xlvin: executed latent value iteration nets.arXiv preprint arXiv:2010.13146, 2020

work page arXiv 2010

[7] [7]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[8] [8]

An efficient graph convolutional network technique for the travelling salesman problem.arXiv preprint arXiv:1906.01227, 2019

Chaitanya K Joshi, Thomas Laurent, and Xavier Bresson. An efficient graph convolutional network technique for the travelling salesman problem.arXiv preprint arXiv:1906.01227, 2019

work page arXiv 1906

[9] [9]

Neural GPUs Learn Algorithms

Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms.arXiv preprint arXiv:1511.08228, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

Semi-Supervised Classification with Graph Convolutional Networks

Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[11] [11]

Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark.Advances in neural information processing systems, 34:14593–14605, 2021

Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, Alexander Filippov, and Evgeny Burnaev. Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark.Advances in neural information processing systems, 34:14593–14605, 2021

2021

[12] [12]

A tutorial on energy-based learning.Predicting structured data, 1(0), 2006

Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, Fujie Huang, et al. A tutorial on energy-based learning.Predicting structured data, 1(0), 2006

2006

[13] [13]

Graph colouring meets deep learning: Effective graph neural network models for combinatorial problems

Henrique Lemos, Marcelo Prates, Pedro Avelar, and Luis Lamb. Graph colouring meets deep learning: Effective graph neural network models for combinatorial problems. In2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pages 879–885. IEEE, 2019

2019

[14] [14]

Fast t2t: Optimization consistency speeds up diffusion-based training-to-testing solving for combinatorial optimization

Yang Li, Jinpei Guo, Runzhong Wang, Hongyuan Zha, and Junchi Yan. Fast t2t: Optimization consistency speeds up diffusion-based training-to-testing solving for combinatorial optimization. Advances in Neural Information Processing Systems, 37:30179–30206, 2024

2024

[15] [15]

Nsnet: A general neural probabilistic framework for satisfiability problems.Advances in Neural Information Processing Systems, 35:25573–25585, 2022

Zhaoyu Li and Xujie Si. Nsnet: A general neural probabilistic framework for satisfiability problems.Advances in Neural Information Processing Systems, 35:25573–25585, 2022

2022

[16] [16]

Optimal transport mapping via input convex neural networks

Ashok Makkuva, Amirhossein Taghvaei, Sewoong Oh, and Jason Lee. Optimal transport mapping via input convex neural networks. InInternational Conference on Machine Learning, pages 6672–6681. PMLR, 2020

2020

[17] [17]

Generalizable reasoning through compositional energy mini- mization.arXiv preprint arXiv:2510.20607, 2025

Alexandru Oarga and Yilun Du. Generalizable reasoning through compositional energy mini- mization.arXiv preprint arXiv:2510.20607, 2025

work page arXiv 2025

[18] [18]

Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks.Advances in Neural Information Processing Systems, 34:6695–6706, 2021

Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks.Advances in Neural Information Processing Systems, 34:6695–6706, 2021. 11

2021

[19] [19]

Learning a SAT Solver from Single-Bit Supervision

Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L Dill. Learning a sat solver from single-bit supervision.arXiv preprint arXiv:1802.03685, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Difusco: Graph-based diffusion solvers for combinatorial optimization.Advances in neural information processing systems, 36:3706–3731, 2023

Zhiqing Sun and Yiming Yang. Difusco: Graph-based diffusion solvers for combinatorial optimization.Advances in neural information processing systems, 36:3706–3731, 2023

2023

[21] [21]

Graph Attention Networks

Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks.arXiv preprint arXiv:1710.10903, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Generative flow networks for discrete probabilistic modeling

Dinghuai Zhang, Nikolay Malkin, Zhen Liu, Alexandra V olokhova, Aaron Courville, and Yoshua Bengio. Generative flow networks for discrete probabilistic modeling. InInternational Conference on Machine Learning, pages 26412–26428. PMLR, 2022. 12 A Proofs Proposition A.1(Non-convex factor composition admits spurious stable minima).Let Y ⊂R d be a convex rela...

2022

[23] [23]

The composed energy also contains a point¯y∈ Y, with¯y̸=y ⋆, satisfying ∇E(¯y) = 0,∇ 2E(¯y)≻0

[24] [24]

Moreover, for gradient descent yt+1 =y t −η∇E(y t), there exists a stepsize η >0 and an open neighborhood U⊂ Y of ¯ysuch that every initialization y0 ∈U converges to ¯y

Therefore,¯yis a strict local minimum ofE, but not a global minimizer: E(¯y)> E(y⋆). Moreover, for gradient descent yt+1 =y t −η∇E(y t), there exists a stepsize η >0 and an open neighborhood U⊂ Y of ¯ysuch that every initialization y0 ∈U converges to ¯y. Hence failure from such initializations is caused by the geometry of the composed energy landscape, ra...