FactorLibrary: From Polynomials to Circuits via Recursive Subgoals

Archit Ganapule; Bhaumik Mehta; Jarod Alper; Kaijie Jin; Michael Ruofan Zeng; Naomi Morato; Rohan Pandey; Weikun K. Zhang

arxiv: 2606.25394 · v1 · pith:GZG3C4Z5new · submitted 2026-06-24 · 💻 cs.LG · cs.AI

FactorLibrary: From Polynomials to Circuits via Recursive Subgoals

Rohan Pandey , Michael Ruofan Zeng , Weikun K. Zhang , Kaijie Jin , Naomi Morato , Archit Ganapule , Bhaumik Mehta , Jarod Alper This is my paper

Pith reviewed 2026-06-25 21:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningarithmetic circuit synthesispolynomial factorizationfinite fieldsMonte Carlo tree searchproximal policy optimizationsubgoal reusecircuit minimization

0 comments

The pith

A top-down PPO+MCTS agent with FactorLibrary finds certified optimal arithmetic circuits for polynomials up to complexity 8 at 91.8 percent success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper recasts the search for smallest arithmetic circuits computing given polynomials over finite fields as a reinforcement learning problem, running in both bottom-up and top-down modes. It introduces FactorLibrary to hold reusable factorizable subexpressions that act as subgoals persisting across training episodes, thereby taming the rapid growth of the search space. Three agent variants are trained and compared, and the top-down PPO plus MCTS combination proves most reliable at returning circuits whose optimality has been independently certified. A sympathetic reader would care because minimal-circuit size is a central measure in algebraic complexity, and an automated method that scales even to moderate sizes could eventually illuminate questions that exhaustive enumeration cannot reach.

Core claim

The central claim is that a top-down agent trained with PPO and MCTS, augmented by FactorLibrary for subgoal reuse, exhibits stable performance and locates certified optimal circuits for all tested polynomials whose complexity is at most 8, attaining a 91.8 percent success rate across the evaluation set.

What carries the argument

FactorLibrary, a persistent store of factorizable subexpressions that supplies reusable subgoals to the RL search process.

If this is right

The top-down PPO+MCTS configuration outperforms both the bottom-up Gumbel-PPO-MCTS agent and the top-down SAC agent in stability.
FactorLibrary reuse allows the search to reach certified optimality for all instances whose complexity is at most 8.
Formulating circuit minimization as an RL task with persistent subgoals converts an intractable combinatorial enumeration into a trainable policy search.
Success rates remain high only when the agent is allowed to exploit the learned subgoal distribution during search.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the subgoal library can be grown incrementally without retraining from scratch, the same framework could be applied to successively larger complexity thresholds.
The method supplies a concrete generator of minimal-circuit examples that could be used to test conjectures in algebraic complexity theory.
Because the approach separates search from certification, any future improvement in either component can be plugged in without redesigning the other.

Load-bearing premise

The RL search guided by FactorLibrary never systematically overlooks a smaller circuit that lies outside the distribution of learned subgoals, and the separate certification step is complete for every instance examined.

What would settle it

An exhaustive check or independent proof that some polynomial of complexity 8 or less possesses a strictly smaller circuit than the one returned and certified by the agent would falsify the optimality claim.

Figures

Figures reproduced from arXiv: 2606.25394 by Archit Ganapule, Bhaumik Mehta, Jarod Alper, Kaijie Jin, Michael Ruofan Zeng, Naomi Morato, Rohan Pandey, Weikun K. Zhang.

**Figure 1.** Figure 1: Comparison of RL environment mechanics for finding arithmetic circuits for the target polynomial f = x0 + x1x2. an honest polynomial, so the subgoals are fully explicit, interpretable, and verifiable. 2.3. The Bottom-Up Formulation We adopt the formulation of the arithmetic circuit problem from Zhang et al. (2026), which we refer to as the bottomup approach. Given a polynomial f(x0, . . . , xn−1), the ini… view at source ↗

**Figure 2.** Figure 2: Training curves [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Overlaid diagnostics for the Gumbel curriculum phases [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Staggered success rates across curriculum phases. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: SAC training diagnostics for the top-down run [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Held-out Ck-match rate over training iterations. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Training pipelines for bottom-up and top-down environments. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: FactorLibrary in the two formulations. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Finding minimal arithmetic circuits for polynomials over finite fields is a combinatorially hard problem central to algebraic complexity theory. We formulate it as a reinforcement learning problem in two directions, bottom-up and top-down. To address the challenge of a fast-growing combinatorial search space, we introduce FactorLibrary, which stores factorizable subexpressions that serve as reusable subgoals across training episodes. We trained a bottom-up agent with Gumbel-PPO-MCTS and two top-down agents with PPO+MCTS and SAC. The PPO+MCTS top-down agent exhibited the most stable performance, finding certified optimal circuits up to complexity $8$ with a success rate of $91.8\%$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FactorLibrary is a useful new mechanism for storing reusable subgoals in RL circuit search, but the optimality certification behind the 91.8% number is not shown to be exhaustive.

read the letter

The paper's concrete advance is FactorLibrary, a persistent store of factorizable subexpressions that agents can draw on across episodes to tame the combinatorial explosion in arithmetic circuit search over finite fields. They set up both bottom-up and top-down RL formulations and test three agents (Gumbel-PPO-MCTS bottom-up, PPO+MCTS top-down, SAC top-down), with the PPO+MCTS variant coming out most stable on instances up to complexity 8.

That framing and the library itself are new enough to be worth attention in the RL-for-synthesis niche. The empirical comparison across agents is straightforward and the reported stability difference is the kind of result that can guide follow-up work.

The soft spot is the certification step. The abstract calls the circuits "certified optimal" yet gives no description of the procedure, no argument that it exhausts all smaller candidates, and no indication whether it re-uses the learned subgoal distribution or performs a full enumeration. For complexity 8 the space is already large; if the certifier is bounded or biased toward the library, the 91.8% figure could overstate true minimality. No derivation or error analysis is mentioned either.

This is a paper for people already working on algebraic circuit minimization or RL synthesis methods. It is not yet ready to change broader practice, but the core idea is clear enough that a serious referee could check the certification details and the experimental setup. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The manuscript formulates the search for minimal arithmetic circuits realizing polynomials over finite fields as a reinforcement-learning task. It introduces FactorLibrary to cache reusable factorizable subexpressions as subgoals, trains a bottom-up Gumbel-PPO-MCTS agent and two top-down agents (PPO+MCTS and SAC), and reports that the PPO+MCTS top-down agent finds certified-optimal circuits of complexity up to 8 with 91.8 % success rate on a held-out test set.

Significance. If the optimality certification is shown to be exhaustive and the reported success rates are reproducible, the work supplies a concrete empirical baseline for circuit minimization that could be used to generate candidate minimal circuits for further theoretical study. The reusable-subgoal mechanism directly targets the combinatorial explosion that has limited prior search-based approaches.

major comments (2)

[Abstract, §4] Abstract and §4 (Results): the central claim that circuits are 'certified optimal' is load-bearing for the 91.8 % figure, yet the manuscript supplies no description of the certification procedure, no statement of its completeness (exhaustive enumeration versus bounded search over the learned subgoal distribution), and no error analysis showing that smaller circuits outside the FactorLibrary distribution cannot exist undetected.
[§3.2] §3.2 (Certification): if the optimality check re-uses the same learned subgoals or a bounded enumeration rather than a complete exhaustive search over all arithmetic circuits of lower complexity, then the reported success rate may reflect search bias rather than true minimality; this must be clarified with an explicit statement of the certification algorithm and its complexity.

minor comments (2)

[§2] Notation for circuit complexity and field size should be defined once in §2 before first use in the abstract.
[Figures] Figure captions should state the exact number of test polynomials and the precise definition of 'success' (exact match to a known minimal circuit or merely lower than a baseline).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need to clarify the optimality certification procedure, which underpins the reported success rates. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Results): the central claim that circuits are 'certified optimal' is load-bearing for the 91.8 % figure, yet the manuscript supplies no description of the certification procedure, no statement of its completeness (exhaustive enumeration versus bounded search over the learned subgoal distribution), and no error analysis showing that smaller circuits outside the FactorLibrary distribution cannot exist undetected.

Authors: We agree that the manuscript does not currently contain a sufficient description of the certification procedure. In the revised version we will add an explicit account of the certification algorithm in §3.2, including a statement of its completeness (exhaustive enumeration via an independent dynamic-programming table over all circuits of lower complexity) and an error analysis confirming that no undetected smaller circuits exist outside the FactorLibrary distribution for the reported range. revision: yes
Referee: [§3.2] §3.2 (Certification): if the optimality check re-uses the same learned subgoals or a bounded enumeration rather than a complete exhaustive search over all arithmetic circuits of lower complexity, then the reported success rate may reflect search bias rather than true minimality; this must be clarified with an explicit statement of the certification algorithm and its complexity.

Authors: We will revise §3.2 to state explicitly that the optimality check performs a complete exhaustive enumeration of all arithmetic circuits of lower complexity using a recursive factorization procedure that is independent of both the learned policy and the FactorLibrary subgoals. The algorithm and its complexity (feasible for complexity ≤8) will be described in full, together with pseudocode, to remove any ambiguity about possible search bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical success rate on held-out instances

full rationale

The paper's central claim is an empirical performance metric (91.8% success rate of PPO+MCTS top-down agent on certified optimal circuits up to complexity 8). This is measured on test instances after training, not derived from any equation or parameter that reduces to the input by construction. No self-definitional steps, fitted-input predictions, self-citation load-bearing arguments, uniqueness theorems, or ansatz smuggling appear in the abstract or described claims. The FactorLibrary and RL formulation are presented as engineering contributions whose value is assessed by external performance numbers rather than internal reduction. Certification completeness is a separate verification concern, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the introduction of FactorLibrary itself as a data structure.

pith-pipeline@v0.9.1-grok · 5665 in / 1089 out tokens · 19435 ms · 2026-06-25T21:10:24.297985+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 10 canonical work pages

[1]

Dennis J

A general reinforcement learning algorithm that masters chess, shogi, and. Science , author =. 2018 , pages =. doi:10.1126/science.aar6404 , language =

work page doi:10.1126/science.aar6404 2018
[3]

International Conference on Learning Representations (ICLR) , year=

High-Dimensional Continuous Control Using Generalized Advantage Estimation , author=. International Conference on Learning Representations (ICLR) , year=
[4]

International Conference on Learning Representations (ICLR) , year=

Semi-Supervised Classification with Graph Convolutional Networks , author=. International Conference on Learning Representations (ICLR) , year=
[5]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Attention Is All You Need , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
[6]

and Powley, Edward and Whitehouse, Daniel and Lucas, Simon M

Browne, Cameron B. and Powley, Edward and Whitehouse, Daniel and Lucas, Simon M. and Cowling, Peter I. and Rohlfshagen, Philipp and Tavener, Stephen and Perez, Diego and Samothrakis, Spyridon and Colton, Simon , journal=. A Survey of
[7]

Fawzi, Alhussein and Balog, Matej and Huang, Aja and Hubert, Thomas and Romera-Paredes, Bernardino and Barekatain, Mohammadamin and Novikov, Alexander and Ruiz, Francisco J. R. and Schrittwieser, Julian and Swirszcz, Grzegorz and Silver, David and Hassabis, Demis and Kohli, Pushmeet , title =. Nature, London , issn =. 2022 , language =. doi:10.1038/s41586...

work page doi:10.1038/s41586-022-05172-4 2022
[9]

2024 , eprint=

OpenTensor: Reproducing Faster Matrix Multiplication Discovering Algorithms , author=. 2024 , eprint=

2024
[10]

2018 , eprint=

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author=. 2018 , eprint=

2018
[11]

Algebraic complexity theory

B. Algebraic complexity theory. 1997 , publisher =

1997
[12]

Bandit Based Monte-Carlo Planning

Kocsis, Levente and Szepesv \'a ri, Csaba. Bandit Based Monte-Carlo Planning. Machine Learning: ECML 2006. 2006

2006
[13]

International Conference on Learning Representations , year=

Deep Learning For Symbolic Mathematics , author=. International Conference on Learning Representations , year=
[14]

2019 , eprint=

Soft Actor-Critic for Discrete Action Settings , author=. 2019 , eprint=

2019
[16]

2024 , eprint=

Completeness classes in algebraic complexity theory , author=. 2024 , eprint=

2024
[19]

Soft Actor-Critic (SAC) , year =
[20]

Hindsight Experience Replay , volume =

Andrychowicz, Marcin and Wolski, Filip and Ray, Alex and Schneider, Jonas and Fong, Rachel and Welinder, Peter and McGrew, Bob and Tobin, Josh and Pieter Abbeel, OpenAI and Zaremba, Wojciech , booktitle =. Hindsight Experience Replay , volume =
[21]

Policy Improvement by Planning with

Danihelka, Ivo and Guez, Arthur and Schrittwieser, Julian and Silver, David , booktitle =. Policy Improvement by Planning with. 2022 , url =

2022
[22]

and Precup, Doina and Singh, Satinder , journal =

Sutton, Richard S. and Precup, Doina and Singh, Satinder , journal =. Between. 1999 , doi =

1999
[23]

2017 , volume =

Vezhnevets, Alexander Sasha and Osindero, Simon and Schaul, Tom and Heess, Nicolas and Jaderberg, Max and Silver, David and Kavukcuoglu, Koray , booktitle =. 2017 , volume =

2017
[24]

1997 , isbn =

Lidl, Rudolf and Niederreiter, Harald , title =. 1997 , isbn =

1997
[25]

Macdonald, I. G. , title =. 1995 , isbn =

1995
[26]

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , volume =

Hosangadi, Anup and Fallah, Farzan and Kastner, Ryan , title =. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , volume =. 2006 , doi =

2006
[27]

and Narasimhan, Karthik and Saeedi, Ardavan and Tenenbaum, Joshua B

Kulkarni, Tejas D. and Narasimhan, Karthik and Saeedi, Ardavan and Tenenbaum, Joshua B. , title =. Advances in Neural Information Processing Systems , volume =
[28]

Proceedings of the 35th International Conference on Machine Learning , pages =

Florensa, Carlos and Held, David and Geng, Xinyang and Abbeel, Pieter , title =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

2018
[29]

Advances in Neural Information Processing Systems , volume =

Anthony, Thomas and Tian, Zheng and Barber, David , title =. Advances in Neural Information Processing Systems , volume =
[30]

and Harada, Daishi and Russell, Stuart J

Ng, Andrew Y. and Harada, Daishi and Russell, Stuart J. , title =. Proceedings of the Sixteenth International Conference on Machine Learning , pages =. 1999 , publisher =

1999
[31]

2026 , eprint=

CircuitBuilder: From Polynomials to Circuits via Reinforcement Learning , author=. 2026 , eprint=

2026
[32]

European Conference on Machine Learning , pages =

Bandit Based Monte-Carlo Planning , author =. European Conference on Machine Learning , pages =. 2006 , publisher =

2006
[33]

Science , volume =

A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play , author =. Science , volume =
[35]

International Conference on Machine Learning , pages =

Policy Invariance under Reward Transformations: Theory and Application to Reward Shaping , author =. International Conference on Machine Learning , pages =
[37]

International Conference on Learning Representations , year=

LILO: Learning Interpretable Libraries by Compressing and Documenting Code , author=. International Conference on Learning Representations , year=
[38]

SAT Competition 2020 , journal =

Nils Froleyks and Marijn Heule and Ashlin Iser and Matti Järvisalo and Martin Suda , keywords =. SAT Competition 2020 , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.artint.2021.103572 , url =

work page doi:10.1016/j.artint.2021.103572 2020
[40]

Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

Kurin, Vitaly and Godil, Saad and Whiteson, Shimon and Catanzaro, Bryan , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =

2020
[41]

Learning local search heuristics for boolean satisfiability , year =

Yolcu, Emre and P\'. Learning local search heuristics for boolean satisfiability , year =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =
[42]

Moro, Lorenzo and Paris, Matteo G. A. and Restelli, Marcello and Prati, Enrico , title =. Communications Physics , volume =. 2021 , doi =

2021
[43]

2021 , eprint=

Quantum circuit optimization with deep reinforcement learning , author=. 2021 , eprint=

2021
[44]

Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[45]

The Thirteenth International Conference on Learning Representations , year=

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search , author=. The Thirteenth International Conference on Learning Representations , year=
[46]

https://satcompetition.github.io/

SAT Competition . https://satcompetition.github.io/. Accessed 2026-06-19

2026
[47]

Hindsight experience replay

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Pieter Abbeel, O., and Zaremba, W. Hindsight experience replay. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

2017
[48]

Thinking fast and slow with deep learning and tree search

Anthony, T., Tian, Z., and Barber, D. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems, volume 30, 2017

2017
[49]

B., Powley, E., Whitehouse, D., Lucas, S

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., and Colton, S. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4 0 (1): 0 1--43, 2012

2012
[50]

B \"u rgisser, P., Clausen, M., and Shokrollahi, M. A. Algebraic complexity theory. With the collaboration of Thomas Lickteig , volume 315 of Grundlehren Math. Wiss. Berlin: Springer, 1997. ISBN 3-540-60582-7

1997
[51]

Soft actor-critic for discrete action settings

Christodoulou, P. Soft actor-critic for discrete action settings. arXiv preprint arXiv:1910.07207, 2019

arXiv 1910
[52]

Policy improvement by planning with Gumbel

Danihelka, I., Guez, A., Schrittwieser, J., and Silver, D. Policy improvement by planning with Gumbel . In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=bERaNdoegnO

2022
[53]

B., Solar-Lezama, A., and Tenenbaum, J

Ellis, K., Wong, C., Nye, M., Sablé-Meyer, M., Cary, L., Morales, L., Hewitt, L. B., Solar-Lezama, A., and Tenenbaum, J. B. Dreamcoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning. arXiv preprint arXiv:2006.08381, 2020

arXiv 2006
[54]

Automatic goal generation for reinforcement learning agents

Florensa, C., Held, D., Geng, X., and Abbeel, P. Automatic goal generation for reinforcement learning agents. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1515--1528. PMLR, 2018. URL https://proceedings.mlr.press/v80/florensa18a.html

2018
[55]

Y., Marquardt, F., and Li, L

Fösel, T., Niu, M. Y., Marquardt, F., and Li, L. Quantum circuit optimization with deep reinforcement learning, 2021. URL https://arxiv.org/abs/2103.07585

arXiv 2021
[56]

X., Liu, M., Tenenbaum, J

Grand, G., Wong, L., Bowers, M., Olausson, T. X., Liu, M., Tenenbaum, J. B., and Andreas, J. Lilo: Learning interpretable libraries by compressing and documenting code. In International Conference on Learning Representations, 2024

2024
[57]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018. URL https://arxiv.org/abs/1801.01290

Pith/arXiv arXiv 2018
[58]

Horner, W. G. Xxi. a new method of solving numerical equations of all orders, by continuous approximation. Philosophical Transactions of the Royal Society of London, 109: 0 308--335, 12 1819. ISSN 0261-0523. doi:10.1098/rstl.1819.0023. URL https://doi.org/10.1098/rstl.1819.0023

work page doi:10.1098/rstl.1819.0023
[59]

Optimizing polynomial expressions by algebraic factorization and common subexpression elimination

Hosangadi, A., Fallah, F., and Kastner, R. Optimizing polynomial expressions by algebraic factorization and common subexpression elimination. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 25 0 (10): 0 2012--2022, 2006. doi:10.1109/TCAD.2006.875712

work page doi:10.1109/tcad.2006.875712 2012
[60]

Hubert, T., Mehta, R., Sartran, L., Horváth, M. Z., Žužić, G., Wieser, E., Huang, A., Schrittwieser, J., Schroecker, Y., Masoom, H., Bertolli, O., Zahavy, T., Mandhane, A., Yung, J., Beloshapka, I., Ibarz, B., Veeriah, V., Yu, L., Nash, O., Lezeau, P., Mercuri, S., Sönne, C., Mehta, B., Davies, A., Zheng, D., Pedregosa, F., Li, Y., von Glehn, I., Rowland,...

work page doi:10.1038/s41586-025-09833-y 2025
[61]

and Szepesv \'a ri, C

Kocsis, L. and Szepesv \'a ri, C. Bandit based monte-carlo planning. In European Conference on Machine Learning, pp.\ 282--293. Springer, 2006

2006
[62]

D., Narasimhan, K., Saeedi, A., and Tenenbaum, J

Kulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenenbaum, J. B. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, volume 29, 2016

2016
[63]

Kurin, V., Godil, S., Whiteson, S., and Catanzaro, B. Can q-learning with graph networks learn a generalizable branching heuristic for a sat solver? In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS '20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

2020
[64]

and Niederreiter, H

Lidl, R. and Niederreiter, H. Finite Fields, volume 20 of Encyclopedia of Mathematics and Its Applications. Cambridge University Press, Cambridge, 2 edition, 1997. ISBN 9780521392310

1997
[65]

Moro, L., Paris, M. G. A., Restelli, M., and Prati, E. Quantum compiling by deep reinforcement learning. Communications Physics, 4 0 (178), 2021. doi:10.1038/s42005-021-00684-3. URL https://doi.org/10.1038/s42005-021-00684-3

work page doi:10.1038/s42005-021-00684-3 2021
[66]

High-dimensional continuous control using generalized advantage estimation

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations (ICLR), 2016

2016
[67]

Proximal policy optimization algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[68]

and Yehudayoff, A

Shpilka, A. and Yehudayoff, A. Arithmetic circuits: A survey of recent results and open questions. Foundations and Trends in Theoretical Computer Science, 5 0 (3-4): 0 207--388, 12 2010. ISSN 1551-305X. doi:10.1561/0400000039. URL https://doi.org/10.1561/0400000039

work page doi:10.1561/0400000039 2010
[69]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362 0 (6419): 0 1140--1144, 2018

2018
[70]

The complexity of computing the permanent

Valiant, L. The complexity of computing the permanent. Theoretical Computer Science, 8 0 (2): 0 189--201, 1979 a . ISSN 0304-3975. doi:10.1016/0304-3975(79)90044-6

work page doi:10.1016/0304-3975(79)90044-6 1979
[71]

Valiant, L. G. Completeness classes in algebra. In Proceedings of the Eleventh Annual ACM Symposium on Theory of Computing, STOC '79, pp.\ 249–261, New York, NY, USA, 1979 b . Association for Computing Machinery. ISBN 9781450374385. doi:10.1145/800135.804419. URL https://doi.org/10.1145/800135.804419

work page doi:10.1145/800135.804419 1979
[72]

Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search

Xin, H., Ren, Z., Song, J., Shao, Z., Zhao, W., Wang, H., Liu, B., Zhang, L., Lu, X., Du, Q., Gao, W., Zhang, H., Zhu, Q., Yang, D., Gou, Z., Wu, Z., Luo, F., and Ruan, C. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search. In The Thirteenth International Conference on Learning Representations,...

2025
[73]

M., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R., and Anandkumar, A

Yang, K., Swope, A. M., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R., and Anandkumar, A. Leandojo: Theorem proving with retrieval-augmented language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=g7OX2sOJtn

2023
[74]

and P\' o czos, B

Yolcu, E. and P\' o czos, B. Learning local search heuristics for boolean satisfiability. Curran Associates Inc., Red Hook, NY, USA, 2019

2019
[75]

K., Pandey, R., Mehta, B., Jin, K., Morato, N., Ganapule, A., Zeng, M

Zhang, W. K., Pandey, R., Mehta, B., Jin, K., Morato, N., Ganapule, A., Zeng, M. R., and Alper, J. Circuitbuilder: From polynomials to circuits via reinforcement learning, 2026. URL https://arxiv.org/abs/2603.17075

arXiv 2026

[1] [1]

Dennis J

A general reinforcement learning algorithm that masters chess, shogi, and. Science , author =. 2018 , pages =. doi:10.1126/science.aar6404 , language =

work page doi:10.1126/science.aar6404 2018

[2] [3]

International Conference on Learning Representations (ICLR) , year=

High-Dimensional Continuous Control Using Generalized Advantage Estimation , author=. International Conference on Learning Representations (ICLR) , year=

[3] [4]

International Conference on Learning Representations (ICLR) , year=

Semi-Supervised Classification with Graph Convolutional Networks , author=. International Conference on Learning Representations (ICLR) , year=

[4] [5]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Attention Is All You Need , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

[5] [6]

and Powley, Edward and Whitehouse, Daniel and Lucas, Simon M

Browne, Cameron B. and Powley, Edward and Whitehouse, Daniel and Lucas, Simon M. and Cowling, Peter I. and Rohlfshagen, Philipp and Tavener, Stephen and Perez, Diego and Samothrakis, Spyridon and Colton, Simon , journal=. A Survey of

[6] [7]

Fawzi, Alhussein and Balog, Matej and Huang, Aja and Hubert, Thomas and Romera-Paredes, Bernardino and Barekatain, Mohammadamin and Novikov, Alexander and Ruiz, Francisco J. R. and Schrittwieser, Julian and Swirszcz, Grzegorz and Silver, David and Hassabis, Demis and Kohli, Pushmeet , title =. Nature, London , issn =. 2022 , language =. doi:10.1038/s41586...

work page doi:10.1038/s41586-022-05172-4 2022

[7] [9]

2024 , eprint=

OpenTensor: Reproducing Faster Matrix Multiplication Discovering Algorithms , author=. 2024 , eprint=

2024

[8] [10]

2018 , eprint=

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author=. 2018 , eprint=

2018

[9] [11]

Algebraic complexity theory

B. Algebraic complexity theory. 1997 , publisher =

1997

[10] [12]

Bandit Based Monte-Carlo Planning

Kocsis, Levente and Szepesv \'a ri, Csaba. Bandit Based Monte-Carlo Planning. Machine Learning: ECML 2006. 2006

2006

[11] [13]

International Conference on Learning Representations , year=

Deep Learning For Symbolic Mathematics , author=. International Conference on Learning Representations , year=

[12] [14]

2019 , eprint=

Soft Actor-Critic for Discrete Action Settings , author=. 2019 , eprint=

2019

[13] [16]

2024 , eprint=

Completeness classes in algebraic complexity theory , author=. 2024 , eprint=

2024

[14] [19]

Soft Actor-Critic (SAC) , year =

[15] [20]

Hindsight Experience Replay , volume =

Andrychowicz, Marcin and Wolski, Filip and Ray, Alex and Schneider, Jonas and Fong, Rachel and Welinder, Peter and McGrew, Bob and Tobin, Josh and Pieter Abbeel, OpenAI and Zaremba, Wojciech , booktitle =. Hindsight Experience Replay , volume =

[16] [21]

Policy Improvement by Planning with

Danihelka, Ivo and Guez, Arthur and Schrittwieser, Julian and Silver, David , booktitle =. Policy Improvement by Planning with. 2022 , url =

2022

[17] [22]

and Precup, Doina and Singh, Satinder , journal =

Sutton, Richard S. and Precup, Doina and Singh, Satinder , journal =. Between. 1999 , doi =

1999

[18] [23]

2017 , volume =

Vezhnevets, Alexander Sasha and Osindero, Simon and Schaul, Tom and Heess, Nicolas and Jaderberg, Max and Silver, David and Kavukcuoglu, Koray , booktitle =. 2017 , volume =

2017

[19] [24]

1997 , isbn =

Lidl, Rudolf and Niederreiter, Harald , title =. 1997 , isbn =

1997

[20] [25]

Macdonald, I. G. , title =. 1995 , isbn =

1995

[21] [26]

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , volume =

Hosangadi, Anup and Fallah, Farzan and Kastner, Ryan , title =. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , volume =. 2006 , doi =

2006

[22] [27]

and Narasimhan, Karthik and Saeedi, Ardavan and Tenenbaum, Joshua B

Kulkarni, Tejas D. and Narasimhan, Karthik and Saeedi, Ardavan and Tenenbaum, Joshua B. , title =. Advances in Neural Information Processing Systems , volume =

[23] [28]

Proceedings of the 35th International Conference on Machine Learning , pages =

Florensa, Carlos and Held, David and Geng, Xinyang and Abbeel, Pieter , title =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

2018

[24] [29]

Advances in Neural Information Processing Systems , volume =

Anthony, Thomas and Tian, Zheng and Barber, David , title =. Advances in Neural Information Processing Systems , volume =

[25] [30]

and Harada, Daishi and Russell, Stuart J

Ng, Andrew Y. and Harada, Daishi and Russell, Stuart J. , title =. Proceedings of the Sixteenth International Conference on Machine Learning , pages =. 1999 , publisher =

1999

[26] [31]

2026 , eprint=

CircuitBuilder: From Polynomials to Circuits via Reinforcement Learning , author=. 2026 , eprint=

2026

[27] [32]

European Conference on Machine Learning , pages =

Bandit Based Monte-Carlo Planning , author =. European Conference on Machine Learning , pages =. 2006 , publisher =

2006

[28] [33]

Science , volume =

A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play , author =. Science , volume =

[29] [35]

International Conference on Machine Learning , pages =

Policy Invariance under Reward Transformations: Theory and Application to Reward Shaping , author =. International Conference on Machine Learning , pages =

[30] [37]

International Conference on Learning Representations , year=

LILO: Learning Interpretable Libraries by Compressing and Documenting Code , author=. International Conference on Learning Representations , year=

[31] [38]

SAT Competition 2020 , journal =

Nils Froleyks and Marijn Heule and Ashlin Iser and Matti Järvisalo and Martin Suda , keywords =. SAT Competition 2020 , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.artint.2021.103572 , url =

work page doi:10.1016/j.artint.2021.103572 2020

[32] [40]

Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

Kurin, Vitaly and Godil, Saad and Whiteson, Shimon and Catanzaro, Bryan , title =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =. 2020 , isbn =

2020

[33] [41]

Learning local search heuristics for boolean satisfiability , year =

Yolcu, Emre and P\'. Learning local search heuristics for boolean satisfiability , year =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =

[34] [42]

Moro, Lorenzo and Paris, Matteo G. A. and Restelli, Marcello and Prati, Enrico , title =. Communications Physics , volume =. 2021 , doi =

2021

[35] [43]

2021 , eprint=

Quantum circuit optimization with deep reinforcement learning , author=. 2021 , eprint=

2021

[36] [44]

Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

LeanDojo: Theorem Proving with Retrieval-Augmented Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

[37] [45]

The Thirteenth International Conference on Learning Representations , year=

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search , author=. The Thirteenth International Conference on Learning Representations , year=

[38] [46]

https://satcompetition.github.io/

SAT Competition . https://satcompetition.github.io/. Accessed 2026-06-19

2026

[39] [47]

Hindsight experience replay

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Pieter Abbeel, O., and Zaremba, W. Hindsight experience replay. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

2017

[40] [48]

Thinking fast and slow with deep learning and tree search

Anthony, T., Tian, Z., and Barber, D. Thinking fast and slow with deep learning and tree search. In Advances in Neural Information Processing Systems, volume 30, 2017

2017

[41] [49]

B., Powley, E., Whitehouse, D., Lucas, S

Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., and Colton, S. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4 0 (1): 0 1--43, 2012

2012

[42] [50]

B \"u rgisser, P., Clausen, M., and Shokrollahi, M. A. Algebraic complexity theory. With the collaboration of Thomas Lickteig , volume 315 of Grundlehren Math. Wiss. Berlin: Springer, 1997. ISBN 3-540-60582-7

1997

[43] [51]

Soft actor-critic for discrete action settings

Christodoulou, P. Soft actor-critic for discrete action settings. arXiv preprint arXiv:1910.07207, 2019

arXiv 1910

[44] [52]

Policy improvement by planning with Gumbel

Danihelka, I., Guez, A., Schrittwieser, J., and Silver, D. Policy improvement by planning with Gumbel . In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=bERaNdoegnO

2022

[45] [53]

B., Solar-Lezama, A., and Tenenbaum, J

Ellis, K., Wong, C., Nye, M., Sablé-Meyer, M., Cary, L., Morales, L., Hewitt, L. B., Solar-Lezama, A., and Tenenbaum, J. B. Dreamcoder: Growing generalizable, interpretable knowledge with wake-sleep bayesian program learning. arXiv preprint arXiv:2006.08381, 2020

arXiv 2006

[46] [54]

Automatic goal generation for reinforcement learning agents

Florensa, C., Held, D., Geng, X., and Abbeel, P. Automatic goal generation for reinforcement learning agents. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1515--1528. PMLR, 2018. URL https://proceedings.mlr.press/v80/florensa18a.html

2018

[47] [55]

Y., Marquardt, F., and Li, L

Fösel, T., Niu, M. Y., Marquardt, F., and Li, L. Quantum circuit optimization with deep reinforcement learning, 2021. URL https://arxiv.org/abs/2103.07585

arXiv 2021

[48] [56]

X., Liu, M., Tenenbaum, J

Grand, G., Wong, L., Bowers, M., Olausson, T. X., Liu, M., Tenenbaum, J. B., and Andreas, J. Lilo: Learning interpretable libraries by compressing and documenting code. In International Conference on Learning Representations, 2024

2024

[49] [57]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018. URL https://arxiv.org/abs/1801.01290

Pith/arXiv arXiv 2018

[50] [58]

Horner, W. G. Xxi. a new method of solving numerical equations of all orders, by continuous approximation. Philosophical Transactions of the Royal Society of London, 109: 0 308--335, 12 1819. ISSN 0261-0523. doi:10.1098/rstl.1819.0023. URL https://doi.org/10.1098/rstl.1819.0023

work page doi:10.1098/rstl.1819.0023

[51] [59]

Optimizing polynomial expressions by algebraic factorization and common subexpression elimination

Hosangadi, A., Fallah, F., and Kastner, R. Optimizing polynomial expressions by algebraic factorization and common subexpression elimination. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 25 0 (10): 0 2012--2022, 2006. doi:10.1109/TCAD.2006.875712

work page doi:10.1109/tcad.2006.875712 2012

[52] [60]

Hubert, T., Mehta, R., Sartran, L., Horváth, M. Z., Žužić, G., Wieser, E., Huang, A., Schrittwieser, J., Schroecker, Y., Masoom, H., Bertolli, O., Zahavy, T., Mandhane, A., Yung, J., Beloshapka, I., Ibarz, B., Veeriah, V., Yu, L., Nash, O., Lezeau, P., Mercuri, S., Sönne, C., Mehta, B., Davies, A., Zheng, D., Pedregosa, F., Li, Y., von Glehn, I., Rowland,...

work page doi:10.1038/s41586-025-09833-y 2025

[53] [61]

and Szepesv \'a ri, C

Kocsis, L. and Szepesv \'a ri, C. Bandit based monte-carlo planning. In European Conference on Machine Learning, pp.\ 282--293. Springer, 2006

2006

[54] [62]

D., Narasimhan, K., Saeedi, A., and Tenenbaum, J

Kulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenenbaum, J. B. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, volume 29, 2016

2016

[55] [63]

Kurin, V., Godil, S., Whiteson, S., and Catanzaro, B. Can q-learning with graph networks learn a generalizable branching heuristic for a sat solver? In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS '20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

2020

[56] [64]

and Niederreiter, H

Lidl, R. and Niederreiter, H. Finite Fields, volume 20 of Encyclopedia of Mathematics and Its Applications. Cambridge University Press, Cambridge, 2 edition, 1997. ISBN 9780521392310

1997

[57] [65]

Moro, L., Paris, M. G. A., Restelli, M., and Prati, E. Quantum compiling by deep reinforcement learning. Communications Physics, 4 0 (178), 2021. doi:10.1038/s42005-021-00684-3. URL https://doi.org/10.1038/s42005-021-00684-3

work page doi:10.1038/s42005-021-00684-3 2021

[58] [66]

High-dimensional continuous control using generalized advantage estimation

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations (ICLR), 2016

2016

[59] [67]

Proximal policy optimization algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[60] [68]

and Yehudayoff, A

Shpilka, A. and Yehudayoff, A. Arithmetic circuits: A survey of recent results and open questions. Foundations and Trends in Theoretical Computer Science, 5 0 (3-4): 0 207--388, 12 2010. ISSN 1551-305X. doi:10.1561/0400000039. URL https://doi.org/10.1561/0400000039

work page doi:10.1561/0400000039 2010

[61] [69]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362 0 (6419): 0 1140--1144, 2018

2018

[62] [70]

The complexity of computing the permanent

Valiant, L. The complexity of computing the permanent. Theoretical Computer Science, 8 0 (2): 0 189--201, 1979 a . ISSN 0304-3975. doi:10.1016/0304-3975(79)90044-6

work page doi:10.1016/0304-3975(79)90044-6 1979

[63] [71]

Valiant, L. G. Completeness classes in algebra. In Proceedings of the Eleventh Annual ACM Symposium on Theory of Computing, STOC '79, pp.\ 249–261, New York, NY, USA, 1979 b . Association for Computing Machinery. ISBN 9781450374385. doi:10.1145/800135.804419. URL https://doi.org/10.1145/800135.804419

work page doi:10.1145/800135.804419 1979

[64] [72]

Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search

Xin, H., Ren, Z., Song, J., Shao, Z., Zhao, W., Wang, H., Liu, B., Zhang, L., Lu, X., Du, Q., Gao, W., Zhang, H., Zhu, Q., Yang, D., Gou, Z., Wu, Z., Luo, F., and Ruan, C. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search. In The Thirteenth International Conference on Learning Representations,...

2025

[65] [73]

M., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R., and Anandkumar, A

Yang, K., Swope, A. M., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R., and Anandkumar, A. Leandojo: Theorem proving with retrieval-augmented language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=g7OX2sOJtn

2023

[66] [74]

and P\' o czos, B

Yolcu, E. and P\' o czos, B. Learning local search heuristics for boolean satisfiability. Curran Associates Inc., Red Hook, NY, USA, 2019

2019

[67] [75]

K., Pandey, R., Mehta, B., Jin, K., Morato, N., Ganapule, A., Zeng, M

Zhang, W. K., Pandey, R., Mehta, B., Jin, K., Morato, N., Ganapule, A., Zeng, M. R., and Alper, J. Circuitbuilder: From polynomials to circuits via reinforcement learning, 2026. URL https://arxiv.org/abs/2603.17075

arXiv 2026