Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs

Manuel Noah Riesen; Peter Alfred von Niederh\"ausern

arxiv: 2605.22195 · v1 · pith:FGPIMARWnew · submitted 2026-05-21 · 💻 cs.LG

Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs

Manuel Noah Riesen , Peter Alfred von Niederh\"ausern This is my paper

Pith reviewed 2026-05-22 07:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords graph of thoughtsreinforcement learningadaptive promptinglarge language modelsprompt engineeringautomated reasoningRL for LLMs

0 comments

The pith

Reinforcement learning can automatically generate task-specific graphs of operations for large language model prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Reinforced Graph of Thoughts, which applies reinforcement learning to build graphs of operations adaptively from a fixed human-defined set instead of requiring manual construction for each problem. This targets the limitation that static graphs lack flexibility and demand expert insight into solution structures. A sympathetic reader would care because the method promises to make structured prompting more practical for problems whose complexity varies, reducing the need for repeated human design of the graph layout.

Core claim

Reinforced Graph of Thoughts (RGoT) is an automated approach to the Graph of Thoughts prompting paradigm that leverages reinforcement learning to adaptively generate a graph of operations from a human-defined set. Results indicate that, under certain constraints, it is possible to construct graphs of operations adaptively to the task's complexity in an automated way.

What carries the argument

Reinforced Graph of Thoughts (RGoT), an RL agent that selects and connects operations from the human-defined set to form a graph whose structure varies with the input task.

If this is right

Graphs of operations no longer require manual redesign for each new problem instance.
The generated graph structure can change automatically according to detected task complexity.
Elaborate LLM problem solving becomes feasible without in-depth knowledge of solution paths for every task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same RL selection process could be tested on other structured prompting formats such as trees or chains to check transferability.
Experiments that systematically vary the size and coverage of the operation set would clarify the boundary between workable and intractable constraint regimes.

Load-bearing premise

The human-defined set of operations must contain good solutions for the target tasks yet remain small enough for reinforcement learning to explore the space of possible graphs effectively.

What would settle it

On a high-complexity task solvable by a manually designed graph, measure whether the RL agent fails to discover any viable graph or produces solutions whose accuracy falls below that of the static baseline.

Figures

Figures reproduced from arXiv: 2605.22195 by Manuel Noah Riesen, Peter Alfred von Niederh\"ausern.

**Figure 2.** Figure 2: Example GoO, GoT, and their relationship for the task [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Operation success probability Pˆ(c) as a function of complexity c for the operations of the tasks sum list and count keywords, evaluated with gpt-3.5-turbo-0125. 4.2 Reinforced Graph of Thoughts Agents We employ simple PPO agents with 2 layers of 64 units (for both actor and critic network) ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Solved rate per complexity for all five tasks. RGoT agents (teal) are compared against the IO [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Project Structure 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Pure Graph of Thoughts Types The related code can be found in the following repository: https://github.com/mriesen/ pure-graph-of-thoughts 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Operations of task sum list with input/output arities and prompt descriptions. Operation sum For a given list of single-digit integers, the sum is to be calculated (Listing 8). The scoring is based on whether the sum is correct. op_sum = PromptOperation ( name = ’ sum ’, n_outputs =1 , n_inputs =1 , type = OperationType . GENERATE , output_complexity = absolute_complexity (1) , prompt = Prompt ( instructio… view at source ↗

**Figure 8.** Figure 8: Definition of Operation sum Operation split A list is to be split into two equally sized sublists. The scoring is based on whether the output contains two similarly sized lists, with some lenience (sublist ratio ≥ 0.5). Operation merge Two given lists are to be merged into a single list. The result is scored based on whether the sum of the resulting list equals the sum of the input sublists and on the corr… view at source ↗

**Figure 9.** Figure 9: Operations of task sort list with input/output arities and prompt descriptions. Operation sort A given list is to be sorted in ascending order. The number of sort errors is calculated and the (binary) scoring strictly allows no errors. Operation split Identical to the Operation split of Task sum list. Operation merge Identical to the Operation merge of Task sum list. Operation branch 5 A programmatic opera… view at source ↗

**Figure 10.** Figure 10: Operations of task intersect set with input/output arities and prompt descriptions. Operation intersect The intersection of a set is to be created. The scoring is based on whether the intersection of the sets is correct. Operation split set Given two sets as input (a single state), two states are created: The first set is split in half, the subsets are fed as first set into the respective output states, i… view at source ↗

**Figure 11.** Figure 11: Operations of task count keywords with input/output arities and prompt descriptions. Operation count Given a text, the number of specific words (here: countries) is to be counted. The scoring is based on whether the words are counted correctly. Operation split Given a text, it is to be split into two substrings of similar size. Here, the scoring is dependent on whether the text is split correctly and in t… view at source ↗

**Figure 12.** Figure 12: Operations of task merge docs with input/output arities and prompt descriptions. Operation merge Several documents are to be merged into a single one, while minimizing redundancy and maximizing information retention. The scoring is based on the ROUGE-L (retention) and ROUGE-1 (redundancy) score and the resulting F1 score Lin (2004). To decide whether a result is valid, we used the threshold 0.6, which was… view at source ↗

**Figure 13.** Figure 13: Operation success probability Pˆ(c) as a function of complexity c for each task, evaluated with gpt-3.5-turbo-0125. G.1 Initial LLM Evaluation The operation sum of the task sum list was initially evaluated on different language models. The variants of gpt-3.5-turbo are quite on-par in solving the specified task. The model gpt-4-turbo-2024-04-09 maintains a higher success rate before degrading similarly ( … view at source ↗

**Figure 14.** Figure 14: LLM Evaluation Results for Operation sum of Task sum list (with n = 100) 23 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: LLM Evaluation Results for Operation sum of Task sum list (with n = 100) 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Number of operations per complexity for all five tasks. The vertical line marks the training front, [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

read the original abstract

Graph of Thoughts (GoT), a generalized form of recent prompting paradigms for large language models (LLMs), has been shown to be useful for elaborate problem solving. By executing a graph of operations, thoughts of the LLM are structured as an arbitrary graph, forming the actual graph of thoughts. Originally, the graph of operations is defined manually, which requires in-depth knowledge about the solution of the problem to solve. Such a static graph of operations is rigid and therefore lacks adaptability. We propose Reinforced Graph of Thoughts (RGoT), an automated approach to the GoT prompting paradigm that leverages reinforcement learning (RL) to adaptively generate a graph of operations from a human-defined set. Results indicate that, under certain constraints, it is possible to construct graphs of operations adaptively to the task's complexity in an automated way.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RGoT tries to automate GoT graph construction with RL from a fixed operation set, but the abstract gives no numbers or setup details to judge if it works.

read the letter

The one thing to know is that this paper proposes using reinforcement learning to automatically build the graph of operations for Graph of Thoughts prompting from a set of human-defined primitives. They take the existing GoT framework, which structures LLM thoughts as a graph, and replace the manual graph definition with an RL policy that composes operations adaptively. This targets the rigidity problem where static graphs don't adjust to task complexity. What the paper does reasonably well is identify a clear limitation in current prompting methods and suggest a direction to address it. Automating the graph construction could make these techniques more accessible without requiring experts to design the structure each time. The soft spots are more substantial. The abstract claims positive results under certain constraints but offers no data, no baselines like standard GoT or other adaptive methods, and no outline of the RL setup including states, actions, or rewards. This makes it difficult to evaluate if the approach delivers meaningful gains or just works in narrow cases. The stress-test concern about the operation set holds up here. For RL to succeed, the set must be rich enough to include good solutions yet small enough to avoid combinatorial explosion. Without any characterization of the sets used or experiments varying their size, we can't tell if this balance is achieved or if the method is limited to specific tasks where it happens to work. The thinking in the paper appears straightforward and engages honestly with the motivation from prior GoT work. There are no obvious contradictions or overclaims in the provided text. This paper is mainly for researchers in LLM prompting, chain-of-thought variants, and automated optimization of reasoning structures. A reader working on practical LLM applications might find the idea useful if the experiments in the full paper are solid. I would recommend sending it for peer review. The idea has enough substance to warrant feedback on the implementation and results, though it will likely need more empirical support to be convincing.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Reinforced Graph of Thoughts (RGoT), which applies reinforcement learning to automatically construct adaptive graphs of operations for LLM prompting, starting from a human-defined set of operations. It claims that under certain constraints this yields graphs tailored to task complexity, overcoming the rigidity of manually specified static graphs in the Graph of Thoughts paradigm.

Significance. If the empirical demonstration holds with proper validation and baselines, the integration of RL for dynamic operation-graph construction could meaningfully reduce the expert knowledge required for effective LLM reasoning on complex tasks and extend structured prompting techniques. The approach directly addresses adaptability limitations in prior GoT work.

major comments (2)

[Abstract] Abstract: The statement 'Results indicate that, under certain constraints, it is possible to construct graphs of operations adaptively to the task's complexity in an automated way' supplies no quantitative metrics, error bars, baseline comparisons (e.g., to static GoT or other RL-free methods), or description of the RL setup (policy, reward, state representation). This leaves the central empirical claim without visible support.
[Method] The human-defined operation set is asserted to enable effective RL exploration, yet no characterization of its cardinality, diversity, or task coverage appears; without this, it is impossible to verify that the set is simultaneously rich enough to contain high-quality solutions and compact enough to avoid prohibitive sample complexity, which is load-bearing for the 'certain constraints' qualifier.

minor comments (2)

[Method] Clarify the precise RL algorithm (e.g., PPO, REINFORCE) and how the graph-construction action space is represented to avoid ambiguity in reproducibility.
[Experiments] Add a table summarizing the evaluated tasks, operation-set sizes, and key performance deltas versus baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be incorporated to strengthen the presentation of our results and method.

read point-by-point responses

Referee: [Abstract] Abstract: The statement 'Results indicate that, under certain constraints, it is possible to construct graphs of operations adaptively to the task's complexity in an automated way' supplies no quantitative metrics, error bars, baseline comparisons (e.g., to static GoT or other RL-free methods), or description of the RL setup (policy, reward, state representation). This leaves the central empirical claim without visible support.

Authors: We agree that the abstract is currently high-level and does not include quantitative metrics, error bars, or explicit details on the RL setup. The full manuscript reports experimental results with baseline comparisons (including static GoT) and describes the RL components (policy gradient optimization with a reward based on task accuracy and graph efficiency) in the Experiments and Method sections. To better support the central claim, we will revise the abstract to include a concise reference to key performance gains and the RL formulation while maintaining length limits. revision: yes
Referee: [Method] The human-defined operation set is asserted to enable effective RL exploration, yet no characterization of its cardinality, diversity, or task coverage appears; without this, it is impossible to verify that the set is simultaneously rich enough to contain high-quality solutions and compact enough to avoid prohibitive sample complexity, which is load-bearing for the 'certain constraints' qualifier.

Authors: The referee correctly notes that the manuscript does not characterize the human-defined operation set in terms of cardinality, diversity, or task coverage. We will add this information in the revised Method section, including the number of operations, their coverage of reasoning primitives, and how the set size balances expressiveness against sample complexity to support the 'certain constraints' under which adaptive graph construction is feasible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL method proposal with no derivations or self-referential reductions

full rationale

The paper introduces RGoT as an RL-based automation of the GoT prompting paradigm, where a human-defined operation set is an explicit external input and the central claim is an empirical observation that adaptive graph construction is possible under certain constraints. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The method is presented as a practical extension relying on standard RL exploration rather than any closed-form reduction to its own inputs, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach assumes a human-provided operation set is both sufficient and tractable for RL search; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5670 in / 1052 out tokens · 34509 ms · 2026-05-22T07:25:23.272466+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

[2]

and Le, Quoc V

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

work page 2024
[3]

Advances in Neural Information Processing Systems , author =

Tree of thoughts: Deliberate problem solving with large language models , volume =. Advances in Neural Information Processing Systems , author =

work page
[5]

Watkins, Christopher , title =

work page
[6]

Reinforcement Learning, Second Edition , isbn =

Sutton, Richard S and Barto, Andrew G , year =. Reinforcement Learning, Second Edition , isbn =

work page
[7]

Large Language Models for Mathematical Reasoning: Progresses and Challenges

Ahn, Janice and Verma, Rishu and Lou, Renze and Liu, Di and Zhang, Rui and Yin, Wenpeng. Large Language Models for Mathematical Reasoning: Progresses and Challenges. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop. 2024

work page 2024
[8]

, year =

Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Graves, Alex and Antonoglou, Ioannis and Wierstra, Daan and Riedmiller, Martin A. , year =. Playing Atari with Deep Reinforcement Learning , volume =

work page
[9]

Asynchronous methods for deep reinforcement learning , pages =

Mnih, Volodymyr and Badia, Adria Puigdomenech and Mirza, Mehdi and Graves, Alex and Lillicrap, Timothy and Harley, Tim and Silver, David and Kavukcuoglu, Koray , year =. Asynchronous methods for deep reinforcement learning , pages =. International conference on machine learning , publisher =

work page
[10]

Proximal policy optimization algorithms , journal =

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , year =. Proximal policy optimization algorithms , journal =

work page
[11]

2024 , urldate =

Models -. 2024 , urldate =

work page 2024
[12]

Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

work page
[13]

Reward Shaping , isbn =

Wiewiora, Eric , editor =. Reward Shaping , isbn =. Encyclopedia of Machine Learning , publisher =. 2010 , doi =

work page 2010
[14]

1998 , month = jan, doi =

Matsumoto, Makoto and Nishimura, Takuji , title =. 1998 , month = jan, doi =

work page 1998
[15]

, title =

O'Neill, Melissa E. , title =

work page
[17]

Welcome to Python.org , howpublished =

work page
[18]

Gymnasium Documentation , howpublished =

work page
[19]

Journal of Machine Learning Research , author =

Stable-Baselines3: Reliable Reinforcement Learning Implementations , volume =. Journal of Machine Learning Research , author =

work page
[21]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

work page 2004
[22]

2504.02670 , title =

Besta, Maciej and Paleari, Lorenzo and Jiang, Jia and Gerstenberger, Robert and Wu, You and Iff, Patrick and Kubicek, Ales and Nyczyk, Piotr and Khimey, Diana and Hannesson, Jón and Kwasniewski, Grzegorz and Copik, Marcin and Niewiadomski, Hubert and Hoefler, Torsten , year =. 2504.02670 , title =

work page arXiv
[23]

Large language models for mathematical reasoning: Progresses and challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. In Neele Falk, Sara Papi, and Mike Zhang (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.\ 225--237, St. Juli...

work page 2024
[24]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michał Podstawski, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence , 38 0 (16): 0 17682--17690, 2024. d...

work page doi:10.1609/aaai.v38i16.29720 2024
[25]

Affordable ai assistants with knowledge graph of thoughts, 2025

Maciej Besta, Lorenzo Paleari, Jia Jiang, Robert Gerstenberger, You Wu, Patrick Iff, Ales Kubicek, Piotr Nyczyk, Diana Khimey, Jón Hannesson, Grzegorz Kwasniewski, Marcin Copik, Hubert Niewiadomski, and Torsten Hoefler. Affordable ai assistants with knowledge graph of thoughts, 2025

work page 2025
[26]

Gymnasium documentation

Gymnasium. Gymnasium documentation. https://gymnasium.farama.org/, 2026. Accessed: Mar. 12, 2026

work page 2026
[27]

Auto graph of thoughts: A hands-free and cost effective method for using graph of thoughts

Thien-Loc Ha, Trong-Bao Ho, Long Nguyen, and Dien Dinh. Auto graph of thoughts: A hands-free and cost effective method for using graph of thoughts. In Proceedings of the 2024 10th International Conference on Computer Technology Applications, ICCTA '24, pp.\ 116–121, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400716386. doi:10.1...

work page doi:10.1145/3674558.3674574 2024
[28]

Hancock and Taghi M

John T. Hancock and Taghi M. Khoshgoftaar. Survey on categorical data for neural networks. Journal of Big Data, 7 0 (1): 0 28, 2020. ISSN 2196-1115. doi:10.1186/s40537-020-00305-w. URL https://doi.org/10.1186/s40537-020-00305-w

work page doi:10.1186/s40537-020-00305-w 2020
[29]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\" u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems,...

work page 2020
[30]

ROUGE : A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE : A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.\ 74--81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013/

work page 2004
[31]

Mersenne Twister: A 6 23-dimensionally equidistributed uniform pseudorandom number generator

Makoto Matsumoto and Takuji Nishimura. Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation , 8 0 (1): 0 3--30, January 1998. doi:10.1145/272991.272995

work page doi:10.1145/272991.272995 1998
[32]

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR , abs/1312.5602, 2013. URL http://arxiv.org/abs/1312.5602

work page internal anchor Pith review Pith/arXiv arXiv 2013
[33]

Asynchronous methods for deep reinforcement learning

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp.\ 1928--1937. PMLR , 2016

work page 1928
[34]

Melissa E. O'Neill. PCG : A family of simple fast space-efficient statistically good algorithms for random number generation. Technical Report HMC-CS-2014-0905, Harvey Mudd College, Claremont, CA, September 2014

work page 2014
[35]

Models - OpenAI API

OpenAPI Models. Models - OpenAI API . https://platform.openai.com/docs/models/overview, 2024

work page 2024
[36]

Welcome to python.org

Python. Welcome to python.org. https://www.python.org/, 2026. Accessed: Mar. 12, 2026

work page 2026
[37]

Stable-baselines3: Reliable reinforcement learning implementations

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22 0 (268): 0 1--8, 2021. URL http://jmlr.org/papers/v22/20-1364.html

work page 2021
[38]

Salmon, Mark A

John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E. Shaw. Parallel random numbers: as easy as 1, 2, 3. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--12. ACM , 2011. ISBN 978-1-4503-0771-0. doi:10.1145/2063384.2063405. URL https://dl.acm.org/doi/10.1145/2063384.2063405

work page doi:10.1145/2063384.2063405 2011
[39]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv :1707.06347 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Developer guide --- Stable Baselines3 2.3.2 documentation

Stable Baselines3. Developer guide --- Stable Baselines3 2.3.2 documentation. https://stable-baselines3.readthedocs.io/en/master/guide/developer.html, 2026. Accessed: Mar. 15, 2026

work page 2026
[41]

Reinforcement Learning, Second Edition

Richard S Sutton and Andrew G Barto. Reinforcement Learning, Second Edition. The MIT Press, 2018. ISBN 978-0-262-36401-0

work page 2018
[42]

Learning from Delayed Rewards

Christopher Watkins. Learning from Delayed Rewards. PhD thesis, King's College, Cambridge, 1989

work page 1989
[43]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22, Red Hook, NY, USA, 2024. Curran Associates Inc. ISBN 9781713871088

work page 2024
[44]

Reward shaping

Eric Wiewiora. Reward shaping. In Claude Sammut and Geoffrey I. Webb (eds.), Encyclopedia of Machine Learning, pp.\ 863--865. Springer US , 2010. ISBN 978-0-387-30164-8. doi:10.1007/978-0-387-30164-8_731. URL https://doi.org/10.1007/978-0-387-30164-8_731

work page doi:10.1007/978-0-387-30164-8_731 2010
[45]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[1] [2]

and Le, Quoc V

Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed H. and Le, Quoc V. and Zhou, Denny , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

work page 2024

[2] [3]

Advances in Neural Information Processing Systems , author =

Tree of thoughts: Deliberate problem solving with large language models , volume =. Advances in Neural Information Processing Systems , author =

work page

[3] [5]

Watkins, Christopher , title =

work page

[4] [6]

Reinforcement Learning, Second Edition , isbn =

Sutton, Richard S and Barto, Andrew G , year =. Reinforcement Learning, Second Edition , isbn =

work page

[5] [7]

Large Language Models for Mathematical Reasoning: Progresses and Challenges

Ahn, Janice and Verma, Rishu and Lou, Renze and Liu, Di and Zhang, Rui and Yin, Wenpeng. Large Language Models for Mathematical Reasoning: Progresses and Challenges. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop. 2024

work page 2024

[6] [8]

, year =

Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Graves, Alex and Antonoglou, Ioannis and Wierstra, Daan and Riedmiller, Martin A. , year =. Playing Atari with Deep Reinforcement Learning , volume =

work page

[7] [9]

Asynchronous methods for deep reinforcement learning , pages =

Mnih, Volodymyr and Badia, Adria Puigdomenech and Mirza, Mehdi and Graves, Alex and Lillicrap, Timothy and Harley, Tim and Silver, David and Kavukcuoglu, Koray , year =. Asynchronous methods for deep reinforcement learning , pages =. International conference on machine learning , publisher =

work page

[8] [10]

Proximal policy optimization algorithms , journal =

Schulman, John and Wolski, Filip and Dhariwal, Prafulla and Radford, Alec and Klimov, Oleg , year =. Proximal policy optimization algorithms , journal =

work page

[9] [11]

2024 , urldate =

Models -. 2024 , urldate =

work page 2024

[10] [12]

Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

work page

[11] [13]

Reward Shaping , isbn =

Wiewiora, Eric , editor =. Reward Shaping , isbn =. Encyclopedia of Machine Learning , publisher =. 2010 , doi =

work page 2010

[12] [14]

1998 , month = jan, doi =

Matsumoto, Makoto and Nishimura, Takuji , title =. 1998 , month = jan, doi =

work page 1998

[13] [15]

, title =

O'Neill, Melissa E. , title =

work page

[14] [17]

Welcome to Python.org , howpublished =

work page

[15] [18]

Gymnasium Documentation , howpublished =

work page

[16] [19]

Journal of Machine Learning Research , author =

Stable-Baselines3: Reliable Reinforcement Learning Implementations , volume =. Journal of Machine Learning Research , author =

work page

[17] [21]

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew. ROUGE : A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. 2004

work page 2004

[18] [22]

2504.02670 , title =

Besta, Maciej and Paleari, Lorenzo and Jiang, Jia and Gerstenberger, Robert and Wu, You and Iff, Patrick and Kubicek, Ales and Nyczyk, Piotr and Khimey, Diana and Hannesson, Jón and Kwasniewski, Grzegorz and Copik, Marcin and Niewiadomski, Hubert and Hoefler, Torsten , year =. 2504.02670 , title =

work page arXiv

[19] [23]

Large language models for mathematical reasoning: Progresses and challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. In Neele Falk, Sara Papi, and Mike Zhang (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.\ 225--237, St. Juli...

work page 2024

[20] [24]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michał Podstawski, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence , 38 0 (16): 0 17682--17690, 2024. d...

work page doi:10.1609/aaai.v38i16.29720 2024

[21] [25]

Affordable ai assistants with knowledge graph of thoughts, 2025

Maciej Besta, Lorenzo Paleari, Jia Jiang, Robert Gerstenberger, You Wu, Patrick Iff, Ales Kubicek, Piotr Nyczyk, Diana Khimey, Jón Hannesson, Grzegorz Kwasniewski, Marcin Copik, Hubert Niewiadomski, and Torsten Hoefler. Affordable ai assistants with knowledge graph of thoughts, 2025

work page 2025

[22] [26]

Gymnasium documentation

Gymnasium. Gymnasium documentation. https://gymnasium.farama.org/, 2026. Accessed: Mar. 12, 2026

work page 2026

[23] [27]

Auto graph of thoughts: A hands-free and cost effective method for using graph of thoughts

Thien-Loc Ha, Trong-Bao Ho, Long Nguyen, and Dien Dinh. Auto graph of thoughts: A hands-free and cost effective method for using graph of thoughts. In Proceedings of the 2024 10th International Conference on Computer Technology Applications, ICCTA '24, pp.\ 116–121, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400716386. doi:10.1...

work page doi:10.1145/3674558.3674574 2024

[24] [28]

Hancock and Taghi M

John T. Hancock and Taghi M. Khoshgoftaar. Survey on categorical data for neural networks. Journal of Big Data, 7 0 (1): 0 28, 2020. ISSN 2196-1115. doi:10.1186/s40537-020-00305-w. URL https://doi.org/10.1186/s40537-020-00305-w

work page doi:10.1186/s40537-020-00305-w 2020

[25] [29]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\" u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt\" a schel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems,...

work page 2020

[26] [30]

ROUGE : A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE : A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.\ 74--81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013/

work page 2004

[27] [31]

Mersenne Twister: A 6 23-dimensionally equidistributed uniform pseudorandom number generator

Makoto Matsumoto and Takuji Nishimura. Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation , 8 0 (1): 0 3--30, January 1998. doi:10.1145/272991.272995

work page doi:10.1145/272991.272995 1998

[28] [32]

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR , abs/1312.5602, 2013. URL http://arxiv.org/abs/1312.5602

work page internal anchor Pith review Pith/arXiv arXiv 2013

[29] [33]

Asynchronous methods for deep reinforcement learning

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp.\ 1928--1937. PMLR , 2016

work page 1928

[30] [34]

Melissa E. O'Neill. PCG : A family of simple fast space-efficient statistically good algorithms for random number generation. Technical Report HMC-CS-2014-0905, Harvey Mudd College, Claremont, CA, September 2014

work page 2014

[31] [35]

Models - OpenAI API

OpenAPI Models. Models - OpenAI API . https://platform.openai.com/docs/models/overview, 2024

work page 2024

[32] [36]

Welcome to python.org

Python. Welcome to python.org. https://www.python.org/, 2026. Accessed: Mar. 12, 2026

work page 2026

[33] [37]

Stable-baselines3: Reliable reinforcement learning implementations

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22 0 (268): 0 1--8, 2021. URL http://jmlr.org/papers/v22/20-1364.html

work page 2021

[34] [38]

Salmon, Mark A

John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E. Shaw. Parallel random numbers: as easy as 1, 2, 3. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--12. ACM , 2011. ISBN 978-1-4503-0771-0. doi:10.1145/2063384.2063405. URL https://dl.acm.org/doi/10.1145/2063384.2063405

work page doi:10.1145/2063384.2063405 2011

[35] [39]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv :1707.06347 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [40]

Developer guide --- Stable Baselines3 2.3.2 documentation

Stable Baselines3. Developer guide --- Stable Baselines3 2.3.2 documentation. https://stable-baselines3.readthedocs.io/en/master/guide/developer.html, 2026. Accessed: Mar. 15, 2026

work page 2026

[37] [41]

Reinforcement Learning, Second Edition

Richard S Sutton and Andrew G Barto. Reinforcement Learning, Second Edition. The MIT Press, 2018. ISBN 978-0-262-36401-0

work page 2018

[38] [42]

Learning from Delayed Rewards

Christopher Watkins. Learning from Delayed Rewards. PhD thesis, King's College, Cambridge, 1989

work page 1989

[39] [43]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22, Red Hook, NY, USA, 2024. Curran Associates Inc. ISBN 9781713871088

work page 2024

[40] [44]

Reward shaping

Eric Wiewiora. Reward shaping. In Claude Sammut and Geoffrey I. Webb (eds.), Encyclopedia of Machine Learning, pp.\ 863--865. Springer US , 2010. ISBN 978-0-387-30164-8. doi:10.1007/978-0-387-30164-8_731. URL https://doi.org/10.1007/978-0-387-30164-8_731

work page doi:10.1007/978-0-387-30164-8_731 2010

[41] [45]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024