Verifiable Process Rewards for Agentic Reasoning

Chao Yu; Huaijie Wang; Huining Yuan; Jiaxuan Gao; Xiangmin Yi; Xiao-Ping Zhang; Yi Wu; Yu Wang; Zelai Xu

Verifiable process rewards from oracles give dense turn-level signals that improve credit assignment in long-horizon LLM reasoning.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 22:41 UTC pith:QBNBTLIK

load-bearing objection VPR supplies a concrete framework for turning reliable oracles into dense process rewards for agentic LLM training, backed by theory on credit assignment and some transfer results, but only in domains where those oracles exist.

arxiv 2605.10325 v2 pith:QBNBTLIK submitted 2026-05-11 cs.AI

Verifiable Process Rewards for Agentic Reasoning

Huining Yuan , Zelai Xu , Huaijie Wang , Xiangmin Yi , Jiaxuan Gao , Xiao-Ping Zhang , Yu Wang , Chao Yu

show 1 more author

Yi Wu

This is my paper

classification cs.AI

keywords verifiable process rewardsagentic reasoningreinforcement learningcredit assignmentlarge language modelsprocess supervisionLLM agentsdense rewards

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to turn objective checks on intermediate actions into dense rewards for training LLM agents on reasoning tasks. This targets the problem of sparse outcome feedback, which makes it hard for models to learn which steps in a long sequence were correct or incorrect. By applying the method in three verification settings, the work shows gains over both outcome-only rewards and rollout-based process rewards, with the gains carrying over to wider reasoning benchmarks. The central idea is that reliable intermediate verification supplies more localized learning signals whose value scales with how accurate the verifier is.

Core claim

In agentic reasoning problems where intermediate actions can be checked by symbolic or algorithmic oracles, converting those oracles into turn-level process rewards for reinforcement learning produces more effective credit assignment than outcome-level rewards alone, and the resulting policies transfer to general and agentic reasoning benchmarks outside the original training environments.

What carries the argument

Verifiable Process Rewards (VPR) framework that converts oracles into dense turn-level supervision signals for reinforcement learning.

Load-bearing premise

Reliable oracles exist that can objectively verify whether each intermediate action is correct.

What would settle it

A controlled run in which the oracle is deliberately made noisy or incorrect on a known fraction of steps, after which VPR no longer outperforms outcome-level rewards.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Dense verifier-grounded rewards localize learning signals and ease long-horizon credit assignment.
The size of the improvement scales with the reliability of the verifier.
Policies trained this way outperform both outcome-level reward and rollout-based process reward baselines in controlled environments.
The learned skills transfer to general reasoning and agentic reasoning benchmarks beyond the training settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same oracle-to-reward conversion could be applied to other sequential decision tasks that already have partial verification tools.
If approximate or learned verifiers can be substituted for perfect oracles, the approach might extend to less structured problems.
Training with VPR may change the distribution of errors an agent makes, which could affect how it combines with other alignment techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

VPR supplies a concrete framework for turning reliable oracles into dense process rewards for agentic LLM training, backed by theory on credit assignment and some transfer results, but only in domains where those oracles exist.

read the letter

The main thing here is that the paper introduces Verifiable Process Rewards (VPR) to create dense, turn-level supervision from symbolic or algorithmic oracles in long-horizon agentic reasoning tasks. This targets the credit assignment issue that sparse outcome rewards create, where good intermediate steps can be lost in a failed trajectory.

What stands out as new is the three oracle instantiations—search-based for deduction, constraint-based for logic, and posterior-based for inference—plus the theoretical analysis showing localized signals help when the verifier is reliable, and the reported transfer to broader benchmarks. The work does a decent job scoping itself to settings where intermediate verification is objective and feasible, and it compares against outcome-level and rollout-based baselines in controlled environments.

The soft spots are proportionate to the claims. Everything rests on having reliable oracles, which the authors acknowledge, so the method does not extend to open-ended or unstructured tasks yet. The empirical gains and transfer are presented as positive, but without full details on controls, error bars, or exact effect sizes it is hard to gauge robustness. The theory is conditional rather than absolute, which is honest but limits how far the practical payoff goes.

This is aimed at researchers working on process supervision and RL for LLM reasoning. It shows clear engagement with the credit assignment problem and supplies a usable framework with honest limits, so it deserves a serious referee rather than a desk reject.

Referee Report

0 major / 4 minor

Summary. The paper introduces Verifiable Process Rewards (VPR), a framework that converts symbolic or algorithmic oracles into dense turn-level rewards for RL training of LLM agents on agentic reasoning tasks. It instantiates the approach in search-based, constraint-based, and posterior-based verification settings, provides a theoretical analysis of improved long-horizon credit assignment conditional on verifier reliability, and reports empirical outperformance versus outcome-level and rollout-based baselines together with positive transfer to general and agentic reasoning benchmarks.

Significance. If the results hold, the work supplies a practical route to denser supervision signals precisely where reliable intermediate verification is available, directly addressing credit-assignment sparsity in long-horizon agentic reasoning. The conditional theoretical analysis and the transfer results are notable strengths; the explicit scoping to oracle-equipped domains and the acknowledgment of limitations for open-ended settings further increase the contribution's credibility.

minor comments (4)

[§1] The introduction would benefit from a concise table or paragraph summarizing the three verification settings and the corresponding oracles before the method section.
[§5.2] In the experimental section, the precise number of rollouts used for the rollout-based process-reward baseline and the exact definition of 'reliability' metric for the verifiers should be stated explicitly to allow direct replication.
[Figure 2] Figure 2 (or equivalent) showing learning curves would be clearer if the y-axis scale were normalized across environments or if shaded regions explicitly indicated standard error rather than standard deviation.
[§6] A short discussion of how VPR's reward density scales with trajectory length would help readers assess applicability to longer-horizon tasks beyond the reported benchmarks.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the accurate summary of the VPR framework, and the recommendation for minor revision. We are pleased that the conditional theoretical analysis, transfer results, and scoping to oracle-equipped domains were viewed as strengths.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's derivation chain consists of a theoretical analysis showing credit-assignment benefits from dense verifier-grounded rewards (conditional on external oracle reliability) plus empirical outperformance in scoped environments with search-, constraint-, and posterior-based oracles. No equations, predictions, or central claims reduce by construction to fitted parameters from the same data, self-citations, or imported ansatzes; verifier reliability is treated as an independent external property rather than an internal fit, and limitations for open-ended settings are explicitly stated. The reported gains therefore remain independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of reliable oracles for intermediate verification and on the assumption that the three chosen verification styles are representative of densely-verifiable agentic problems; no explicit free parameters or invented physical entities are introduced in the abstract.

axioms (1)

domain assumption Intermediate actions in the studied problems can be objectively checked by symbolic, algorithmic, or posterior oracles.
Invoked when defining the class of densely-verifiable agentic reasoning problems and when converting oracles into turn-level rewards.

invented entities (1)

Verifiable Process Rewards (VPR) framework no independent evidence
purpose: Converts oracles into dense turn-level supervision signals for RL
New named framework introduced to organize the three verification styles and the training procedure.

pith-pipeline@v0.9.1-grok · 5809 in / 1454 out tokens · 21017 ms · 2026-06-30T22:41:42.387635+00:00 · methodology

0 comments

read the original abstract

Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existing approaches rely on sparse outcome-level feedback. This sparsity creates a credit assignment challenge in long-horizon agentic reasoning: a trajectory may fail despite containing many correct intermediate decisions, or succeed despite containing flawed ones. In this work, we study a class of densely-verifiable agentic reasoning problems, where intermediate actions can be objectively checked by symbolic or algorithmic oracles. We propose Verifiable Process Rewards (VPR), a framework that converts such oracles into dense turn-level supervision for reinforcement learning, and instantiate it in three representative settings: search-based verification for dynamic deduction, constraint-based verification for logical reasoning, and posterior-based verification for probabilistic inference. We further provide a theoretical analysis showing that dense verifier-grounded rewards can improve long-horizon credit assignment by providing more localized learning signals, with the benefit depending on the reliability of the verifier. Empirically, VPR outperforms outcome-level reward and rollout-based process reward baselines across controlled environments, and more importantly, transfers to both general and agentic reasoning benchmarks, suggesting that verifiable process supervision can foster general reasoning skills applicable beyond the training environments. Our results indicate that VPR is a promising approach for enhancing LLM agents whenever reliable intermediate verification is available, while also highlighting its dependence on oracle quality and the open challenge of extending VPR to less structured, open-ended environments.

Figures

Figures reproduced from arXiv: 2605.10325 by Chao Yu, Huaijie Wang, Huining Yuan, Jiaxuan Gao, Xiangmin Yi, Xiao-Ping Zhang, Yi Wu, Yu Wang, Zelai Xu.

**Figure 2.** Figure 2: Three VPR instantiations. Search-based (Tic-Tac-Toe): MCTS lookahead labels the move with the highest value as oracle-valid. Constraint-based (Sudoku): a constraint solver verifies the candidate digit against the row, column, and the local box. Posterior-based (Minesweeper): posterior mine probabilities mark zero-probability cells as safe reveals and probability-one cells as flags. Posterior-Based VPR for … view at source ↗

**Figure 3.** Figure 3: Evaluation curves over GRPO training in the three in-domain environments. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of VPR and outcome reward (OR) on a representative Minesweeper trajectory. Pattern Analysis. A side-by-side trajectory comparison on Minesweeper ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 10 internal anchors

[1]

Fireact: Toward language agent fine-tuning

Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning.arXiv preprint arXiv:2310.05915, 2023

work page arXiv 2023
[2]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

CRITIC: Large language models can self-correct with tool-interactive critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Nan Duan, and Weizhu Chen. CRITIC: Large language models can self-correct with tool-interactive critiquing. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[6]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[8]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1):99–134, 1998

work page 1998
[9]

VinePPO: Refining credit assignment in RL training of LLMs

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. VinePPO: Refining credit assignment in RL training of LLMs. InForty-second International Conference on Machine Learning, 2025

work page 2025
[10]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[11]

Bandit based monte-carlo planning

Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors,Machine Learning: ECML 2006, pages 282–293, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg

work page 2006
[12]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[13]

OpenSpiel: A Framework for Reinforcement Learning in Games

Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Paul Muller, Timo Ewalds, Ryan Faulkner, János Kramár, Bart De Vylder, Brennan Saeta, James Bradbury, David Ding, Sebastian Borgeaud, Matthew Lai, Ju...

work page Pith review arXiv 1908
[14]

Coderl: Mastering code generation through pretrained models and deep reinforcement learning

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 21314–21328. Curran Associates, ...

work page 2022
[15]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024. 10

work page 2024
[16]

Agentbench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning R...

work page 2024
[17]

Gem: A gym for agentic llms.arXiv preprint arXiv:2510.01051, 2025

Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Simon Yu, Xiangxin Zhou, Haotian Xu, Shaopan Xiong, Bo Liu, Chenmien Tan, et al. Gem: A gym for agentic llms.arXiv preprint arXiv:2510.01051, 2025

work page arXiv 2025
[18]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022
[19]

Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning

Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3806–3824, Singapore, December 2023. Association for Computational Linguistics

work page 2023
[20]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Reflex- ion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflex- ion: language agents with verbal reinforcement learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 8634–8652. Curran Associates, Inc., 2023

work page 2023
[23]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[24]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[25]

EvalScope: Evaluation framework for large models, 2024

ModelScope Team. EvalScope: Evaluation framework for large models, 2024

work page 2024
[26]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024
[28]

Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page...

work page 2024
[29]

Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

work page arXiv 2025
[30]

what it can create, it may not understand

Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman, Linjie Li, Jena D. Hwang, Liwei Jiang, Jillian Fisher, Abhilasha Ravichander, Khyathi Chandu, Benjamin Newman, Pang Wei Koh, Allyson Ettinger, and Yejin Choi. The generative AI paradox: “what it can create, it may not understand”. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[31]

The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

work page 2025
[32]

Vs-bench: Evaluating vlms for strategic reasoning and decision-making in multi- agent environments.coming soon, 2025

Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Xinlei Chen, Yi Wu, Chao Yu, and Yu Wang. Vs-bench: Evaluating vlms for strategic reasoning and decision-making in multi- agent environments.coming soon, 2025

work page 2025
[33]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

work page 2022
[35]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc., 2023

work page 2023
[36]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[37]

OVM, outcome-supervised value models for planning in mathematical reasoning

Fei Yu, Anningzhe Gao, and Benyou Wang. OVM, outcome-supervised value models for planning in mathematical reasoning. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, pages 858–875, Mexico City, Mexico, June 2024. Association for Computational Linguistics

work page 2024
[38]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Sys...

work page 2023
[39]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In Forty-first International Conference on Machine Learning, 2024. 12 A Reproducibility Statement To facilitate future research and ensure the reproducibility of our results, we have made al...

work page 2024
[40]

The grid is 0 - indexed , where (0 ,0) is the top - left corner and (2 ,2) is the bottom - right corner

Tic - tac - toe is a two - player board game played on a three - by - three grid . The grid is 0 - indexed , where (0 ,0) is the top - left corner and (2 ,2) is the bottom - right corner

work page
[41]

Two players take turns placing their marks X and O in empty cells of the grid

work page
[42]

The player who first places three of their marks in a horizontal , vertical , or diagonal line wins

work page
[43]

PLAYER I N F O R M A T I O N :

If all cells are filled and no player wins , the game ends in a draw . PLAYER I N F O R M A T I O N :

work page
[44]

You are c om pe ti ng with another player c o n t r o l l i n g the mark O

Your mark is X . You are c om pe ti ng with another player c o n t r o l l i n g the mark O . 16

work page
[45]

The game state d e m o n s t r a t e s the current board with a three - line text grid , where ’X ’ and ’O ’ are the marks of the two players , and ’

In each of your turns : a . The game state d e m o n s t r a t e s the current board with a three - line text grid , where ’X ’ and ’O ’ are the marks of the two players , and ’. ’ r e p r e s e n t s empty cells . b . You need to choose an action to place your mark in an empty cell , based on the given game state and the history of your d eci si on s . c...

work page
[46]

Rows and columns are 1 - indexed (1 to 9)

Sudoku is played on a 9 x9 grid . Rows and columns are 1 - indexed (1 to 9)

work page
[47]

The goal is to fill the empty cells with digits from 1 to 9

work page
[48]

Each row must contain all digits from 1 to 9 without r e p e t i t i o n

work page
[49]

Each column must contain all digits from 1 to 9 without r e p e t i t i o n

work page
[50]

Each of the nine 3 x3 subgrids must contain all digits from 1 to 9 without r e p e t i t i o n

work page
[51]

PLAYER I N F O R M A T I O N :

You cannot o ve rw ri te pre - filled cells . PLAYER I N F O R M A T I O N :

work page
[52]

The current board state is d is pl ay ed as a text grid . - ’. ’ r e p r e s e n t s an empty cell . - Numbers re pr es ent filled cells . - Rows are labeled R1 , R2 ... and Columns C1 , C2

work page
[53]

In each turn , you choose an action to fill an empty cell with a number

work page
[54]

All legal actions are provided in the format ‘< fill ({ row } ,{ col } ,{ number }) > ‘. RESPONSE I N S T R U C T I O N S : 17 Always choose strictly one action and output ‘< answer >{ your chosen action } </ answer > ‘ with no extra text after you finish the thinking process . For example , to fill row 1 , column 1 with number 5 , output ‘< answer > < fi...

work page
[55]

The grid contains exactly 5 hidden mines

M i n e s w e e p e r is played on a 5 x5 grid of cells . The grid contains exactly 5 hidden mines . The grid is 0 - indexed , where (0 ,0) is the top - left corner and (4 ,4) is the bottom - right corner

work page
[56]

The goal is to reveal all cells that do not contain mines without re ve al in g any mine

work page
[57]

If you reveal a mine , you lose the game i m m e d i a t e l y

work page
[58]

If you reveal a safe cell , it will show a number i n d i c a t i n g how many mines are adjacent to it ( n ei gh bo rs include d ia go nal s )

work page
[59]

PLAYER I N F O R M A T I O N :

You can also place a flag on a cell if you suspect it contains a mine , or remove a flag if you change your mind . PLAYER I N F O R M A T I O N :

work page
[60]

’ r e p r e s e n t s an u n r e v e a l e d cell

The current board state is d is pl ay ed as a text grid , where : - ’. ’ r e p r e s e n t s an u n r e v e a l e d cell . - ’F ’ r e p r e s e n t s a flagged cell . - A number (0 -8) r e p r e s e n t s a revealed safe cell with that many adjacent mines

work page
[61]

In each turn , you must choose an action to either reveal a cell or flag / unflag a cell

work page
[62]

The ’ flag ’ command acts as a toggle : play it on an un fl agg ed cell to place a flag , or on a flagged cell to remove it

All legal actions are provided in the format ‘< reveal ({ row } ,{ col }) >‘ or ‘< flag ({ row } ,{ col }) > ‘. The ’ flag ’ command acts as a toggle : play it on an un fl agg ed cell to place a flag , or on a flagged cell to remove it . RESPONSE I N S T R U C T I O N S : 18 Always choose strictly one action and output ‘< answer >{ your chosen action } </...

work page

[1] [1]

Fireact: Toward language agent fine-tuning

Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. Fireact: Toward language agent fine-tuning.arXiv preprint arXiv:2310.05915, 2023

work page arXiv 2023

[2] [2]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

CRITIC: Large language models can self-correct with tool-interactive critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen, Yujiu Yang, Nan Duan, and Weizhu Chen. CRITIC: Large language models can self-correct with tool-interactive critiquing. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[4] [4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[6] [6]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[8] [8]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1):99–134, 1998

work page 1998

[9] [9]

VinePPO: Refining credit assignment in RL training of LLMs

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. VinePPO: Refining credit assignment in RL training of LLMs. InForty-second International Conference on Machine Learning, 2025

work page 2025

[10] [10]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[11] [11]

Bandit based monte-carlo planning

Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors,Machine Learning: ECML 2006, pages 282–293, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg

work page 2006

[12] [12]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[13] [13]

OpenSpiel: A Framework for Reinforcement Learning in Games

Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Paul Muller, Timo Ewalds, Ryan Faulkner, János Kramár, Bart De Vylder, Brennan Saeta, James Bradbury, David Ding, Sebastian Borgeaud, Matthew Lai, Ju...

work page Pith review arXiv 1908

[14] [14]

Coderl: Mastering code generation through pretrained models and deep reinforcement learning

Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems, volume 35, pages 21314–21328. Curran Associates, ...

work page 2022

[15] [15]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024. 10

work page 2024

[16] [16]

Agentbench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents. InThe Twelfth International Conference on Learning R...

work page 2024

[17] [17]

Gem: A gym for agentic llms.arXiv preprint arXiv:2510.01051, 2025

Zichen Liu, Anya Sims, Keyu Duan, Changyu Chen, Simon Yu, Xiangxin Zhou, Haotian Xu, Shaopan Xiong, Bo Liu, Chenmien Tan, et al. Gem: A gym for agentic llms.arXiv preprint arXiv:2510.01051, 2025

work page arXiv 2025

[18] [18]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022

[19] [19]

Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning

Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. Logic-LM: Empowering large language models with symbolic solvers for faithful logical reasoning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3806–3824, Singapore, December 2023. Association for Computational Linguistics

work page 2023

[20] [20]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Reflex- ion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflex- ion: language agents with verbal reinforcement learning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 8634–8652. Curran Associates, Inc., 2023

work page 2023

[23] [23]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[24] [24]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[25] [25]

EvalScope: Evaluation framework for large models, 2024

ModelScope Team. EvalScope: Evaluation framework for large models, 2024

work page 2024

[26] [26]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024

[28] [28]

Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page...

work page 2024

[29] [29]

Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

Weixun Wang, Shaopan Xiong, Gengru Chen, Wei Gao, Sheng Guo, Yancheng He, Ju Huang, Jiaheng Liu, Zhendong Li, Xiaoyang Li, et al. Reinforcement learning optimization for large- scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025

work page arXiv 2025

[30] [30]

what it can create, it may not understand

Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman, Linjie Li, Jena D. Hwang, Liwei Jiang, Jillian Fisher, Abhilasha Ravichander, Khyathi Chandu, Benjamin Newman, Pang Wei Koh, Allyson Ettinger, and Yejin Choi. The generative AI paradox: “what it can create, it may not understand”. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[31] [31]

The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

work page 2025

[32] [32]

Vs-bench: Evaluating vlms for strategic reasoning and decision-making in multi- agent environments.coming soon, 2025

Zelai Xu, Zhexuan Xu, Xiangmin Yi, Huining Yuan, Xinlei Chen, Yi Wu, Chao Yu, and Yu Wang. Vs-bench: Evaluating vlms for strategic reasoning and decision-making in multi- agent environments.coming soon, 2025

work page 2025

[33] [33]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

work page 2022

[35] [35]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc., 2023

work page 2023

[36] [36]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[37] [37]

OVM, outcome-supervised value models for planning in mathematical reasoning

Fei Yu, Anningzhe Gao, and Benyou Wang. OVM, outcome-supervised value models for planning in mathematical reasoning. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Findings of the Association for Computational Linguistics: NAACL 2024, pages 858–875, Mexico City, Mexico, June 2024. Association for Computational Linguistics

work page 2024

[38] [38]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Sys...

work page 2023

[39] [39]

Language agent tree search unifies reasoning, acting, and planning in language models

Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning, acting, and planning in language models. In Forty-first International Conference on Machine Learning, 2024. 12 A Reproducibility Statement To facilitate future research and ensure the reproducibility of our results, we have made al...

work page 2024

[40] [40]

The grid is 0 - indexed , where (0 ,0) is the top - left corner and (2 ,2) is the bottom - right corner

Tic - tac - toe is a two - player board game played on a three - by - three grid . The grid is 0 - indexed , where (0 ,0) is the top - left corner and (2 ,2) is the bottom - right corner

work page

[41] [41]

Two players take turns placing their marks X and O in empty cells of the grid

work page

[42] [42]

The player who first places three of their marks in a horizontal , vertical , or diagonal line wins

work page

[43] [43]

PLAYER I N F O R M A T I O N :

If all cells are filled and no player wins , the game ends in a draw . PLAYER I N F O R M A T I O N :

work page

[44] [44]

You are c om pe ti ng with another player c o n t r o l l i n g the mark O

Your mark is X . You are c om pe ti ng with another player c o n t r o l l i n g the mark O . 16

work page

[45] [45]

The game state d e m o n s t r a t e s the current board with a three - line text grid , where ’X ’ and ’O ’ are the marks of the two players , and ’

In each of your turns : a . The game state d e m o n s t r a t e s the current board with a three - line text grid , where ’X ’ and ’O ’ are the marks of the two players , and ’. ’ r e p r e s e n t s empty cells . b . You need to choose an action to place your mark in an empty cell , based on the given game state and the history of your d eci si on s . c...

work page

[46] [46]

Rows and columns are 1 - indexed (1 to 9)

Sudoku is played on a 9 x9 grid . Rows and columns are 1 - indexed (1 to 9)

work page

[47] [47]

The goal is to fill the empty cells with digits from 1 to 9

work page

[48] [48]

Each row must contain all digits from 1 to 9 without r e p e t i t i o n

work page

[49] [49]

Each column must contain all digits from 1 to 9 without r e p e t i t i o n

work page

[50] [50]

Each of the nine 3 x3 subgrids must contain all digits from 1 to 9 without r e p e t i t i o n

work page

[51] [51]

PLAYER I N F O R M A T I O N :

You cannot o ve rw ri te pre - filled cells . PLAYER I N F O R M A T I O N :

work page

[52] [52]

The current board state is d is pl ay ed as a text grid . - ’. ’ r e p r e s e n t s an empty cell . - Numbers re pr es ent filled cells . - Rows are labeled R1 , R2 ... and Columns C1 , C2

work page

[53] [53]

In each turn , you choose an action to fill an empty cell with a number

work page

[54] [54]

All legal actions are provided in the format ‘< fill ({ row } ,{ col } ,{ number }) > ‘. RESPONSE I N S T R U C T I O N S : 17 Always choose strictly one action and output ‘< answer >{ your chosen action } </ answer > ‘ with no extra text after you finish the thinking process . For example , to fill row 1 , column 1 with number 5 , output ‘< answer > < fi...

work page

[55] [55]

The grid contains exactly 5 hidden mines

M i n e s w e e p e r is played on a 5 x5 grid of cells . The grid contains exactly 5 hidden mines . The grid is 0 - indexed , where (0 ,0) is the top - left corner and (4 ,4) is the bottom - right corner

work page

[56] [56]

The goal is to reveal all cells that do not contain mines without re ve al in g any mine

work page

[57] [57]

If you reveal a mine , you lose the game i m m e d i a t e l y

work page

[58] [58]

If you reveal a safe cell , it will show a number i n d i c a t i n g how many mines are adjacent to it ( n ei gh bo rs include d ia go nal s )

work page

[59] [59]

PLAYER I N F O R M A T I O N :

You can also place a flag on a cell if you suspect it contains a mine , or remove a flag if you change your mind . PLAYER I N F O R M A T I O N :

work page

[60] [60]

’ r e p r e s e n t s an u n r e v e a l e d cell

The current board state is d is pl ay ed as a text grid , where : - ’. ’ r e p r e s e n t s an u n r e v e a l e d cell . - ’F ’ r e p r e s e n t s a flagged cell . - A number (0 -8) r e p r e s e n t s a revealed safe cell with that many adjacent mines

work page

[61] [61]

In each turn , you must choose an action to either reveal a cell or flag / unflag a cell

work page

[62] [62]

The ’ flag ’ command acts as a toggle : play it on an un fl agg ed cell to place a flag , or on a flagged cell to remove it

All legal actions are provided in the format ‘< reveal ({ row } ,{ col }) >‘ or ‘< flag ({ row } ,{ col }) > ‘. The ’ flag ’ command acts as a toggle : play it on an un fl agg ed cell to place a flag , or on a flagged cell to remove it . RESPONSE I N S T R U C T I O N S : 18 Always choose strictly one action and output ‘< answer >{ your chosen action } </...

work page