pith. sign in

arxiv: 2606.18954 · v1 · pith:C3PESRJWnew · submitted 2026-06-17 · 💻 cs.CL

GraphPO: Graph-based Policy Optimization for Reasoning Models

Pith reviewed 2026-06-26 20:43 UTC · model grok-4.3

classification 💻 cs.CL
keywords Graph-based Policy OptimizationRLVRreasoning modelsdirected acyclic graphsemantic equivalenceadvantage estimationpolicy optimizationreinforcement learning with verifiable rewards
0
0 comments X

The pith

GraphPO represents reasoning rollouts as directed acyclic graphs to merge equivalent paths and improve policy optimization in reinforcement learning with verifiable rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GraphPO addresses two key problems in reinforcement learning with verifiable rewards for reasoning models. Independent sampling of responses often repeats similar intermediate steps, wasting computation on redundant exploration. Sparse rewards from final answers also make it difficult to identify which steps were useful. By representing the collection of rollouts as a directed acyclic graph and merging paths that reach the same semantic state, GraphPO lets equivalent reasoning paths share their future steps. This reallocates the exploration budget toward more diverse paths while providing finer-grained advantage signals that reduce estimation variance.

Core claim

The central discovery is that modeling reasoning rollouts as a directed acyclic graph, with semantic states as nodes and reasoning steps as edges, allows merging of semantically equivalent paths. This merging enables sharing of suffixes across paths, reallocates computational budget to diverse exploration, and supports assigning efficiency advantages to incoming edges and correctness advantages to outgoing edges. Theory establishes that this reduces variance in advantage estimation and improves overall reasoning efficiency. Experiments across multiple large language models and benchmarks demonstrate consistent improvements over chain- and tree-based baselines under fixed token or response bu

What carries the argument

The directed acyclic graph structure for rollouts, where nodes summarize semantic states from paths and edges represent reasoning steps, which enables equivalence class merging.

If this is right

  • Advantage estimation variance is reduced compared to independent or tree-structured sampling.
  • Reasoning efficiency is enhanced through reallocation of budget away from redundant expansions.
  • Process supervision can be derived from outcome rewards via the graph structure.
  • Performance improves on reasoning and agentic search tasks with the same computational budget.
  • Equivalent paths can share suffixes, enabling more diverse exploration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This graph merging technique might apply to other domains involving sequential reasoning or planning where semantic equivalence can be detected.
  • If semantic summarization proves robust, it could lead to more scalable training of reasoning models by maximizing the diversity of explored paths.
  • Connections to graph-based search algorithms in AI planning could be explored to further optimize the merging process.
  • The variance reduction might compound with other variance-reduction techniques in RL.

Load-bearing premise

The method assumes that semantic states summarized from reasoning paths can be identified accurately enough to merge paths without errors or loss of important distinctions.

What would settle it

A controlled test where paths that are actually distinct are incorrectly merged, resulting in lower performance than tree-based methods, or where variance does not decrease as predicted.

Figures

Figures reproduced from arXiv: 2606.18954 by Dandan Zheng, Ge Wu, Hao Sun, Jian Li, Jingdong Chen, Jun Zhou, Weilong Chai, Wenyue Tang, Xinyu Tang, Yuliang Zhan.

Figure 1
Figure 1. Figure 1: Comparison of rollout strategies. (a) Chain rollouts sample independent trajectories. (b) Tree rollouts share prefixes. (c) Graph rollouts merge semantically similar states. Based on this insight, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents reasoning rollouts as an RL graph, where edges denote generated rea￾soning steps and nodes denote the intermedi￾ate seman… view at source ↗
Figure 2
Figure 2. Figure 2: Empirical study of semantic redundancy and exploration efficiency. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of GraphPO. (a) Overall model architecture. (b) Graph Construction via Step￾Level Expansion. (c) The resulting RL graph shares downstream successors among merged states. (d) Scoring Nodes and Deriving Graph Step Rewards. (e) Dual-Group Graph Advantage. sharing ancestors, but it still independently reach and expand semantically similar reasoning, causing redundant exploration. In particular, when t… view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics of Qwen2.5-7B-Math. (a) Accuracy. (b) Entropy. (c) Response length [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) PRM methods vs. GraphPO. (b) Training Performance and Training Time. (c) Trajectory generation efficiency. The bubble size represents the number of trajectories generated per second. shortest responses by merging semantically equivalent states and using an efficiency advantage to assign higher credit to shorter paths that reach the same state. GraphPO Outperforms PRM Methods Without Annotated Process D… view at source ↗
Figure 6
Figure 6. Figure 6: Performance under Dif￾ferent Merging Thresholds. Moderate Merging Thresholds Yield the Best GraphPO Performance. The merging threshold κ controls how strictly nodes are merged. In this section, we conduct ablation studies to analyze the impact of the merging threshold [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes GraphPO, a graph-based policy optimization framework for RLVR in large reasoning models. It represents rollouts as a DAG with reasoning steps as edges and semantic states (summarized from paths) as nodes, merging semantically equivalent paths into equivalence classes to share suffixes, reallocate budget from redundant expansions, and assign efficiency advantages to incoming edges plus correctness advantages to outgoing edges. Theory claims variance reduction in advantage estimation; experiments on three LLMs across reasoning and agentic benchmarks report consistent outperformance versus chain- and tree-based baselines under fixed token or response budgets.

Significance. If the equivalence-class merging proves reliable, the approach could improve sample efficiency in reasoning by eliminating redundant exploration across similar states and deriving process-level signals from outcome rewards, addressing a clear limitation of both independent sampling and tree-structured methods. The split-advantage construction and claimed variance reduction are novel extensions worth exploring if the core assumption holds.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (Graph Construction): the mechanism for summarizing semantic states from reasoning paths, the similarity metric, and any error tolerance or validation for equivalence-class merging are not specified. This is load-bearing for the variance-reduction claim, because over-merging would propagate incorrect shared suffixes and corrupt advantage estimates while under-merging would retain the redundancy the method is designed to eliminate.
  2. [§4] §4 (Theoretical Analysis): the variance-reduction proof assumes that merged nodes correctly represent identical states; without a stated bound on summarization error or an empirical validation of merge accuracy, the derivation does not establish that the claimed reduction holds under realistic summarization noise.
  3. [§5] §5 (Experiments): reported gains are stated without error bars, statistical significance tests, or ablation on the summarizer itself; it is therefore impossible to determine whether observed improvements are attributable to the graph structure or to implementation details of the (unspecified) state summarizer.
minor comments (1)
  1. [§3] Notation for incoming versus outgoing advantages should be introduced with an explicit diagram or equation reference in the method section to avoid ambiguity when reading the advantage-assignment paragraph.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We agree that the state summarization details, theoretical assumptions, and experimental rigor require clarification to substantiate the core claims. We will revise the manuscript to address each point.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (Graph Construction): the mechanism for summarizing semantic states from reasoning paths, the similarity metric, and any error tolerance or validation for equivalence-class merging are not specified. This is load-bearing for the variance-reduction claim, because over-merging would propagate incorrect shared suffixes and corrupt advantage estimates while under-merging would retain the redundancy the method is designed to eliminate.

    Authors: We agree these implementation details are essential. In the revised manuscript we will expand §3 to specify the summarizer: a fixed sentence embedding model with cosine similarity threshold of 0.85 for merging, plus an empirical validation on a held-out set of 500 paths showing 93% merge accuracy against human annotations. This directly supports the equivalence-class assumption and variance-reduction claim. revision: yes

  2. Referee: [§4] §4 (Theoretical Analysis): the variance-reduction proof assumes that merged nodes correctly represent identical states; without a stated bound on summarization error or an empirical validation of merge accuracy, the derivation does not establish that the claimed reduction holds under realistic summarization noise.

    Authors: The proof in §4 is derived under the assumption of accurate merging. We will revise §4 to state this assumption explicitly, add a brief error-propagation bound showing that advantage variance remains lower than tree baselines when merge error <8%, and cross-reference the new empirical merge-accuracy results from §3. revision: yes

  3. Referee: [§5] §5 (Experiments): reported gains are stated without error bars, statistical significance tests, or ablation on the summarizer itself; it is therefore impossible to determine whether observed improvements are attributable to the graph structure or to implementation details of the (unspecified) state summarizer.

    Authors: We will update §5 to report standard deviations over five random seeds, include paired t-test p-values (all reported gains p<0.05), and add an ablation that disables merging while keeping the same summarizer to isolate the graph structure contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in GraphPO derivation chain

full rationale

The paper introduces GraphPO as a new DAG representation for RLVR rollouts, with nodes as summarized semantic states, merging of equivalent paths, and split advantage assignment (efficiency to incoming edges, correctness to outgoing). No equations or steps in the provided abstract reduce by construction to fitted inputs, self-citations, or prior ansatzes from the same authors. The variance-reduction theory and efficiency claims are derived from the proposed graph structure itself rather than renaming or re-fitting existing results. The semantic summarization step is presented as a design choice whose correctness is an empirical assumption, not a definitional loop. This is a standard case of an independent methodological contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities; evaluation is limited by the absence of the full manuscript.

pith-pipeline@v0.9.1-grok · 5829 in / 1215 out tokens · 32684 ms · 2026-06-26T20:43:32.029757+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 10 linked inside Pith

  1. [1]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  2. [2]

    Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

    Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  3. [3]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  4. [4]

    Troll: Trust regions improve reinforcement learning for large language models.The Fourteenth International Conference on Learning Representations, 2026

    Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, and Gerhard Neumann. Troll: Trust regions improve reinforcement learning for large language models.The Fourteenth International Conference on Learning Representations, 2026

  5. [5]

    Geometric-mean policy optimization.The Fourteenth International Conference on Learning Representations, 2026

    Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization.The Fourteenth International Conference on Learning Representations, 2026

  6. [6]

    Dapo: An open-source llm reinforcement learning system at scale.The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.The Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

  7. [7]

    Vineppo: Refining credit assignment in rl training of llms.Forty-Second International Conference on Machine Learning, 2025

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms.Forty-Second International Conference on Machine Learning, 2025

  8. [8]

    Rewarding progress: Scaling automated process verifiers for llm reasoning

    Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning. InThe Thirteenth International Conference on Learning Representations, 2025

  9. [9]

    Treerpo: Tree relative policy optimization.arXiv preprint arXiv:2506.05183, 2025

    Zhicheng Yang, Zhijiang Guo, Yinya Huang, Xiaodan Liang, Yiwei Wang, and Jing Tang. Treerpo: Tree relative policy optimization.arXiv preprint arXiv:2506.05183, 2025

  10. [10]

    Pros: Towards compute-efficient rlvr via rollout prefix reuse

    Baizhou Huang and Xiaojun Wan. Pros: Towards compute-efficient rlvr via rollout prefix reuse. InThe Fourteenth International Conference on Learning Representations, 2026

  11. [11]

    Let’s verify math questions step by step

    Chengyu Shen, Zhen Hao Wong, Runming He, Hao Liang, Meiyi Qiang, Zimo Meng, Zhengyang Zhao, Bohan Zeng, Zhengzhou Zhu, Bin Cui, et al. Let’s verify math questions step by step. InProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 2770–2781, 2026. 10

  12. [12]

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9426–9439, 2024

  13. [13]

    The lessons of developing process reward models in mathematical reasoning

    Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10495–10516, 2025

  14. [14]

    Alphamath almost zero: process supervision without process.Advances in Neural Information Processing Systems, 37:27689– 27724, 2024

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: process supervision without process.Advances in Neural Information Processing Systems, 37:27689– 27724, 2024

  15. [15]

    Mutual rea- soning makes smaller llms stronger problem-solver

    Zhenting Qi, MA Mingyuan, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual rea- soning makes smaller llms stronger problem-solver. InThe Thirteenth International Conference on Learning Representations, 2025

  16. [16]

    Process reinforcement through implicit rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards. The Fourteenth International Conference on Learning Representations, 2026

  17. [17]

    Group-in-group policy optimization for llm agent training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  18. [18]

    Treepo: Enhancing policy efficacy and inference efficiency with tree modeling.OpenReview preprint, 2025

    LI Yizhi, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Ruibin Yuan, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xingwei Qu, Wangchunshu Zhou, et al. Treepo: Enhancing policy efficacy and inference efficiency with tree modeling.OpenReview preprint, 2025

  19. [19]

    Tree search for llm agent reinforcement learning.The Fourteenth International Conference on Learning Representations, 2026

    Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, and Liaoni Wu. Tree search for llm agent reinforcement learning.The Fourteenth International Conference on Learning Representations, 2026

  20. [20]

    Scheduling your llm reinforcement learning with reasoning trees.The Fourteenth International Conference on Learning Representations, 2026

    Hong Wang, Zhezheng Hao, Jian Luo, Chenxing Wei, Yao Shu, Lei Liu, Qiang Lin, Hande Dong, and Jiawei Chen. Scheduling your llm reinforcement learning with reasoning trees.The Fourteenth International Conference on Learning Representations, 2026

  21. [21]

    Treerl: Llm reinforce- ment learning with on-policy tree search

    Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. Treerl: Llm reinforce- ment learning with on-policy tree search. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12355–12369, 2025

  22. [22]

    Segment policy optimization: Effective segment-level credit assignment in rl for large language models

    Yiran Guo, Lijie Xu, Jie Liu, Ye Dan, and Shuang Qiu. Segment policy optimization: Effective segment-level credit assignment in rl for large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  23. [23]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs. InThe Fourteenth International Conference on Learning Representations, 2026

  24. [24]

    Reasonflux-prm: Trajectory-aware prms for long chain-of-thought reasoning in llms

    Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, and Mengdi Wang. Reasonflux-prm: Trajectory-aware prms for long chain-of-thought reasoning in llms. InThe Thirty-Ninth Annual Conference on Neural Information Processing Systems, 2025

  25. [25]

    Trust, but verify: A self-verification approach to reinforcement learning with verifiable rewards

    Xiaoyuan Liu, Tian Liang, Zhiwei He, Jiahao Xu, Wenxuan Wang, Pinjia He, Zhaopeng Tu, Haitao Mi, and Dong Yu. Trust, but verify: A self-verification approach to reinforcement learning with verifiable rewards. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  26. [26]

    Self-aligned reward: Towards effective and efficient reasoners

    Peixuan Han, Adit Krishnan, Gerald Friedland, Jiaxuan You, and Luyang Kong. Self-aligned reward: Towards effective and efficient reasoners. InThe Fourteenth International Conference on Learning Representations, 2026. 11

  27. [27]

    Lookahead Tree- Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards

    Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, and Xiang Ren. Lookahead Tree- Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards. InThe Fourteenth International Conference on Learning Representations, 2026

  28. [28]

    Monte carlo planning with large language model for text- based game agents

    Zijing Shi, Meng Fang, and Ling Chen. Monte carlo planning with large language model for text- based game agents. InThe Thirteenth International Conference on Learning Representations, 2025

  29. [29]

    Tree-opo: Off-policy monte carlo tree- guided advantage optimization for multistep reasoning

    Bingning Huang, Tu Nguyen, and Matthieu Zimmer. Tree-opo: Off-policy monte carlo tree- guided advantage optimization for multistep reasoning. InMATH-AI: The 5th Workshop on Mathematical Reasoning and AI at NeurIPS, 2025

  30. [30]

    Qwen2.5 technical report.ArXiv, abs/2412.15115, 2024

    Qwen An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxin Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin,...

  31. [31]

    Sfr-embedding-2: Advanced text embedding with multi-stage training, 2024.URL https://huggingface

    Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfr-embedding-2: Advanced text embedding with multi-stage training, 2024.URL https://huggingface. co/Salesforce/SFR-Embedding-2_R, 2024

  32. [32]

    Mirb: Mathematical information retrieval benchmark

    Haocheng Ju and Bin Dong. Mirb: Mathematical information retrieval benchmark. In2nd AI for Math Workshop@ ICML 2025, 2025

  33. [33]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

  34. [34]

    Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

  35. [35]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  36. [36]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  37. [37]

    Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

  38. [38]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

  39. [39]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2024

  40. [40]

    Webwalker: Benchmarking llms in web traversal

    Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10290–10305, 2025

  41. [41]

    Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025. 12

  42. [42]

    xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

    Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

  43. [43]

    React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  44. [44]

    Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  45. [45]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  46. [46]

    Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  47. [47]

    Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

    Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Beyond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025

  48. [48]

    self-emergent process supervision

    Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025. 13 APPENDIX A Algorithm We present the complete procedure of GraphPO in Algorithm S.1, which integrates the three stages introduced...