pith. machine review for the scientific record. sign in

arxiv: 2412.09413 · v2 · pith:R5R6G44Qnew · submitted 2024-12-12 · 💻 cs.AI · cs.CL

Imitate, Explore, and Self-Improve: A Reproduction Report on Slow-thinking Reasoning Systems

Pith reviewed 2026-05-18 00:31 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords slow-thinking reasoningreasoning reproductionself-improvementmultiple rolloutso1-like systemsAI reasoning benchmarks
0
0 comments X

The pith

Reproducing o1-like slow-thinking reasoning works by first imitating long thought traces, then exploring hard problems with multiple rollouts, and iteratively refining the training set on its own outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that a three-step public recipe can close much of the gap to closed industry reasoning models. First the base model is fine-tuned on distilled long-form reasoning traces so it learns to spend extra steps thinking. It is then prompted to generate many solutions to difficult questions, keeping only the correct ones as new training data. Finally the model retrains on this growing set and repeats the cycle. A sympathetic reader cares because the method is fully described and does not rely on secret data or proprietary training runs, so others can test and extend it.

Core claim

By first fine-tuning on distilled long-form thought data to invoke a slow-thinking mode, then generating multiple rollouts on challenging problems to harvest high-quality correct trajectories, and finally using those trajectories to iteratively refine the training dataset, the resulting STILL-2 model reaches competitive accuracy on three hard reasoning benchmarks.

What carries the argument

The STILL-2 framework that sequences imitation of long thought traces, exploration via multiple rollouts, and self-improvement through iterative dataset refinement.

If this is right

  • The fine-tuned model learns to produce extended internal reasoning before giving a final answer.
  • Multiple rollouts on the same hard question yield an increasing fraction of correct solution paths.
  • Each self-improvement round raises performance on the chosen benchmarks.
  • The final system matches the accuracy of undisclosed industry reasoning models on the tested tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the self-refinement loop stays stable, the method could be applied to new domains with only a small seed of high-quality traces.
  • The approach suggests that closed models may rely on similar internal search-and-filter steps that are now reproducible from public data.
  • Monitoring error accumulation across iterations would be a natural next measurement to decide how many rounds are safe.

Load-bearing premise

That generating many rollouts on hard problems will keep producing more correct trajectories and that retraining on the model's own outputs will steadily raise quality without accumulating mistakes.

What would settle it

Run the full imitation-explore-refine loop for three cycles and measure whether accuracy on the hardest benchmark either plateaus below or falls behind the reported industry baselines.

read the original abstract

Recently, slow-thinking reasoning systems, such as o1, have demonstrated remarkable capabilities in solving complex reasoning tasks. These systems typically engage in an extended thinking process before responding to a query, allowing them to generate more thorough, accurate, and well-reasoned solutions. These systems are primarily developed and maintained by industry, with their core techniques not publicly disclosed. In response, an increasing number of studies from the research community aim to explore the technical foundations underlying these powerful reasoning systems. Building on these prior efforts, this paper presents a reproduction report on implementing o1-like reasoning systems. We introduce an ``imitate, explore, and self-improve'' framework, denoted as \textbf{STILL-2}, as our primary technical approach to train the reasoning model. In the initial phase, we use distilled long-form thought data to fine-tune the reasoning model, enabling it to invoke a slow-thinking mode. The model is then encouraged to explore challenging problems by generating multiple rollouts, which can result in increasingly more high-quality trajectories that lead to correct answers. Furthermore, the model undergoes self-improvement by iteratively refining its training dataset. To verify the effectiveness of this approach, we conduct extensive experiments on three challenging benchmarks. The experimental results demonstrate that our approach achieves competitive performance compared to industry-level reasoning systems on these benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents STILL-2, an 'imitate, explore, and self-improve' framework for reproducing o1-like slow-thinking reasoning systems. It begins with fine-tuning a base model on distilled long-form thought data to induce slow-thinking behavior, proceeds to an exploration phase that generates multiple rollouts on challenging problems to surface additional high-quality correct trajectories, and concludes with iterative self-improvement that refines the training dataset using the model's own outputs. The central claim is that this pipeline yields competitive performance on three challenging benchmarks relative to industry-level reasoning systems.

Significance. If the reported performance gains are robustly supported by detailed, reproducible experiments with proper controls, the work would be significant as one of the first public, end-to-end reproductions of extended chain-of-thought reasoning systems. It offers a concrete, open recipe that combines imitation learning with self-generated trajectories, which could lower barriers for community research on scaling reasoning capabilities beyond standard supervised fine-tuning.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments) and associated tables: the assertion of 'competitive performance' is not accompanied by concrete accuracy numbers, baseline comparisons (e.g., against standard CoT or prior open reproductions), error bars, or details on data sources and exclusion rules, leaving the headline claim without visible empirical grounding.
  2. [Self-improve stage (§3.3)] Self-improve stage description (around §3.3): the iterative refinement of the training dataset assumes that filtering or self-labeling multiple rollouts on hard problems will reliably increase the proportion of correct trajectories without compounding errors, yet no external verifier, held-out accuracy check, or success-rate bound on the rollout filter is provided; this assumption is load-bearing for attributing gains to STILL-2 rather than to the initial imitation phase.
minor comments (1)
  1. [Abstract and §3] The abstract and method sections use the term 'high-quality trajectories' without an explicit operational definition or filtering criterion, which could be clarified for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our reproduction report. We address each major comment below and outline the revisions we will make to improve the clarity and rigor of the experimental results and the self-improvement methodology.

read point-by-point responses
  1. Referee: §4 (Experiments) and associated tables: the assertion of 'competitive performance' is not accompanied by concrete accuracy numbers, baseline comparisons (e.g., against standard CoT or prior open reproductions), error bars, or details on data sources and exclusion rules, leaving the headline claim without visible empirical grounding.

    Authors: We appreciate this observation. The manuscript reports concrete accuracy numbers for STILL-2 on the three benchmarks in the experimental tables, along with comparisons to industry systems such as o1-preview. To further ground the claim, we will revise §4 to add explicit baseline results against standard Chain-of-Thought and prior open reproductions, include error bars from repeated runs where computationally feasible, and expand the description of data sources and exclusion rules. revision: yes

  2. Referee: Self-improve stage description (around §3.3): the iterative refinement of the training dataset assumes that filtering or self-labeling multiple rollouts on hard problems will reliably increase the proportion of correct trajectories without compounding errors, yet no external verifier, held-out accuracy check, or success-rate bound on the rollout filter is provided; this assumption is load-bearing for attributing gains to STILL-2 rather than to the initial imitation phase.

    Authors: The referee correctly notes that the self-improvement stage depends on the quality of filtered trajectories. In §3.3 we select rollouts that produce correct final answers using ground-truth verification on the training problems. We will revise the section to include a held-out accuracy analysis showing the increase in correct trajectories across iterations and report the empirical success rate of the rollout filter to quantify and bound potential error accumulation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical reproduction uses external benchmarks and standard self-training without definitional reduction

full rationale

The paper describes an imitate-explore-self-improve pipeline (STILL-2) that begins with fine-tuning on externally distilled long-form thought data, proceeds to multiple rollouts on challenging problems to collect trajectories, and iterates dataset refinement. Performance is measured on three external benchmarks and compared to industry systems. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would make any claimed result equivalent to its inputs by construction. The central claims rest on experimental outcomes rather than internal redefinitions or load-bearing self-references, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions about the benefits of supervised fine-tuning on long-form data and the value of self-generated trajectories for iterative improvement; no new entities or ad-hoc parameters are introduced in the abstract.

axioms (2)
  • domain assumption Distilled long-form thought data from external sources provides a sufficient starting point for invoking slow-thinking behavior in the base model.
    Invoked in the initial imitation phase described in the abstract.
  • domain assumption Multiple rollouts on challenging problems will yield progressively higher-quality correct trajectories suitable for further training.
    Central to the explore phase and self-improvement loop.

pith-pipeline@v0.9.0 · 5817 in / 1271 out tokens · 40496 ms · 2026-05-18T00:31:17.215529+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

    cs.LG 2026-05 conditional novelty 7.0

    ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...

  2. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.

  3. Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion

    cs.LG 2026-05 unverdicted novelty 7.0

    Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...

  4. Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.

  5. Reinforcement Learning for Reasoning in Large Language Models with One Training Example

    cs.LG 2025-04 accept novelty 7.0

    One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

  6. L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

    cs.CL 2025-03 unverdicted novelty 7.0

    LCPO trains L1 reasoning models to adhere to prompt-specified CoT lengths, supporting accuracy-compute trade-offs and yielding short reasoning models that outperform larger baselines at matched lengths.

  7. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 6.0

    RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...

  8. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

  9. InCoder-32B-Thinking: Industrial Code World Model for Thinking

    cs.AR 2026-04 unverdicted novelty 6.0

    InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.

  10. TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.

  11. CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

    cs.CL 2026-03 unverdicted novelty 6.0

    CODA uses rollout-based difficulty signals to drive two gates that penalize verbosity on easy instances and promote deliberation on hard ones, cutting token use over 60% on simple tasks while maintaining accuracy.

  12. WebThinker: Empowering Large Reasoning Models with Deep Research Capability

    cs.CL 2025-04 unverdicted novelty 6.0

    WebThinker equips large reasoning models with autonomous web exploration and interleaved reasoning-drafting via a Deep Web Explorer and RL-based DPO training, yielding gains on GPQA, GAIA, and report-generation benchmarks.

  13. Search-o1: Agentic Search-Enhanced Large Reasoning Models

    cs.AI 2025-01 unverdicted novelty 6.0

    Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...

  14. Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

    cs.LG 2026-05 unverdicted novelty 5.0

    FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.

  15. SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

    cs.LG 2026-04 unverdicted novelty 5.0

    SCOPE routes LLM on-policy rollouts by correctness into teacher-perplexity-weighted KL for errors and student-perplexity-weighted MLE for successes, with group normalization, yielding 11.42% relative Avg@32 gain on re...

  16. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

    cs.AI 2025-08 unverdicted novelty 5.0

    A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.

  17. From System 1 to System 2: A Survey of Reasoning Large Language Models

    cs.AI 2025-02 accept novelty 3.0

    The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

  18. Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    cs.CV 2025-03 unverdicted novelty 2.0

    The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 17 Pith papers · 7 internal anchors

  1. [1]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models. CoRR, abs/2303.18223, 2023

  2. [2]

    Thinking, fast and slow

    Kahneman Daniel. Thinking, fast and slow . 2017

  3. [3]

    A comparative study on reasoning patterns of openai’s o1 model

    Siwei Wu, Zhongyuan Peng, Xinrun Du, Tuney Zheng, Minghao Liu, Jialong Wu, Jiachen Ma, Yizhi Li, Jian Yang, Wangchunshu Zhou, Qunshu Lin, Junbo Zhao, Zhaoxiang Zhang, Wenhao Huang, Ge Zhang, Chenghua Lin, and Jiaheng Liu. A comparative study on reasoning patterns of openai’s o1 model. CoRR, abs/2410.13639, 2024

  4. [4]

    Evaluation of openai o1: Opportunities and challenges of AGI

    Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, Chao Cao, Hanqi Jiang, Hanxu Chen, Yiwei Li, Junhao Chen, Huawen Hu, Yihen Liu, Huaqin Zhao, Shaochen Xu, Haixing Dai, Lin Zhao, Ruidong Zhang, Wei Zhao, Zhenyuan Yang, Jingyuan Chen, Peilong Wang, Wei Ruan, Hui Wang, Huan 11 Zhao, J...

  5. [5]

    Learning to reason with llms, 2024

    OpenAI. Learning to reason with llms, 2024

  6. [6]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022

  7. [7]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. CoRR, abs/2409.12122, 2024

  8. [8]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300, 2024

  9. [9]

    Enhancing llm reasoning with reward-guided tree search.arXiv preprint arXiv:2411.11694, 2024a

    Jinhao Jiang, Zhipeng Chen, Yingqian Min, Jie Chen, Xiaoxue Cheng, Jiapeng Wang, Yiru Tang, Haoxiang Sun, Jia Deng, Wayne Xin Zhao, et al. Technical report: Enhancing llm reasoning with reward-guided tree search. CoRR, abs/2411.11694, 2024

  10. [10]

    Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning

    Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, and Dongzhan Zhou. Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning. CoRR, abs/2410.02884, 2024

  11. [11]

    o1-coder: an o1 replication for coding

    Yuxiang Zhang, Shangxi Wu, Yuqi Yang, Jiangming Shu, Jinlin Xiao, Chao Kong, and Jitao Sang. o1-coder: an o1 replication for coding. CoRR, abs/2412.00154, 2024

  12. [12]

    O1 replication journey: A strategic progress report – part 1

    Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, and Pengfei Liu. O1 replication journey: A strategic progress report – part 1. CoRR, 2024

  13. [13]

    arXiv:2411.14405 [cs]

    Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. Marco-o1: Towards open reasoning models for open-ended solutions. CoRR, abs/2411.14405, 2024

  14. [14]

    Skywork-o1 open series

    Skywork o1 Team. Skywork-o1 open series. https://huggingface.co/Skywork, Novem- ber 2024

  15. [15]

    Deepseek-r1-lite-preview is now live: unleashing supercharged reasoning power!, November 2024

    DeepSeek Team. Deepseek-r1-lite-preview is now live: unleashing supercharged reasoning power!, November 2024

  16. [16]

    Qwq: Reflect deeply on the boundaries of the unknown, November 2024

    Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, November 2024

  17. [17]

    O1 replication journey–part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson? arXiv preprint arXiv:2411.16489, 2024

    Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey–part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson? arXiv preprint arXiv:2411.16489, 2024

  18. [18]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In ICLR. OpenReview.net, 2024

  19. [19]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. CoRR, abs/2311.12022, 2023

  20. [20]

    Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

    Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D. Good- man. Quiet-star: Language models can teach themselves to think before speaking. CoRR, abs/2403.09629, 2024. 12

  21. [21]

    Thinking tokens for language modeling

    David Herel and Tomás Mikolov. Thinking tokens for language modeling. CoRR, abs/2405.08644, 2024

  22. [22]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V . Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In ICLR. OpenReview.net, 2023

  23. [23]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023

  24. [24]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 2024

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 2024

  25. [25]

    Leavitt, and Mansheej Paul

    Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, Max Marion, Matthew L. Leavitt, and Mansheej Paul. Perplexed by perplexity: Perplexity-based data pruning with small reference models. CoRR, abs/2405.20541, 2024

  26. [26]

    Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. CoRR, abs/2308.01825, 2023

  27. [27]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. In NeurIPS, 2022

  28. [28]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024

  29. [29]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2023

  30. [30]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017

  31. [31]

    Online iterative reinforce- ment learning from human feedback with general preference model

    Chenlu Ye, Wei Xiong, Yuheng Zhang, Nan Jiang, and Tong Zhang. Online iterative reinforce- ment learning from human feedback with general preference model. CoRR, abs/2402.07314, 2024

  32. [32]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 202...

  33. [33]

    Hello gpt-4o, 2024

    OpenAI. Hello gpt-4o, 2024

  34. [34]

    Claude 3.5 sonnet, 2024

    Anthropic. Claude 3.5 sonnet, 2024. 13 Problem: Convert the point (0, 3) in rectangular coordinates to polar coordinates. Enter your answer in the form (r, θ), where r > 0 and 0 ≤ θ < 2π. Output: <|begin_of_thought|> Alright, so I’ve got this problem here: I need to convert the point (0,3) from rectangular co- ordinates to polar coordinates. I know that i...

  35. [35]

    Calculate r: r = p x2 + y2 = p 02 + 32 = √ 9 = 3

  36. [36]

    Thus, the polar coordinates are: (3, π 2 ) <|end_of_solution|> Table 5: A sample solution to a problem in MATH-OAI

    Determine θ: Since the point (0, 3) lies on the positive y-axis, the angle θ is π 2 radians. Thus, the polar coordinates are: (3, π 2 ) <|end_of_solution|> Table 5: A sample solution to a problem in MATH-OAI. 14 Problem: Every morning Aya goes for a 9-kilometer-long walk and stops at a coffee shop afterwards. When she walks at a constant speed of s kilome...