MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis

Chongyang Pan; Haiyang Shen; Jinsheng Huang; Ming Zhang; Mugeng Liu; Siqi Zhong; Sixiong Xie; Taian Guo; Weichen Bi; Wenchun Jing

arxiv: 2605.21630 · v1 · pith:36EKGWK2new · submitted 2026-05-20 · 💻 cs.AI

MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis

Haiyang Shen , Taian Guo , Xuanzhong Chen , Mugeng Liu , Weichen Bi , Wenchun Jing , Sixiong Xie , Zhuofan Shi

show 6 more authors

Yudong Han Chongyang Pan Siqi Zhong Jinsheng Huang Ming Zhang Yun Ma

This is my paper

Pith reviewed 2026-05-22 09:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords reasoning data synthesisthought modescompositional generationLLM fine-tuningdifficulty controlSTEM benchmarksmath reasoningdata augmentation

0 comments

The pith

Composing thought modes from existing solutions lets models synthesize reasoning data with controllable difficulty and wide diversity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that reasoning difficulty comes from stacking atomic knowledge-reasoning transformations, which it calls thought modes. MindLoom decomposes verified hard-problem solutions into explicit chains of these modes, trains a retriever to pick suitable modes for any current problem state, and then builds fresh problems by repeatedly applying the retrieved modes to simple seed questions. Distribution-aligned sampling and a final rollout judge ensure the new problems vary in both coverage and difficulty level. Models fine-tuned on the resulting data outperform base models, distillation pipelines, and external-data baselines on nine benchmarks spanning five STEM fields and four math reasoning tasks.

Core claim

MindLoom decomposes verified solutions into thought-mode chains, trains a retrieval model that maps problem states to compatible modes, composes new questions by iteratively applying retrieved modes to seed questions under distribution-aligned sampling, and finally uses rollout-based judging to label difficulty and supply correct responses for supervised fine-tuning.

What carries the argument

Thought modes: atomic knowledge-reasoning transformations whose accumulation determines problem difficulty; these are decomposed from solutions and recomposed via retrieval to synthesize new instances.

If this is right

Fine-tuned models achieve favorable results over base models, distillation, and external-data baselines across nine benchmarks in STEM and math.
The framework supplies explicit structural visibility into factors that govern difficulty, unlike prior synthesis methods.
Ablation studies attribute performance gains to the decomposition, retrieval, and judging stages.
Generated problems cover a broad range of reasoning patterns while preserving useful difficulty gradation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition-and-recomposition loop could be tested on non-STEM reasoning tasks by first identifying domain-specific atomic transformations.
If thought-mode chains prove stable across different base models, the method might support iterative self-improvement loops without external data.
Measuring whether the synthesized problems transfer to model families or sizes not used in the original experiments would test the generality of the difficulty-control claim.

Load-bearing premise

Reasoning difficulty arises specifically from the accumulation of reliably decomposable atomic knowledge-reasoning transformations that can be recomposed to control both difficulty and diversity.

What would settle it

If models fine-tuned on MindLoom data show no consistent gains over strong distillation or external-data baselines when evaluated on the same nine benchmarks, the claim that compositional thought-mode synthesis produces superior training data would be falsified.

Figures

Figures reproduced from arXiv: 2605.21630 by Chongyang Pan, Haiyang Shen, Jinsheng Huang, Ming Zhang, Mugeng Liu, Siqi Zhong, Sixiong Xie, Taian Guo, Weichen Bi, Wenchun Jing, Xuanzhong Chen, Yudong Han, Yun Ma, Zhuofan Shi.

**Figure 1.** Figure 1: Overview of the MINDLOOM pipeline. Step 1: hard problems with verified solutions are reverse-engineered into thought mode chains, populating a thought mode bank B. Step 2: training pairs (Qi−1, T + i ) and mined hard negatives train a retrieval model via margin ranking loss L. Step 3: from a seed question Q0, the pipeline iteratively matches compatible thought modes, applies distribution-aligned scoring wi… view at source ↗

**Figure 2.** Figure 2: Coverage over the twelve thought-mode clusters (Appendix H). Polar bars give the [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Per-cluster selection proportions for the three selector variants. Cluster labels follow [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗

**Figure 4.** Figure 4: Hyperparameter sensitivity sweeps on Qwen3-4B (left) and Qwen3.5-4B (right). Bottom [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗

read the original abstract

Although LLMs have made substantial progress in reasoning, systematically producing frontier-level reasoning data remains difficult. Existing synthesis methods often have limited visibility into the structural factors that govern problem difficulty, which can result in narrow diversity and unstable difficulty control. In this work, we view the difficulty of a reasoning problem as arising from the accumulation of atomic knowledge-reasoning transformations, which we term thought modes. Building on this perspective, we propose MindLoom, a framework for synthesizing frontier-level reasoning data through compositional thought mode engineering. Given a collection of hard problems with verified solutions, MindLoom first decomposes those solutions into thought mode chains that reveal each problem's construction logic. It then trains a retrieval model that matches problem states to compatible thought modes, providing guidance on which reasoning challenges to introduce during synthesis. New problems are composed by iteratively applying retrieved thought modes to seed questions, with distribution-aligned sampling to encourage diverse reasoning coverage. Finally, a rollout-based judging stage labels generated questions by difficulty and supplies judged-correct responses for supervised fine-tuning. We evaluate MindLoom on nine benchmarks covering five STEM disciplines and four mathematical reasoning tasks across multiple model families and sizes. Models fine-tuned on MindLoom-generated data achieves favorable performances over base models, distillation, and external-data baselines across the reported benchmarks. Ablation studies indicate the contribution of each component, and further analysis suggests that MindLoom covers a broad range of reasoning patterns while maintaining useful difficulty control. We have open-sourced our implementation at https://github.com/EachSheep/MindLoom.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MindLoom adds a structured decomposition step to synthetic reasoning data but the results do not yet show that the thought-mode mechanism itself produces the reported gains.

read the letter

The main point for you is that this paper gives a concrete pipeline for breaking hard problems into reusable reasoning steps and recomposing them, yet the link between that structure and the benchmark improvements remains unproven on the evidence shown. The abstract and stress-test note both flag the same gap: no check that longer thought-mode chains actually increase difficulty or that the composed problems require the modes the method claims to insert. Without that, the edge over distillation and external baselines could come from the rollout judging or simply from generating more varied data rather than from the compositional control they emphasize. That said, the work does something useful. It starts from verified solutions, decomposes them into chains of atomic knowledge-reasoning transformations, trains a retrieval model to guide which modes to apply next, and adds distribution-aligned sampling plus a judging stage. They evaluate across nine benchmarks spanning STEM and math, run ablations, and release the code. That combination of decomposition, retrieval, and judging is a step past basic self-instruct or plain CoT, and the open implementation lets others test the claims directly. The soft spots are mostly around missing quantitative anchors. No numbers appear for dataset size, exact performance deltas, or error bars, and the central assumption that difficulty scales with accumulated modes gets no correlation test or human verification. The citation pattern is light on direct comparisons to earlier decomposition or synthetic-data papers, which makes it harder to judge how much is truly incremental. Readers working on post-training data pipelines for reasoning models will get the most from this. The framework is practical enough to try, especially with the code available, and the multi-benchmark setup gives a reasonable starting point for replication. It is coherent on its own terms and shows honest engagement with the problem of controlling difficulty and diversity, so it deserves a serious referee. I would send it out with requests for the missing difficulty-scaling checks and clearer baseline comparisons.

Referee Report

3 major / 2 minor

Summary. The paper introduces MindLoom, a framework for synthesizing frontier-level reasoning data. It decomposes verified solutions into chains of atomic 'thought modes' (knowledge-reasoning transformations), trains a retrieval model to match problem states to modes, generates new problems via iterative mode application with distribution-aligned sampling, and applies rollout-based judging to label difficulty and extract correct responses. Models fine-tuned on the resulting data are reported to achieve favorable performance over base models, distillation, and external-data baselines across nine benchmarks spanning five STEM disciplines and four mathematical reasoning tasks. Ablation studies are said to indicate each component's contribution, with further analysis suggesting broad reasoning-pattern coverage and useful difficulty control. The implementation is open-sourced.

Significance. If the empirical results and the link between thought-mode composition and difficulty hold, the work could advance systematic generation of diverse, controllable reasoning datasets beyond current distillation approaches, potentially improving LLM generalization on complex tasks. The open-sourced code supports reproducibility and extension.

major comments (3)

[Abstract] Abstract: the central claim of favorable benchmark gains attributable to compositional thought-mode engineering rests on unshown quantitative evidence. No numbers, error bars, dataset sizes, or exclusion criteria are supplied for the reported performances or ablations, preventing assessment of effect sizes or reliability of the difficulty-control claims.
[Method] Method description (decomposition, retrieval, and rollout judging): the core assumption that difficulty arises specifically from accumulation of thought modes and that recomposition monotonically increases effective difficulty lacks a direct test. No correlation between mode-chain length and rollout difficulty labels, nor human verification that composed problems require the claimed modes, is described; gains could therefore arise from judging quality or data volume rather than the proposed mechanism.
[Experiments] Experiments section: comparisons to distillation and external-data baselines are load-bearing for the superiority claim, yet details on matched data volumes, baseline implementations, or statistical significance of improvements are not referenced, leaving potential confounds unaddressed.

minor comments (2)

[Abstract] Abstract contains a subject-verb agreement error ('Models ... achieves' should be 'achieve').
[Method] The term 'thought modes' would benefit from an explicit formal definition or pseudocode early in the method section to clarify atomicity and stability assumptions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we agree revisions are warranted and providing our reasoning on the underlying claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of favorable benchmark gains attributable to compositional thought-mode engineering rests on unshown quantitative evidence. No numbers, error bars, dataset sizes, or exclusion criteria are supplied for the reported performances or ablations, preventing assessment of effect sizes or reliability of the difficulty-control claims.

Authors: We agree that the abstract would benefit from greater specificity to allow readers to assess effect sizes immediately. In the revised manuscript we will incorporate concise quantitative highlights (e.g., average relative gains, training-set size, and reference to variance across runs) while retaining the high-level narrative. The detailed tables, error bars, and exclusion criteria already appear in the Experiments section; the abstract revision will simply surface the most salient numbers. revision: yes
Referee: [Method] Method description (decomposition, retrieval, and rollout judging): the core assumption that difficulty arises specifically from accumulation of thought modes and that recomposition monotonically increases effective difficulty lacks a direct test. No correlation between mode-chain length and rollout difficulty labels, nor human verification that composed problems require the claimed modes, is described; gains could therefore arise from judging quality or data volume rather than the proposed mechanism.

Authors: We acknowledge that an explicit correlation between mode-chain length and rollout-derived difficulty labels is not currently reported, and that human verification of mode necessity for the generated problems is absent. While the ablation studies isolate the contribution of the compositional stage, we accept that these do not constitute a direct mechanistic test. In revision we will add a post-hoc correlation analysis between chain length and difficulty labels on the generated set, together with a brief discussion of alternative explanations such as judging quality. A full human verification study is resource-intensive and may be noted as future work rather than added to the current revision. revision: partial
Referee: [Experiments] Experiments section: comparisons to distillation and external-data baselines are load-bearing for the superiority claim, yet details on matched data volumes, baseline implementations, or statistical significance of improvements are not referenced, leaving potential confounds unaddressed.

Authors: We agree that additional experimental details are necessary to rule out confounds. The revised Experiments section will explicitly state the data volumes used for each baseline, describe the precise distillation and external-data implementations (including any hyper-parameter matching), and report statistical significance or confidence intervals for the observed improvements. These clarifications will be added without altering the existing results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in MindLoom's compositional synthesis framework

full rationale

The paper presents an empirical pipeline for reasoning data synthesis: decompose verified solutions into thought-mode chains, train a retrieval model on problem states, iteratively compose new problems via retrieved modes with distribution-aligned sampling, and apply rollout judging for difficulty labeling and SFT targets. No equations, fitted parameters, or self-referential definitions appear in the abstract or described method that would make outputs equivalent to inputs by construction. Performance claims rest on external benchmark evaluations across multiple models and disciplines rather than tautological reductions. The framework is explicitly open-sourced, and the central mechanism (compositional control of difficulty via mode accumulation) is treated as a testable hypothesis supported by ablations, not imported via self-citation chains or uniqueness theorems. This qualifies as a self-contained engineering contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the novel framing of difficulty as accumulated thought modes and the effectiveness of the retrieval-plus-composition pipeline; full details on any fitted parameters in the retrieval model or judging stage are unavailable from the abstract alone.

axioms (1)

domain assumption Reasoning difficulty arises from the accumulation of atomic knowledge-reasoning transformations called thought modes.
Explicitly stated as the foundational perspective in the abstract.

invented entities (1)

thought modes no independent evidence
purpose: Atomic units representing knowledge-reasoning transformations used to model and control problem difficulty and diversity.
New term and concept introduced to structure the decomposition and composition process.

pith-pipeline@v0.9.0 · 5856 in / 1362 out tokens · 46336 ms · 2026-05-22T09:20:49.076100+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we view the difficulty of a reasoning problem as arising from the accumulation of atomic knowledge-reasoning transformations, which we term thought modes
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

New problems are composed by iteratively applying retrieved thought modes to seed questions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 1 internal anchor

[1]

Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023

work page 2023
[2]

Math- arena: Evaluating LLMs on uncontaminated math competitions

Mislav Balunovic, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Math- arena: Evaluating LLMs on uncontaminated math competitions. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URLhttps://openreview.net/forum?id=y0zL9IZxZ7

work page 2026
[3]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889–7901, 2023

work page 2023
[4]

Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025

work page 2025
[5]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems, 2021

work page 2021
[6]

The faiss library.IEEE Transactions on Big Data, 2025

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library.IEEE Transactions on Big Data, 2025

work page 2025
[7]

SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines

Xeron Du, Yifan Yao, Kaijing Ma, and Others. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URL https://openreview.net/forum? id=6WgflzYQpf

work page 2026
[8]

Megascience: Pushing the frontiers of post- training datasets for science reasoning, 2025

Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post- training datasets for science reasoning, 2025

work page 2025
[9]

ai 1 Gatti Alice 1 Li Nathaniel 1 Khoja Adam 1 Kim Ryan 1 Ren Richard 1 Hausenloy Jason 1 Zhang Oliver 1 Mazeika Mantas 1 Hendrycks Dan dan@ safe

Center for AI Safety Phan Long agibenchmark@ safe. ai 1 Gatti Alice 1 Li Nathaniel 1 Khoja Adam 1 Kim Ryan 1 Ren Richard 1 Hausenloy Jason 1 Zhang Oliver 1 Mazeika Mantas 1 Hendrycks Dan dan@ safe. ai 1. A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649(8099):1139–1146, 2026

work page 2026
[10]

Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, 2024

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, 2024

work page 2024
[11]

Data selection via optimal control for language models

Yuxian Gu, Li Dong, Hongning Wang, Yaru Hao, Qingxiu Dong, Furu Wei, and Minlie Huang. Data selection via optimal control for language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=dhAL5fy8wS

work page 2025
[12]

Openthoughts: Data recipes for reasoning mod- els

Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hri- tik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Rea Sprague, Ashima Suvarna, Benjamin Feuer, Leon Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpal- gaonkar, Kartik sharma, Cha...

work page 2026
[13]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025
[14]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024
[15]

Measuring mathematical problem solving with the math dataset, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021

work page 2021
[16]

AIME 2024

Maxwell Jia. AIME 2024. Hugging Face dataset, 2025. URL https://huggingface.co/ datasets/Maxwell-Jia/AIME_2024

work page 2024
[17]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

work page 2021
[18]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[19]

Deepseek-v3

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models, 2025

work page 2025
[20]

Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond, 2025

Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, et al. Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond, 2025

work page 2025
[21]

Selectit: Selective instruction tuning for llms via uncertainty-aware self-reflection

Liangxin Liu, Xuebo Liu, Derek F Wong, Dongfang Li, Ziyi Wang, Baotian Hu, and Min Zhang. Selectit: Selective instruction tuning for llms via uncertainty-aware self-reflection. volume 37, pages 97800–97825, 2024

work page 2024
[22]

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. 2023

work page 2023
[23]

Some methods of classification and analysis of multivariate observations

James B McQueen. Some methods of classification and analysis of multivariate observations. InProc. of 5th Berkeley Symposium on Math. Stat. and Prob., pages 281–297, 1967

work page 1967
[24]

Are large language models superhuman chemists?, 2024

Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoek- abu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, et al. Are large language models superhuman chemists?, 2024

work page 2024
[25]

OpenR1-Math-220k: A Large-Scale Math Dataset for Reinforcement Learning

Open-R1 Team. OpenR1-Math-220k: A Large-Scale Math Dataset for Reinforcement Learning. https://huggingface.co/datasets/open-r1/OpenR1-Math-220k, 2025

work page 2025
[26]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report, 2024. URLhttps://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Gpqa: A graduate-level google-proof q&a benchmark, 2023

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023

work page 2023
[28]

Ai-assisted generation of difficult math questions, 2024

Vedant Shah, Dingli Yu, Kaifeng Lyu, Simon Park, Jiatong Yu, Yinghui He, Nan Rosemary Ke, Michael Mozer, Yoshua Bengio, Sanjeev Arora, et al. Ai-assisted generation of difficult math questions, 2024. 11

work page 2024
[29]

CS-bench: A comprehensive benchmark for large lan- guage models towards computer science mastery

Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, and Weiran Xu. CS-bench: A comprehensive benchmark for large lan- guage models towards computer science mastery. InThe Thirteenth International Conference on Learnin...

work page 2025
[30]

Qwen3.5: Towards Native Multimodal Agents

Qwen Team. Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog?id= qwen3.5, February 2026. Accessed: 2026-04-13

work page 2026
[31]

Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: evaluating college-level scientific problem-solving abilities of large language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[32]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. volume 37, pages 95266–95290, 2024

work page 2024
[33]

Qurating: Selecting high-quality data for training language models, 2024

Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. Qurating: Selecting high-quality data for training language models, 2024

work page 2024
[34]

LESS: Selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: Selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning (ICML), 2024

work page 2024
[35]

Bennett, Junaid Ahmed, and Arnold Overwijk

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=zeFrfgyZln

work page 2021
[36]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qing- wei Lin, and Daxin Jiang. WizardLM: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations,

work page
[37]

URLhttps://openreview.net/forum?id=CfXh93NDgH

work page
[38]

Nori, Rahul Sharma, Amit Sharma, and Javier Gonzalez

Xinnuo Xu, Rachel Lawrence, Kshitij Dubey, Atharva Pandey, Risa Ueno, Fabian Falck, Aditya V . Nori, Rahul Sharma, Amit Sharma, and Javier Gonzalez. RE-IMAGINE: Symbolic benchmark synthesis for reasoning evaluation. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=QJPl0DWajD

work page 2025
[39]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025

work page 2025
[40]

Select2reason: Efficient instruction-tuning data selection for long-cot reasoning, 2025

Cehao Yang, Xueyuan Lin, Xiaojun Wu, Chengjin Xu, Xuhui Jiang, Honghao Liu, Hui Xiong, and Jian Guo. Select2reason: Efficient instruction-tuning data selection for long-cot reasoning, 2025

work page 2025
[41]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, et al. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models. InInternational Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=N8N0hgNDRt

work page 2024
[42]

Mathsmith: Towards extremely hard mathematical reasoning by forging synthetic problems with a reinforced policy

Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, and Fei Tan. Mathsmith: Towards extremely hard mathematical reasoning by forging synthetic problems with a reinforced policy. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34602–34610, 2026

work page 2026
[43]

Expanding reasoning potential in foundation model by learning diverse chains of thought patterns

Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Shuo Wang, Hongfei Yan, Jingang Wang, and Xunliang Cai. Expanding reasoning potential in foundation model by learning diverse chains of thought patterns. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=3FQV4JHPpY

work page 2026
[44]

Darg: Dynamic evaluation of large language models via adaptive reasoning graph

Zhehao Zhang, Jiaao Chen, and Diyi Yang. Darg: Dynamic evaluation of large language models via adaptive reasoning graph. volume 37, pages 135904–135942, 2024. 12

work page 2024
[45]

Swift: a scalable lightweight infrastructure for fine-tuning, 2025

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning, 2025

work page 2025
[46]

minif2f: a cross-system benchmark for formal olympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InInternational Conference on Learning Representations,

work page
[47]

URLhttps://openreview.net/forum?id=9ZPegFuFTFv

work page
[48]

Agieval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. InFindings of the association for computational linguistics: NAACL 2024, pages 2299–2314, 2024

work page 2024
[49]

Q_next":

zwhe99. amc23. Hugging Face dataset, 2023. URL https://huggingface.co/datasets/ zwhe99/amc23. A Limitations and Future Work The primary limitation of MINDLOOMis that the diversity of synthesized problems is bounded by the thought-mode bank, which is itself derived from the collected reference corpus. If the corpus lacks certain types of advanced reasoning...

work page 2023
[50]

Identify all values and quantities that window Wk uses but does not derive (i.e., values computed in earlier windows)

work page
[51]

Convert each such value into an explicit given in the seed question

work page
[52]

Formulate a self-contained question whose complete solution requires exactly the steps inW k

work page
[53]

seed_question

Verify that the seed question is independently solvable without any external information. The implementation uses the following prompt skeleton. The system message defines the role as a problem designer and reverse-engineering specialist, explains that the model should backtrack from the final solution window, and emphasizes dependency isolation. The user...

work page
[54]

Identify which explicit given inQ i−1 is the result of the computation in windowW k−i

work page
[55]

Remove that given from the problem statement and modify the question so that the solver must derive the value

work page
[56]

Ensure the evolved questionQ i remains well-defined and solvable

work page
[57]

Q_next":

Extract the thought mode tuple Ti = (S sum, Sdet, Kgen, Kspec) that describes the added reasoning requirement. 15 The iterative evolution prompt receives the original target problem, the current intermediate question, the upstream solution steps, and the next solution window to absorb. It asks the model to remove one explicit dependency from the current q...

work page
[58]

These serve as hard negatives{T − j }

Mining.For each training pair (Qi,T + i ), we query the FAISS index with the embedding of Qi and retrieve the top-k most similar thought modes, excluding the positive T + i . These serve as hard negatives{T − j }

work page
[59]

is_compatible

Refresh.After every R training steps, we re-encode all thought modes using the updated model and rebuild the FAISS index. This ensures the hard negatives remain informative as training progresses. D.4 Training Hyperparameters The checkpoint at step 20 is selected based on validation-set performance during training. Training runs for up to 300 optimizer st...

work page 2025

[1] [1]

Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023

Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023

work page 2023

[2] [2]

Math- arena: Evaluating LLMs on uncontaminated math competitions

Mislav Balunovic, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Math- arena: Evaluating LLMs on uncontaminated math competitions. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URLhttps://openreview.net/forum?id=y0zL9IZxZ7

work page 2026

[3] [3]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889–7901, 2023

work page 2023

[4] [4]

Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025

Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025

work page 2025

[5] [5]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems, 2021

work page 2021

[6] [6]

The faiss library.IEEE Transactions on Big Data, 2025

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library.IEEE Transactions on Big Data, 2025

work page 2025

[7] [7]

SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines

Xeron Du, Yifan Yao, Kaijing Ma, and Others. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URL https://openreview.net/forum? id=6WgflzYQpf

work page 2026

[8] [8]

Megascience: Pushing the frontiers of post- training datasets for science reasoning, 2025

Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post- training datasets for science reasoning, 2025

work page 2025

[9] [9]

ai 1 Gatti Alice 1 Li Nathaniel 1 Khoja Adam 1 Kim Ryan 1 Ren Richard 1 Hausenloy Jason 1 Zhang Oliver 1 Mazeika Mantas 1 Hendrycks Dan dan@ safe

Center for AI Safety Phan Long agibenchmark@ safe. ai 1 Gatti Alice 1 Li Nathaniel 1 Khoja Adam 1 Kim Ryan 1 Ren Richard 1 Hausenloy Jason 1 Zhang Oliver 1 Mazeika Mantas 1 Hendrycks Dan dan@ safe. ai 1. A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649(8099):1139–1146, 2026

work page 2026

[10] [10]

Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, 2024

Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, 2024

work page 2024

[11] [11]

Data selection via optimal control for language models

Yuxian Gu, Li Dong, Hongning Wang, Yaru Hao, Qingxiu Dong, Furu Wei, and Minlie Huang. Data selection via optimal control for language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=dhAL5fy8wS

work page 2025

[12] [12]

Openthoughts: Data recipes for reasoning mod- els

Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hri- tik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Rea Sprague, Ashima Suvarna, Benjamin Feuer, Leon Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpal- gaonkar, Kartik sharma, Cha...

work page 2026

[13] [13]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025

[14] [14]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024

[15] [15]

Measuring mathematical problem solving with the math dataset, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021

work page 2021

[16] [16]

AIME 2024

Maxwell Jia. AIME 2024. Hugging Face dataset, 2025. URL https://huggingface.co/ datasets/Maxwell-Jia/AIME_2024

work page 2024

[17] [17]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

work page 2021

[18] [18]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023

[19] [19]

Deepseek-v3

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models, 2025

work page 2025

[20] [20]

Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond, 2025

Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, et al. Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond, 2025

work page 2025

[21] [21]

Selectit: Selective instruction tuning for llms via uncertainty-aware self-reflection

Liangxin Liu, Xuebo Liu, Derek F Wong, Dongfang Li, Ziyi Wang, Baotian Hu, and Min Zhang. Selectit: Selective instruction tuning for llms via uncertainty-aware self-reflection. volume 37, pages 97800–97825, 2024

work page 2024

[22] [22]

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. 2023

work page 2023

[23] [23]

Some methods of classification and analysis of multivariate observations

James B McQueen. Some methods of classification and analysis of multivariate observations. InProc. of 5th Berkeley Symposium on Math. Stat. and Prob., pages 281–297, 1967

work page 1967

[24] [24]

Are large language models superhuman chemists?, 2024

Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoek- abu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, et al. Are large language models superhuman chemists?, 2024

work page 2024

[25] [25]

OpenR1-Math-220k: A Large-Scale Math Dataset for Reinforcement Learning

Open-R1 Team. OpenR1-Math-220k: A Large-Scale Math Dataset for Reinforcement Learning. https://huggingface.co/datasets/open-r1/OpenR1-Math-220k, 2025

work page 2025

[26] [26]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report, 2024. URLhttps://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Gpqa: A graduate-level google-proof q&a benchmark, 2023

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023

work page 2023

[28] [28]

Ai-assisted generation of difficult math questions, 2024

Vedant Shah, Dingli Yu, Kaifeng Lyu, Simon Park, Jiatong Yu, Yinghui He, Nan Rosemary Ke, Michael Mozer, Yoshua Bengio, Sanjeev Arora, et al. Ai-assisted generation of difficult math questions, 2024. 11

work page 2024

[29] [29]

CS-bench: A comprehensive benchmark for large lan- guage models towards computer science mastery

Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, and Weiran Xu. CS-bench: A comprehensive benchmark for large lan- guage models towards computer science mastery. InThe Thirteenth International Conference on Learnin...

work page 2025

[30] [30]

Qwen3.5: Towards Native Multimodal Agents

Qwen Team. Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog?id= qwen3.5, February 2026. Accessed: 2026-04-13

work page 2026

[31] [31]

Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: evaluating college-level scientific problem-solving abilities of large language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024

[32] [32]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. volume 37, pages 95266–95290, 2024

work page 2024

[33] [33]

Qurating: Selecting high-quality data for training language models, 2024

Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. Qurating: Selecting high-quality data for training language models, 2024

work page 2024

[34] [34]

LESS: Selecting influential data for targeted instruction tuning

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: Selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning (ICML), 2024

work page 2024

[35] [35]

Bennett, Junaid Ahmed, and Arnold Overwijk

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=zeFrfgyZln

work page 2021

[36] [36]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qing- wei Lin, and Daxin Jiang. WizardLM: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations,

work page

[37] [37]

URLhttps://openreview.net/forum?id=CfXh93NDgH

work page

[38] [38]

Nori, Rahul Sharma, Amit Sharma, and Javier Gonzalez

Xinnuo Xu, Rachel Lawrence, Kshitij Dubey, Atharva Pandey, Risa Ueno, Fabian Falck, Aditya V . Nori, Rahul Sharma, Amit Sharma, and Javier Gonzalez. RE-IMAGINE: Symbolic benchmark synthesis for reasoning evaluation. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=QJPl0DWajD

work page 2025

[39] [39]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025

work page 2025

[40] [40]

Select2reason: Efficient instruction-tuning data selection for long-cot reasoning, 2025

Cehao Yang, Xueyuan Lin, Xiaojun Wu, Chengjin Xu, Xuhui Jiang, Honghao Liu, Hui Xiong, and Jian Guo. Select2reason: Efficient instruction-tuning data selection for long-cot reasoning, 2025

work page 2025

[41] [41]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, et al. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models. InInternational Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=N8N0hgNDRt

work page 2024

[42] [42]

Mathsmith: Towards extremely hard mathematical reasoning by forging synthetic problems with a reinforced policy

Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, and Fei Tan. Mathsmith: Towards extremely hard mathematical reasoning by forging synthetic problems with a reinforced policy. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34602–34610, 2026

work page 2026

[43] [43]

Expanding reasoning potential in foundation model by learning diverse chains of thought patterns

Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Shuo Wang, Hongfei Yan, Jingang Wang, and Xunliang Cai. Expanding reasoning potential in foundation model by learning diverse chains of thought patterns. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=3FQV4JHPpY

work page 2026

[44] [44]

Darg: Dynamic evaluation of large language models via adaptive reasoning graph

Zhehao Zhang, Jiaao Chen, and Diyi Yang. Darg: Dynamic evaluation of large language models via adaptive reasoning graph. volume 37, pages 135904–135942, 2024. 12

work page 2024

[45] [45]

Swift: a scalable lightweight infrastructure for fine-tuning, 2025

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning, 2025

work page 2025

[46] [46]

minif2f: a cross-system benchmark for formal olympiad-level mathematics

Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InInternational Conference on Learning Representations,

work page

[47] [47]

URLhttps://openreview.net/forum?id=9ZPegFuFTFv

work page

[48] [48]

Agieval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. InFindings of the association for computational linguistics: NAACL 2024, pages 2299–2314, 2024

work page 2024

[49] [49]

Q_next":

zwhe99. amc23. Hugging Face dataset, 2023. URL https://huggingface.co/datasets/ zwhe99/amc23. A Limitations and Future Work The primary limitation of MINDLOOMis that the diversity of synthesized problems is bounded by the thought-mode bank, which is itself derived from the collected reference corpus. If the corpus lacks certain types of advanced reasoning...

work page 2023

[50] [50]

Identify all values and quantities that window Wk uses but does not derive (i.e., values computed in earlier windows)

work page

[51] [51]

Convert each such value into an explicit given in the seed question

work page

[52] [52]

Formulate a self-contained question whose complete solution requires exactly the steps inW k

work page

[53] [53]

seed_question

Verify that the seed question is independently solvable without any external information. The implementation uses the following prompt skeleton. The system message defines the role as a problem designer and reverse-engineering specialist, explains that the model should backtrack from the final solution window, and emphasizes dependency isolation. The user...

work page

[54] [54]

Identify which explicit given inQ i−1 is the result of the computation in windowW k−i

work page

[55] [55]

Remove that given from the problem statement and modify the question so that the solver must derive the value

work page

[56] [56]

Ensure the evolved questionQ i remains well-defined and solvable

work page

[57] [57]

Q_next":

Extract the thought mode tuple Ti = (S sum, Sdet, Kgen, Kspec) that describes the added reasoning requirement. 15 The iterative evolution prompt receives the original target problem, the current intermediate question, the upstream solution steps, and the next solution window to absorb. It asks the model to remove one explicit dependency from the current q...

work page

[58] [58]

These serve as hard negatives{T − j }

Mining.For each training pair (Qi,T + i ), we query the FAISS index with the embedding of Qi and retrieve the top-k most similar thought modes, excluding the positive T + i . These serve as hard negatives{T − j }

work page

[59] [59]

is_compatible

Refresh.After every R training steps, we re-encode all thought modes using the updated model and rebuild the FAISS index. This ensures the hard negatives remain informative as training progresses. D.4 Training Hyperparameters The checkpoint at step 20 is selected based on validation-set performance during training. Training runs for up to 300 optimizer st...

work page 2025