pith. sign in

arxiv: 2605.21630 · v1 · pith:36EKGWK2new · submitted 2026-05-20 · 💻 cs.AI

MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis

Pith reviewed 2026-05-22 09:20 UTC · model grok-4.3

classification 💻 cs.AI
keywords reasoning data synthesisthought modescompositional generationLLM fine-tuningdifficulty controlSTEM benchmarksmath reasoningdata augmentation
0
0 comments X

The pith

Composing thought modes from existing solutions lets models synthesize reasoning data with controllable difficulty and wide diversity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that reasoning difficulty comes from stacking atomic knowledge-reasoning transformations, which it calls thought modes. MindLoom decomposes verified hard-problem solutions into explicit chains of these modes, trains a retriever to pick suitable modes for any current problem state, and then builds fresh problems by repeatedly applying the retrieved modes to simple seed questions. Distribution-aligned sampling and a final rollout judge ensure the new problems vary in both coverage and difficulty level. Models fine-tuned on the resulting data outperform base models, distillation pipelines, and external-data baselines on nine benchmarks spanning five STEM fields and four math reasoning tasks.

Core claim

MindLoom decomposes verified solutions into thought-mode chains, trains a retrieval model that maps problem states to compatible modes, composes new questions by iteratively applying retrieved modes to seed questions under distribution-aligned sampling, and finally uses rollout-based judging to label difficulty and supply correct responses for supervised fine-tuning.

What carries the argument

Thought modes: atomic knowledge-reasoning transformations whose accumulation determines problem difficulty; these are decomposed from solutions and recomposed via retrieval to synthesize new instances.

If this is right

  • Fine-tuned models achieve favorable results over base models, distillation, and external-data baselines across nine benchmarks in STEM and math.
  • The framework supplies explicit structural visibility into factors that govern difficulty, unlike prior synthesis methods.
  • Ablation studies attribute performance gains to the decomposition, retrieval, and judging stages.
  • Generated problems cover a broad range of reasoning patterns while preserving useful difficulty gradation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition-and-recomposition loop could be tested on non-STEM reasoning tasks by first identifying domain-specific atomic transformations.
  • If thought-mode chains prove stable across different base models, the method might support iterative self-improvement loops without external data.
  • Measuring whether the synthesized problems transfer to model families or sizes not used in the original experiments would test the generality of the difficulty-control claim.

Load-bearing premise

Reasoning difficulty arises specifically from the accumulation of reliably decomposable atomic knowledge-reasoning transformations that can be recomposed to control both difficulty and diversity.

What would settle it

If models fine-tuned on MindLoom data show no consistent gains over strong distillation or external-data baselines when evaluated on the same nine benchmarks, the claim that compositional thought-mode synthesis produces superior training data would be falsified.

Figures

Figures reproduced from arXiv: 2605.21630 by Chongyang Pan, Haiyang Shen, Jinsheng Huang, Ming Zhang, Mugeng Liu, Siqi Zhong, Sixiong Xie, Taian Guo, Weichen Bi, Wenchun Jing, Xuanzhong Chen, Yudong Han, Yun Ma, Zhuofan Shi.

Figure 1
Figure 1. Figure 1: Overview of the MINDLOOM pipeline. Step 1: hard problems with verified solutions are reverse-engineered into thought mode chains, populating a thought mode bank B. Step 2: training pairs (Qi−1, T + i ) and mined hard negatives train a retrieval model via margin ranking loss L. Step 3: from a seed question Q0, the pipeline iteratively matches compatible thought modes, applies distribution-aligned scoring wi… view at source ↗
Figure 2
Figure 2. Figure 2: Coverage over the twelve thought-mode clusters (Appendix H). Polar bars give the [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-cluster selection proportions for the three selector variants. Cluster labels follow [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hyperparameter sensitivity sweeps on Qwen3-4B (left) and Qwen3.5-4B (right). Bottom [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗
read the original abstract

Although LLMs have made substantial progress in reasoning, systematically producing frontier-level reasoning data remains difficult. Existing synthesis methods often have limited visibility into the structural factors that govern problem difficulty, which can result in narrow diversity and unstable difficulty control. In this work, we view the difficulty of a reasoning problem as arising from the accumulation of atomic knowledge-reasoning transformations, which we term thought modes. Building on this perspective, we propose MindLoom, a framework for synthesizing frontier-level reasoning data through compositional thought mode engineering. Given a collection of hard problems with verified solutions, MindLoom first decomposes those solutions into thought mode chains that reveal each problem's construction logic. It then trains a retrieval model that matches problem states to compatible thought modes, providing guidance on which reasoning challenges to introduce during synthesis. New problems are composed by iteratively applying retrieved thought modes to seed questions, with distribution-aligned sampling to encourage diverse reasoning coverage. Finally, a rollout-based judging stage labels generated questions by difficulty and supplies judged-correct responses for supervised fine-tuning. We evaluate MindLoom on nine benchmarks covering five STEM disciplines and four mathematical reasoning tasks across multiple model families and sizes. Models fine-tuned on MindLoom-generated data achieves favorable performances over base models, distillation, and external-data baselines across the reported benchmarks. Ablation studies indicate the contribution of each component, and further analysis suggests that MindLoom covers a broad range of reasoning patterns while maintaining useful difficulty control. We have open-sourced our implementation at https://github.com/EachSheep/MindLoom.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MindLoom, a framework for synthesizing frontier-level reasoning data. It decomposes verified solutions into chains of atomic 'thought modes' (knowledge-reasoning transformations), trains a retrieval model to match problem states to modes, generates new problems via iterative mode application with distribution-aligned sampling, and applies rollout-based judging to label difficulty and extract correct responses. Models fine-tuned on the resulting data are reported to achieve favorable performance over base models, distillation, and external-data baselines across nine benchmarks spanning five STEM disciplines and four mathematical reasoning tasks. Ablation studies are said to indicate each component's contribution, with further analysis suggesting broad reasoning-pattern coverage and useful difficulty control. The implementation is open-sourced.

Significance. If the empirical results and the link between thought-mode composition and difficulty hold, the work could advance systematic generation of diverse, controllable reasoning datasets beyond current distillation approaches, potentially improving LLM generalization on complex tasks. The open-sourced code supports reproducibility and extension.

major comments (3)
  1. [Abstract] Abstract: the central claim of favorable benchmark gains attributable to compositional thought-mode engineering rests on unshown quantitative evidence. No numbers, error bars, dataset sizes, or exclusion criteria are supplied for the reported performances or ablations, preventing assessment of effect sizes or reliability of the difficulty-control claims.
  2. [Method] Method description (decomposition, retrieval, and rollout judging): the core assumption that difficulty arises specifically from accumulation of thought modes and that recomposition monotonically increases effective difficulty lacks a direct test. No correlation between mode-chain length and rollout difficulty labels, nor human verification that composed problems require the claimed modes, is described; gains could therefore arise from judging quality or data volume rather than the proposed mechanism.
  3. [Experiments] Experiments section: comparisons to distillation and external-data baselines are load-bearing for the superiority claim, yet details on matched data volumes, baseline implementations, or statistical significance of improvements are not referenced, leaving potential confounds unaddressed.
minor comments (2)
  1. [Abstract] Abstract contains a subject-verb agreement error ('Models ... achieves' should be 'achieve').
  2. [Method] The term 'thought modes' would benefit from an explicit formal definition or pseudocode early in the method section to clarify atomicity and stability assumptions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we agree revisions are warranted and providing our reasoning on the underlying claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of favorable benchmark gains attributable to compositional thought-mode engineering rests on unshown quantitative evidence. No numbers, error bars, dataset sizes, or exclusion criteria are supplied for the reported performances or ablations, preventing assessment of effect sizes or reliability of the difficulty-control claims.

    Authors: We agree that the abstract would benefit from greater specificity to allow readers to assess effect sizes immediately. In the revised manuscript we will incorporate concise quantitative highlights (e.g., average relative gains, training-set size, and reference to variance across runs) while retaining the high-level narrative. The detailed tables, error bars, and exclusion criteria already appear in the Experiments section; the abstract revision will simply surface the most salient numbers. revision: yes

  2. Referee: [Method] Method description (decomposition, retrieval, and rollout judging): the core assumption that difficulty arises specifically from accumulation of thought modes and that recomposition monotonically increases effective difficulty lacks a direct test. No correlation between mode-chain length and rollout difficulty labels, nor human verification that composed problems require the claimed modes, is described; gains could therefore arise from judging quality or data volume rather than the proposed mechanism.

    Authors: We acknowledge that an explicit correlation between mode-chain length and rollout-derived difficulty labels is not currently reported, and that human verification of mode necessity for the generated problems is absent. While the ablation studies isolate the contribution of the compositional stage, we accept that these do not constitute a direct mechanistic test. In revision we will add a post-hoc correlation analysis between chain length and difficulty labels on the generated set, together with a brief discussion of alternative explanations such as judging quality. A full human verification study is resource-intensive and may be noted as future work rather than added to the current revision. revision: partial

  3. Referee: [Experiments] Experiments section: comparisons to distillation and external-data baselines are load-bearing for the superiority claim, yet details on matched data volumes, baseline implementations, or statistical significance of improvements are not referenced, leaving potential confounds unaddressed.

    Authors: We agree that additional experimental details are necessary to rule out confounds. The revised Experiments section will explicitly state the data volumes used for each baseline, describe the precise distillation and external-data implementations (including any hyper-parameter matching), and report statistical significance or confidence intervals for the observed improvements. These clarifications will be added without altering the existing results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in MindLoom's compositional synthesis framework

full rationale

The paper presents an empirical pipeline for reasoning data synthesis: decompose verified solutions into thought-mode chains, train a retrieval model on problem states, iteratively compose new problems via retrieved modes with distribution-aligned sampling, and apply rollout judging for difficulty labeling and SFT targets. No equations, fitted parameters, or self-referential definitions appear in the abstract or described method that would make outputs equivalent to inputs by construction. Performance claims rest on external benchmark evaluations across multiple models and disciplines rather than tautological reductions. The framework is explicitly open-sourced, and the central mechanism (compositional control of difficulty via mode accumulation) is treated as a testable hypothesis supported by ablations, not imported via self-citation chains or uniqueness theorems. This qualifies as a self-contained engineering contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the novel framing of difficulty as accumulated thought modes and the effectiveness of the retrieval-plus-composition pipeline; full details on any fitted parameters in the retrieval model or judging stage are unavailable from the abstract alone.

axioms (1)
  • domain assumption Reasoning difficulty arises from the accumulation of atomic knowledge-reasoning transformations called thought modes.
    Explicitly stated as the foundational perspective in the abstract.
invented entities (1)
  • thought modes no independent evidence
    purpose: Atomic units representing knowledge-reasoning transformations used to model and control problem difficulty and diversity.
    New term and concept introduced to structure the decomposition and composition process.

pith-pipeline@v0.9.0 · 5856 in / 1362 out tokens · 46336 ms · 2026-05-22T09:20:49.076100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 1 internal anchor

  1. [1]

    Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023

    Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023

  2. [2]

    Math- arena: Evaluating LLMs on uncontaminated math competitions

    Mislav Balunovic, Jasper Dekoninck, Ivo Petrov, Nikola Jovanovi´c, and Martin Vechev. Math- arena: Evaluating LLMs on uncontaminated math competitions. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URLhttps://openreview.net/forum?id=y0zL9IZxZ7

  3. [3]

    Theoremqa: A theorem-driven question answering dataset

    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7889–7901, 2023

  4. [4]

    Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025

    Francois Chollet, Mike Knoop, Gregory Kamradt, Bryan Landers, and Henry Pinkard. Arc-agi-2: A new challenge for frontier ai reasoning systems, 2025

  5. [5]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems, 2021

  6. [6]

    The faiss library.IEEE Transactions on Big Data, 2025

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library.IEEE Transactions on Big Data, 2025

  7. [7]

    SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines

    Xeron Du, Yifan Yao, Kaijing Ma, and Others. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URL https://openreview.net/forum? id=6WgflzYQpf

  8. [8]

    Megascience: Pushing the frontiers of post- training datasets for science reasoning, 2025

    Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post- training datasets for science reasoning, 2025

  9. [9]

    ai 1 Gatti Alice 1 Li Nathaniel 1 Khoja Adam 1 Kim Ryan 1 Ren Richard 1 Hausenloy Jason 1 Zhang Oliver 1 Mazeika Mantas 1 Hendrycks Dan dan@ safe

    Center for AI Safety Phan Long agibenchmark@ safe. ai 1 Gatti Alice 1 Li Nathaniel 1 Khoja Adam 1 Kim Ryan 1 Ren Richard 1 Hausenloy Jason 1 Zhang Oliver 1 Mazeika Mantas 1 Hendrycks Dan dan@ safe. ai 1. A benchmark of expert-level academic questions to assess ai capabilities.Nature, 649(8099):1139–1146, 2026

  10. [10]

    Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, 2024

    Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, et al. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, 2024

  11. [11]

    Data selection via optimal control for language models

    Yuxian Gu, Li Dong, Hongning Wang, Yaru Hao, Qingxiu Dong, Furu Wei, and Minlie Huang. Data selection via optimal control for language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=dhAL5fy8wS

  12. [12]

    Openthoughts: Data recipes for reasoning mod- els

    Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hri- tik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Rea Sprague, Ashima Suvarna, Benjamin Feuer, Leon Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpal- gaonkar, Kartik sharma, Cha...

  13. [13]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  14. [14]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

  15. [15]

    Measuring mathematical problem solving with the math dataset, 2021

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021

  16. [16]

    AIME 2024

    Maxwell Jia. AIME 2024. Hugging Face dataset, 2025. URL https://huggingface.co/ datasets/Maxwell-Jia/AIME_2024

  17. [17]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

  18. [18]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  19. [19]

    Deepseek-v3

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models, 2025

  20. [20]

    Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond, 2025

    Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, et al. Synlogic: Synthesizing verifiable reasoning data at scale for learning logical reasoning and beyond, 2025

  21. [21]

    Selectit: Selective instruction tuning for llms via uncertainty-aware self-reflection

    Liangxin Liu, Xuebo Liu, Derek F Wong, Dongfang Li, Ziyi Wang, Baotian Hu, and Min Zhang. Selectit: Selective instruction tuning for llms via uncertainty-aware self-reflection. volume 37, pages 97800–97825, 2024

  22. [22]

    Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct

    Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. 2023

  23. [23]

    Some methods of classification and analysis of multivariate observations

    James B McQueen. Some methods of classification and analysis of multivariate observations. InProc. of 5th Berkeley Symposium on Math. Stat. and Prob., pages 281–297, 1967

  24. [24]

    Are large language models superhuman chemists?, 2024

    Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoek- abu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, et al. Are large language models superhuman chemists?, 2024

  25. [25]

    OpenR1-Math-220k: A Large-Scale Math Dataset for Reinforcement Learning

    Open-R1 Team. OpenR1-Math-220k: A Large-Scale Math Dataset for Reinforcement Learning. https://huggingface.co/datasets/open-r1/OpenR1-Math-220k, 2025

  26. [26]

    GPT-4 Technical Report

    OpenAI. GPT-4 Technical Report, 2024. URLhttps://arxiv.org/abs/2303.08774

  27. [27]

    Gpqa: A graduate-level google-proof q&a benchmark, 2023

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023

  28. [28]

    Ai-assisted generation of difficult math questions, 2024

    Vedant Shah, Dingli Yu, Kaifeng Lyu, Simon Park, Jiatong Yu, Yinghui He, Nan Rosemary Ke, Michael Mozer, Yoshua Bengio, Sanjeev Arora, et al. Ai-assisted generation of difficult math questions, 2024. 11

  29. [29]

    CS-bench: A comprehensive benchmark for large lan- guage models towards computer science mastery

    Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, and Weiran Xu. CS-bench: A comprehensive benchmark for large lan- guage models towards computer science mastery. InThe Thirteenth International Conference on Learnin...

  30. [30]

    Qwen3.5: Towards Native Multimodal Agents

    Qwen Team. Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog?id= qwen3.5, February 2026. Accessed: 2026-04-13

  31. [31]

    Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang

    Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: evaluating college-level scientific problem-solving abilities of large language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  32. [32]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. volume 37, pages 95266–95290, 2024

  33. [33]

    Qurating: Selecting high-quality data for training language models, 2024

    Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. Qurating: Selecting high-quality data for training language models, 2024

  34. [34]

    LESS: Selecting influential data for targeted instruction tuning

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. LESS: Selecting influential data for targeted instruction tuning. InInternational Conference on Machine Learning (ICML), 2024

  35. [35]

    Bennett, Junaid Ahmed, and Arnold Overwijk

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=zeFrfgyZln

  36. [36]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qing- wei Lin, and Daxin Jiang. WizardLM: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations,

  37. [37]

    URLhttps://openreview.net/forum?id=CfXh93NDgH

  38. [38]

    Nori, Rahul Sharma, Amit Sharma, and Javier Gonzalez

    Xinnuo Xu, Rachel Lawrence, Kshitij Dubey, Atharva Pandey, Risa Ueno, Fabian Falck, Aditya V . Nori, Rahul Sharma, Amit Sharma, and Javier Gonzalez. RE-IMAGINE: Symbolic benchmark synthesis for reasoning evaluation. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=QJPl0DWajD

  39. [39]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025

  40. [40]

    Select2reason: Efficient instruction-tuning data selection for long-cot reasoning, 2025

    Cehao Yang, Xueyuan Lin, Xiaojun Wu, Chengjin Xu, Xuhui Jiang, Honghao Liu, Hui Xiong, and Jian Guo. Select2reason: Efficient instruction-tuning data selection for long-cot reasoning, 2025

  41. [41]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, et al. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models. InInternational Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=N8N0hgNDRt

  42. [42]

    Mathsmith: Towards extremely hard mathematical reasoning by forging synthetic problems with a reinforced policy

    Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, and Fei Tan. Mathsmith: Towards extremely hard mathematical reasoning by forging synthetic problems with a reinforced policy. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34602–34610, 2026

  43. [43]

    Expanding reasoning potential in foundation model by learning diverse chains of thought patterns

    Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Shuo Wang, Hongfei Yan, Jingang Wang, and Xunliang Cai. Expanding reasoning potential in foundation model by learning diverse chains of thought patterns. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=3FQV4JHPpY

  44. [44]

    Darg: Dynamic evaluation of large language models via adaptive reasoning graph

    Zhehao Zhang, Jiaao Chen, and Diyi Yang. Darg: Dynamic evaluation of large language models via adaptive reasoning graph. volume 37, pages 135904–135942, 2024. 12

  45. [45]

    Swift: a scalable lightweight infrastructure for fine-tuning, 2025

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, et al. Swift: a scalable lightweight infrastructure for fine-tuning, 2025

  46. [46]

    minif2f: a cross-system benchmark for formal olympiad-level mathematics

    Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InInternational Conference on Learning Representations,

  47. [47]

    URLhttps://openreview.net/forum?id=9ZPegFuFTFv

  48. [48]

    Agieval: A human-centric benchmark for evaluating foundation models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. InFindings of the association for computational linguistics: NAACL 2024, pages 2299–2314, 2024

  49. [49]

    Q_next":

    zwhe99. amc23. Hugging Face dataset, 2023. URL https://huggingface.co/datasets/ zwhe99/amc23. A Limitations and Future Work The primary limitation of MINDLOOMis that the diversity of synthesized problems is bounded by the thought-mode bank, which is itself derived from the collected reference corpus. If the corpus lacks certain types of advanced reasoning...

  50. [50]

    Identify all values and quantities that window Wk uses but does not derive (i.e., values computed in earlier windows)

  51. [51]

    Convert each such value into an explicit given in the seed question

  52. [52]

    Formulate a self-contained question whose complete solution requires exactly the steps inW k

  53. [53]

    seed_question

    Verify that the seed question is independently solvable without any external information. The implementation uses the following prompt skeleton. The system message defines the role as a problem designer and reverse-engineering specialist, explains that the model should backtrack from the final solution window, and emphasizes dependency isolation. The user...

  54. [54]

    Identify which explicit given inQ i−1 is the result of the computation in windowW k−i

  55. [55]

    Remove that given from the problem statement and modify the question so that the solver must derive the value

  56. [56]

    Ensure the evolved questionQ i remains well-defined and solvable

  57. [57]

    Q_next":

    Extract the thought mode tuple Ti = (S sum, Sdet, Kgen, Kspec) that describes the added reasoning requirement. 15 The iterative evolution prompt receives the original target problem, the current intermediate question, the upstream solution steps, and the next solution window to absorb. It asks the model to remove one explicit dependency from the current q...

  58. [58]

    These serve as hard negatives{T − j }

    Mining.For each training pair (Qi,T + i ), we query the FAISS index with the embedding of Qi and retrieve the top-k most similar thought modes, excluding the positive T + i . These serve as hard negatives{T − j }

  59. [59]

    is_compatible

    Refresh.After every R training steps, we re-encode all thought modes using the updated model and rebuild the FAISS index. This ensures the hard negatives remain informative as training progresses. D.4 Training Hyperparameters The checkpoint at step 20 is selected based on validation-set performance during training. Training runs for up to 300 optimizer st...