pith. the verified trust layer for science. sign in

arxiv: 2507.21166 · v2 · submitted 2025-07-25 · 💻 cs.LG · cs.AI

The Ratchet Effect in Silico through Interaction-Driven Cumulative Intelligence in Large Language Models

Pith reviewed 2026-05-19 02:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords large language modelsmulti-agent systemscumulative intelligenceratchet effectpeer verificationmathematical reasoningparameter internalization
0
0 comments X p. Extension

The pith

Small language models improve reasoning by verifying each other's solutions and updating parameters with the validated results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that populations of small large language models can accumulate knowledge gains over successive rounds of interaction by generating candidate solutions, having peers check them, keeping only the correct ones in a shared memory, and then adjusting their own parameters to incorporate those successes. This process is presented as a computational version of the ratchet effect seen in human cumulative culture, where improvements are retained rather than lost to drift. If the mechanism holds, it would mean that structured social interaction among models offers a route to stronger performance that does not require simply adding more parameters or more training data. The experiments focus on mathematical reasoning tasks and show that the verification step is the main driver while internalization carries the gains forward across rounds.

Core claim

In the POLIS framework, heterogeneous agents generate solutions to problems, verify one another's outputs, store only the validated artifacts in shared cultural memory, and internalize those artifacts through parameter updates. Populations of 1--4B-parameter models using this process record average gains of 8.8--18.9 points on mathematical reasoning benchmarks and reduce the performance gap to single 70B+ models. Mechanistic tests isolate peer verification as the primary operator that sustains the accumulation and show that internalization prevents loss of the retained knowledge between rounds.

What carries the argument

The POLIS multi-agent loop in which peer verification filters outputs before they enter shared memory and drive parameter internalization.

If this is right

  • Models in the 1--4B range can reach performance levels closer to much larger single models solely through repeated cycles of generation, verification, retention, and internalization.
  • Peer verification functions as the main mechanism that prevents loss of improvements and enables the ratchet-like accumulation.
  • Internalization of validated artifacts into agent parameters sustains the gains from one round to the next rather than requiring constant external prompting.
  • The overall approach treats structured interaction as an independent scaling dimension alongside increases in parameter count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar interaction structures could be tested on non-mathematical tasks such as code generation or scientific hypothesis refinement to check whether the accumulation effect generalizes.
  • Systems built this way might support continuous, open-ended refinement by maintaining an ongoing society of agents instead of periodic full retraining.
  • The design raises the practical question of how verification quality must scale with task difficulty to avoid gradual contamination of the shared memory.

Load-bearing premise

Peer agents can reliably separate correct solutions from incorrect ones and keep only the good ones in the shared memory that later shapes updates.

What would settle it

Replace the peer verification step with random acceptance of solutions or with a process that systematically admits errors; if the reported performance gains then disappear or reverse across rounds, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2507.21166 by Ren Zhuang.

Figure 1
Figure 1. Figure 1: Performance trajectory of the AGORA en￾semble on MATH500. The ensemble’s performance im￾proves over evolutionary steps, surpassing several larger, static benchmarks. The star denotes a strategic shift toward elite-peer learning, which breaks a performance plateau. learn from peer-generated solutions, dynamically alternat￾ing between teacher and student roles. This multidirectional knowledge flow contrasts … view at source ↗
Figure 2
Figure 2. Figure 2: The AGORA architecture. A four-module, dual-loop architecture that facilitates two core processes: Challenge Generation (I), Solution Formulation (II), Quality Evaluation (III), and Model Evolution (IV). The architecture facilitates group distillation for knowledge consolidation and uses an elite history of peer solutions to incentivize group emergence. Quality Recognition For solutions verified as correct… view at source ↗
Figure 3
Figure 3. Figure 3: Adaptive curriculum dynamics on MATH500. As the ensemble’s performance increases, the system au￾tomatically generates more challenging problems which re￾flected in longer response times, ensuring a state of continu￾ous improvement. powerful and generalizable enhancement, not tied to a spe￾cific model architecture. The strong performance on bench￾marks like GPQA and AIME24 is particularly noteworthy, confir… view at source ↗
Figure 5
Figure 5. Figure 5: Comparing AGORA ensembles on MATH500. While homogeneous groups of 1B and 4B models improve, the heterogeneous mixed group achieves the highest abso￾lute performance gain, highlighting the benefit of cognitive diversity. lishing correctness without an external oracle is the foun￾dational element of the collective’s success. Without it, the system is unable to prevent the propagation of plausible but flawed … view at source ↗
Figure 6
Figure 6. Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Human intelligence scales through cumulative cultural evolution (CCE), a ratchet process in which innovations are retained against entropic drift. Large language model training, by contrast, still depends primarily on static corpora and parameter growth, leaving little room for endogenous accumulation through interaction. We present POLIS (Population Orchestrated Learning and Inference Society), a framework in which heterogeneous agents generate solutions, verify one another's outputs, retain validated artifacts in shared cultural memory, and internalize them through parameter updates. On mathematical reasoning benchmarks, populations of 1--4B-parameter models achieved average gains of 8.8--18.9 points over base models and narrowed the gap to 70B+ monoliths. Mechanistic ablations identify peer verification as the main ratchet operator and show that internalization sustains accumulation across rounds, providing computational evidence that epistemic vigilance organizes durable knowledge growth. These results position structured social interaction as a scaling lever orthogonal to parameter count.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces POLIS, a multi-agent framework in which heterogeneous LLM agents generate solutions to mathematical reasoning tasks, perform peer verification, retain validated artifacts in a shared cultural memory, and internalize them via parameter updates. It reports that populations of 1--4B-parameter models achieve average benchmark gains of 8.8--18.9 points over base models and narrow the performance gap to 70B+ monolithic models, with mechanistic ablations identifying peer verification as the primary ratchet operator and internalization as the mechanism sustaining accumulation across rounds.

Significance. If the central mechanism is shown to operate as claimed, the work provides computational evidence that structured social interaction can serve as a scaling dimension orthogonal to parameter count, enabling cumulative knowledge growth in LLMs through an endogenous ratchet process analogous to cultural evolution. This would have implications for efficient training of smaller models and for understanding how epistemic vigilance organizes durable knowledge in artificial systems.

major comments (3)
  1. [§4] §4 (Experimental Setup and Results): The central claim attributes the 8.8--18.9 point gains and gap-narrowing to peer verification acting as a reliable ratchet, yet no quantitative measurements of verification accuracy are reported (e.g., false-positive rate on incorrect solutions or false-negative rate on correct ones for the specific math reasoning tasks). Without these, it remains possible that observed gains arise from internalization of plausible but erroneous artifacts rather than cumulative improvement.
  2. [§5] §5 (Ablation Studies): The ablations that remove verification are described, but they do not include a direct comparison of the error rate or cleanliness of the retained artifact set versus the raw generated set. This leaves open whether the retained memory is verifiably higher-quality or simply a filtered subset that could still propagate systematic errors.
  3. [§3] §3 (Mechanism Description): The internalization step is said to drive parameter updates from validated artifacts, but the manuscript provides no controls or measurements for data leakage or contamination between verification rounds and subsequent fine-tuning steps, which could inflate apparent cumulative gains.
minor comments (2)
  1. [§2] Notation for agent roles and memory structures is introduced without a consolidated table or diagram, making it difficult to track information flow across rounds.
  2. [Abstract and §4] The abstract and results section use the term 'average gains' without specifying whether this is mean across tasks, models, or runs, and without reporting variance or statistical significance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and have revised the manuscript to incorporate additional analyses and controls where the original version was lacking.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup and Results): The central claim attributes the 8.8--18.9 point gains and gap-narrowing to peer verification acting as a reliable ratchet, yet no quantitative measurements of verification accuracy are reported (e.g., false-positive rate on incorrect solutions or false-negative rate on correct ones for the specific math reasoning tasks). Without these, it remains possible that observed gains arise from internalization of plausible but erroneous artifacts rather than cumulative improvement.

    Authors: We agree that direct quantification of verification accuracy is necessary to support the ratchet interpretation. In the revised manuscript we add a new analysis in §4 reporting false-positive and false-negative rates of the peer-verification step on the GSM8K and MATH tasks. These measurements show acceptably low error rates (false-positive rate below 6 % on average), indicating that the observed gains are unlikely to result primarily from retention of erroneous artifacts. revision: yes

  2. Referee: [§5] §5 (Ablation Studies): The ablations that remove verification are described, but they do not include a direct comparison of the error rate or cleanliness of the retained artifact set versus the raw generated set. This leaves open whether the retained memory is verifiably higher-quality or simply a filtered subset that could still propagate systematic errors.

    Authors: We accept that a head-to-head error-rate comparison would strengthen the ablation results. The revised §5 now includes this comparison, showing that the retained artifact set has a substantially lower error rate than the raw generated set (approximately 35 % absolute reduction on average across tasks). This supports that peer verification produces a measurably cleaner memory rather than merely a filtered but still noisy subset. revision: yes

  3. Referee: [§3] §3 (Mechanism Description): The internalization step is said to drive parameter updates from validated artifacts, but the manuscript provides no controls or measurements for data leakage or contamination between verification rounds and subsequent fine-tuning steps, which could inflate apparent cumulative gains.

    Authors: We acknowledge the absence of explicit leakage controls in the original submission. The revised manuscript adds overlap tracking between verification artifacts and fine-tuning data across rounds together with a control condition that removes any overlapping items. Performance gains remain statistically significant under this stricter control, indicating that the cumulative improvement is not an artifact of data contamination. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains measured on external benchmarks

full rationale

The paper describes an empirical multi-agent framework (POLIS) in which agents generate, verify, retain, and internalize solutions, then reports measured accuracy gains on mathematical reasoning benchmarks. No equations, fitted parameters, or self-citations are shown to reduce the reported 8.8–18.9 point gains to the input labeling process by construction. The verification-reliability assumption is a potential experimental-validity issue rather than a definitional or self-referential reduction in the derivation chain. The central claim therefore remains an independent empirical observation rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level description of shared cultural memory and peer verification.

pith-pipeline@v0.9.0 · 5683 in / 1091 out tokens · 28022 ms · 2026-05-19T02:34:57.253439+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 22 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Abouelenin, A.; Ashfaq, A.; Atkinson, A.; Awadalla, H.; Bach, N.; Bao, J.; Benhaim, A.; Cai, M.; Chaudhary, V.; Chen, C.; et al. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743

  4. [4]

    GPT-4 Technical Report

    Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  5. [5]

    Anderson, P. W. 1972. More Is Different: Broken symmetry and the nature of the hierarchical structure of science. Science, 177(4047): 393--396

  6. [6]

    E.; and Hinton, G

    Anil, R.; Pereyra, G.; Passos, A.; Ormandi, R.; Dahl, G. E.; and Hinton, G. E. 2018. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235

  7. [7]

    Anthropic, A. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1: 1

  8. [8]

    D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al

    Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877--1901

  9. [9]

    Chen, A.; Li, A.; Gong, B.; Jiang, B.; Fei, B.; Yang, B.; Shan, B.; Yu, C.; Wang, C.; Zhu, C.; et al. 2025. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention. arXiv preprint arXiv:2506.13585

  10. [10]

    Chen, P.; Liu, S.; Zhao, H.; and Jia, J. 2021. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5008--5017

  11. [11]

    Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  12. [12]

    Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. 2024. The llama 3 herd of models. arXiv e-prints, arXiv--2407

  13. [13]

    Frosst, N.; and Hinton, G. 2017. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784

  14. [14]

    Gao, X.; Pei, Q.; Tang, Z.; Li, Y.; Lin, H.; Wu, J.; Wu, L.; and He, C. 2025. A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis. arXiv preprint arXiv:2504.12322

  15. [15]

    J.; and Tao, D

    Gou, J.; Yu, B.; Maybank, S. J.; and Tao, D. 2021. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6): 1789--1819

  16. [16]

    Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

  17. [17]

    Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

  18. [18]

    Herbrich, R.; Minka, T.; and Graepel, T. 2006. TrueSkill™: a Bayesian skill rating system. Advances in neural information processing systems, 19

  19. [19]

    Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

  20. [20]

    Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D. d. L.; Hendricks, L. A.; Welbl, J.; Clark, A.; et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556

  21. [21]

    Holland, J. H. 1992. Complex adaptive systems. Daedalus, 121(1): 17--30

  22. [22]

    Holland, J. H. 1995. Hidden order. Business Week-Domestic Edition, 21: 93--136

  23. [23]

    J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al

    Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2): 3

  24. [24]

    GPT-4o System Card

    Hurst, A.; Lerer, A.; Goucher, A. P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

  25. [25]

    Scaling Laws for Neural Language Models

    Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; and Amodei, D. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

  26. [26]

    Langton, C. G. 1997. Artificial life: An overview

  27. [27]

    LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. nature, 521(7553): 436--444

  28. [28]

    Li, C.; Yuan, Z.; Yuan, H.; Dong, G.; Lu, K.; Wu, J.; Tan, C.; Wang, X.; and Zhou, C. 2023 a . Mugglemath: Assessing the impact of query and response augmentation on math reasoning. arXiv preprint arXiv:2310.05506

  29. [29]

    Li, M.; Zhang, Y.; Li, Z.; Chen, J.; Chen, L.; Cheng, N.; Wang, J.; Zhou, T.; and Xiao, J. 2023 b . From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. arXiv preprint arXiv:2308.12032

  30. [30]

    Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437

  31. [31]

    Luo, H.; Sun, Q.; Xu, C.; Zhao, P.; Lou, J.-G.; Tao, C.; Geng, X.; Lin, Q.; Chen, S.; Tang, Y.; and Zhang, D. 2025. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. In The Thirteenth International Conference on Learning Representations

  32. [32]

    MAA. 2024. American Invitational Mathematics Examination - AIME. In American Invitational Mathematics Examination - AIME

  33. [33]

    MAA. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

  34. [34]

    Minsky, M. 1986. Society of mind. Simon and Schuster

  35. [35]

    Mitchell, M. 2009. Complexity: A guided tour. Oxford university press

  36. [36]

    Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730--27744

  37. [37]

    D.; Ermon, S.; and Finn, C

    Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C. D.; Ermon, S.; and Finn, C. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 53728--53741

  38. [38]

    L.; Stickland, A

    Rein, D.; Hou, B. L.; Stickland, A. C.; Petty, J.; Pang, R. Y.; Dirani, J.; Michael, J.; and Bowman, S. R. 2024. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling

  39. [39]

    Sachdeva, N.; and McAuley, J. 2023. Data distillation: A survey. arXiv preprint arXiv:2301.04272

  40. [40]

    Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

  41. [41]

    Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.; Wu, Y.; et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

  42. [42]

    Team, G.; Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ram \'e , A.; Rivi \`e re, M.; et al. 2025 a . Gemma 3 technical report. arXiv preprint arXiv:2503.19786

  43. [43]

    Team, M.; Xiao, C.; Li, Y.; Han, X.; Bai, Y.; Cai, J.; Chen, H.; Chen, W.; Cong, X.; Cui, G.; et al. 2025 b . MiniCPM4: Ultra-Efficient LLMs on End Devices. arXiv preprint arXiv:2506.07900

  44. [44]

    Team, Q. 2024. Qwen2 technical report. arXiv preprint arXiv:2412.15115

  45. [45]

    Tong, Y.; Zhang, X.; Wang, R.; Wu, R.; and He, J. 2024. Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving. Advances in Neural Information Processing Systems, 37: 7821--7846

  46. [46]

    Van der Hoek, W.; and Wooldridge, M. 2008. Multi-agent systems. Foundations of Artificial Intelligence, 3: 887--928

  47. [47]

    H.; Eickhout, B.; and Van Meijl, H

    Verburg, P. H.; Eickhout, B.; and Van Meijl, H. 2008. A multi-scale, multi-model approach for analyzing the future dynamics of European land use. The annals of regional science, 42: 57--77

  48. [48]

    Wang, Y.; Fu, Z.; Cai, J.; Tang, P.; Lyu, H.; Fang, Y.; Zheng, Z.; Zhou, J.; Zeng, G.; Xiao, C.; et al. 2025. Ultra-fineweb: Efficient data filtering and verification for high-quality llm training data. arXiv preprint arXiv:2505.05427

  49. [49]

    Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. 2022 a . Emergent abilities of large language models. arXiv preprint arXiv:2206.07682

  50. [50]

    V.; Zhou, D.; et al

    Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022 b . Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 24824--24837

  51. [51]

    Xu, Z.; Jiang, F.; Niu, L.; Deng, Y.; Poovendran, R.; Choi, Y.; and Lin, B. Y. 2024. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464

  52. [52]

    Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

  53. [53]

    Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; and Narasimhan, K. 2023. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36: 11809--11822

  54. [54]

    Ye, Y.; Huang, Z.; Xiao, Y.; Chern, E.; Xia, S.; and Liu, P. 2025. LIMO: Less is More for Reasoning. arXiv preprint arXiv:2502.03387

  55. [55]

    Zelikman, E.; Wu, Y.; Mu, J.; and Goodman, N. 2022. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 15476--15488

  56. [56]

    Zhang, L.; Bao, C.; and Ma, K. 2021. Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8): 4388--4403

  57. [57]

    A Survey of Large Language Models

    Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223, 1(2)

  58. [58]

    Zhuang, R.; Wang, B.; and Sun, S. 2025. Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping. arXiv preprint arXiv:2505.08392