arxiv: 2507.21166 · v2 · submitted 2025-07-25 · 💻 cs.LG · cs.AI

The Ratchet Effect in Silico through Interaction-Driven Cumulative Intelligence in Large Language Models

Ren Zhuang This is my paper

Pith reviewed 2026-05-19 02:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords large language modelsmulti-agent systemscumulative intelligenceratchet effectpeer verificationmathematical reasoningparameter internalization

0 comments p. Extension

The pith

Small language models improve reasoning by verifying each other's solutions and updating parameters with the validated results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that populations of small large language models can accumulate knowledge gains over successive rounds of interaction by generating candidate solutions, having peers check them, keeping only the correct ones in a shared memory, and then adjusting their own parameters to incorporate those successes. This process is presented as a computational version of the ratchet effect seen in human cumulative culture, where improvements are retained rather than lost to drift. If the mechanism holds, it would mean that structured social interaction among models offers a route to stronger performance that does not require simply adding more parameters or more training data. The experiments focus on mathematical reasoning tasks and show that the verification step is the main driver while internalization carries the gains forward across rounds.

Core claim

In the POLIS framework, heterogeneous agents generate solutions to problems, verify one another's outputs, store only the validated artifacts in shared cultural memory, and internalize those artifacts through parameter updates. Populations of 1--4B-parameter models using this process record average gains of 8.8--18.9 points on mathematical reasoning benchmarks and reduce the performance gap to single 70B+ models. Mechanistic tests isolate peer verification as the primary operator that sustains the accumulation and show that internalization prevents loss of the retained knowledge between rounds.

What carries the argument

The POLIS multi-agent loop in which peer verification filters outputs before they enter shared memory and drive parameter internalization.

If this is right

Models in the 1--4B range can reach performance levels closer to much larger single models solely through repeated cycles of generation, verification, retention, and internalization.
Peer verification functions as the main mechanism that prevents loss of improvements and enables the ratchet-like accumulation.
Internalization of validated artifacts into agent parameters sustains the gains from one round to the next rather than requiring constant external prompting.
The overall approach treats structured interaction as an independent scaling dimension alongside increases in parameter count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar interaction structures could be tested on non-mathematical tasks such as code generation or scientific hypothesis refinement to check whether the accumulation effect generalizes.
Systems built this way might support continuous, open-ended refinement by maintaining an ongoing society of agents instead of periodic full retraining.
The design raises the practical question of how verification quality must scale with task difficulty to avoid gradual contamination of the shared memory.

Load-bearing premise

Peer agents can reliably separate correct solutions from incorrect ones and keep only the good ones in the shared memory that later shapes updates.

What would settle it

Replace the peer verification step with random acceptance of solutions or with a process that systematically admits errors; if the reported performance gains then disappear or reverse across rounds, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2507.21166 by Ren Zhuang.

**Figure 1.** Figure 1: Performance trajectory of the AGORA ensemble on MATH500. The ensemble’s performance improves over evolutionary steps, surpassing several larger, static benchmarks. The star denotes a strategic shift toward elite-peer learning, which breaks a performance plateau. learn from peer-generated solutions, dynamically alternating between teacher and student roles. This multidirectional knowledge flow contrasts … view at source ↗

**Figure 2.** Figure 2: The AGORA architecture. A four-module, dual-loop architecture that facilitates two core processes: Challenge Generation (I), Solution Formulation (II), Quality Evaluation (III), and Model Evolution (IV). The architecture facilitates group distillation for knowledge consolidation and uses an elite history of peer solutions to incentivize group emergence. Quality Recognition For solutions verified as correct… view at source ↗

**Figure 3.** Figure 3: Adaptive curriculum dynamics on MATH500. As the ensemble’s performance increases, the system automatically generates more challenging problems which reflected in longer response times, ensuring a state of continuous improvement. powerful and generalizable enhancement, not tied to a specific model architecture. The strong performance on benchmarks like GPQA and AIME24 is particularly noteworthy, confir… view at source ↗

**Figure 5.** Figure 5: Comparing AGORA ensembles on MATH500. While homogeneous groups of 1B and 4B models improve, the heterogeneous mixed group achieves the highest absolute performance gain, highlighting the benefit of cognitive diversity. lishing correctness without an external oracle is the foundational element of the collective’s success. Without it, the system is unable to prevent the propagation of plausible but flawed … view at source ↗

**Figure 6.** Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Human intelligence scales through cumulative cultural evolution (CCE), a ratchet process in which innovations are retained against entropic drift. Large language model training, by contrast, still depends primarily on static corpora and parameter growth, leaving little room for endogenous accumulation through interaction. We present POLIS (Population Orchestrated Learning and Inference Society), a framework in which heterogeneous agents generate solutions, verify one another's outputs, retain validated artifacts in shared cultural memory, and internalize them through parameter updates. On mathematical reasoning benchmarks, populations of 1--4B-parameter models achieved average gains of 8.8--18.9 points over base models and narrowed the gap to 70B+ monoliths. Mechanistic ablations identify peer verification as the main ratchet operator and show that internalization sustains accumulation across rounds, providing computational evidence that epistemic vigilance organizes durable knowledge growth. These results position structured social interaction as a scaling lever orthogonal to parameter count.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces POLIS, a multi-agent framework in which heterogeneous LLM agents generate solutions to mathematical reasoning tasks, perform peer verification, retain validated artifacts in a shared cultural memory, and internalize them via parameter updates. It reports that populations of 1--4B-parameter models achieve average benchmark gains of 8.8--18.9 points over base models and narrow the performance gap to 70B+ monolithic models, with mechanistic ablations identifying peer verification as the primary ratchet operator and internalization as the mechanism sustaining accumulation across rounds.

Significance. If the central mechanism is shown to operate as claimed, the work provides computational evidence that structured social interaction can serve as a scaling dimension orthogonal to parameter count, enabling cumulative knowledge growth in LLMs through an endogenous ratchet process analogous to cultural evolution. This would have implications for efficient training of smaller models and for understanding how epistemic vigilance organizes durable knowledge in artificial systems.

major comments (3)

[§4] §4 (Experimental Setup and Results): The central claim attributes the 8.8--18.9 point gains and gap-narrowing to peer verification acting as a reliable ratchet, yet no quantitative measurements of verification accuracy are reported (e.g., false-positive rate on incorrect solutions or false-negative rate on correct ones for the specific math reasoning tasks). Without these, it remains possible that observed gains arise from internalization of plausible but erroneous artifacts rather than cumulative improvement.
[§5] §5 (Ablation Studies): The ablations that remove verification are described, but they do not include a direct comparison of the error rate or cleanliness of the retained artifact set versus the raw generated set. This leaves open whether the retained memory is verifiably higher-quality or simply a filtered subset that could still propagate systematic errors.
[§3] §3 (Mechanism Description): The internalization step is said to drive parameter updates from validated artifacts, but the manuscript provides no controls or measurements for data leakage or contamination between verification rounds and subsequent fine-tuning steps, which could inflate apparent cumulative gains.

minor comments (2)

[§2] Notation for agent roles and memory structures is introduced without a consolidated table or diagram, making it difficult to track information flow across rounds.
[Abstract and §4] The abstract and results section use the term 'average gains' without specifying whether this is mean across tasks, models, or runs, and without reporting variance or statistical significance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and have revised the manuscript to incorporate additional analyses and controls where the original version was lacking.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup and Results): The central claim attributes the 8.8--18.9 point gains and gap-narrowing to peer verification acting as a reliable ratchet, yet no quantitative measurements of verification accuracy are reported (e.g., false-positive rate on incorrect solutions or false-negative rate on correct ones for the specific math reasoning tasks). Without these, it remains possible that observed gains arise from internalization of plausible but erroneous artifacts rather than cumulative improvement.

Authors: We agree that direct quantification of verification accuracy is necessary to support the ratchet interpretation. In the revised manuscript we add a new analysis in §4 reporting false-positive and false-negative rates of the peer-verification step on the GSM8K and MATH tasks. These measurements show acceptably low error rates (false-positive rate below 6 % on average), indicating that the observed gains are unlikely to result primarily from retention of erroneous artifacts. revision: yes
Referee: [§5] §5 (Ablation Studies): The ablations that remove verification are described, but they do not include a direct comparison of the error rate or cleanliness of the retained artifact set versus the raw generated set. This leaves open whether the retained memory is verifiably higher-quality or simply a filtered subset that could still propagate systematic errors.

Authors: We accept that a head-to-head error-rate comparison would strengthen the ablation results. The revised §5 now includes this comparison, showing that the retained artifact set has a substantially lower error rate than the raw generated set (approximately 35 % absolute reduction on average across tasks). This supports that peer verification produces a measurably cleaner memory rather than merely a filtered but still noisy subset. revision: yes
Referee: [§3] §3 (Mechanism Description): The internalization step is said to drive parameter updates from validated artifacts, but the manuscript provides no controls or measurements for data leakage or contamination between verification rounds and subsequent fine-tuning steps, which could inflate apparent cumulative gains.

Authors: We acknowledge the absence of explicit leakage controls in the original submission. The revised manuscript adds overlap tracking between verification artifacts and fine-tuning data across rounds together with a control condition that removes any overlapping items. Performance gains remain statistically significant under this stricter control, indicating that the cumulative improvement is not an artifact of data contamination. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains measured on external benchmarks

full rationale

The paper describes an empirical multi-agent framework (POLIS) in which agents generate, verify, retain, and internalize solutions, then reports measured accuracy gains on mathematical reasoning benchmarks. No equations, fitted parameters, or self-citations are shown to reduce the reported 8.8–18.9 point gains to the input labeling process by construction. The verification-reliability assumption is a potential experimental-validity issue rather than a definitional or self-referential reduction in the derivation chain. The central claim therefore remains an independent empirical observation rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level description of shared cultural memory and peer verification.

pith-pipeline@v0.9.0 · 5683 in / 1091 out tokens · 28022 ms · 2026-05-19T02:34:57.253439+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 22 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Abouelenin, A.; Ashfaq, A.; Atkinson, A.; Awadalla, H.; Bach, N.; Bao, J.; Benhaim, A.; Cai, M.; Chaudhary, V.; Chen, C.; et al. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

GPT-4 Technical Report

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Anderson, P. W. 1972. More Is Different: Broken symmetry and the nature of the hierarchical structure of science. Science, 177(4047): 393--396

work page 1972
[6]

E.; and Hinton, G

Anil, R.; Pereyra, G.; Passos, A.; Ormandi, R.; Dahl, G. E.; and Hinton, G. E. 2018. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235

work page arXiv 2018
[7]

Anthropic, A. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1: 1

work page 2024
[8]

D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877--1901

work page 2020
[9]

Chen, A.; Li, A.; Gong, B.; Jiang, B.; Fei, B.; Yang, B.; Shan, B.; Yu, C.; Wang, C.; Zhu, C.; et al. 2025. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention. arXiv preprint arXiv:2506.13585

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Chen, P.; Liu, S.; Zhao, H.; and Jia, J. 2021. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5008--5017

work page 2021
[11]

Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. 2024. The llama 3 herd of models. arXiv e-prints, arXiv--2407

work page 2024
[13]

Frosst, N.; and Hinton, G. 2017. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Gao, X.; Pei, Q.; Tang, Z.; Li, Y.; Lin, H.; Wu, J.; Wu, L.; and He, C. 2025. A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis. arXiv preprint arXiv:2504.12322

work page arXiv 2025
[15]

J.; and Tao, D

Gou, J.; Yu, B.; Maybank, S. J.; and Tao, D. 2021. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6): 1789--1819

work page 2021
[16]

Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Herbrich, R.; Minka, T.; and Graepel, T. 2006. TrueSkill™: a Bayesian skill rating system. Advances in neural information processing systems, 19

work page 2006
[19]

Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[20]

Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D. d. L.; Hendricks, L. A.; Welbl, J.; Clark, A.; et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Holland, J. H. 1992. Complex adaptive systems. Daedalus, 121(1): 17--30

work page 1992
[22]

Holland, J. H. 1995. Hidden order. Business Week-Domestic Edition, 21: 93--136

work page 1995
[23]

J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al

Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2): 3

work page 2022
[24]

GPT-4o System Card

Hurst, A.; Lerer, A.; Goucher, A. P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Scaling Laws for Neural Language Models

Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; and Amodei, D. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[26]

Langton, C. G. 1997. Artificial life: An overview

work page 1997
[27]

LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. nature, 521(7553): 436--444

work page 2015
[28]

Li, C.; Yuan, Z.; Yuan, H.; Dong, G.; Lu, K.; Wu, J.; Tan, C.; Wang, X.; and Zhou, C. 2023 a . Mugglemath: Assessing the impact of query and response augmentation on math reasoning. arXiv preprint arXiv:2310.05506

work page arXiv 2023
[29]

Li, M.; Zhang, Y.; Li, Z.; Chen, J.; Chen, L.; Cheng, N.; Wang, J.; Zhou, T.; and Xiao, J. 2023 b . From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. arXiv preprint arXiv:2308.12032

work page arXiv 2023
[30]

Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Luo, H.; Sun, Q.; Xu, C.; Zhao, P.; Lou, J.-G.; Tao, C.; Geng, X.; Lin, Q.; Chen, S.; Tang, Y.; and Zhang, D. 2025. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. In The Thirteenth International Conference on Learning Representations

work page 2025
[32]

MAA. 2024. American Invitational Mathematics Examination - AIME. In American Invitational Mathematics Examination - AIME

work page 2024
[33]

MAA. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

work page 2025
[34]

Minsky, M. 1986. Society of mind. Simon and Schuster

work page 1986
[35]

Mitchell, M. 2009. Complexity: A guided tour. Oxford university press

work page 2009
[36]

Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730--27744

work page 2022
[37]

D.; Ermon, S.; and Finn, C

Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C. D.; Ermon, S.; and Finn, C. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 53728--53741

work page 2023
[38]

L.; Stickland, A

Rein, D.; Hou, B. L.; Stickland, A. C.; Petty, J.; Pang, R. Y.; Dirani, J.; Michael, J.; and Bowman, S. R. 2024. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling

work page 2024
[39]

Sachdeva, N.; and McAuley, J. 2023. Data distillation: A survey. arXiv preprint arXiv:2301.04272

work page arXiv 2023
[40]

Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.; Wu, Y.; et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Team, G.; Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ram \'e , A.; Rivi \`e re, M.; et al. 2025 a . Gemma 3 technical report. arXiv preprint arXiv:2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Team, M.; Xiao, C.; Li, Y.; Han, X.; Bai, Y.; Cai, J.; Chen, H.; Chen, W.; Cong, X.; Cui, G.; et al. 2025 b . MiniCPM4: Ultra-Efficient LLMs on End Devices. arXiv preprint arXiv:2506.07900

work page arXiv 2025
[44]

Team, Q. 2024. Qwen2 technical report. arXiv preprint arXiv:2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Tong, Y.; Zhang, X.; Wang, R.; Wu, R.; and He, J. 2024. Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving. Advances in Neural Information Processing Systems, 37: 7821--7846

work page 2024
[46]

Van der Hoek, W.; and Wooldridge, M. 2008. Multi-agent systems. Foundations of Artificial Intelligence, 3: 887--928

work page 2008
[47]

H.; Eickhout, B.; and Van Meijl, H

Verburg, P. H.; Eickhout, B.; and Van Meijl, H. 2008. A multi-scale, multi-model approach for analyzing the future dynamics of European land use. The annals of regional science, 42: 57--77

work page 2008
[48]

Wang, Y.; Fu, Z.; Cai, J.; Tang, P.; Lyu, H.; Fang, Y.; Zheng, Z.; Zhou, J.; Zeng, G.; Xiao, C.; et al. 2025. Ultra-fineweb: Efficient data filtering and verification for high-quality llm training data. arXiv preprint arXiv:2505.05427

work page arXiv 2025
[49]

Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. 2022 a . Emergent abilities of large language models. arXiv preprint arXiv:2206.07682

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

V.; Zhou, D.; et al

Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022 b . Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 24824--24837

work page 2022
[51]

Xu, Z.; Jiang, F.; Niu, L.; Deng, Y.; Poovendran, R.; Choi, Y.; and Lin, B. Y. 2024. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; and Narasimhan, K. 2023. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36: 11809--11822

work page 2023
[54]

Ye, Y.; Huang, Z.; Xiao, Y.; Chern, E.; Xia, S.; and Liu, P. 2025. LIMO: Less is More for Reasoning. arXiv preprint arXiv:2502.03387

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Zelikman, E.; Wu, Y.; Mu, J.; and Goodman, N. 2022. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 15476--15488

work page 2022
[56]

Zhang, L.; Bao, C.; and Ma, K. 2021. Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8): 4388--4403

work page 2021
[57]

A Survey of Large Language Models

Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223, 1(2)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Zhuang, R.; Wang, B.; and Sun, S. 2025. Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping. arXiv preprint arXiv:2505.08392

work page internal anchor Pith review Pith/arXiv arXiv 2025