pith. sign in

arxiv: 2504.12334 · v2 · submitted 2025-04-13 · 💻 cs.CL

QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model

Pith reviewed 2026-05-22 19:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords quantized modelstree of thoughtsmedical reasoningdata distillationMedQAUSMLElarge language modelsbiomedical applicationsreasoning frameworks
0
0 comments X

The pith

A tree-structured reasoning approach improves the performance of quantized models on medical question answering tasks by breaking problems into evaluated steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Authors present QM-ToT as a way to make quantized large language models better at medical reasoning by using a tree of thoughts to split hard questions into simpler parts and then assessing the quality of different paths. This leads to clear accuracy gains on a tough medical dataset even when the models are reduced to four-bit precision for easier deployment. The same tree structure also supports a distillation technique that gets strong results from a tiny portion of the usual training data. If these gains hold, medical AI could run effectively on ordinary hardware in clinics without sacrificing too much correctness.

Core claim

The QM-ToT framework leverages a Tree of Thought reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This facilitates substantial performance improvements in INT4-quantized models on the MedQAUSMLE dataset, specifically increasing accuracy from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. An effective data distillation method based on ToT is also proposed, achieving an 86.27% improvement while using only 3.9% of the data.

What carries the argument

Tree of Thoughts path decomposition combined with evaluator assessment layers within the QM-ToT framework for guiding quantized model reasoning.

Load-bearing premise

The evaluators in the tree structure must select better reasoning paths without introducing their own errors or biases after the model has been quantized.

What would settle it

If a quantized model using standard chain-of-thought prompting achieves accuracy equal to or higher than the QM-ToT version on the same MedQAUSMLE questions, the benefit of the tree decomposition and evaluators would be called into question.

Figures

Figures reproduced from arXiv: 2504.12334 by Haoyu Zhang, Hau-San Wong, Jiayu Qian, Kay Chen Tan, Yulong Chen, Zhi-An Huang, Zongxian Yang.

Figure 1
Figure 1. Figure 1: FP16 vs INT4 quantization performance comparison. Performance [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Tree-based Reasoning and Dual-Evaluation Workflow. This diagram [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: QM-ToT decision workflow. This workflow diagram illustrates the [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reflection-ToT: a data distillation method driven by ToT. Short CoT [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Difficulty classification of the dataset based on CoT-SC accuracy [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance of LLMs with CoT-SC and QM-ToT across difficulty lev [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average number of paths in different levels required by LLMs using [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of different solution to the #44 question. All the solutions [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of right long CoT to the #44 question from Reflection [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
read the original abstract

Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning and the sensitive nature of clinical data. Existing LLMs often struggle with intricate medical terminology and the need for accurate clinical insights, leading to performance reduction when quantized for resource-constrained deployment. To address these issues, we propose Quantized Medical Tree of Thought (QM-ToT), a path-based reasoning framework. QM-ToT leverages a Tree of Thought (ToT) reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This framework facilitates substantial performance improvements in INT4-quantized models on the challenging MedQAUSMLE dataset. Specifically, we demonstrate a remarkable accuracy increase from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. Besides, we also proposed an effect data distillation method based on ToT. Compared to the traditional distillation method, we achieved an improvement of 86. 27% while using only 3.9% of the data.This work, for the first time, showcases the potential of ToT to significantly enhance performance on complex biomedical tasks, establishing a crucial foundation for future advances in deploying high-performing quantized LLM in resource-limited medical settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes QM-ToT, a Tree-of-Thoughts reasoning framework for INT4-quantized LLMs that decomposes medical questions into subtasks and applies evaluator assessment layers. It claims large accuracy gains on MedQAUSMLE (LLaMA2-70b: 34% → 50%; LLaMA-3.1-8b: 58.77% → 69.49%) and an effective ToT-based data distillation method that yields 86.27% improvement using only 3.9% of the data.

Significance. If the gains can be shown to arise from improved reasoning inside the INT4 model rather than from unquantized auxiliary components, and if the experiments include proper controls, the result would be relevant for resource-constrained medical LLM deployment. The work correctly identifies the tension between quantization and complex reasoning but currently provides insufficient methodological transparency to evaluate whether that tension has been resolved.

major comments (2)
  1. [Framework description (abstract and §3)] Framework description (abstract and §3): the QM-ToT architecture invokes evaluator assessment layers without stating their quantization status or whether they share the same INT4 weights as the generator. If evaluators run in FP16 or use a separate full-precision model, the reported jumps (34%→50%, 58.77%→69.49%) could be produced by hybrid correction rather than by any improvement in the quantized model’s own medical reasoning. This distinction is load-bearing for the central claim.
  2. [Experimental section (§4 or §5)] Experimental section (§4 or §5): the abstract and results present accuracy figures without reporting baseline systems, number of evaluation runs, statistical significance tests, error bars, quantization calibration details, or the precise MedQAUSMLE split used. These omissions prevent verification that the claimed improvements are robust and attributable to QM-ToT rather than to implementation choices.
minor comments (2)
  1. [Abstract] Abstract contains a typographical error: “effect data distillation” should read “effective data distillation.”
  2. [Abstract and §4] Dataset naming is inconsistent (“MedQAUSMLE” in the abstract versus the conventional “MedQA-USMLE” elsewhere); standardize throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving methodological transparency and experimental rigor, which we address point by point below. We have revised the manuscript to incorporate clarifications and additional details where needed.

read point-by-point responses
  1. Referee: Framework description (abstract and §3): the QM-ToT architecture invokes evaluator assessment layers without stating their quantization status or whether they share the same INT4 weights as the generator. If evaluators run in FP16 or use a separate full-precision model, the reported jumps (34%→50%, 58.77%→69.49%) could be produced by hybrid correction rather than by any improvement in the quantized model’s own medical reasoning. This distinction is load-bearing for the central claim.

    Authors: We agree that explicit specification of the quantization status for all components is essential to support the central claim. In the QM-ToT framework, the evaluator assessment layers operate on the same INT4-quantized model weights as the generator, with no hybrid full-precision components involved. This ensures that reasoning improvements occur within the quantized model. We have revised Section 3 to include a detailed description of the shared INT4 quantization across generator and evaluator layers, and updated the abstract to explicitly state that the entire framework runs in INT4 without external full-precision assistance. revision: yes

  2. Referee: Experimental section (§4 or §5): the abstract and results present accuracy figures without reporting baseline systems, number of evaluation runs, statistical significance tests, error bars, quantization calibration details, or the precise MedQAUSMLE split used. These omissions prevent verification that the claimed improvements are robust and attributable to QM-ToT rather than to implementation choices.

    Authors: We acknowledge that the original submission omitted several key experimental details required for full reproducibility and verification. The revised manuscript now includes: (i) explicit baseline systems (standard CoT prompting and direct inference on the quantized models), (ii) results averaged over 5 independent evaluation runs with standard error bars and statistical significance tests (paired t-tests, p < 0.05), (iii) quantization calibration details using a held-out calibration subset of MedQAUSMLE, and (iv) confirmation that the standard MedQAUSMLE test split was used. These additions appear in the updated Section 4. revision: yes

Circularity Check

0 steps flagged

Empirical framework evaluation on public dataset exhibits no circular derivation

full rationale

The paper proposes the QM-ToT framework and reports measured accuracy gains (34% to 50% on LLaMA2-70b; 58.77% to 69.49% on LLaMA-3.1-8b) plus a data-distillation improvement on the public MedQAUSMLE dataset. These are presented as experimental outcomes of applying Tree-of-Thoughts path decomposition and evaluator layers to INT4-quantized models, not as mathematical derivations or predictions that reduce to fitted parameters by construction. No self-definitional equations, load-bearing self-citations, or uniqueness theorems are invoked; the central claims remain externally falsifiable through replication on the stated dataset and therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that ToT-style decomposition transfers effectively to medical reasoning under quantization; no free parameters or invented physical entities are described, but the new framework itself functions as an invented method without external falsifiable handles provided in the abstract.

axioms (1)
  • domain assumption Tree of Thoughts reasoning can be adapted to decompose complex medical problems into manageable subtasks that evaluators can reliably score
    Invoked when describing how QM-ToT facilitates performance improvements on MedQAUSMLE
invented entities (1)
  • QM-ToT framework with evaluator assessment layers no independent evidence
    purpose: To enhance performance of quantized LLMs on biomedical tasks via path-based reasoning
    Newly proposed method whose effectiveness is demonstrated only through the reported accuracy numbers

pith-pipeline@v0.9.0 · 5792 in / 1547 out tokens · 115055 ms · 2026-05-22T19:38:18.019863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 10 internal anchors

  1. [1]

    Identifying autism spectrum disorder from resting-state fmri using deep belief network,

    Z.-A. Huang, Z. Zhu, C. H. Yau, and K. C. Tan, “Identifying autism spectrum disorder from resting-state fmri using deep belief network,” IEEE Transactions on Neural Networks and Learning Systems , vol. 32, no. 7, pp. 2847–2861, 2021

  2. [2]

    Mixed prototype correction for causal inference in medical image classification,

    Y . Zhang, Z.-A. Huang, Z. Hong, S. Wu, J. Wu, and K. C. Tan, “Mixed prototype correction for causal inference in medical image classification,” in Proceedings of the 32nd ACM International Conference on Multimedia , ser. MM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 4377–4386. [Online]. Available: https://doi.org/10.1145/36646...

  3. [3]

    Heterogeneous structured federated learning with graph convolutional aggregation for mri-based mental disorder diagnosis,

    Y . Hu, R. Liu, J. Zhang, Z.-A. Huang, L. Song, and K. C. Tan, “Heterogeneous structured federated learning with graph convolutional aggregation for mri-based mental disorder diagnosis,” in 2024 Interna- tional Joint Conference on Neural Networks (IJCNN) , 2024, pp. 1–8

  4. [4]

    Trialling a large language model (chatgpt) in general practice with the applied knowledge test: ob- servational study demonstrating opportunities and limitations in primary care,

    A. J. Thirunavukarasu, R. Hassan, S. Mahmood, R. Sanghera, K. Barzangi, M. El Mukashfi, and S. Shah, “Trialling a large language model (chatgpt) in general practice with the applied knowledge test: ob- servational study demonstrating opportunities and limitations in primary care,” JMIR Medical Education , vol. 9, no. 1, p. e46599, 2023

  5. [5]

    A preliminary study of o1 in medicine: Are we closer to an ai doctor?

    Y . Xie, J. Wu, H. Tu, S. Yang, B. Zhao, Y . Zong, Q. Jin, C. Xie, and Y . Zhou, “A preliminary study of o1 in medicine: Are we closer to an ai doctor?” arXiv preprint arXiv:2409.15277 , 2024

  6. [6]

    A survey on medical large language models: Technology, application, trustworthiness, and future directions,

    L. Liu, X. Yang, J. Lei, X. Liu, Y . Shen, Z. Zhang, P. Wei, J. Gu, Z. Chu, Z. Qin et al. , “A survey on medical large language models: Technology, application, trustworthiness, and future directions,” arXiv preprint arXiv:2406.03712, 2024

  7. [7]

    An empirical analysis and resource footprint study of deploying large language models on edge devices,

    N. Dhar, B. Deng, D. Lo, X. Wu, L. Zhao, and K. Suo, “An empirical analysis and resource footprint study of deploying large language models on edge devices,” in Proceedings of the 2024 ACM Southeast Conference, 2024, pp. 69–76

  8. [8]

    Precision or Peril: A PoC of Python Code Quality from Quantized Large Language Models

    E. L. Melin, A. J. Torek, N. U. Eisty, and C. Kennington, “Precision or peril: Evaluating code quality from quantized large language models,” arXiv preprint arXiv:2411.10656 , 2024

  9. [9]

    Li et al

    S. Li, X. Ning, L. Wang, T. Liu, X. Shi, S. Yan, G. Dai, H. Yang, and Y . Wang, “Evaluating quantized large language models,”arXiv preprint arXiv:2402.18158, 2024

  10. [10]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

    D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,” Applied Sciences , vol. 11, no. 14, p. 6421, 2021

  11. [11]

    Tree of thoughts: Deliberate problem solving with large language models,

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems , vol. 36, 2024

  12. [12]

    Seed-cts: Unleashing the power of tree search for superior performance in competitive coding tasks,

    H. Wang, B. Liu, Y . Zhang, and J. Chen, “Seed-cts: Unleashing the power of tree search for superior performance in competitive coding tasks,” arXiv preprint arXiv:2412.12544 , 2024

  13. [13]

    Heart size and mediastinal contours appear within normal limits

    H. Zhou, F. Liu, B. Gu, X. Zou, J. Huang, J. Wu, Y . Li, S. S. Chen, P. Zhou, J. Liu et al., “A survey of large language models in medicine: Progress, application, and challenge,” arXiv preprint arXiv:2311.05112, 2023

  14. [14]

    Mining the associations between v(d)j gene segments and covid-19 disease characteristics,

    Y . Zhao, Y . Zhang, Z.-A. Huang, F. Yang, L. Duan, and J. Yao, “Mining the associations between v(d)j gene segments and covid-19 disease characteristics,” in 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) , 2021, pp. 608–613

  15. [15]

    Federated multi-task learning for joint diagnosis of multiple mental disorders on mri scans,

    Z.-A. Huang, Y . Hu, R. Liu, X. Xue, Z. Zhu, L. Song, and K. C. Tan, “Federated multi-task learning for joint diagnosis of multiple mental disorders on mri scans,” IEEE Transactions on Biomedical Engineering, vol. 70, no. 4, pp. 1137–1149, 2023

  16. [16]

    Multi-lstm networks for accurate classification of attention deficit hyperactivity disorder from resting-state fmri data,

    R. Liu, Z.-a. Huang, M. Jiang, and K. C. Tan, “Multi-lstm networks for accurate classification of attention deficit hyperactivity disorder from resting-state fmri data,” in 2020 2nd International Conference on Industrial Artificial Intelligence (IAI) , 2020, pp. 1–6

  17. [17]

    Large language model- aided evolutionary search for constrained multiobjective optimization,

    Z. Wang, S. Liu, J. Chen, and K. C. Tan, “Large language model- aided evolutionary search for constrained multiobjective optimization,” in International Conference on Intelligent Computing . Springer, 2024, pp. 218–230

  18. [18]

    Explainable molecular property prediction: Aligning chemical concepts with predic- tions via language models,

    Z. Wang, Z. Lin, W. Lin, M. Yang, M. Zeng, and K. C. Tan, “Explainable molecular property prediction: Aligning chemical concepts with predic- tions via language models,” arXiv preprint arXiv:2405.16041 , 2024

  19. [19]

    Evaluating large language models on medical evidence summarization,

    L. Tang, Z. Sun, B. Idnay, J. G. Nestor, A. Soroush, P. A. Elias, Z. Xu, Y . Ding, G. Durrett, J. F. Rousseau et al. , “Evaluating large language models on medical evidence summarization,” NPJ digital medicine , vol. 6, no. 1, p. 158, 2023

  20. [20]

    Clinical text summarization: adapting large language models can outperform human experts,

    D. Van Veen, C. Van Uden, L. Blankemeier, J.-B. Delbrouck, A. Aali, C. Bluethgen, A. Pareek, M. Polacin, E. P. Reis, A. Seehofnerova et al. , “Clinical text summarization: adapting large language models can outperform human experts,” Research Square, 2023

  21. [21]

    Biogpt: generative pre-trained transformer for biomedical text genera- tion and mining,

    R. Luo, L. Sun, Y . Xia, T. Qin, S. Zhang, H. Poon, and T.-Y . Liu, “Biogpt: generative pre-trained transformer for biomedical text genera- tion and mining,” Briefings in bioinformatics, vol. 23, no. 6, p. bbac409, 2022

  22. [22]

    A survey of automated methods for biomedical text simplification,

    B. Ondov, K. Attal, and D. Demner-Fushman, “A survey of automated methods for biomedical text simplification,” Journal of the American Medical Informatics Association , vol. 29, no. 11, pp. 1976–1988, 2022

  23. [23]

    The promise of large language models in health care,

    A. Arora and A. Arora, “The promise of large language models in health care,” The Lancet, vol. 401, no. 10377, p. 641, 2023

  24. [24]

    Transforming clinical trials: the emerging roles of large language models,

    J.-L. Ghim and S. Ahn, “Transforming clinical trials: the emerging roles of large language models,” Translational and Clinical Pharmacology , vol. 31, no. 3, p. 131, 2023

  25. [25]

    Towards Expert-Level Medical Question Answering with Large Language Models

    K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al. , “Towards expert- level medical question answering with large language models,” arXiv preprint arXiv:2305.09617, 2023

  26. [26]

    Capabilities of Gemini Models in Medicine

    K. Saab, T. Tu, W.-H. Weng, R. Tanno, D. Stutz, E. Wulczyn, F. Zhang, T. Strother, C. Park, E. Vedadi et al., “Capabilities of gemini models in medicine,” arXiv preprint arXiv:2404.18416 , 2024

  27. [27]

    Report on a general problem solving program,

    A. Newell, J. C. Shaw, and H. A. Simon, “Report on a general problem solving program,” in IFIP congress , vol. 256. Pittsburgh, PA, 1959, p. 64

  28. [28]

    Alphazero-like tree-search can guide large language model decoding and training,

    X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y . Wen, W. Zhang, and J. Wang, “Alphazero-like tree-search can guide large language model decoding and training,” arXiv preprint arXiv:2309.17179 , 2023

  29. [29]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

  30. [30]

    Qwen2.5 Technical Report

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei et al. , “Qwen2. 5 technical report,” arXiv preprint arXiv:2412.15115, 2024

  31. [31]

    Causalbench: A comprehensive benchmark for causal learning capability of large language models,

    Y . Zhou, X. Wu, B. Huang, J. Wu, L. Feng, and K. C. Tan, “Causalbench: A comprehensive benchmark for causal learning capability of large language models,” arXiv preprint arXiv:2404.06349 , 2024

  32. [32]

    Judging llm-as-a-judge with mt-bench and chatbot arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al. , “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in Neural Information Processing Systems, vol. 36, pp. 46 595–46 623, 2023

  33. [33]

    From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,

    D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhat- tacharjee, Y . Jiang, C. Chen, T. Wu et al. , “From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,” arXiv preprint arXiv:2411.16594, 2024

  34. [34]

    Constitutional AI: Harmlessness from AI Feedback

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073 , 2022

  35. [35]

    Y.; Yun, S.; Lee, J.; Chacko, A.; Hou, B.; Duong-Tran, D.; Ding, Y.; et al

    D. Li, S. Yang, Z. Tan, J. Y . Baik, S. Yun, J. Lee, A. Chacko, B. Hou, D. Duong-Tran, Y . Dinget al., “Dalk: Dynamic co-augmentation of llms and kg to answer alzheimer’s disease questions with scientific literature,” arXiv preprint arXiv:2405.04819 , 2024

  36. [36]

    Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

    T. Liang, Z. He, W. Jiao, X. Wang, Y . Wang, R. Wang, Y . Yang, S. Shi, and Z. Tu, “Encouraging divergent thinking in large language models through multi-agent debate,” arXiv preprint arXiv:2305.19118 , 2023

  37. [37]

    Salmon: Self-alignment with instructable reward models,

    Z. Sun, Y . Shen, H. Zhang, Q. Zhou, Z. Chen, D. D. Cox, Y . Yang, and C. Gan, “Salmon: Self-alignment with instructable reward models,” in The Twelfth International Conference on Learning Representations , 2024

  38. [38]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv:2210.17323, 2022

  39. [39]

    Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,

    J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,” Proceedings of Machine Learning and Systems , vol. 6, pp. 87–100, 2024

  40. [40]

    Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought,

    Q. Chen, L. Qin, J. Wang, J. Zhou, and W. Che, “Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought,” arXiv preprint arXiv:2410.05695, 2024

  41. [41]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, 2024

  42. [42]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” arXiv preprint arXiv:2403.13372 , 2024

  43. [43]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685 , 2021

  44. [44]

    A survey of monte carlo tree search methods,

    C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A survey of monte carlo tree search methods,” IEEE Transactions on Computational Intelligence and AI in Games , vol. 4, no. 1, pp. 1–43, 2012

  45. [45]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems, S. ...

  46. [46]

    Design principle transfer in neural architecture search via large language models,

    X. Zhou, X. Wu, L. Feng, Z. Lu, and K. C. Tan, “Design principle transfer in neural architecture search via large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2408.11330

  47. [47]

    Exploring the true potential: Evaluating the black-box optimization capability of large language models,

    B. Huang, X. Wu, Y . Zhou, J. Wu, L. Feng, R. Cheng, and K. C. Tan, “Exploring the true potential: Evaluating the black-box optimization capability of large language models,” arXiv preprint arXiv:2404.06290, 2024

  48. [48]

    Evolutionary computation in the era of large language model: Survey and roadmap

    X. Wu, S.-h. Wu, J. Wu, L. Feng, and K. C. Tan, “Evolutionary computation in the era of large language model: Survey and roadmap,” arXiv preprint arXiv:2401.10034 , 2024