QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model

Haoyu Zhang; Hau-San Wong; Jiayu Qian; Kay Chen Tan; Yulong Chen; Zhi-An Huang; Zongxian Yang

arxiv: 2504.12334 · v2 · submitted 2025-04-13 · 💻 cs.CL

QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model

Zongxian Yang , Jiayu Qian , Kay Chen Tan , Hau-San Wong , Yulong Chen , Haoyu Zhang , Zhi-An Huang This is my paper

Pith reviewed 2026-05-22 19:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords quantized modelstree of thoughtsmedical reasoningdata distillationMedQAUSMLElarge language modelsbiomedical applicationsreasoning frameworks

0 comments

The pith

A tree-structured reasoning approach improves the performance of quantized models on medical question answering tasks by breaking problems into evaluated steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Authors present QM-ToT as a way to make quantized large language models better at medical reasoning by using a tree of thoughts to split hard questions into simpler parts and then assessing the quality of different paths. This leads to clear accuracy gains on a tough medical dataset even when the models are reduced to four-bit precision for easier deployment. The same tree structure also supports a distillation technique that gets strong results from a tiny portion of the usual training data. If these gains hold, medical AI could run effectively on ordinary hardware in clinics without sacrificing too much correctness.

Core claim

The QM-ToT framework leverages a Tree of Thought reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This facilitates substantial performance improvements in INT4-quantized models on the MedQAUSMLE dataset, specifically increasing accuracy from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. An effective data distillation method based on ToT is also proposed, achieving an 86.27% improvement while using only 3.9% of the data.

What carries the argument

Tree of Thoughts path decomposition combined with evaluator assessment layers within the QM-ToT framework for guiding quantized model reasoning.

Load-bearing premise

The evaluators in the tree structure must select better reasoning paths without introducing their own errors or biases after the model has been quantized.

What would settle it

If a quantized model using standard chain-of-thought prompting achieves accuracy equal to or higher than the QM-ToT version on the same MedQAUSMLE questions, the benefit of the tree decomposition and evaluators would be called into question.

Figures

Figures reproduced from arXiv: 2504.12334 by Haoyu Zhang, Hau-San Wong, Jiayu Qian, Kay Chen Tan, Yulong Chen, Zhi-An Huang, Zongxian Yang.

**Figure 2.** Figure 2: Tree-based Reasoning and Dual-Evaluation Workflow. This diagram [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: QM-ToT decision workflow. This workflow diagram illustrates the [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Reflection-ToT: a data distillation method driven by ToT. Short CoT [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Difficulty classification of the dataset based on CoT-SC accuracy [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of LLMs with CoT-SC and QM-ToT across difficulty lev [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Average number of paths in different levels required by LLMs using [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Example of different solution to the #44 question. All the solutions [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: Example of right long CoT to the #44 question from Reflection [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

read the original abstract

Large language models (LLMs) face significant challenges in specialized biomedical tasks due to the inherent complexity of medical reasoning and the sensitive nature of clinical data. Existing LLMs often struggle with intricate medical terminology and the need for accurate clinical insights, leading to performance reduction when quantized for resource-constrained deployment. To address these issues, we propose Quantized Medical Tree of Thought (QM-ToT), a path-based reasoning framework. QM-ToT leverages a Tree of Thought (ToT) reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers. This framework facilitates substantial performance improvements in INT4-quantized models on the challenging MedQAUSMLE dataset. Specifically, we demonstrate a remarkable accuracy increase from 34% to 50% for the LLaMA2-70b model and from 58.77% to 69.49% for LLaMA-3.1-8b. Besides, we also proposed an effect data distillation method based on ToT. Compared to the traditional distillation method, we achieved an improvement of 86. 27% while using only 3.9% of the data.This work, for the first time, showcases the potential of ToT to significantly enhance performance on complex biomedical tasks, establishing a crucial foundation for future advances in deploying high-performing quantized LLM in resource-limited medical settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QM-ToT applies Tree of Thoughts to INT4 medical LLMs and reports sizable accuracy lifts on MedQAUSMLE, but the gains may depend on unquantized evaluators rather than improved reasoning inside the quantized model itself.

read the letter

The main thing to know is that this paper takes the existing Tree of Thoughts method, adds evaluator layers, and applies the whole setup to INT4-quantized models on a medical QA task. It claims clear accuracy gains and a data-efficient distillation trick, but the experimental write-up leaves the source of those gains ambiguous. The work is a straightforward domain extension rather than a new algorithm or theoretical result. It does a reasonable job framing the practical problem of running medical LLMs on limited hardware and shows how path-based reasoning might offset some quantization damage. The distillation claim, using only 3.9% of the data for an 86% relative improvement, is the part that feels most immediately useful if it holds up. The paper also cites the relevant ToT and quantization literature without obvious gaps. The soft spot is the missing detail on whether the evaluator layers themselves stay in INT4 or run at higher precision. If the evaluators are full-precision or use a separate unquantized model, the reported jumps from 34% to 50% and 58.77% to 69.49% could simply reflect the evaluators correcting quantization errors rather than the quantized model doing better medical reasoning. The abstract and framework description do not state the quantization status of those layers, and the stress-test concern lands directly on the central claim. Without that clarification, plus the usual missing pieces like baseline tables, statistical tests, and error bars, it is hard to treat the numbers as strong evidence. This paper is for people working on efficient deployment of medical LLMs who already know ToT and quantization basics. A reader looking for concrete implementation ideas or a starting point for follow-up experiments could get something out of it. It is coherent enough on its own terms to deserve a serious referee, even though the current version would need revisions on the experimental controls and component precision details before publication.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes QM-ToT, a Tree-of-Thoughts reasoning framework for INT4-quantized LLMs that decomposes medical questions into subtasks and applies evaluator assessment layers. It claims large accuracy gains on MedQAUSMLE (LLaMA2-70b: 34% → 50%; LLaMA-3.1-8b: 58.77% → 69.49%) and an effective ToT-based data distillation method that yields 86.27% improvement using only 3.9% of the data.

Significance. If the gains can be shown to arise from improved reasoning inside the INT4 model rather than from unquantized auxiliary components, and if the experiments include proper controls, the result would be relevant for resource-constrained medical LLM deployment. The work correctly identifies the tension between quantization and complex reasoning but currently provides insufficient methodological transparency to evaluate whether that tension has been resolved.

major comments (2)

[Framework description (abstract and §3)] Framework description (abstract and §3): the QM-ToT architecture invokes evaluator assessment layers without stating their quantization status or whether they share the same INT4 weights as the generator. If evaluators run in FP16 or use a separate full-precision model, the reported jumps (34%→50%, 58.77%→69.49%) could be produced by hybrid correction rather than by any improvement in the quantized model’s own medical reasoning. This distinction is load-bearing for the central claim.
[Experimental section (§4 or §5)] Experimental section (§4 or §5): the abstract and results present accuracy figures without reporting baseline systems, number of evaluation runs, statistical significance tests, error bars, quantization calibration details, or the precise MedQAUSMLE split used. These omissions prevent verification that the claimed improvements are robust and attributable to QM-ToT rather than to implementation choices.

minor comments (2)

[Abstract] Abstract contains a typographical error: “effect data distillation” should read “effective data distillation.”
[Abstract and §4] Dataset naming is inconsistent (“MedQAUSMLE” in the abstract versus the conventional “MedQA-USMLE” elsewhere); standardize throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving methodological transparency and experimental rigor, which we address point by point below. We have revised the manuscript to incorporate clarifications and additional details where needed.

read point-by-point responses

Referee: Framework description (abstract and §3): the QM-ToT architecture invokes evaluator assessment layers without stating their quantization status or whether they share the same INT4 weights as the generator. If evaluators run in FP16 or use a separate full-precision model, the reported jumps (34%→50%, 58.77%→69.49%) could be produced by hybrid correction rather than by any improvement in the quantized model’s own medical reasoning. This distinction is load-bearing for the central claim.

Authors: We agree that explicit specification of the quantization status for all components is essential to support the central claim. In the QM-ToT framework, the evaluator assessment layers operate on the same INT4-quantized model weights as the generator, with no hybrid full-precision components involved. This ensures that reasoning improvements occur within the quantized model. We have revised Section 3 to include a detailed description of the shared INT4 quantization across generator and evaluator layers, and updated the abstract to explicitly state that the entire framework runs in INT4 without external full-precision assistance. revision: yes
Referee: Experimental section (§4 or §5): the abstract and results present accuracy figures without reporting baseline systems, number of evaluation runs, statistical significance tests, error bars, quantization calibration details, or the precise MedQAUSMLE split used. These omissions prevent verification that the claimed improvements are robust and attributable to QM-ToT rather than to implementation choices.

Authors: We acknowledge that the original submission omitted several key experimental details required for full reproducibility and verification. The revised manuscript now includes: (i) explicit baseline systems (standard CoT prompting and direct inference on the quantized models), (ii) results averaged over 5 independent evaluation runs with standard error bars and statistical significance tests (paired t-tests, p < 0.05), (iii) quantization calibration details using a held-out calibration subset of MedQAUSMLE, and (iv) confirmation that the standard MedQAUSMLE test split was used. These additions appear in the updated Section 4. revision: yes

Circularity Check

0 steps flagged

Empirical framework evaluation on public dataset exhibits no circular derivation

full rationale

The paper proposes the QM-ToT framework and reports measured accuracy gains (34% to 50% on LLaMA2-70b; 58.77% to 69.49% on LLaMA-3.1-8b) plus a data-distillation improvement on the public MedQAUSMLE dataset. These are presented as experimental outcomes of applying Tree-of-Thoughts path decomposition and evaluator layers to INT4-quantized models, not as mathematical derivations or predictions that reduce to fitted parameters by construction. No self-definitional equations, load-bearing self-citations, or uniqueness theorems are invoked; the central claims remain externally falsifiable through replication on the stated dataset and therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that ToT-style decomposition transfers effectively to medical reasoning under quantization; no free parameters or invented physical entities are described, but the new framework itself functions as an invented method without external falsifiable handles provided in the abstract.

axioms (1)

domain assumption Tree of Thoughts reasoning can be adapted to decompose complex medical problems into manageable subtasks that evaluators can reliably score
Invoked when describing how QM-ToT facilitates performance improvements on MedQAUSMLE

invented entities (1)

QM-ToT framework with evaluator assessment layers no independent evidence
purpose: To enhance performance of quantized LLMs on biomedical tasks via path-based reasoning
Newly proposed method whose effectiveness is demonstrated only through the reported accuracy numbers

pith-pipeline@v0.9.0 · 5792 in / 1547 out tokens · 115055 ms · 2026-05-22T19:38:18.019863+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

QM-ToT leverages a Tree of Thought (ToT) reasoning approach to decompose complex medical problems into manageable subtasks, coupled with evaluator assessment layers... fs = α · exp(r) + (1 − α) · exp(c)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We quantize medical problem-solving into discrete paths, forming a ToT structure where each node represents a path in the reasoning process.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 10 internal anchors

[1]

Identifying autism spectrum disorder from resting-state fmri using deep belief network,

Z.-A. Huang, Z. Zhu, C. H. Yau, and K. C. Tan, “Identifying autism spectrum disorder from resting-state fmri using deep belief network,” IEEE Transactions on Neural Networks and Learning Systems , vol. 32, no. 7, pp. 2847–2861, 2021

work page 2021
[2]

Mixed prototype correction for causal inference in medical image classification,

Y . Zhang, Z.-A. Huang, Z. Hong, S. Wu, J. Wu, and K. C. Tan, “Mixed prototype correction for causal inference in medical image classification,” in Proceedings of the 32nd ACM International Conference on Multimedia , ser. MM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 4377–4386. [Online]. Available: https://doi.org/10.1145/36646...

work page doi:10.1145/3664647.3681395 2024
[3]

Heterogeneous structured federated learning with graph convolutional aggregation for mri-based mental disorder diagnosis,

Y . Hu, R. Liu, J. Zhang, Z.-A. Huang, L. Song, and K. C. Tan, “Heterogeneous structured federated learning with graph convolutional aggregation for mri-based mental disorder diagnosis,” in 2024 Interna- tional Joint Conference on Neural Networks (IJCNN) , 2024, pp. 1–8

work page 2024
[4]

Trialling a large language model (chatgpt) in general practice with the applied knowledge test: ob- servational study demonstrating opportunities and limitations in primary care,

A. J. Thirunavukarasu, R. Hassan, S. Mahmood, R. Sanghera, K. Barzangi, M. El Mukashfi, and S. Shah, “Trialling a large language model (chatgpt) in general practice with the applied knowledge test: ob- servational study demonstrating opportunities and limitations in primary care,” JMIR Medical Education , vol. 9, no. 1, p. e46599, 2023

work page 2023
[5]

A preliminary study of o1 in medicine: Are we closer to an ai doctor?

Y . Xie, J. Wu, H. Tu, S. Yang, B. Zhao, Y . Zong, Q. Jin, C. Xie, and Y . Zhou, “A preliminary study of o1 in medicine: Are we closer to an ai doctor?” arXiv preprint arXiv:2409.15277 , 2024

work page arXiv 2024
[6]

A survey on medical large language models: Technology, application, trustworthiness, and future directions,

L. Liu, X. Yang, J. Lei, X. Liu, Y . Shen, Z. Zhang, P. Wei, J. Gu, Z. Chu, Z. Qin et al. , “A survey on medical large language models: Technology, application, trustworthiness, and future directions,” arXiv preprint arXiv:2406.03712, 2024

work page arXiv 2024
[7]

An empirical analysis and resource footprint study of deploying large language models on edge devices,

N. Dhar, B. Deng, D. Lo, X. Wu, L. Zhao, and K. Suo, “An empirical analysis and resource footprint study of deploying large language models on edge devices,” in Proceedings of the 2024 ACM Southeast Conference, 2024, pp. 69–76

work page 2024
[8]

Precision or Peril: A PoC of Python Code Quality from Quantized Large Language Models

E. L. Melin, A. J. Torek, N. U. Eisty, and C. Kennington, “Precision or peril: Evaluating code quality from quantized large language models,” arXiv preprint arXiv:2411.10656 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Li et al

S. Li, X. Ning, L. Wang, T. Liu, X. Shi, S. Yan, G. Dai, H. Yang, and Y . Wang, “Evaluating quantized large language models,”arXiv preprint arXiv:2402.18158, 2024

work page arXiv 2024
[10]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,” Applied Sciences , vol. 11, no. 14, p. 6421, 2021

work page 2021
[11]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems , vol. 36, 2024

work page 2024
[12]

Seed-cts: Unleashing the power of tree search for superior performance in competitive coding tasks,

H. Wang, B. Liu, Y . Zhang, and J. Chen, “Seed-cts: Unleashing the power of tree search for superior performance in competitive coding tasks,” arXiv preprint arXiv:2412.12544 , 2024

work page arXiv 2024
[13]

Heart size and mediastinal contours appear within normal limits

H. Zhou, F. Liu, B. Gu, X. Zou, J. Huang, J. Wu, Y . Li, S. S. Chen, P. Zhou, J. Liu et al., “A survey of large language models in medicine: Progress, application, and challenge,” arXiv preprint arXiv:2311.05112, 2023

work page arXiv 2023
[14]

Mining the associations between v(d)j gene segments and covid-19 disease characteristics,

Y . Zhao, Y . Zhang, Z.-A. Huang, F. Yang, L. Duan, and J. Yao, “Mining the associations between v(d)j gene segments and covid-19 disease characteristics,” in 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) , 2021, pp. 608–613

work page 2021
[15]

Federated multi-task learning for joint diagnosis of multiple mental disorders on mri scans,

Z.-A. Huang, Y . Hu, R. Liu, X. Xue, Z. Zhu, L. Song, and K. C. Tan, “Federated multi-task learning for joint diagnosis of multiple mental disorders on mri scans,” IEEE Transactions on Biomedical Engineering, vol. 70, no. 4, pp. 1137–1149, 2023

work page 2023
[16]

Multi-lstm networks for accurate classification of attention deficit hyperactivity disorder from resting-state fmri data,

R. Liu, Z.-a. Huang, M. Jiang, and K. C. Tan, “Multi-lstm networks for accurate classification of attention deficit hyperactivity disorder from resting-state fmri data,” in 2020 2nd International Conference on Industrial Artificial Intelligence (IAI) , 2020, pp. 1–6

work page 2020
[17]

Large language model- aided evolutionary search for constrained multiobjective optimization,

Z. Wang, S. Liu, J. Chen, and K. C. Tan, “Large language model- aided evolutionary search for constrained multiobjective optimization,” in International Conference on Intelligent Computing . Springer, 2024, pp. 218–230

work page 2024
[18]

Explainable molecular property prediction: Aligning chemical concepts with predic- tions via language models,

Z. Wang, Z. Lin, W. Lin, M. Yang, M. Zeng, and K. C. Tan, “Explainable molecular property prediction: Aligning chemical concepts with predic- tions via language models,” arXiv preprint arXiv:2405.16041 , 2024

work page arXiv 2024
[19]

Evaluating large language models on medical evidence summarization,

L. Tang, Z. Sun, B. Idnay, J. G. Nestor, A. Soroush, P. A. Elias, Z. Xu, Y . Ding, G. Durrett, J. F. Rousseau et al. , “Evaluating large language models on medical evidence summarization,” NPJ digital medicine , vol. 6, no. 1, p. 158, 2023

work page 2023
[20]

Clinical text summarization: adapting large language models can outperform human experts,

D. Van Veen, C. Van Uden, L. Blankemeier, J.-B. Delbrouck, A. Aali, C. Bluethgen, A. Pareek, M. Polacin, E. P. Reis, A. Seehofnerova et al. , “Clinical text summarization: adapting large language models can outperform human experts,” Research Square, 2023

work page 2023
[21]

Biogpt: generative pre-trained transformer for biomedical text genera- tion and mining,

R. Luo, L. Sun, Y . Xia, T. Qin, S. Zhang, H. Poon, and T.-Y . Liu, “Biogpt: generative pre-trained transformer for biomedical text genera- tion and mining,” Briefings in bioinformatics, vol. 23, no. 6, p. bbac409, 2022

work page 2022
[22]

A survey of automated methods for biomedical text simplification,

B. Ondov, K. Attal, and D. Demner-Fushman, “A survey of automated methods for biomedical text simplification,” Journal of the American Medical Informatics Association , vol. 29, no. 11, pp. 1976–1988, 2022

work page 1976
[23]

The promise of large language models in health care,

A. Arora and A. Arora, “The promise of large language models in health care,” The Lancet, vol. 401, no. 10377, p. 641, 2023

work page 2023
[24]

Transforming clinical trials: the emerging roles of large language models,

J.-L. Ghim and S. Ahn, “Transforming clinical trials: the emerging roles of large language models,” Translational and Clinical Pharmacology , vol. 31, no. 3, p. 131, 2023

work page 2023
[25]

Towards Expert-Level Medical Question Answering with Large Language Models

K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al. , “Towards expert- level medical question answering with large language models,” arXiv preprint arXiv:2305.09617, 2023

work page internal anchor Pith review arXiv 2023
[26]

Capabilities of Gemini Models in Medicine

K. Saab, T. Tu, W.-H. Weng, R. Tanno, D. Stutz, E. Wulczyn, F. Zhang, T. Strother, C. Park, E. Vedadi et al., “Capabilities of gemini models in medicine,” arXiv preprint arXiv:2404.18416 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Report on a general problem solving program,

A. Newell, J. C. Shaw, and H. A. Simon, “Report on a general problem solving program,” in IFIP congress , vol. 256. Pittsburgh, PA, 1959, p. 64

work page 1959
[28]

Alphazero-like tree-search can guide large language model decoding and training,

X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y . Wen, W. Zhang, and J. Wang, “Alphazero-like tree-search can guide large language model decoding and training,” arXiv preprint arXiv:2309.17179 , 2023

work page arXiv 2023
[29]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei et al. , “Qwen2. 5 technical report,” arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Causalbench: A comprehensive benchmark for causal learning capability of large language models,

Y . Zhou, X. Wu, B. Huang, J. Wu, L. Feng, and K. C. Tan, “Causalbench: A comprehensive benchmark for causal learning capability of large language models,” arXiv preprint arXiv:2404.06349 , 2024

work page arXiv 2024
[32]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al. , “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in Neural Information Processing Systems, vol. 36, pp. 46 595–46 623, 2023

work page 2023
[33]

From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,

D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhat- tacharjee, Y . Jiang, C. Chen, T. Wu et al. , “From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,” arXiv preprint arXiv:2411.16594, 2024

work page arXiv 2024
[34]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Y.; Yun, S.; Lee, J.; Chacko, A.; Hou, B.; Duong-Tran, D.; Ding, Y.; et al

D. Li, S. Yang, Z. Tan, J. Y . Baik, S. Yun, J. Lee, A. Chacko, B. Hou, D. Duong-Tran, Y . Dinget al., “Dalk: Dynamic co-augmentation of llms and kg to answer alzheimer’s disease questions with scientific literature,” arXiv preprint arXiv:2405.04819 , 2024

work page arXiv 2024
[36]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

T. Liang, Z. He, W. Jiao, X. Wang, Y . Wang, R. Wang, Y . Yang, S. Shi, and Z. Tu, “Encouraging divergent thinking in large language models through multi-agent debate,” arXiv preprint arXiv:2305.19118 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Salmon: Self-alignment with instructable reward models,

Z. Sun, Y . Shen, H. Zhang, Q. Zhou, Z. Chen, D. D. Cox, Y . Yang, and C. Gan, “Salmon: Self-alignment with instructable reward models,” in The Twelfth International Conference on Learning Representations , 2024

work page 2024
[38]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,” Proceedings of Machine Learning and Systems , vol. 6, pp. 87–100, 2024

work page 2024
[40]

Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought,

Q. Chen, L. Qin, J. Wang, J. Zhou, and W. Che, “Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought,” arXiv preprint arXiv:2410.05695, 2024

work page arXiv 2024
[41]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[42]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” arXiv preprint arXiv:2403.13372 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[44]

A survey of monte carlo tree search methods,

C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A survey of monte carlo tree search methods,” IEEE Transactions on Computational Intelligence and AI in Games , vol. 4, no. 1, pp. 1–43, 2012

work page 2012
[45]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems, S. ...

work page 2022
[46]

Design principle transfer in neural architecture search via large language models,

X. Zhou, X. Wu, L. Feng, Z. Lu, and K. C. Tan, “Design principle transfer in neural architecture search via large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2408.11330

work page arXiv 2024
[47]

Exploring the true potential: Evaluating the black-box optimization capability of large language models,

B. Huang, X. Wu, Y . Zhou, J. Wu, L. Feng, R. Cheng, and K. C. Tan, “Exploring the true potential: Evaluating the black-box optimization capability of large language models,” arXiv preprint arXiv:2404.06290, 2024

work page arXiv 2024
[48]

Evolutionary computation in the era of large language model: Survey and roadmap

X. Wu, S.-h. Wu, J. Wu, L. Feng, and K. C. Tan, “Evolutionary computation in the era of large language model: Survey and roadmap,” arXiv preprint arXiv:2401.10034 , 2024

work page arXiv 2024

[1] [1]

Identifying autism spectrum disorder from resting-state fmri using deep belief network,

Z.-A. Huang, Z. Zhu, C. H. Yau, and K. C. Tan, “Identifying autism spectrum disorder from resting-state fmri using deep belief network,” IEEE Transactions on Neural Networks and Learning Systems , vol. 32, no. 7, pp. 2847–2861, 2021

work page 2021

[2] [2]

Mixed prototype correction for causal inference in medical image classification,

Y . Zhang, Z.-A. Huang, Z. Hong, S. Wu, J. Wu, and K. C. Tan, “Mixed prototype correction for causal inference in medical image classification,” in Proceedings of the 32nd ACM International Conference on Multimedia , ser. MM ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 4377–4386. [Online]. Available: https://doi.org/10.1145/36646...

work page doi:10.1145/3664647.3681395 2024

[3] [3]

Heterogeneous structured federated learning with graph convolutional aggregation for mri-based mental disorder diagnosis,

Y . Hu, R. Liu, J. Zhang, Z.-A. Huang, L. Song, and K. C. Tan, “Heterogeneous structured federated learning with graph convolutional aggregation for mri-based mental disorder diagnosis,” in 2024 Interna- tional Joint Conference on Neural Networks (IJCNN) , 2024, pp. 1–8

work page 2024

[4] [4]

Trialling a large language model (chatgpt) in general practice with the applied knowledge test: ob- servational study demonstrating opportunities and limitations in primary care,

A. J. Thirunavukarasu, R. Hassan, S. Mahmood, R. Sanghera, K. Barzangi, M. El Mukashfi, and S. Shah, “Trialling a large language model (chatgpt) in general practice with the applied knowledge test: ob- servational study demonstrating opportunities and limitations in primary care,” JMIR Medical Education , vol. 9, no. 1, p. e46599, 2023

work page 2023

[5] [5]

A preliminary study of o1 in medicine: Are we closer to an ai doctor?

Y . Xie, J. Wu, H. Tu, S. Yang, B. Zhao, Y . Zong, Q. Jin, C. Xie, and Y . Zhou, “A preliminary study of o1 in medicine: Are we closer to an ai doctor?” arXiv preprint arXiv:2409.15277 , 2024

work page arXiv 2024

[6] [6]

A survey on medical large language models: Technology, application, trustworthiness, and future directions,

L. Liu, X. Yang, J. Lei, X. Liu, Y . Shen, Z. Zhang, P. Wei, J. Gu, Z. Chu, Z. Qin et al. , “A survey on medical large language models: Technology, application, trustworthiness, and future directions,” arXiv preprint arXiv:2406.03712, 2024

work page arXiv 2024

[7] [7]

An empirical analysis and resource footprint study of deploying large language models on edge devices,

N. Dhar, B. Deng, D. Lo, X. Wu, L. Zhao, and K. Suo, “An empirical analysis and resource footprint study of deploying large language models on edge devices,” in Proceedings of the 2024 ACM Southeast Conference, 2024, pp. 69–76

work page 2024

[8] [8]

Precision or Peril: A PoC of Python Code Quality from Quantized Large Language Models

E. L. Melin, A. J. Torek, N. U. Eisty, and C. Kennington, “Precision or peril: Evaluating code quality from quantized large language models,” arXiv preprint arXiv:2411.10656 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Li et al

S. Li, X. Ning, L. Wang, T. Liu, X. Shi, S. Yan, G. Dai, H. Yang, and Y . Wang, “Evaluating quantized large language models,”arXiv preprint arXiv:2402.18158, 2024

work page arXiv 2024

[10] [10]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,” Applied Sciences , vol. 11, no. 14, p. 6421, 2021

work page 2021

[11] [11]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems , vol. 36, 2024

work page 2024

[12] [12]

Seed-cts: Unleashing the power of tree search for superior performance in competitive coding tasks,

H. Wang, B. Liu, Y . Zhang, and J. Chen, “Seed-cts: Unleashing the power of tree search for superior performance in competitive coding tasks,” arXiv preprint arXiv:2412.12544 , 2024

work page arXiv 2024

[13] [13]

Heart size and mediastinal contours appear within normal limits

H. Zhou, F. Liu, B. Gu, X. Zou, J. Huang, J. Wu, Y . Li, S. S. Chen, P. Zhou, J. Liu et al., “A survey of large language models in medicine: Progress, application, and challenge,” arXiv preprint arXiv:2311.05112, 2023

work page arXiv 2023

[14] [14]

Mining the associations between v(d)j gene segments and covid-19 disease characteristics,

Y . Zhao, Y . Zhang, Z.-A. Huang, F. Yang, L. Duan, and J. Yao, “Mining the associations between v(d)j gene segments and covid-19 disease characteristics,” in 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) , 2021, pp. 608–613

work page 2021

[15] [15]

Federated multi-task learning for joint diagnosis of multiple mental disorders on mri scans,

Z.-A. Huang, Y . Hu, R. Liu, X. Xue, Z. Zhu, L. Song, and K. C. Tan, “Federated multi-task learning for joint diagnosis of multiple mental disorders on mri scans,” IEEE Transactions on Biomedical Engineering, vol. 70, no. 4, pp. 1137–1149, 2023

work page 2023

[16] [16]

Multi-lstm networks for accurate classification of attention deficit hyperactivity disorder from resting-state fmri data,

R. Liu, Z.-a. Huang, M. Jiang, and K. C. Tan, “Multi-lstm networks for accurate classification of attention deficit hyperactivity disorder from resting-state fmri data,” in 2020 2nd International Conference on Industrial Artificial Intelligence (IAI) , 2020, pp. 1–6

work page 2020

[17] [17]

Large language model- aided evolutionary search for constrained multiobjective optimization,

Z. Wang, S. Liu, J. Chen, and K. C. Tan, “Large language model- aided evolutionary search for constrained multiobjective optimization,” in International Conference on Intelligent Computing . Springer, 2024, pp. 218–230

work page 2024

[18] [18]

Explainable molecular property prediction: Aligning chemical concepts with predic- tions via language models,

Z. Wang, Z. Lin, W. Lin, M. Yang, M. Zeng, and K. C. Tan, “Explainable molecular property prediction: Aligning chemical concepts with predic- tions via language models,” arXiv preprint arXiv:2405.16041 , 2024

work page arXiv 2024

[19] [19]

Evaluating large language models on medical evidence summarization,

L. Tang, Z. Sun, B. Idnay, J. G. Nestor, A. Soroush, P. A. Elias, Z. Xu, Y . Ding, G. Durrett, J. F. Rousseau et al. , “Evaluating large language models on medical evidence summarization,” NPJ digital medicine , vol. 6, no. 1, p. 158, 2023

work page 2023

[20] [20]

Clinical text summarization: adapting large language models can outperform human experts,

D. Van Veen, C. Van Uden, L. Blankemeier, J.-B. Delbrouck, A. Aali, C. Bluethgen, A. Pareek, M. Polacin, E. P. Reis, A. Seehofnerova et al. , “Clinical text summarization: adapting large language models can outperform human experts,” Research Square, 2023

work page 2023

[21] [21]

Biogpt: generative pre-trained transformer for biomedical text genera- tion and mining,

R. Luo, L. Sun, Y . Xia, T. Qin, S. Zhang, H. Poon, and T.-Y . Liu, “Biogpt: generative pre-trained transformer for biomedical text genera- tion and mining,” Briefings in bioinformatics, vol. 23, no. 6, p. bbac409, 2022

work page 2022

[22] [22]

A survey of automated methods for biomedical text simplification,

B. Ondov, K. Attal, and D. Demner-Fushman, “A survey of automated methods for biomedical text simplification,” Journal of the American Medical Informatics Association , vol. 29, no. 11, pp. 1976–1988, 2022

work page 1976

[23] [23]

The promise of large language models in health care,

A. Arora and A. Arora, “The promise of large language models in health care,” The Lancet, vol. 401, no. 10377, p. 641, 2023

work page 2023

[24] [24]

Transforming clinical trials: the emerging roles of large language models,

J.-L. Ghim and S. Ahn, “Transforming clinical trials: the emerging roles of large language models,” Translational and Clinical Pharmacology , vol. 31, no. 3, p. 131, 2023

work page 2023

[25] [25]

Towards Expert-Level Medical Question Answering with Large Language Models

K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al. , “Towards expert- level medical question answering with large language models,” arXiv preprint arXiv:2305.09617, 2023

work page internal anchor Pith review arXiv 2023

[26] [26]

Capabilities of Gemini Models in Medicine

K. Saab, T. Tu, W.-H. Weng, R. Tanno, D. Stutz, E. Wulczyn, F. Zhang, T. Strother, C. Park, E. Vedadi et al., “Capabilities of gemini models in medicine,” arXiv preprint arXiv:2404.18416 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Report on a general problem solving program,

A. Newell, J. C. Shaw, and H. A. Simon, “Report on a general problem solving program,” in IFIP congress , vol. 256. Pittsburgh, PA, 1959, p. 64

work page 1959

[28] [28]

Alphazero-like tree-search can guide large language model decoding and training,

X. Feng, Z. Wan, M. Wen, S. M. McAleer, Y . Wen, W. Zhang, and J. Wang, “Alphazero-like tree-search can guide large language model decoding and training,” arXiv preprint arXiv:2309.17179 , 2023

work page arXiv 2023

[29] [29]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei et al. , “Qwen2. 5 technical report,” arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Causalbench: A comprehensive benchmark for causal learning capability of large language models,

Y . Zhou, X. Wu, B. Huang, J. Wu, L. Feng, and K. C. Tan, “Causalbench: A comprehensive benchmark for causal learning capability of large language models,” arXiv preprint arXiv:2404.06349 , 2024

work page arXiv 2024

[32] [32]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al. , “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in Neural Information Processing Systems, vol. 36, pp. 46 595–46 623, 2023

work page 2023

[33] [33]

From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,

D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhat- tacharjee, Y . Jiang, C. Chen, T. Wu et al. , “From generation to judg- ment: Opportunities and challenges of llm-as-a-judge,” arXiv preprint arXiv:2411.16594, 2024

work page arXiv 2024

[34] [34]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

Y.; Yun, S.; Lee, J.; Chacko, A.; Hou, B.; Duong-Tran, D.; Ding, Y.; et al

D. Li, S. Yang, Z. Tan, J. Y . Baik, S. Yun, J. Lee, A. Chacko, B. Hou, D. Duong-Tran, Y . Dinget al., “Dalk: Dynamic co-augmentation of llms and kg to answer alzheimer’s disease questions with scientific literature,” arXiv preprint arXiv:2405.04819 , 2024

work page arXiv 2024

[36] [36]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

T. Liang, Z. He, W. Jiao, X. Wang, Y . Wang, R. Wang, Y . Yang, S. Shi, and Z. Tu, “Encouraging divergent thinking in large language models through multi-agent debate,” arXiv preprint arXiv:2305.19118 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Salmon: Self-alignment with instructable reward models,

Z. Sun, Y . Shen, H. Zhang, Q. Zhou, Z. Chen, D. D. Cox, Y . Yang, and C. Gan, “Salmon: Self-alignment with instructable reward models,” in The Twelfth International Conference on Learning Representations , 2024

work page 2024

[38] [38]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quanti- zation for on-device llm compression and acceleration,” Proceedings of Machine Learning and Systems , vol. 6, pp. 87–100, 2024

work page 2024

[40] [40]

Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought,

Q. Chen, L. Qin, J. Wang, J. Zhou, and W. Che, “Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought,” arXiv preprint arXiv:2410.05695, 2024

work page arXiv 2024

[41] [41]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024

[42] [42]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” arXiv preprint arXiv:2403.13372 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[44] [44]

A survey of monte carlo tree search methods,

C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A survey of monte carlo tree search methods,” IEEE Transactions on Computational Intelligence and AI in Games , vol. 4, no. 1, pp. 1–43, 2012

work page 2012

[45] [45]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems, S. ...

work page 2022

[46] [46]

Design principle transfer in neural architecture search via large language models,

X. Zhou, X. Wu, L. Feng, Z. Lu, and K. C. Tan, “Design principle transfer in neural architecture search via large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2408.11330

work page arXiv 2024

[47] [47]

Exploring the true potential: Evaluating the black-box optimization capability of large language models,

B. Huang, X. Wu, Y . Zhou, J. Wu, L. Feng, R. Cheng, and K. C. Tan, “Exploring the true potential: Evaluating the black-box optimization capability of large language models,” arXiv preprint arXiv:2404.06290, 2024

work page arXiv 2024

[48] [48]

Evolutionary computation in the era of large language model: Survey and roadmap

X. Wu, S.-h. Wu, J. Wu, L. Feng, and K. C. Tan, “Evolutionary computation in the era of large language model: Survey and roadmap,” arXiv preprint arXiv:2401.10034 , 2024

work page arXiv 2024