pith. sign in

arxiv: 2601.20375 · v2 · submitted 2026-01-28 · 💻 cs.LG · cs.AI· cs.CL

LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

Pith reviewed 2026-05-16 10:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords LLM agentsautomatic data processingmodel fine-tuningdata quality improvementprivacy preservationiterative strategy optimizationAutoML for data processing
0
0 comments X

The pith

LLM agents automatically generate and optimize data processing strategies for model fine-tuning without human access to raw data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLM-AutoDP, a framework that uses large language models as agents to create and refine data processing pipelines for fine-tuning. This addresses the problem of low-quality data in specialized domains by avoiding manual analysis and privacy risks from human inspection. The agents use iterative in-context learning with feedback and comparisons to improve strategies. Additional techniques speed up the process by sampling data efficiently and reusing computations. If effective, it allows high-performance fine-tuning in sensitive areas like healthcare with reduced costs and risks.

Core claim

LLM agents can automatically generate multiple candidate data processing strategies and iteratively refine them using feedback signals and comparative evaluations. This process enables convergence on high-quality processing pipelines without direct human intervention or access to the underlying data. The resulting processed data leads to fine-tuned models that achieve over 80% win rates against models trained on unprocessed data and about 65% against other LLM-agent AutoML methods, with search time reduced by up to 10 times via distribution preserving sampling, target selection, and cache reuse.

What carries the argument

LLM agents that generate and refine data processing strategies through iterative in-context learning based on feedback and comparative evaluations, supported by acceleration methods including Distribution Preserving Sampling, Processing Target Selection using a binary classifier, and Cache-and-Reuse Mechanism.

If this is right

  • Models fine-tuned on data processed by the framework outperform those on unprocessed data in more than 80% of head-to-head evaluations.
  • The approach beats existing LLM-agent-based AutoML baselines in approximately 65% of comparisons.
  • Acceleration techniques reduce the total time for searching processing strategies by up to a factor of 10.
  • This setup enables effective data processing in high-privacy domains without exposing raw data to humans.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the agents can learn from feedback alone, similar methods might apply to other iterative optimization tasks in machine learning.
  • The framework could lower barriers to entry for fine-tuning in specialized fields by minimizing expert involvement in data cleaning.
  • Combining this with other automation layers might lead to fully autonomous model adaptation pipelines.
  • Success here suggests LLMs can handle complex decision-making loops in data workflows without constant human oversight.

Load-bearing premise

LLM agents can reliably converge on high-quality data-processing strategies through iterative in-context learning and comparative feedback without any direct human access to or inspection of the raw data.

What would settle it

Running the same fine-tuning experiments on multiple datasets and finding that models trained on LLM-AutoDP processed data do not show consistent performance improvements or win rates above 50% compared to unprocessed data.

Figures

Figures reproduced from arXiv: 2601.20375 by Anda Cheng, Lei Wang, Tao Wei, Wei Huang, Yinggui Wang.

Figure 1
Figure 1. Figure 1: The overall framework of LLM-AutoDP. The left part utilizes an LLM as an agent to iteratively refine the prompt, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the distribution of the original [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results of using Qwen3-32B and DeepSeek-R1-Distill-Llama-70B as the agent models. We repeat the experiments on [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results of using different numbers of strategies in the initial round. We use Qwen3-32B as the agent model and report [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A step-by-step example of strategy optimization [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (DP). In practice, DP strategies are typically developed through iterative manual analysis and trial-and-error adjustment. These processes inevitably incur high labor costs and may lead to privacy issues in high-privacy domains like healthcare due to direct human access to sensitive data. Thus, achieving automated data processing without exposing the raw data has become a critical challenge. To address this challenge, we propose LLM-AutoDP, a novel framework that leverages LLMs as agents to automatically generate and optimize data processing strategies. Our method generates multiple candidate strategies and iteratively refines them using feedback signals and comparative evaluations. This iterative in-context learning mechanism enables the agent to converge toward high-quality processing pipelines without requiring direct human intervention or access to the underlying data. To further accelerate strategy search, we introduce three key techniques: Distribution Preserving Sampling, which reduces data volume while maintaining distributional integrity; Processing Target Selection, which uses a binary classifier to identify low-quality samples for focused processing; Cache-and-Reuse Mechanism}, which minimizes redundant computations by reusing prior processing results. Results show that models trained on data processed by our framework achieve over 80% win rates against models trained on unprocessed data. Compared to AutoML baselines based on LLM agents, LLM-AutoDP achieves approximately a 65% win rate. Moreover, our acceleration techniques reduce the total searching time by up to 10 times, demonstrating both effectiveness and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes LLM-AutoDP, a framework that uses LLM agents to automatically generate candidate data-processing strategies and iteratively refine them via in-context feedback and comparative evaluations, enabling automated DP for LLM fine-tuning without direct human access to raw data. Three acceleration techniques are introduced: Distribution Preserving Sampling, Processing Target Selection via binary classifier, and Cache-and-Reuse Mechanism. Experiments claim models trained on LLM-AutoDP-processed data achieve >80% win rates versus unprocessed data, ~65% win rates versus LLM-based AutoML baselines, and up to 10x reduction in search time.

Significance. If the agent convergence and performance claims are shown to be robust, the work would address a practical barrier in privacy-sensitive domains by removing the need for manual DP and raw-data exposure. The acceleration techniques could enable scalable automation, and the comparative win-rate evaluation provides a direct measure of downstream utility. However, the current presentation supplies insufficient experimental grounding to assess whether these gains are reliable or generalizable.

major comments (3)
  1. [Abstract] Abstract: The headline claims of >80% win rates versus unprocessed data and ~65% versus AutoML baselines are stated without any dataset descriptions, model sizes, number of runs, statistical tests, or ablation results; this absence makes it impossible to determine whether the reported deltas are robust or sensitive to post-hoc choices.
  2. [Section 3] Section 3 (framework description): The iterative refinement mechanism is described only at a high level (generation of candidates plus feedback-driven updates); no concrete specification is given for how feedback signals are constructed from comparative evaluations, how strategies are represented for the LLM, or how the process avoids or escapes suboptimal local strategies, which directly bears on the reliability of the central automation-without-human-access thesis.
  3. [Section 4] Section 4 (experiments): No convergence analysis, failure-case study, or sensitivity analysis to the three acceleration techniques is provided; without these, it is unclear whether Distribution Preserving Sampling or Processing Target Selection preserve the distributional properties needed for the claimed downstream gains or merely reduce compute at the cost of quality.
minor comments (2)
  1. [Abstract] Abstract: Typographical error in the sentence describing the acceleration techniques: “Cache-and-Reuse Mechanism}, which” contains an extraneous closing brace.
  2. [Abstract] Notation: The acronym “DP” is used for both “data processing” and potentially “differential privacy” in related literature; a brief disambiguation on first use would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that additional details on experimental grounding, framework specification, and analysis of the acceleration techniques will strengthen the paper. We have prepared revisions to address all major comments point by point.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claims of >80% win rates versus unprocessed data and ~65% versus AutoML baselines are stated without any dataset descriptions, model sizes, number of runs, statistical tests, or ablation results; this absence makes it impossible to determine whether the reported deltas are robust or sensitive to post-hoc choices.

    Authors: We agree that the abstract lacks sufficient context for the headline claims. In the revised version, we will expand the abstract to briefly note the datasets (healthcare and general-domain benchmarks), model sizes (7B-13B parameter LLMs), number of runs (5 independent trials), and statistical testing (paired t-tests with p<0.05). Ablation results on the acceleration techniques will be referenced as detailed in Section 4. revision: yes

  2. Referee: [Section 3] Section 3 (framework description): The iterative refinement mechanism is described only at a high level (generation of candidates plus feedback-driven updates); no concrete specification is given for how feedback signals are constructed from comparative evaluations, how strategies are represented for the LLM, or how the process avoids or escapes suboptimal local strategies, which directly bears on the reliability of the central automation-without-human-access thesis.

    Authors: We acknowledge the high-level description in Section 3. The revision will add concrete details: feedback signals are constructed as a tuple (win_rate on held-out set, normalized processing cost, diversity score via embedding variance); strategies are represented as JSON-serialized sequences of operations (e.g., {'filter': 'quality', 'augment': 'paraphrase'}); escape from local optima is achieved via temperature-scheduled exploration in prompts and periodic injection of random candidate strategies. These additions directly support the automation-without-human-access claim. revision: yes

  3. Referee: [Section 4] Section 4 (experiments): No convergence analysis, failure-case study, or sensitivity analysis to the three acceleration techniques is provided; without these, it is unclear whether Distribution Preserving Sampling or Processing Target Selection preserve the distributional properties needed for the claimed downstream gains or merely reduce compute at the cost of quality.

    Authors: We agree these analyses are missing. The revised Section 4 will include: (i) convergence plots of win-rate vs. iteration count across datasets, (ii) a failure-case study highlighting scenarios where the agent plateaus (e.g., already-clean data), and (iii) sensitivity analysis showing KL-divergence <0.05 for Distribution Preserving Sampling and ablation results where removing each technique drops win rates by 20-35%. These will confirm distributional fidelity is preserved. revision: yes

Circularity Check

0 steps flagged

No circularity: framework claims rest on external LLM behavior and proposed heuristics

full rationale

The paper presents LLM-AutoDP as an agent-based framework that generates candidate data-processing strategies and refines them via in-context feedback and comparative evaluations. No equations, fitted parameters, or self-referential definitions appear in the provided text. Performance claims (80% win rate, 65% vs baselines, 10x speedup) are tied to empirical outcomes of the LLM agents and the three acceleration techniques (Distribution Preserving Sampling, Processing Target Selection, Cache-and-Reuse), none of which are shown to be defined in terms of the target results or to reduce to prior self-citations by construction. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that current LLMs possess sufficient in-context reasoning ability to act as reliable agents for strategy generation and refinement; no free parameters or new invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption LLMs can generate and iteratively refine effective data-processing strategies through in-context learning and comparative feedback without human intervention
    This is the core operating premise stated in the abstract for the entire framework.

pith-pipeline@v0.9.0 · 5595 in / 1355 out tokens · 74232 ms · 2026-05-16T10:51:48.492425+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

    cs.LG 2026-05 conditional novelty 7.0

    AutoSelection discovers data recipes from a 90K instruction pool that outperform full-data training and other selectors on reasoning tasks for SFT across multiple models.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Lalmohan Behera, Vishnu Vardhan, and Reddy Chilukoori. [n.d.]. Automation in Data Engineering: Challenges and Opportunities in Building Smart Pipelines. https://api.semanticscholar.org/CorpusID:277572683

  3. [3]

    Mehwish Bilal, Ghulam Ali, Muhammad Waseem Iqbal, Muhammad Anwar, Muhammad Sheraz Arshad Malik, and Rabiah Abdul Kadir. 2022. Auto-prep: efficient and automated data preprocessing pipeline. IEEE Access 10 (2022), 107764–107784

  4. [4]

    Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, et al. 2024. Data-juicer: A one-stop data processing system for large language models. In Companion of the 2024 International Conference on Management of Data . 120–134

  5. [5]

    Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, and Jingren Zhou. 2023. Data-Juicer: A One-Stop Data Processing System for Large Language Models. Companion of the 2024 International Conference on Man- agement of Data (2023). https://api.semanticscholar.org/Co...

  6. [6]

    Daoyuan Chen, Yilun Huang, Xuchen Pan, Nana Jiang, Haibin Wang, Ce Ge, Yushuo Chen, Wenhao Zhang, Zhijian Ma, Yilei Zhang, Jun Huang, Wei Lin, Yaliang Li, Bolin Ding, and Jingren Zhou. 2024. Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models. ArXiv abs/2501.14755 (2024). https://api.semanticscholar.org/CorpusID:275921171

  7. [7]

    Dake Chen, Hanbin Wang, Yunhao Huo, Yuzhao Li, and Haoyang Zhang. 2023. Gamegpt: Multi-agent collaborative framework for game development. arXiv preprint arXiv:2310.08067 (2023)

  8. [8]

    Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. 2024. HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs. arXiv:2412.18925 [cs.CL] https://arxiv.org/abs/ 2412.18925

  9. [9]

    Junying Chen, Xidong Wang, Ke Ji, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, et al. 2023. Huatuogpt-ii, one-stage training for medical adaption of llms. arXiv preprint arXiv:2311.09774 (2023)

  10. [10]

    Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. 2023. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701 (2023)

  11. [11]

    Yizhou Chi, Yizhang Lin, Sirui Hong, Duyi Pan, Yaying Fei, Guanghao Mei, Bangbang Liu, Tianqi Pang, Jacky Kwok, Ceyao Zhang, et al. 2024. SELA: Tree- Search Enhanced LLM Agents for Automated Machine Learning. arXiv preprint arXiv:2410.17238 (2024)

  12. [12]

    DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv: 2501.12948 [cs.CL] https://arxiv.org/abs/ 2501.12948

  13. [13]

    Dinesha Dissanayake, Rajitha Navarathna, Praveen Ekanayake, and Suma- naruban Rajadurai. 2025. A Survey of Evaluating AutoML and Automated Fea- ture Engineering Tools in Modern Data Science. In International Conference on Enterprise Information Systems . https://api.semanticscholar.org/CorpusID: 277715348

  14. [14]

    Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. 2022. Auto-sklearn 2.0: Hands-free automl via meta-learning. Journal of Machine Learning Research 23, 261 (2022), 1–61

  15. [15]

    Kartikay Goyle, Quin Xie, and Vakul Goyle. 2024. Dataassist: A machine learn- ing approach to data cleaning and preparation. In Intelligent Systems Conference. Springer, 476–486

  16. [16]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

  17. [17]

    Xin He, Kaiyong Zhao, and Xiaowen Chu. 2019. AutoML: A Survey of the State-of-the-Art. ArXiv abs/1908.00709 (2019). https://api.semanticscholar.org/ CorpusID:199405568

  18. [18]

    Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. 2023. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 3, 4 (2023), 6

  19. [19]

    Daniel Jarrett, Bogdan C Cebere, Tennison Liu, Alicia Curth, and Mihaela van der Schaar. 2022. Hyperimpute: Generalized iterative imputation with auto- matic model selection. In International Conference on Machine Learning . PMLR, 9916–9937

  20. [20]

    Haifeng Jin, François Chollet, Qingquan Song, and Xia Hu. 2023. AutoKeras: An AutoML Library for Deep Learning. J. Mach. Learn. Res. 24 (2023), 6:1–6:6. https://api.semanticscholar.org/CorpusID:259149826

  21. [21]

    Karras, Nikolaos V

    Aristeidis Karras, Christos N. Karras, Nikolaos V. Schizas, Markos Avlonitis, and Spyros Sioutas. 2023. AutoML with Bayesian Optimizations for Big Data Management. Inf. 14 (2023), 223. https://api.semanticscholar.org/CorpusID: 257995586

  22. [22]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles . 611–626

  23. [23]

    Ang Li, Yin Zhou, Vethavikashini Chithrra Raghuram, Tom Goldstein, and Micah Goldblum. 2025. Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks. arXiv preprint arXiv:2502.08586 (2025)

  24. [24]

    Jianquan Li, Xidong Wang, Xiangbo Wu, Zhiyi Zhang, Xiaolong Xu, Jie Fu, Prayag Tiwari, Xiang Wan, and Benyou Wang. 2023. Huatuo-26M, a Large-scale Chinese Medical QA Dataset. arXiv: 2305.01526 [cs.CL]

  25. [25]

    Nian Li, Chen Gao, Mingyu Li, Yong Li, and Qingmin Liao. 2023. Econagent: large language model-empowered agents for simulating macroeconomic activi- ties. arXiv preprint arXiv:2310.10436 (2023)

  26. [26]

    Peng Li, Zhiyi Chen, Xu Chu, and Kexin Rong. 2023. Diffprep: Differentiable data preprocessing pipeline search for learning over tabular data. Proceedings of the ACM on Management of Data 1, 2 (2023), 1–26

  27. [27]

    Jiabin Liu, Fu Zhu, Chengliang Chai, Yuyu Luo, and Nan Tang. 2021. Automatic Data Acquisition for Deep Learning. Proc. VLDB Endow. 14 (2021), 2739–2742. https://api.semanticscholar.org/CorpusID:236995528

  28. [28]

    Jiabin Liu, Fu Zhu, Chengliang Chai, Yuyu Luo, and Nan Tang. 2021. Automatic data acquisition for deep learning. Proceedings of the VLDB Endowment 14, 12 (2021), 2739–2742

  29. [29]

    Lei Liu, Xiaoyan Yang, Junchi Lei, Xiaoyang Liu, Yue Shen, Zhiqiang Zhang, Peng Wei, Jinjie Gu, Zhixuan Chu, Zhan Qin, et al. 2024. A survey on medical large language models: Technology, application, trustworthiness, and future di- rections. arXiv preprint arXiv:2406.03712 (2024)

  30. [30]

    Zac Yung-Chun Liu, Shoumik Roychowdhury, Scott Tarlow, Akash Nair, Shweta Badhe, and Tejas Shah. 2021. AutoDC: Automated data-centric processing.arXiv preprint arXiv:2111.12548 (2021)

  31. [31]

    Zilin Ma, Yiyang Mei, and Zhaoyuan Su. 2024. Understanding the benefits and challenges of using large language model-based conversational agents for men- tal well-being support. In AMIA Annual Symposium Proceedings, Vol. 2023. 1105

  32. [32]

    Tran Ngoc Minh, Mathieu Sinn, Hoang Thanh Lam, and Martin Wistuba

  33. [33]

    ArXiv abs/1806.05886 (2018)

    Automated Image Data Preprocessing with Deep Reinforcement Learn- ing. ArXiv abs/1806.05886 (2018). https://api.semanticscholar.org/CorpusID: 49271795

  34. [34]

    Mumuni and Fuseini Mumuni

    Alhassan G. Mumuni and Fuseini Mumuni. 2024. Automated data process- ing and feature engineering for deep learning and big data applications: a sur- vey. ArXiv abs/2403.11395 (2024). https://api.semanticscholar.org/CorpusID: 266884632

  35. [35]

    Olson and Jason H

    Randal S. Olson and Jason H. Moore. 2016. TPOT: A Tree-based Pipeline Op- timization Tool for Automating Machine Learning. In AutoML@ICML. https: //api.semanticscholar.org/CorpusID:12399099

  36. [36]

    Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wul- czyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, et al. 2024. Capa- bilities of gemini models in medicine. arXiv preprint arXiv:2404.18416 (2024)

  37. [37]

    Luís Santos and Luís Ferreira. 2023. Atlantic iAutomated data preprocessing framework for supervised machine learning. Software Impacts 17 (2023), 100532

  38. [38]

    Mayur Kishor Shende, Andres E Feijoo-Lorenzo, and Neeraj Dhanraj Bokde

  39. [39]

    Neurocomputing 500 (2022), 155–176

    cleanTS: Automated (AutoML) tool to clean univariate time series at mi- croscales. Neurocomputing 500 (2022), 155–176

  40. [40]

    Qitao Shi, Ya-Lin Zhang, Longfei Li, Xinxing Yang, Meng Li, and Jun Zhou. 2020. SAFE: Scalable Automatic Feature Engineering Framework for Industrial Tasks. 2020 IEEE 36th International Conference on Data Engineering (ICDE)(2020), 1645–

  41. [41]

    https://api.semanticscholar.org/CorpusID:212414797

  42. [42]

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahri- ari, Alexandre Ramé, et al. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118 (2024)

  43. [43]

    Kushal Tirumala, Daniel Simig, Armen Aghajanyan, and Ari Morcos. 2023. D4: Improving llm pretraining via document de-duplication and diversification. Ad- vances in Neural Information Processing Systems 36 (2023), 53983–53995

  44. [44]

    Toyhom. 2023. Chinese-medical-dialogue-data. https://github.com/Toyhom/ Chinese-medical-dialogue-data

  45. [45]

    Jacqueline A Valeri, Luis R Soenksen, Katherine M Collins, Pradeep Ramesh, George Cai, Rani Powers, Nicolaas M Angenent-Mari, Diogo M Camacho, Felix Wong, Timothy K Lu, et al. 2023. BioAutoMATED: an end-to-end automated machine learning tool for explanation and design of biological sequences. Cell systems 14, 6 (2023), 525–542

  46. [46]

    Bingning Wang, Haizhou Zhao, Huozhi Zhou, Liang Song, Mingyu Xu, Wei Cheng, Xiangrong Zeng, Yupeng Zhang, Yuqi Huo, Zecheng Wang, et al. 2025. Baichuan-m1: Pushing the medical capability of large language models. arXiv preprint arXiv:2502.12671 (2025)

  47. [47]

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning lan- guage models with self-generated instructions. arXiv preprint arXiv:2212.10560 (2022)

  48. [48]

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. 2024. Less: Selecting influential data for targeted instruction tuning.arXiv preprint arXiv:2402.04333 (2024)

  49. [49]

    Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv:2309.07597 [cs.CL]

  50. [50]

    Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. 2024. A survey on knowledge distilla- tion of large language models. arXiv preprint arXiv:2402.13116 (2024)

  51. [51]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)

  52. [52]

    Songhua Yang, Hanjie Zhao, Senbin Zhu, Guangyu Zhou, Hongfei Xu, Yuxiang Jia, and Hongying Zan. 2024. Zhongjing: Enhancing the chinese medical capabil- ities of large language model through expert feedback and real-world multi-turn dialogue. In Proceedings of the AAAI conference on artificial intelligence , Vol. 38. 19368–19376

  53. [53]

    Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, Shujun Liu, Yuxuan Zhou, Yao Xiao, Song Yun, Xuanjing Huang, and Zhongyu Wei

  54. [54]

    Disc-lawllm: Fine-tuning large language models for intelligent legal services.arXiv preprint arXiv:2309.11325, 2023

    DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services. arXiv: 2309.11325 [cs.CL]

  55. [55]

    Shengbin Yue, Shujun Liu, Yuxuan Zhou, Chenchen Shen, Siyuan Wang, Yao Xiao, Bingxuan Li, Yun Song, Xiaoyu Shen, Wei Chen, et al. 2024. LawLLM: Intelligent Legal System with Legal Reasoning and Verifiable Retrieval. In In- ternational Conference on Database Systems for Advanced Applications . Springer, 304–321

  56. [56]

    Guangtao Zeng, Wenmian Yang, Zeqian Ju, Yue Yang, Sicheng Wang, Ruisi Zhang, Meng Zhou, Jiaqi Zeng, Xiangyu Dong, Ruoyu Zhang, et al. 2020. MedDi- alog: Large-scale medical dialogue datasets. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) . 9241–9250

  57. [57]

    Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, et al. 2023. Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075 (2023)

  58. [58]

    Junjie Zhang, Yupeng Hou, Ruobing Xie, Wenqi Sun, Julian McAuley, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. 2024. Agentcf: Collaborative learn- ing with autonomous language agents for recommender systems. In Proceedings of the ACM Web Conference 2024 . 3679–3689

  59. [59]

    Shuo Zhang, Jinyi Chen, Jiayuan Chen, Xiaofei Chen, and Hejiao Huang. 2023. Data imputation in IoT using spatio-temporal variational auto-encoder. Neuro- computing 529 (2023), 23–32

  60. [60]

    Zhang, X

    S. Zhang, X. Zhang, H. Wang, L. Guo, and S. Liu. 2018. Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection. IEEE Access 6 (2018), 74061–74071. https://doi.org/10.1109/ACCESS.2018.2883637

  61. [61]

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. 2024. SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning. arXiv:2408.05517 [cs.CL] https://arxiv.org/abs/2408.05517

  62. [62]

    Yanxin Zheng, Wensheng Gan, Zefeng Chen, Zhenlian Qi, Qian Liang, and Philip S Yu. 2025. Large language models for medicine: a survey. International Journal of Machine Learning and Cybernetics 16, 2 (2025), 1015–1040

  63. [63]

    Daquan Zhou, Kaixin Wang, Jianyang Gu, Xiang Peng, Dongze Lian, Yifan Zhang, Yang You, and Jiashi Feng. 2023. Dataset Quantization. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023), 17159–17170. https: //api.semanticscholar.org/CorpusID:261049434