arxiv: 2604.12640 · v1 · submitted 2026-04-14 · 💻 cs.SE

Recognition: unknown

LLMs Are Not a Silver Bullet: A Case Study on Software Fairness

Xinyue Li , Sixuan Li , Ying Xiao , Jie M. Zhang , Zhou Yang , Xuanzhe Liu , Zhenpeng Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:21 UTC · model grok-4.3

classification 💻 cs.SE

keywords software fairnessbias mitigationlarge language modelsmachine learningtabular dataimbalanced datasetsevaluation practices

0 comments

The pith

Traditional machine learning methods outperform large language models at bias mitigation for software fairness on realistic imbalanced data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a large-scale comparison of state-of-the-art ML and LLM techniques for removing bias from tabular software datasets used in human-related decisions. It shows that ML approaches deliver stronger fairness guarantees and higher predictive accuracy, and that even capable LLMs do not beat established ML baselines. Earlier studies that reported LLM success turn out to have evaluated on artificially balanced test data instead of the imbalanced distributions found in practice. LLMs in these tasks mainly use in-context learning and therefore cannot exploit the full training set, while supervised fine-tuning on all available data improves results but still leaves them behind ML methods.

Core claim

ML-based bias mitigation methods consistently outperform LLM-based methods in both fairness and predictive performance across multiple datasets and metrics, with even strong LLMs failing to surpass established ML baselines when test data follows realistic imbalanced distributions rather than artificial balance.

What carries the argument

The comparative evaluation of ML and LLM bias mitigation techniques, with explicit contrast between balanced test sets and real-world imbalanced distributions plus the distinction between in-context learning and full supervised training.

If this is right

Software engineers facing fairness requirements on tabular data should default to established ML techniques rather than LLM prompting.
LLM-based fairness work must move beyond in-context learning to use full training sets if it is to close the performance gap.
Evaluation protocols for new fairness methods should require testing on naturally imbalanced data instead of balanced subsets.
Fine-tuning LLMs on the complete training data yields competitive but still limited gains, so hybrid or specialized approaches may be needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The findings imply that tabular fairness tasks may remain better served by domain-tuned ML until LLMs develop more efficient ways to absorb large labeled sets.
If the same imbalance effect appears in other high-stakes domains, it would narrow the range of problems where current LLMs are the default choice.
A direct next test would be to apply the same balanced-versus-imbalanced protocol to non-tabular fairness settings such as code or text.

Load-bearing premise

The selected ML and LLM methods, datasets, and fairness metrics represent the practical conditions engineers encounter in real software fairness work with imbalanced data.

What would settle it

A demonstration that any current LLM-based method achieves both better fairness scores and better accuracy than the strongest ML baselines on several standard imbalanced tabular software datasets using the same metrics would refute the central finding.

Figures

Figures reproduced from arXiv: 2604.12640 by Jie M. Zhang, Sixuan Li, Xinyue Li, Xuanzhe Liu, Ying Xiao, Zhenpeng Chen, Zhou Yang.

**Figure 1.** Figure 1: Prompt structure for the LLM paradigm. containing missing values (NA) to ensure data quality and consistency across different models. To mitigate the impact of randomness and ensure the soundness of our comparisons, we then adopt a consistent evaluation protocol across both traditional ML and LLM paradigms. In each experimental run, the cleaned dataset is randomly split into 80% training data and 20% tes… view at source ↗

read the original abstract

Fairness is a critical requirement for human-related, high-stakes software systems, motivating extensive research on bias mitigation. Prior work has largely focused on tabular data settings using traditional Machine Learning (ML) methods. With the rapid rise of Large Language Models (LLMs), recent studies have begun to explore their use for bias mitigation in the same setting. However, it remains unclear whether LLM-based methods offer advantages over traditional ML methods, leaving software engineers without clear guidance for practical adoption. To address this gap, we present a large-scale study comparing state-of-the-art ML- and LLM-based bias mitigation methods. We find that ML-based methods consistently outperform LLM-based methods in both fairness and predictive performance, with even strong LLMs failing to surpass established ML baselines. To understand why prior LLM-based studies report favorable results, we analyze their evaluation settings and show that these gains are largely driven by artificially balanced test data rather than realistic imbalanced distributions. We further observe that existing LLM-based methods primarily rely on in-context learning and thus fail to leverage all available training data. Motivated by this, we explore supervised fine-tuning on the full training set and find that, while it achieves competitive results, its advantages over traditional ML methods remain limited. These findings suggest that LLMs are not a silver bullet for software fairness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows ML methods still beat LLMs for tabular bias mitigation once you use realistic imbalanced test data, and it pins prior LLM wins on that evaluation artifact.

read the letter

The main thing here is that standard ML approaches keep coming out ahead of LLM-based bias mitigation on fairness and accuracy for tabular software tasks, and the paper traces earlier positive LLM findings back to the use of artificially balanced test sets instead of the skewed distributions that show up in practice. They also note that most prior LLM work stuck to in-context learning and didn't use the full training data, then test supervised fine-tuning as a fix but still see limited gains over the ML baselines. That combination of large-scale comparison plus the diagnostic on evaluation settings is the clearest new piece. It gives software engineers a more grounded reason to stick with established methods rather than chasing LLM adoption for this problem. The work does a decent job staying empirical and avoiding overclaim, and the stress-test note lines up with what the abstract lays out—no obvious internal contradictions in the argument. On the soft side, the results rest on a particular collection of datasets, metrics, and chosen SOTA methods, so it's possible other LLM configurations or additional real-world data would shift the picture. Without the full experimental appendix or released code in view, the exact implementation details for the fine-tuning runs are hard to double-check for fairness of comparison. This is aimed at researchers and practitioners in software fairness who need practical guidance on tool selection. A reader looking for evidence against defaulting to LLMs in high-stakes systems would find it useful, though it is not a broad theoretical advance. I would flag it for peer review because the head-to-head plus the explanation of prior discrepancies supplies something concrete that deserves referee scrutiny, even if revisions on datasets or ablations would strengthen it.

Referee Report

2 major / 2 minor

Summary. The paper presents a large-scale empirical study comparing state-of-the-art ML-based and LLM-based bias mitigation methods for software fairness on tabular data. It claims that ML methods consistently outperform LLM-based approaches (including after supervised fine-tuning) in both fairness and predictive performance, and attributes prior positive LLM results to the use of artificially balanced test sets rather than realistic imbalanced distributions.

Significance. If the results hold, the work is significant for providing practical guidance to software engineers on bias mitigation, challenging LLM hype in this domain, and highlighting the need for realistic evaluation settings. A strength is the direct comparison grounded in external benchmarks and prior literature, with concrete identification of evaluation artifacts explaining discrepancies, offering falsifiable insights for the field.

major comments (2)

[Experiments and Results] The central claim that ML-based methods outperform LLM-based methods (including fine-tuned variants) under realistic imbalanced settings is load-bearing; however, the representativeness of the chosen SOTA ML/LLM methods, datasets, and fairness metrics for practical software fairness scenarios requires more explicit justification and sensitivity analysis, as this underpins the general conclusion that LLMs are not a silver bullet.
[Analysis of Prior LLM Studies] The analysis attributing prior favorable LLM results to artificially balanced test data is key to reconciling with existing literature; more quantitative details on the imbalance levels in those prior test sets versus the current study's distributions would make the attribution more rigorous and verifiable.

minor comments (2)

[Abstract] The abstract summarizes the findings well but could include the specific number of datasets, methods compared, and fairness metrics to better convey the scale of the large-scale study.
[Results Tables] Tables reporting performance and fairness metrics should include statistical significance tests or confidence intervals to support the consistent outperformance claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below.

read point-by-point responses

Referee: [Experiments and Results] The central claim that ML-based methods outperform LLM-based methods (including fine-tuned variants) under realistic imbalanced settings is load-bearing; however, the representativeness of the chosen SOTA ML/LLM methods, datasets, and fairness metrics for practical software fairness scenarios requires more explicit justification and sensitivity analysis, as this underpins the general conclusion that LLMs are not a silver bullet.

Authors: The methods, datasets, and metrics were chosen as direct representatives of current practice in the software fairness literature, following the most-cited benchmarks and surveys on tabular bias mitigation. We will revise the Experimental Setup section to make this rationale more explicit, including citations to the prevalence of these choices. We will also add a short sensitivity subsection reporting results under one additional fairness metric and one supplementary dataset to further support robustness of the central claim. revision: yes
Referee: [Analysis of Prior LLM Studies] The analysis attributing prior favorable LLM results to artificially balanced test data is key to reconciling with existing literature; more quantitative details on the imbalance levels in those prior test sets versus the current study's distributions would make the attribution more rigorous and verifiable.

Authors: We agree that additional quantitative detail will strengthen the reconciliation with prior work. In the revision we will insert a table that reports the exact positive-class proportions (imbalance ratios) used in the test sets of each cited prior LLM study alongside the ratios in our realistic distributions. This will allow direct, verifiable comparison of the artificial balancing effect. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

This is a direct empirical case study that reports experimental comparisons of ML and LLM bias mitigation methods on fairness tasks using established datasets, metrics, and baselines from prior literature. No mathematical derivations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the argument structure. All claims rest on observable experimental outcomes and external benchmarks rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical comparison relying on existing methods and metrics; no new free parameters, mathematical axioms, or invented entities are introduced beyond standard domain assumptions about fairness evaluation.

axioms (1)

domain assumption Standard fairness metrics and tabular datasets adequately represent bias issues in high-stakes software systems.
The comparison and conclusions depend on these metrics and data distributions being appropriate proxies for real-world fairness.

pith-pipeline@v0.9.0 · 5550 in / 1210 out tokens · 64324 ms · 2026-05-10T15:21:15.691985+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 4 canonical work pages · 1 internal anchor

[1]

The Credit Dataset

1994. The Credit Dataset. https://archive.ics.uci.edu/dataset/350/default+of+ credit+card+clients

1994
[2]

The Compas dataset

2016. The Compas dataset. https://github.com/propublica/compas-analysis

2016
[3]

The Adult Census Income dataset

2017. The Adult Census Income dataset. https://archive.ics.uci.edu/ml/datasets/ adult

2017
[4]

GPT-4o mini

2024. GPT-4o mini. https://platform.openai.com/docs/models/gpt-4o-mini

2024
[5]

IBM AIF360

2024. IBM AIF360. https://ai-fairness-360.org/

2024
[6]

Qwen2.5-7B-Instruct

2024. Qwen2.5-7B-Instruct. https://huggingface.co/Qwen/Qwen2.5-7B-Instruct

2024
[7]

DeepSeek-v3.2

2025. DeepSeek-v3.2. https://huggingface.co/deepseek-ai/DeepSeek-V3.2

2025
[8]

Gemini-2.5-Flash

2025. Gemini-2.5-Flash. https://ai.google.dev/gemini-api/docs/models/gemini- 2.5-flash

2025
[9]

2025. GPT-5. https://developers.openai.com/api/docs/models/gpt-5-chat-latest

2025
[10]

2026. openai. https://developers.openai.com/api/reference/overview

2026
[11]

openrouter

2026. openrouter. https://openrouter.ai/

2026
[12]

Qwen3-Max-Thinking

2026. Qwen3-Max-Thinking. https://openrouter.ai/qwen/qwen3-max-thinking

2026
[13]

Replication package

2026. Replication package. https://doi.org/10.5281/zenodo.19244975

work page doi:10.5281/zenodo.19244975 2026
[14]

Razieh Alidoosti. 2021. Ethics-driven software architecture decision making. In2021 IEEE 18th International Conference on Software Architecture Companion (ICSA-C). IEEE, 90–91

2021
[15]

Fatma Başak Aydemir and Fabiano Dalpiaz. 2018. A roadmap for ethics-aware software engineering. InProceedings of the international workshop on software fairness. 15–21

2018
[16]

Sumon Biswas and Hridesh Rajan. 2020. Do the machine learning models on a crowd sourced platform exhibit bias? an empirical study on model fairness. InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 642–653

2020
[17]

Sumon Biswas and Hridesh Rajan. 2021. Fair preprocessing: towards understand- ing compositional fairness of data transformers in machine learning pipeline. InProceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 981–993

2021
[18]

Yuriy Brun and Alexandra Meliou. 2018. Software fairness. InProceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 754–759

2018
[19]

Joymallya Chakraborty, Suvodeep Majumder, and Tim Menzies. 2021. Bias in machine learning software: Why? how? what to do?. InProceedings of the 29th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 429–440

2021
[20]

Joymallya Chakraborty, Suvodeep Majumder, Zhe Yu, and Tim Menzies. 2020. Fairway: a way to build fair ML software. InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 654–665

2020
[21]

Evan Chen, Run-Jun Zhan, Yan-Bai Lin, and Hung-Hsuan Chen. 2025. More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Lan- guage Models. InProceedings of the 34th ACM International Conference on Infor- mation and Knowledge Management. 4639–4643

2025
[22]

Zhang, Federica Sarro, and Yang Liu

Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Federica Sarro, and Yang Liu. 2025. Diversity Drives Fairness: Ensemble of Higher Order Mutants for Intersectional Fairness of Machine Learning Software. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering. 743–755

2025
[23]

Zhenpeng Chen, Xinyue Li, Jie M Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, and Yang Liu. 2025. Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?Proceedings of the ACM on Software Engineering2, FSE (2025), 1780–1801

2025
[24]

Zhenpeng Chen, Jie M Zhang, Max Hort, Mark Harman, and Federica Sarro
[25]

Fairness testing: A comprehensive survey and analysis of trends.ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–59

2024
[26]

Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. 2022. MAAT: a novel ensemble approach to addressing fairness and performance bugs for machine learning software. InProceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering. 1122–1134

2022
[27]

Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. 2023. A comprehensive empirical study of bias mitigation methods for machine learning classifiers.ACM transactions on software engineering and methodology32, 4 (2023), 1–30

2023
[28]

Valeriia Cherepanova, Chia-Jung Lee, Nil-Jana Akpinar, Riccardo Fogliato, Mar- tin Bertran Lopez, Michael Kearns, and James Zou. 2025. Improving llm group fairness on tabular data via in-context learning. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 8. 579–590

2025
[29]

Jordan, Joseph E

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot arena: An open platform for evaluat- ing LLMs by human preference. InProceedings of the Forty-first International Conference on Machine Learning, ICML 2024

2024
[30]

Michael E Donohue. 2018. A replacement for Justitia’s scales: Machine learning’s role in sentencing.Harv. JL & Tech.32 (2018), 657

2018
[31]

Atmika Gorti, Aman Chadha, and Manas Gaur. 2024. Unboxing occupational bias: Debiasing llms with us labor data. InProceedings of the AAAI Symposium Series, Vol. 4. 48–55

2024
[32]

Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning.Advances in neural information processing systems29 (2016)

2016
[33]

Naila Shafirni Hidayat, Muhammad Dehan Al Kautsar, Alfan Farizki Wicaksono, and Fajri Koto. 2025. Simulating training data leakage in multiple-choice bench- marks for llm evaluation. InProceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems. 21–39

2025
[34]

Max Hort, Zhenpeng Chen, Jie M Zhang, Mark Harman, and Federica Sarro
[35]

ACM Journal on Responsible Computing1, 2 (2024), 1–52

Bias mitigation for machine learning classifiers: A comprehensive survey. ACM Journal on Responsible Computing1, 2 (2024), 1–52

2024
[36]

Jingyu Hu, Weiru Liu, and Mengnan Du. 2024. Strategic demonstration selec- tion for improved fairness in llm in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 7460–7475

2024
[37]

Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for classification without discrimination.Knowledge and information systems33, 1 (2012), 1–33

2012
[38]

Faisal Kamiran, Asim Karim, and Xiangliang Zhang. 2012. Decision theory for discrimination-aware classification. In2012 IEEE 12th international conference on data mining. IEEE, 924–929

2012
[39]

Zhang, Ying Xiao, Tianlin Li, Weisong Sun, Yang Liu, Yiling Lou, and Xuanzhe Liu

Xinyue Li, Zhenpeng Chen, Jie M. Zhang, Ying Xiao, Tianlin Li, Weisong Sun, Yang Liu, Yiling Lou, and Xuanzhe Liu. 2026. Fairness Testing of Large Language Models in Role-Playing. InProceedings of the 34th ACM International Conference on the Foundations of Software Engineering, FSE

2026
[40]

Yucheng Li, Yunhao Guo, Frank Guerin, and Chenghua Lin. 2024. An open- source data contamination report for large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024. 528–541

2024
[41]

Yanhui Li, Linghan Meng, Lin Chen, Li Yu, Di Wu, Yuming Zhou, and Baowen Xu. 2022. Training data debugging for the fairness of machine learning software. InProceedings of the 44th International Conference on Software Engineering. 2215– 2227

2022
[42]

Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2026. Large Language Model-Based Agents for Soft- ware Engineering: A Survey.ACM Transactions on Software Engineering and Methodology(2026)

2026
[43]

Yang Liu and Chenhui Chu. 2025. Do LLMs Align Human Values Regarding Social Biases? Judging and Explaining Social Biases with LLMs.arXiv preprint arXiv:2509.13869(2025)

work page arXiv 2025
[44]

Yanchen Liu, Srishti Gautam, Jiaqi Ma, and Himabindu Lakkaraju. 2024. Con- fronting LLMs with traditional ML: Rethinking the fairness of large language models in tabular classifications. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). ...

2024
[45]

Ali A Mahmoud, Tahani AL Shawabkeh, Walid A Salameh, and Ibrahim Al Amro
[46]

In2019 10th international conference on information and communication systems (ICICS)

Performance predicting in hiring process and performance appraisals using machine learning. In2019 10th international conference on information and communication systems (ICICS). IEEE, 110–115
[47]

Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other.The annals of mathematical statistics(1947), 50–60

1947
[48]

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning.ACM computing surveys (CSUR)54, 6 (2021), 1–35. Xinyue Li, Sixuan Li, Ying Xiao, Jie M. Zhang, Zhou Yang, Xuanzhe Liu, and Zhenpeng Chen

2021
[49]

Kewen Peng, Joymallya Chakraborty, and Tim Menzies. 2022. Fairmask: Better fairness via model-based rebalancing of protected attributes.IEEE Transactions on Software Engineering49, 4 (2022), 2426–2439

2022
[50]

Data Protection. 2018. General data protection regulation.Intersoft Consulting, Accessed in October24, 1 (2018)

2018
[51]

Omar Shaikh, Hongxin Zhang, William Held, Michael Bernstein, and Diyi Yang
[52]

InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

On second thought, let’s not think step by step! bias and toxicity in zero- shot reasoning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4454–4470
[53]

Yining She, Sumon Biswas, Christian Kästner, and Eunsuk Kang. 2025. FairSense: Long-Term Fairness Analysis of ML-Enabled Systems. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering. 782–794

2025
[54]

Ezekiel Soremekun, Mike Papadakis, Maxime Cordy, and Yves Le Traon. 2025. Software fairness: An analysis and survey.Comput. Surveys58, 3 (2025), 1–38

2025
[55]

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. 2023. Decod- ingTrust: A Comprehensive Assessment of Trustworthiness in {GPT} Models. (2023)

2023
[56]

Michael Wick, Jean-Baptiste Tristan, et al. 2019. Unlocking fairness: a trade-off revisited.Advances in neural information processing systems32 (2019)

2019
[57]

Noam Wies, Yoav Levine, and Amnon Shashua. 2023. The learnability of in- context learning.Advances in Neural Information Processing Systems36 (2023), 36637–36651

2023
[58]

Yonghao Wu, Zheng Li, Jie M Zhang, and Yong Liu. 2024. Condefects: A com- plementary dataset to address the data leakage concern for llm-based fault localization and program repair. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 642–646

2024
[59]

Ying Xiao, Shangwen Wang, Sicen Liu, Dingyuan Xue, Xian Zhan, Yepang Liu, and Jie M Zhang. 2025. Fairness Is Not Just Ethical: Performance Trade-Off via Data Correlation Tuning to Mitigate Bias in ML Software.arXiv preprint arXiv:2512.21348(2025)

work page arXiv 2025
[60]

Ying Xiao, Jie M Zhang, Yepang Liu, Mohammad Reza Mousavi, Sicen Liu, and Dingyuan Xue. 2024. MirrorFair: Fixing fairness bugs in machine learning software via counterfactual predictions.Proceedings of the ACM on Software Engineering1, FSE (2024), 2121–2143

2024
[61]

Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating un- wanted biases with adversarial learning. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 335–340

2018
[62]

Ignorance and Prejudice

Jie M Zhang and Mark Harman. 2021. " Ignorance and Prejudice" in Software Fairness. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1436–1447

2021
[63]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A survey of large language models.CoRRabs/2303.18223 (2023)

work page internal anchor Pith review arXiv 2023