Recognition: unknown
LLMs Are Not a Silver Bullet: A Case Study on Software Fairness
Pith reviewed 2026-05-10 15:21 UTC · model grok-4.3
The pith
Traditional machine learning methods outperform large language models at bias mitigation for software fairness on realistic imbalanced data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ML-based bias mitigation methods consistently outperform LLM-based methods in both fairness and predictive performance across multiple datasets and metrics, with even strong LLMs failing to surpass established ML baselines when test data follows realistic imbalanced distributions rather than artificial balance.
What carries the argument
The comparative evaluation of ML and LLM bias mitigation techniques, with explicit contrast between balanced test sets and real-world imbalanced distributions plus the distinction between in-context learning and full supervised training.
If this is right
- Software engineers facing fairness requirements on tabular data should default to established ML techniques rather than LLM prompting.
- LLM-based fairness work must move beyond in-context learning to use full training sets if it is to close the performance gap.
- Evaluation protocols for new fairness methods should require testing on naturally imbalanced data instead of balanced subsets.
- Fine-tuning LLMs on the complete training data yields competitive but still limited gains, so hybrid or specialized approaches may be needed.
Where Pith is reading between the lines
- The findings imply that tabular fairness tasks may remain better served by domain-tuned ML until LLMs develop more efficient ways to absorb large labeled sets.
- If the same imbalance effect appears in other high-stakes domains, it would narrow the range of problems where current LLMs are the default choice.
- A direct next test would be to apply the same balanced-versus-imbalanced protocol to non-tabular fairness settings such as code or text.
Load-bearing premise
The selected ML and LLM methods, datasets, and fairness metrics represent the practical conditions engineers encounter in real software fairness work with imbalanced data.
What would settle it
A demonstration that any current LLM-based method achieves both better fairness scores and better accuracy than the strongest ML baselines on several standard imbalanced tabular software datasets using the same metrics would refute the central finding.
Figures
read the original abstract
Fairness is a critical requirement for human-related, high-stakes software systems, motivating extensive research on bias mitigation. Prior work has largely focused on tabular data settings using traditional Machine Learning (ML) methods. With the rapid rise of Large Language Models (LLMs), recent studies have begun to explore their use for bias mitigation in the same setting. However, it remains unclear whether LLM-based methods offer advantages over traditional ML methods, leaving software engineers without clear guidance for practical adoption. To address this gap, we present a large-scale study comparing state-of-the-art ML- and LLM-based bias mitigation methods. We find that ML-based methods consistently outperform LLM-based methods in both fairness and predictive performance, with even strong LLMs failing to surpass established ML baselines. To understand why prior LLM-based studies report favorable results, we analyze their evaluation settings and show that these gains are largely driven by artificially balanced test data rather than realistic imbalanced distributions. We further observe that existing LLM-based methods primarily rely on in-context learning and thus fail to leverage all available training data. Motivated by this, we explore supervised fine-tuning on the full training set and find that, while it achieves competitive results, its advantages over traditional ML methods remain limited. These findings suggest that LLMs are not a silver bullet for software fairness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a large-scale empirical study comparing state-of-the-art ML-based and LLM-based bias mitigation methods for software fairness on tabular data. It claims that ML methods consistently outperform LLM-based approaches (including after supervised fine-tuning) in both fairness and predictive performance, and attributes prior positive LLM results to the use of artificially balanced test sets rather than realistic imbalanced distributions.
Significance. If the results hold, the work is significant for providing practical guidance to software engineers on bias mitigation, challenging LLM hype in this domain, and highlighting the need for realistic evaluation settings. A strength is the direct comparison grounded in external benchmarks and prior literature, with concrete identification of evaluation artifacts explaining discrepancies, offering falsifiable insights for the field.
major comments (2)
- [Experiments and Results] The central claim that ML-based methods outperform LLM-based methods (including fine-tuned variants) under realistic imbalanced settings is load-bearing; however, the representativeness of the chosen SOTA ML/LLM methods, datasets, and fairness metrics for practical software fairness scenarios requires more explicit justification and sensitivity analysis, as this underpins the general conclusion that LLMs are not a silver bullet.
- [Analysis of Prior LLM Studies] The analysis attributing prior favorable LLM results to artificially balanced test data is key to reconciling with existing literature; more quantitative details on the imbalance levels in those prior test sets versus the current study's distributions would make the attribution more rigorous and verifiable.
minor comments (2)
- [Abstract] The abstract summarizes the findings well but could include the specific number of datasets, methods compared, and fairness metrics to better convey the scale of the large-scale study.
- [Results Tables] Tables reporting performance and fairness metrics should include statistical significance tests or confidence intervals to support the consistent outperformance claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below.
read point-by-point responses
-
Referee: [Experiments and Results] The central claim that ML-based methods outperform LLM-based methods (including fine-tuned variants) under realistic imbalanced settings is load-bearing; however, the representativeness of the chosen SOTA ML/LLM methods, datasets, and fairness metrics for practical software fairness scenarios requires more explicit justification and sensitivity analysis, as this underpins the general conclusion that LLMs are not a silver bullet.
Authors: The methods, datasets, and metrics were chosen as direct representatives of current practice in the software fairness literature, following the most-cited benchmarks and surveys on tabular bias mitigation. We will revise the Experimental Setup section to make this rationale more explicit, including citations to the prevalence of these choices. We will also add a short sensitivity subsection reporting results under one additional fairness metric and one supplementary dataset to further support robustness of the central claim. revision: yes
-
Referee: [Analysis of Prior LLM Studies] The analysis attributing prior favorable LLM results to artificially balanced test data is key to reconciling with existing literature; more quantitative details on the imbalance levels in those prior test sets versus the current study's distributions would make the attribution more rigorous and verifiable.
Authors: We agree that additional quantitative detail will strengthen the reconciliation with prior work. In the revision we will insert a table that reports the exact positive-class proportions (imbalance ratios) used in the test sets of each cited prior LLM study alongside the ratios in our realistic distributions. This will allow direct, verifiable comparison of the artificial balancing effect. revision: yes
Circularity Check
No significant circularity detected
full rationale
This is a direct empirical case study that reports experimental comparisons of ML and LLM bias mitigation methods on fairness tasks using established datasets, metrics, and baselines from prior literature. No mathematical derivations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the argument structure. All claims rest on observable experimental outcomes and external benchmarks rather than reducing to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard fairness metrics and tabular datasets adequately represent bias issues in high-stakes software systems.
Reference graph
Works this paper leans on
-
[1]
The Credit Dataset
1994. The Credit Dataset. https://archive.ics.uci.edu/dataset/350/default+of+ credit+card+clients
1994
-
[2]
The Compas dataset
2016. The Compas dataset. https://github.com/propublica/compas-analysis
2016
-
[3]
The Adult Census Income dataset
2017. The Adult Census Income dataset. https://archive.ics.uci.edu/ml/datasets/ adult
2017
-
[4]
GPT-4o mini
2024. GPT-4o mini. https://platform.openai.com/docs/models/gpt-4o-mini
2024
-
[5]
IBM AIF360
2024. IBM AIF360. https://ai-fairness-360.org/
2024
-
[6]
Qwen2.5-7B-Instruct
2024. Qwen2.5-7B-Instruct. https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
2024
-
[7]
DeepSeek-v3.2
2025. DeepSeek-v3.2. https://huggingface.co/deepseek-ai/DeepSeek-V3.2
2025
-
[8]
Gemini-2.5-Flash
2025. Gemini-2.5-Flash. https://ai.google.dev/gemini-api/docs/models/gemini- 2.5-flash
2025
-
[9]
2025. GPT-5. https://developers.openai.com/api/docs/models/gpt-5-chat-latest
2025
-
[10]
2026. openai. https://developers.openai.com/api/reference/overview
2026
-
[11]
openrouter
2026. openrouter. https://openrouter.ai/
2026
-
[12]
Qwen3-Max-Thinking
2026. Qwen3-Max-Thinking. https://openrouter.ai/qwen/qwen3-max-thinking
2026
-
[13]
2026. Replication package. https://doi.org/10.5281/zenodo.19244975
-
[14]
Razieh Alidoosti. 2021. Ethics-driven software architecture decision making. In2021 IEEE 18th International Conference on Software Architecture Companion (ICSA-C). IEEE, 90–91
2021
-
[15]
Fatma Başak Aydemir and Fabiano Dalpiaz. 2018. A roadmap for ethics-aware software engineering. InProceedings of the international workshop on software fairness. 15–21
2018
-
[16]
Sumon Biswas and Hridesh Rajan. 2020. Do the machine learning models on a crowd sourced platform exhibit bias? an empirical study on model fairness. InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 642–653
2020
-
[17]
Sumon Biswas and Hridesh Rajan. 2021. Fair preprocessing: towards understand- ing compositional fairness of data transformers in machine learning pipeline. InProceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 981–993
2021
-
[18]
Yuriy Brun and Alexandra Meliou. 2018. Software fairness. InProceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 754–759
2018
-
[19]
Joymallya Chakraborty, Suvodeep Majumder, and Tim Menzies. 2021. Bias in machine learning software: Why? how? what to do?. InProceedings of the 29th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 429–440
2021
-
[20]
Joymallya Chakraborty, Suvodeep Majumder, Zhe Yu, and Tim Menzies. 2020. Fairway: a way to build fair ML software. InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 654–665
2020
-
[21]
Evan Chen, Run-Jun Zhan, Yan-Bai Lin, and Hung-Hsuan Chen. 2025. More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Lan- guage Models. InProceedings of the 34th ACM International Conference on Infor- mation and Knowledge Management. 4639–4643
2025
-
[22]
Zhang, Federica Sarro, and Yang Liu
Zhenpeng Chen, Xinyue Li, Jie M. Zhang, Federica Sarro, and Yang Liu. 2025. Diversity Drives Fairness: Ensemble of Higher Order Mutants for Intersectional Fairness of Machine Learning Software. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering. 743–755
2025
-
[23]
Zhenpeng Chen, Xinyue Li, Jie M Zhang, Weisong Sun, Ying Xiao, Tianlin Li, Yiling Lou, and Yang Liu. 2025. Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?Proceedings of the ACM on Software Engineering2, FSE (2025), 1780–1801
2025
-
[24]
Zhenpeng Chen, Jie M Zhang, Max Hort, Mark Harman, and Federica Sarro
-
[25]
Fairness testing: A comprehensive survey and analysis of trends.ACM Transactions on Software Engineering and Methodology33, 5 (2024), 1–59
2024
-
[26]
Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. 2022. MAAT: a novel ensemble approach to addressing fairness and performance bugs for machine learning software. InProceedings of the 30th ACM joint european software engineering conference and symposium on the foundations of software engineering. 1122–1134
2022
-
[27]
Zhenpeng Chen, Jie M Zhang, Federica Sarro, and Mark Harman. 2023. A comprehensive empirical study of bias mitigation methods for machine learning classifiers.ACM transactions on software engineering and methodology32, 4 (2023), 1–30
2023
-
[28]
Valeriia Cherepanova, Chia-Jung Lee, Nil-Jana Akpinar, Riccardo Fogliato, Mar- tin Bertran Lopez, Michael Kearns, and James Zou. 2025. Improving llm group fairness on tabular data via in-context learning. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society, Vol. 8. 579–590
2025
-
[29]
Jordan, Joseph E
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot arena: An open platform for evaluat- ing LLMs by human preference. InProceedings of the Forty-first International Conference on Machine Learning, ICML 2024
2024
-
[30]
Michael E Donohue. 2018. A replacement for Justitia’s scales: Machine learning’s role in sentencing.Harv. JL & Tech.32 (2018), 657
2018
-
[31]
Atmika Gorti, Aman Chadha, and Manas Gaur. 2024. Unboxing occupational bias: Debiasing llms with us labor data. InProceedings of the AAAI Symposium Series, Vol. 4. 48–55
2024
-
[32]
Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning.Advances in neural information processing systems29 (2016)
2016
-
[33]
Naila Shafirni Hidayat, Muhammad Dehan Al Kautsar, Alfan Farizki Wicaksono, and Fajri Koto. 2025. Simulating training data leakage in multiple-choice bench- marks for llm evaluation. InProceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems. 21–39
2025
-
[34]
Max Hort, Zhenpeng Chen, Jie M Zhang, Mark Harman, and Federica Sarro
-
[35]
ACM Journal on Responsible Computing1, 2 (2024), 1–52
Bias mitigation for machine learning classifiers: A comprehensive survey. ACM Journal on Responsible Computing1, 2 (2024), 1–52
2024
-
[36]
Jingyu Hu, Weiru Liu, and Mengnan Du. 2024. Strategic demonstration selec- tion for improved fairness in llm in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 7460–7475
2024
-
[37]
Faisal Kamiran and Toon Calders. 2012. Data preprocessing techniques for classification without discrimination.Knowledge and information systems33, 1 (2012), 1–33
2012
-
[38]
Faisal Kamiran, Asim Karim, and Xiangliang Zhang. 2012. Decision theory for discrimination-aware classification. In2012 IEEE 12th international conference on data mining. IEEE, 924–929
2012
-
[39]
Zhang, Ying Xiao, Tianlin Li, Weisong Sun, Yang Liu, Yiling Lou, and Xuanzhe Liu
Xinyue Li, Zhenpeng Chen, Jie M. Zhang, Ying Xiao, Tianlin Li, Weisong Sun, Yang Liu, Yiling Lou, and Xuanzhe Liu. 2026. Fairness Testing of Large Language Models in Role-Playing. InProceedings of the 34th ACM International Conference on the Foundations of Software Engineering, FSE
2026
-
[40]
Yucheng Li, Yunhao Guo, Frank Guerin, and Chenghua Lin. 2024. An open- source data contamination report for large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024. 528–541
2024
-
[41]
Yanhui Li, Linghan Meng, Lin Chen, Li Yu, Di Wu, Yuming Zhou, and Baowen Xu. 2022. Training data debugging for the fairness of machine learning software. InProceedings of the 44th International Conference on Software Engineering. 2215– 2227
2022
-
[42]
Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2026. Large Language Model-Based Agents for Soft- ware Engineering: A Survey.ACM Transactions on Software Engineering and Methodology(2026)
2026
- [43]
-
[44]
Yanchen Liu, Srishti Gautam, Jiaqi Ma, and Himabindu Lakkaraju. 2024. Con- fronting LLMs with traditional ML: Rethinking the fairness of large language models in tabular classifications. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). ...
2024
-
[45]
Ali A Mahmoud, Tahani AL Shawabkeh, Walid A Salameh, and Ibrahim Al Amro
-
[46]
In2019 10th international conference on information and communication systems (ICICS)
Performance predicting in hiring process and performance appraisals using machine learning. In2019 10th international conference on information and communication systems (ICICS). IEEE, 110–115
-
[47]
Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other.The annals of mathematical statistics(1947), 50–60
1947
-
[48]
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning.ACM computing surveys (CSUR)54, 6 (2021), 1–35. Xinyue Li, Sixuan Li, Ying Xiao, Jie M. Zhang, Zhou Yang, Xuanzhe Liu, and Zhenpeng Chen
2021
-
[49]
Kewen Peng, Joymallya Chakraborty, and Tim Menzies. 2022. Fairmask: Better fairness via model-based rebalancing of protected attributes.IEEE Transactions on Software Engineering49, 4 (2022), 2426–2439
2022
-
[50]
Data Protection. 2018. General data protection regulation.Intersoft Consulting, Accessed in October24, 1 (2018)
2018
-
[51]
Omar Shaikh, Hongxin Zhang, William Held, Michael Bernstein, and Diyi Yang
-
[52]
InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
On second thought, let’s not think step by step! bias and toxicity in zero- shot reasoning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4454–4470
-
[53]
Yining She, Sumon Biswas, Christian Kästner, and Eunsuk Kang. 2025. FairSense: Long-Term Fairness Analysis of ML-Enabled Systems. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering. 782–794
2025
-
[54]
Ezekiel Soremekun, Mike Papadakis, Maxime Cordy, and Yves Le Traon. 2025. Software fairness: An analysis and survey.Comput. Surveys58, 3 (2025), 1–38
2025
-
[55]
Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. 2023. Decod- ingTrust: A Comprehensive Assessment of Trustworthiness in {GPT} Models. (2023)
2023
-
[56]
Michael Wick, Jean-Baptiste Tristan, et al. 2019. Unlocking fairness: a trade-off revisited.Advances in neural information processing systems32 (2019)
2019
-
[57]
Noam Wies, Yoav Levine, and Amnon Shashua. 2023. The learnability of in- context learning.Advances in Neural Information Processing Systems36 (2023), 36637–36651
2023
-
[58]
Yonghao Wu, Zheng Li, Jie M Zhang, and Yong Liu. 2024. Condefects: A com- plementary dataset to address the data leakage concern for llm-based fault localization and program repair. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 642–646
2024
- [59]
-
[60]
Ying Xiao, Jie M Zhang, Yepang Liu, Mohammad Reza Mousavi, Sicen Liu, and Dingyuan Xue. 2024. MirrorFair: Fixing fairness bugs in machine learning software via counterfactual predictions.Proceedings of the ACM on Software Engineering1, FSE (2024), 2121–2143
2024
-
[61]
Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. 2018. Mitigating un- wanted biases with adversarial learning. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 335–340
2018
-
[62]
Ignorance and Prejudice
Jie M Zhang and Mark Harman. 2021. " Ignorance and Prejudice" in Software Fairness. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1436–1447
2021
-
[63]
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A survey of large language models.CoRRabs/2303.18223 (2023)
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.