Recognition: unknown
First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows
Pith reviewed 2026-05-10 03:44 UTC · model grok-4.3
The pith
Embedding LLMs in agentic workflows reduces some measured racial bias in medical diagnosis tasks
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that retrieval-based agentic workflows improve DeepSeek V3's alignment with bias benchmarks in differential diagnosis ranking, delivering a 0.0348 increase in mean p-value, a 0.1166 increase in median p-value, and a 0.0949 decrease in mean difference versus the standalone model, while all models exhibited deviations in the synthetic case generation task.
What carries the argument
Retrieval-based agentic workflows that incorporate external data from epidemiological distributions and expert diagnosis lists to guide and evaluate LLM outputs for bias.
Load-bearing premise
US race-stratified epidemiological distributions and expert differential diagnosis lists act as unbiased and complete benchmarks for detecting racial bias in LLM-generated medical content.
What would settle it
Applying the identical agentic workflow and standalone setups to DeepSeek V3 on the differential diagnosis task and observing no gains or losses in the mean p-value, median p-value, and mean difference metrics.
Figures
read the original abstract
Large language models (LLMs) are increasingly used in clinical settings, raising concerns about racial bias in both generated medical text and clinical reasoning. Existing studies have identified bias in medical LLMs, but many focus on single models and give less attention to mitigation. This study uses the EU AI Act as a governance lens to evaluate five widely used LLMs across two tasks, namely synthetic patient-case generation and differential diagnosis ranking. Using race-stratified epidemiological distributions in the United States and expert differential diagnosis lists as benchmarks, we apply structured prompt templates and a two-part evaluation design to examine implicit and explicit racial bias. All models deviated from observed racial distributions in the synthetic case generation task, with GPT-4.1 showing the smallest overall deviation. In the differential diagnosis task, DeepSeek V3 produced the strongest overall results across the reported metrics. When embedded in an agentic workflow, DeepSeek V3 showed an improvement of 0.0348 in mean p-value, 0.1166 in median p-value, and 0.0949 in mean difference relative to the standalone model, although improvement was not uniform across every metric. These findings support multi-metric bias evaluation for AI systems used in medical settings and suggest that retrieval-based agentic workflows may reduce some forms of explicit bias in benchmarked diagnostic tasks. Detailed prompt templates, experimental datasets, and code pipelines are available on our GitHub.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates five LLMs for racial bias in synthetic patient-case generation and differential diagnosis ranking tasks. It uses US race-stratified epidemiological distributions and expert differential diagnosis lists as benchmarks, reports deviations from these in all models (with GPT-4.1 least deviant in case generation), and finds that a retrieval-based agentic workflow improves DeepSeek V3 performance on mean p-value (+0.0348), median p-value (+0.1166), and mean difference (+0.0949) relative to the standalone baseline, though gains are not uniform. The work is framed through the EU AI Act lens and releases prompts, datasets, and code.
Significance. If the results hold, the paper provides concrete evidence that agentic workflows can improve alignment with external epidemiological benchmarks in diagnostic tasks, offering a practical mitigation approach for explicit bias in medical LLMs. The open release of prompts, experimental datasets, and code pipelines is a clear strength that supports reproducibility and extension by others.
major comments (3)
- [Evaluation Design] Evaluation Design section: The central claim that closer alignment to US race-stratified epidemiological distributions constitutes bias mitigation rests on treating these observed distributions as normative targets. No justification, sensitivity analysis, or discussion is provided for the possibility that these distributions encode historical healthcare access or diagnostic disparities rather than unbiased prevalence; if so, the reported p-value gains for DeepSeek V3 may reflect better reproduction of benchmark skew rather than reduced harm.
- [Results] Results section (agentic workflow experiments): The specific numeric improvements for DeepSeek V3 are presented without accompanying full data tables, per-metric breakdowns across all runs, or details on how p-values and mean differences were computed (e.g., number of trials, variance, or exact statistical procedure). This makes it impossible to verify that the shifts (0.0348 mean p-value, etc.) are robust and attributable to bias reduction rather than other factors.
- [Methods] Methods for differential diagnosis ranking: The evaluation assumes expert differential diagnosis lists are unbiased reference standards, yet the manuscript provides no analysis of potential racial biases within those lists or how the ranking metric would behave if the lists themselves embed disparities.
minor comments (2)
- [Abstract] Abstract: The statement that improvement 'was not uniform across every metric' is left unspecified; naming the non-improving metrics would clarify the scope of the agentic workflow benefit.
- [Evaluation Metrics] Notation: The precise definitions of 'mean difference' and how it relates to the p-value metrics are not restated in the main text, which could confuse readers new to the evaluation setup.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity, transparency, and discussion of limitations.
read point-by-point responses
-
Referee: [Evaluation Design] Evaluation Design section: The central claim that closer alignment to US race-stratified epidemiological distributions constitutes bias mitigation rests on treating these observed distributions as normative targets. No justification, sensitivity analysis, or discussion is provided for the possibility that these distributions encode historical healthcare access or diagnostic disparities rather than unbiased prevalence; if so, the reported p-value gains for DeepSeek V3 may reflect better reproduction of benchmark skew rather than reduced harm.
Authors: We agree this is a substantive limitation in the current framing. Our benchmark uses observed US race-stratified epidemiological distributions to quantify explicit bias as systematic deviation from reported real-world prevalences, with the goal of preventing LLMs from generating synthetic cases that further distort group representations. We recognize these distributions may embed historical disparities in access and diagnosis. In the revised manuscript we will add a dedicated paragraph in the Evaluation Design section justifying the choice as a practical proxy for measurable explicit bias reduction, explicitly discuss the risk of reproducing benchmark skew, and include a sensitivity analysis comparing results against alternative prevalence estimates where available. The agentic workflow gains will be presented as improved alignment with the chosen benchmark rather than comprehensive harm reduction. revision: yes
-
Referee: [Results] Results section (agentic workflow experiments): The specific numeric improvements for DeepSeek V3 are presented without accompanying full data tables, per-metric breakdowns across all runs, or details on how p-values and mean differences were computed (e.g., number of trials, variance, or exact statistical procedure). This makes it impossible to verify that the shifts (0.0348 mean p-value, etc.) are robust and attributable to bias reduction rather than other factors.
Authors: We accept this criticism on reporting transparency. The improvements derive from 500 synthetic cases per condition across 10 categories, with p-values from chi-squared goodness-of-fit tests against the benchmark distributions and mean differences calculated as average absolute deviation in ranked probabilities. In the revision we will insert full supplementary tables showing per-metric results, standard deviations, exact trial counts, and variance for all models and conditions. We will also expand the Methods section with the precise statistical procedures and update the GitHub repository with the scripts used to generate these tables so readers can verify robustness. revision: yes
-
Referee: [Methods] Methods for differential diagnosis ranking: The evaluation assumes expert differential diagnosis lists are unbiased reference standards, yet the manuscript provides no analysis of potential racial biases within those lists or how the ranking metric would behave if the lists themselves embed disparities.
Authors: This observation is correct. The lists are drawn from established clinical guidelines and peer-reviewed sources, which we treat as the prevailing reference standard for the ranking task. We do not assert they are bias-free. We will revise the Methods section to acknowledge possible embedded disparities in expert lists, describe the ranking metric (position and probability of correct diagnoses) explicitly, and add a short analysis of how metric scores could shift under alternative list compositions. A limitations paragraph will note that future work could involve debiasing the reference standards themselves. revision: partial
Circularity Check
No significant circularity: empirical comparisons to external benchmarks
full rationale
The paper performs an empirical evaluation of five LLMs on two tasks (synthetic case generation and differential diagnosis ranking), measuring deviation from fixed external references—race-stratified US epidemiological distributions and expert differential diagnosis lists—then comparing standalone vs. agentic-workflow outputs on the same references. The reported metric improvements (e.g., +0.0348 mean p-value for DeepSeek V3) are direct statistical differences against these independent benchmarks and the standalone baseline; they do not reduce to quantities defined from the models' own outputs, fitted parameters renamed as predictions, or self-citation chains. No equations, ansatzes, or uniqueness theorems appear in the provided text. The design is self-contained against external data and therefore receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Race-stratified epidemiological distributions in the United States are accurate and appropriate benchmarks for measuring deviation in synthetic patient-case generation.
- domain assumption Expert-curated differential diagnosis lists constitute an unbiased ground truth for evaluating ranking quality and bias in the diagnosis task.
Reference graph
Works this paper leans on
-
[1]
Andrea Moglia, Konstantinos Georgiou, Pietro Cerveri, Luca Mainardi, Richard M Satava, and Alfred Cuschieri. Large language models in healthcare: from a systematic review on medical examinations to a comparative analysis on fundamentals of robotic surgery online test.Artificial Intelligence Review, 57(9):231, 2024
2024
-
[2]
Keisha E Montalmant and Anna K Ettinger. The racial disparities in maternal mortality and impact of structural racism and implicit racial bias on pregnant black women: a review of the literature.Journal of racial and ethnic health disparities, 11(6):3658–3677, 2024
2024
-
[3]
European Parliament and Council of the European Union. Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence and amending regulations (ec) no 300/2008, (eu) no 167/2013, (eu) no 168/2013, (eu) 2018/858, (eu) 2018/1139 and (eu) 2019/2144 and directives 2014/90/eu, (eu...
2024
-
[4]
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach
Rajesh Ranjan, Shailja Gupta, and Surya Narayan Singh. A comprehensive survey of bias in llms: Current landscape and future directions.arXiv preprint arXiv:2409.16430, 2024
-
[5]
Large language models in healthcare and medical domain: A review
Zabir Al Nazi and Wei Peng. Large language models in healthcare and medical domain: A review. InInformatics, volume 11, page 57. MDPI, 2024
2024
-
[6]
Large language models in medical and healthcare fields: applications, advances, and challenges.Artificial intelligence review, 57(11):299, 2024
Dandan Wang and Shiqing Zhang. Large language models in medical and healthcare fields: applications, advances, and challenges.Artificial intelligence review, 57(11):299, 2024
2024
-
[7]
Ummara Mumtaz, Awais Ahmed, and Summaya Mumtaz. Llms-healthcare: Current applications and challenges of large language models in various medical specialties.arXiv preprint arXiv:2311.12882, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Race, gender, and age biases in biomedical masked language models
Michelle Kim, Junghwan Kim, and Kristen Johnson. Race, gender, and age biases in biomedical masked language models. InFindings of the Association for Computational Linguistics: ACL 2023, pages 11806–11815, 2023
2023
-
[9]
Unmasking and quantifying racial bias of large language models in medical report generation.Communications medicine, 4(1):176, 2024
Yifan Yang, Xiaoyu Liu, Qiao Jin, Furong Huang, and Zhiyong Lu. Unmasking and quantifying racial bias of large language models in medical report generation.Communications medicine, 4(1):176, 2024
2024
-
[10]
Assessing racial and ethnic bias in text generation for healthcare-related tasks by chatgpt1.MedRxiv, 2023
John J Hanna, Abdi D Wakene, Christoph U Lehmann, and Richard J Medford. Assessing racial and ethnic bias in text generation for healthcare-related tasks by chatgpt1.MedRxiv, 2023
2023
-
[11]
Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L Griffiths. Measuring implicit bias in explicitly unbiased large language models.arXiv preprint arXiv:2402.04105, 2024
-
[12]
Apakama, Carol R
Mahmud Omar, Shelly Soffer, Reem Agbareia, Nicola Luigi Bragazzi, Donald U. Apakama, Carol R. Horowitz, Alexander W. Charney, Robert Freeman, Benjamin Kummer, Benjamin S. Glicksberg, et al. Sociodemographic biases in medical decision making by large language models.Nature Medicine, 31:1873–1881, 2025
2025
-
[13]
Assessing the potential of gpt-4 to perpetuate racial and gender biases in health care: a model evaluation study.The Lancet Digital Health, 6(1):e12–e22, 2024
Travis Zack, Eric Lehman, Mirac Suzgun, Jorge A Rodriguez, Leo Anthony Celi, Judy Gichoya, Dan Jurafsky, Peter Szolovits, David W Bates, Raja-Elie E Abdulnour, et al. Assessing the potential of gpt-4 to perpetuate racial and gender biases in health care: a model evaluation study.The Lancet Digital Health, 6(1):e12–e22, 2024
2024
-
[14]
Jiazi Tian, Liqin Wang, Pedram Fard, Valdery Moura Junior, Deborah Blacker, Jennifer S Haas, Chirag Patel, Shawn N Murphy, Lidia MVR Moura, and Hossein Estiri. An agentic ai workflow for detecting cognitive concerns in real-world data.arXiv preprint arXiv:2502.01789, 2025
-
[15]
Search-o1: Agentic search-enhanced large reasoning models
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, Suzhou, China, 2025. Association for Computational Linguistics
2025
-
[16]
Using flowise to streamline biomedical data discovery and analysis
João António Reis, João Rafael Almeida, Tiago Melo Almeida, and José Luís Oliveira. Using flowise to streamline biomedical data discovery and analysis. In2024 IEEE 22nd Mediterranean Electrotechnical Conference (MELECON), pages 695–700. IEEE, 2024. 9 APREPRINT- DECEMBER2025
2024
-
[17]
gpt4_bias: Assessing gpt-4’s potential for perpetuating racial and gender biases in healthcare
Eric Lehman. gpt4_bias: Assessing gpt-4’s potential for perpetuating racial and gender biases in healthcare. https://github.com/elehman16/gpt4_bias, 2023. Accessed: 2025-08-07
2023
-
[18]
Advancement of engineered bacteria for orally delivered therapeutics.Small, 19(48):2302702, 2023
Peilin Guo, Shuang Wang, Hua Yue, Xiao Zhang, Guanghui Ma, Xin Li, and Wei Wei. Advancement of engineered bacteria for orally delivered therapeutics.Small, 19(48):2302702, 2023
2023
-
[19]
Review on the coronavirus disease (covid-19) pandemic: its outbreak and current status.International journal of clinical practice, 74(11):e13637, 2020
Dalia Almaghaslah, Geetha Kandasamy, Mona Almanasef, Rajalakshimi Vasudevan, and Sriram Chandramohan. Review on the coronavirus disease (covid-19) pandemic: its outbreak and current status.International journal of clinical practice, 74(11):e13637, 2020
2020
-
[20]
Causes, symptoms and treatments common hepatitis b today.Pharmacognosy Journal, 13(3), 2021
Nguyen Tan Danh. Causes, symptoms and treatments common hepatitis b today.Pharmacognosy Journal, 13(3), 2021
2021
-
[21]
Protecting the confidence of hiv patients and the role of nurses.European Chemical Bulletin, 2023
I Nithyamala, Ms Linda Xavier, JC Helen Shaji, K Karpagavalli, and MR Suchitra. Protecting the confidence of hiv patients and the role of nurses.European Chemical Bulletin, 2023
2023
-
[22]
The epidemic of tuberculosis on vaccinated population
Intan Syahrini, Sriwahyuni, Vera Halfiani, Syarifah Meurah Yuni, Taufiq Iskandar, Rasudin, and Marwan Ramli. The epidemic of tuberculosis on vaccinated population. InJournal of Physics: Conference Series, volume 890, page 012017. IOP Publishing, 2017
2017
-
[23]
A review on diabetes mellitus-an annihilatory metabolic disorder.Journal of Pharmaceutical Sciences and Research, 12(2):232–235, 2020
M Reddi Nagesh, N Vijayakumar, and Keserla Bhavani. A review on diabetes mellitus-an annihilatory metabolic disorder.Journal of Pharmaceutical Sciences and Research, 12(2):232–235, 2020
2020
-
[24]
Il-1β in neoplastic disease and the role of its tumor-derived form in the progression and treatment of metastatic prostate cancer.Cancers, 17(2):290, 2025
Yetunde Oyende, Luke J Taus, and Alessandro Fatatis. Il-1β in neoplastic disease and the role of its tumor-derived form in the progression and treatment of metastatic prostate cancer.Cancers, 17(2):290, 2025
2025
-
[25]
The role of interleukin-10 in autoimmune disease: systemic lupus erythematosus (sle) and multiple sclerosis (ms).Cytokine & growth factor reviews, 13(4-5):403– 412, 2002
Amy M Beebe, Daniel J Cua, and Rene de Waal Malefyt. The role of interleukin-10 in autoimmune disease: systemic lupus erythematosus (sle) and multiple sclerosis (ms).Cytokine & growth factor reviews, 13(4-5):403– 412, 2002
2002
-
[26]
Sarcoidosis as an autoimmune disease.Frontiers in immunology, 10:2933, 2020
Anna A Starshinova, Anna M Malkova, Natalia Y Basantsova, Yulia S Zinchenko, Igor V Kudryavtsev, Gennadiy A Ershov, Lidia A Soprun, Vera A Mayevskaya, Leonid P Churilov, and Piotr K Yablonskiy. Sarcoidosis as an autoimmune disease.Frontiers in immunology, 10:2933, 2020
2020
-
[27]
Exploring deepseek: A survey on advances, applications, challenges and future directions.IEEE/CAA Journal of Automatica Sinica, 12(5):872–893, 2025
Zehang Deng, Wanlun Ma, Qing-Long Han, Wei Zhou, Xiaogang Zhu, Sheng Wen, and Yang Xiang. Exploring deepseek: A survey on advances, applications, challenges and future directions.IEEE/CAA Journal of Automatica Sinica, 12(5):872–893, 2025
2025
-
[28]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review arXiv 2021
-
[29]
Human-centric evaluation for foundation models.arXiv preprint arXiv:2506.01793, 2025
Yijin Guo, Kaiyuan Ji, Xiaorong Zhu, Junying Wang, Farong Wen, Chunyi Li, Zicheng Zhang, and Guangtao Zhai. Human-centric evaluation for foundation models.arXiv preprint arXiv:2506.01793, 2025
-
[30]
Brandon Smith, Mohamed Reda Bouadjenek, Tahsin Alamgir Kheya, Phillip Dawson, and Sunil Aryal. A compre- hensive analysis of large language model outputs: Similarity, diversity, and bias.arXiv preprint arXiv:2505.09056, 2025
-
[31]
Vertex ai platform, 2025
Google Cloud. Vertex ai platform, 2025. Accessed 7 August 2025
2025
-
[32]
Azure ai foundry, 2025
Microsoft Azure. Azure ai foundry, 2025. Accessed 7 August 2025
2025
-
[33]
Flowise documentation.https://docs.flowiseai.com/, 2026
Flowise. Flowise documentation.https://docs.flowiseai.com/, 2026. Accessed: 2026-04-20
2026
-
[34]
Brave search api, 2025
Brave Software, Inc. Brave search api, 2025. Accessed 7 August 2025
2025
-
[35]
Openai platform, 2025
OpenAI. Openai platform, 2025. Accessed 7 August 2025
2025
-
[36]
Pinecone (vector database) on azure marketplace
Pinecone Systems, Inc. Pinecone (vector database) on azure marketplace. https://azuremarketplace. microsoft.com/en-US/marketplace/apps/pineconesystemsinc1688761585469.pineconesaas,
-
[37]
Accessed: 2025-08-07
2025
-
[38]
Supabase: The postgres development platform, 2025
Supabase. Supabase: The postgres development platform, 2025. Accessed 7 August 2025
2025
-
[39]
Enhancing medical ai with retrieval-augmented generation: A mini narrative review.Digital health, 11:20552076251337177, 2025
Omid Kohandel Gargari and Gholamreza Habibi. Enhancing medical ai with retrieval-augmented generation: A mini narrative review.Digital health, 11:20552076251337177, 2025
2025
-
[40]
Controlling the false discovery rate: A practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995
1995
-
[41]
Mann–whitney u test and kruskal–wallis h test statistics in r
Kingsley Okoye and Samira Hosseini. Mann–whitney u test and kruskal–wallis h test statistics in r. InR programming: Statistical data analysis in research, pages 225–246. Springer, 2024
2024
-
[42]
Enhancing-llm-driven-bias-detection-in-healthcare-agentic-workflows-for-racial-disparity-mitigation
Sihao Xing. Enhancing-llm-driven-bias-detection-in-healthcare-agentic-workflows-for-racial-disparity-mitigation. GitHub repository, 2025. Accessed: 2025-08-21. 10
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.