pith. machine review for the scientific record. sign in

arxiv: 2604.18038 · v1 · submitted 2026-04-20 · 💻 cs.CY · cs.AI

Recognition: unknown

First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:44 UTC · model grok-4.3

classification 💻 cs.CY cs.AI
keywords racial biaslarge language modelsmedical diagnosisagentic workflowsbias evaluationclinical AIdifferential diagnosissynthetic data generation
0
0 comments X

The pith

Embedding LLMs in agentic workflows reduces some measured racial bias in medical diagnosis tasks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper assesses five large language models for racial bias in two clinical tasks using structured prompts and benchmarks from US epidemiological data and expert lists. All models deviated from real racial distributions when generating synthetic patient cases, although one showed the smallest deviation. DeepSeek V3 gave the strongest results on the differential diagnosis task, and embedding it in an agentic workflow produced better scores on several bias measures than the model used by itself. If correct, this indicates that workflow designs can serve as a tool to lower explicit bias in LLM clinical reasoning.

Core claim

The authors establish that retrieval-based agentic workflows improve DeepSeek V3's alignment with bias benchmarks in differential diagnosis ranking, delivering a 0.0348 increase in mean p-value, a 0.1166 increase in median p-value, and a 0.0949 decrease in mean difference versus the standalone model, while all models exhibited deviations in the synthetic case generation task.

What carries the argument

Retrieval-based agentic workflows that incorporate external data from epidemiological distributions and expert diagnosis lists to guide and evaluate LLM outputs for bias.

Load-bearing premise

US race-stratified epidemiological distributions and expert differential diagnosis lists act as unbiased and complete benchmarks for detecting racial bias in LLM-generated medical content.

What would settle it

Applying the identical agentic workflow and standalone setups to DeepSeek V3 on the differential diagnosis task and observing no gains or losses in the mean p-value, median p-value, and mean difference metrics.

Figures

Figures reproduced from arXiv: 2604.18038 by Sihao Xing, Zaur Gouliev.

Figure 1
Figure 1. Figure 1: Experimental Pipeline • response: the model’s case description, including demographic details (with racial label), past history, chief complaint, and physical findings The differential diagnosis dataset follows a similar structure. For each of the 10 selected NEJM Healer cases, four racial categories (Black, White, Hispanic, Asian) were inserted into the descriptions, with each case repeated 10 times. This… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline Of Agentic Workflow 4.2.3 Experiment II: AI Agentic Workflow As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: All p-values with statistically significant values (p [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used in clinical settings, raising concerns about racial bias in both generated medical text and clinical reasoning. Existing studies have identified bias in medical LLMs, but many focus on single models and give less attention to mitigation. This study uses the EU AI Act as a governance lens to evaluate five widely used LLMs across two tasks, namely synthetic patient-case generation and differential diagnosis ranking. Using race-stratified epidemiological distributions in the United States and expert differential diagnosis lists as benchmarks, we apply structured prompt templates and a two-part evaluation design to examine implicit and explicit racial bias. All models deviated from observed racial distributions in the synthetic case generation task, with GPT-4.1 showing the smallest overall deviation. In the differential diagnosis task, DeepSeek V3 produced the strongest overall results across the reported metrics. When embedded in an agentic workflow, DeepSeek V3 showed an improvement of 0.0348 in mean p-value, 0.1166 in median p-value, and 0.0949 in mean difference relative to the standalone model, although improvement was not uniform across every metric. These findings support multi-metric bias evaluation for AI systems used in medical settings and suggest that retrieval-based agentic workflows may reduce some forms of explicit bias in benchmarked diagnostic tasks. Detailed prompt templates, experimental datasets, and code pipelines are available on our GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates five LLMs for racial bias in synthetic patient-case generation and differential diagnosis ranking tasks. It uses US race-stratified epidemiological distributions and expert differential diagnosis lists as benchmarks, reports deviations from these in all models (with GPT-4.1 least deviant in case generation), and finds that a retrieval-based agentic workflow improves DeepSeek V3 performance on mean p-value (+0.0348), median p-value (+0.1166), and mean difference (+0.0949) relative to the standalone baseline, though gains are not uniform. The work is framed through the EU AI Act lens and releases prompts, datasets, and code.

Significance. If the results hold, the paper provides concrete evidence that agentic workflows can improve alignment with external epidemiological benchmarks in diagnostic tasks, offering a practical mitigation approach for explicit bias in medical LLMs. The open release of prompts, experimental datasets, and code pipelines is a clear strength that supports reproducibility and extension by others.

major comments (3)
  1. [Evaluation Design] Evaluation Design section: The central claim that closer alignment to US race-stratified epidemiological distributions constitutes bias mitigation rests on treating these observed distributions as normative targets. No justification, sensitivity analysis, or discussion is provided for the possibility that these distributions encode historical healthcare access or diagnostic disparities rather than unbiased prevalence; if so, the reported p-value gains for DeepSeek V3 may reflect better reproduction of benchmark skew rather than reduced harm.
  2. [Results] Results section (agentic workflow experiments): The specific numeric improvements for DeepSeek V3 are presented without accompanying full data tables, per-metric breakdowns across all runs, or details on how p-values and mean differences were computed (e.g., number of trials, variance, or exact statistical procedure). This makes it impossible to verify that the shifts (0.0348 mean p-value, etc.) are robust and attributable to bias reduction rather than other factors.
  3. [Methods] Methods for differential diagnosis ranking: The evaluation assumes expert differential diagnosis lists are unbiased reference standards, yet the manuscript provides no analysis of potential racial biases within those lists or how the ranking metric would behave if the lists themselves embed disparities.
minor comments (2)
  1. [Abstract] Abstract: The statement that improvement 'was not uniform across every metric' is left unspecified; naming the non-improving metrics would clarify the scope of the agentic workflow benefit.
  2. [Evaluation Metrics] Notation: The precise definitions of 'mean difference' and how it relates to the p-value metrics are not restated in the main text, which could confuse readers new to the evaluation setup.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity, transparency, and discussion of limitations.

read point-by-point responses
  1. Referee: [Evaluation Design] Evaluation Design section: The central claim that closer alignment to US race-stratified epidemiological distributions constitutes bias mitigation rests on treating these observed distributions as normative targets. No justification, sensitivity analysis, or discussion is provided for the possibility that these distributions encode historical healthcare access or diagnostic disparities rather than unbiased prevalence; if so, the reported p-value gains for DeepSeek V3 may reflect better reproduction of benchmark skew rather than reduced harm.

    Authors: We agree this is a substantive limitation in the current framing. Our benchmark uses observed US race-stratified epidemiological distributions to quantify explicit bias as systematic deviation from reported real-world prevalences, with the goal of preventing LLMs from generating synthetic cases that further distort group representations. We recognize these distributions may embed historical disparities in access and diagnosis. In the revised manuscript we will add a dedicated paragraph in the Evaluation Design section justifying the choice as a practical proxy for measurable explicit bias reduction, explicitly discuss the risk of reproducing benchmark skew, and include a sensitivity analysis comparing results against alternative prevalence estimates where available. The agentic workflow gains will be presented as improved alignment with the chosen benchmark rather than comprehensive harm reduction. revision: yes

  2. Referee: [Results] Results section (agentic workflow experiments): The specific numeric improvements for DeepSeek V3 are presented without accompanying full data tables, per-metric breakdowns across all runs, or details on how p-values and mean differences were computed (e.g., number of trials, variance, or exact statistical procedure). This makes it impossible to verify that the shifts (0.0348 mean p-value, etc.) are robust and attributable to bias reduction rather than other factors.

    Authors: We accept this criticism on reporting transparency. The improvements derive from 500 synthetic cases per condition across 10 categories, with p-values from chi-squared goodness-of-fit tests against the benchmark distributions and mean differences calculated as average absolute deviation in ranked probabilities. In the revision we will insert full supplementary tables showing per-metric results, standard deviations, exact trial counts, and variance for all models and conditions. We will also expand the Methods section with the precise statistical procedures and update the GitHub repository with the scripts used to generate these tables so readers can verify robustness. revision: yes

  3. Referee: [Methods] Methods for differential diagnosis ranking: The evaluation assumes expert differential diagnosis lists are unbiased reference standards, yet the manuscript provides no analysis of potential racial biases within those lists or how the ranking metric would behave if the lists themselves embed disparities.

    Authors: This observation is correct. The lists are drawn from established clinical guidelines and peer-reviewed sources, which we treat as the prevailing reference standard for the ranking task. We do not assert they are bias-free. We will revise the Methods section to acknowledge possible embedded disparities in expert lists, describe the ranking metric (position and probability of correct diagnoses) explicitly, and add a short analysis of how metric scores could shift under alternative list compositions. A limitations paragraph will note that future work could involve debiasing the reference standards themselves. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical comparisons to external benchmarks

full rationale

The paper performs an empirical evaluation of five LLMs on two tasks (synthetic case generation and differential diagnosis ranking), measuring deviation from fixed external references—race-stratified US epidemiological distributions and expert differential diagnosis lists—then comparing standalone vs. agentic-workflow outputs on the same references. The reported metric improvements (e.g., +0.0348 mean p-value for DeepSeek V3) are direct statistical differences against these independent benchmarks and the standalone baseline; they do not reduce to quantities defined from the models' own outputs, fitted parameters renamed as predictions, or self-citation chains. No equations, ansatzes, or uniqueness theorems appear in the provided text. The design is self-contained against external data and therefore receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions about benchmark validity and on the empirical measurements themselves; no free parameters are fitted inside the reported results and no new entities are postulated.

axioms (2)
  • domain assumption Race-stratified epidemiological distributions in the United States are accurate and appropriate benchmarks for measuring deviation in synthetic patient-case generation.
    Invoked to quantify how much each model's generated cases deviate from observed racial patterns.
  • domain assumption Expert-curated differential diagnosis lists constitute an unbiased ground truth for evaluating ranking quality and bias in the diagnosis task.
    Used as the reference against which model rankings are scored.

pith-pipeline@v0.9.0 · 5557 in / 1487 out tokens · 36797 ms · 2026-05-10T03:44:14.981659+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Andrea Moglia, Konstantinos Georgiou, Pietro Cerveri, Luca Mainardi, Richard M Satava, and Alfred Cuschieri. Large language models in healthcare: from a systematic review on medical examinations to a comparative analysis on fundamentals of robotic surgery online test.Artificial Intelligence Review, 57(9):231, 2024

  2. [2]

    Keisha E Montalmant and Anna K Ettinger. The racial disparities in maternal mortality and impact of structural racism and implicit racial bias on pregnant black women: a review of the literature.Journal of racial and ethnic health disparities, 11(6):3658–3677, 2024

  3. [3]

    European Parliament and Council of the European Union. Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence and amending regulations (ec) no 300/2008, (eu) no 167/2013, (eu) no 168/2013, (eu) 2018/858, (eu) 2018/1139 and (eu) 2019/2144 and directives 2014/90/eu, (eu...

  4. [4]

    Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach

    Rajesh Ranjan, Shailja Gupta, and Surya Narayan Singh. A comprehensive survey of bias in llms: Current landscape and future directions.arXiv preprint arXiv:2409.16430, 2024

  5. [5]

    Large language models in healthcare and medical domain: A review

    Zabir Al Nazi and Wei Peng. Large language models in healthcare and medical domain: A review. InInformatics, volume 11, page 57. MDPI, 2024

  6. [6]

    Large language models in medical and healthcare fields: applications, advances, and challenges.Artificial intelligence review, 57(11):299, 2024

    Dandan Wang and Shiqing Zhang. Large language models in medical and healthcare fields: applications, advances, and challenges.Artificial intelligence review, 57(11):299, 2024

  7. [7]

    LLMs-Healthcare : Current Applications and Challenges of Large Language Models in various Medical Specialties

    Ummara Mumtaz, Awais Ahmed, and Summaya Mumtaz. Llms-healthcare: Current applications and challenges of large language models in various medical specialties.arXiv preprint arXiv:2311.12882, 2023

  8. [8]

    Race, gender, and age biases in biomedical masked language models

    Michelle Kim, Junghwan Kim, and Kristen Johnson. Race, gender, and age biases in biomedical masked language models. InFindings of the Association for Computational Linguistics: ACL 2023, pages 11806–11815, 2023

  9. [9]

    Unmasking and quantifying racial bias of large language models in medical report generation.Communications medicine, 4(1):176, 2024

    Yifan Yang, Xiaoyu Liu, Qiao Jin, Furong Huang, and Zhiyong Lu. Unmasking and quantifying racial bias of large language models in medical report generation.Communications medicine, 4(1):176, 2024

  10. [10]

    Assessing racial and ethnic bias in text generation for healthcare-related tasks by chatgpt1.MedRxiv, 2023

    John J Hanna, Abdi D Wakene, Christoph U Lehmann, and Richard J Medford. Assessing racial and ethnic bias in text generation for healthcare-related tasks by chatgpt1.MedRxiv, 2023

  11. [11]

    Measuring implicit bias in explicitly unbiased large language models.arXiv preprint arXiv:2402.04105, 2024

    Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L Griffiths. Measuring implicit bias in explicitly unbiased large language models.arXiv preprint arXiv:2402.04105, 2024

  12. [12]

    Apakama, Carol R

    Mahmud Omar, Shelly Soffer, Reem Agbareia, Nicola Luigi Bragazzi, Donald U. Apakama, Carol R. Horowitz, Alexander W. Charney, Robert Freeman, Benjamin Kummer, Benjamin S. Glicksberg, et al. Sociodemographic biases in medical decision making by large language models.Nature Medicine, 31:1873–1881, 2025

  13. [13]

    Assessing the potential of gpt-4 to perpetuate racial and gender biases in health care: a model evaluation study.The Lancet Digital Health, 6(1):e12–e22, 2024

    Travis Zack, Eric Lehman, Mirac Suzgun, Jorge A Rodriguez, Leo Anthony Celi, Judy Gichoya, Dan Jurafsky, Peter Szolovits, David W Bates, Raja-Elie E Abdulnour, et al. Assessing the potential of gpt-4 to perpetuate racial and gender biases in health care: a model evaluation study.The Lancet Digital Health, 6(1):e12–e22, 2024

  14. [14]

    An agentic ai workflow for detecting cognitive concerns in real-world data.arXiv preprint arXiv:2502.01789, 2025

    Jiazi Tian, Liqin Wang, Pedram Fard, Valdery Moura Junior, Deborah Blacker, Jennifer S Haas, Chirag Patel, Shawn N Murphy, Lidia MVR Moura, and Hossein Estiri. An agentic ai workflow for detecting cognitive concerns in real-world data.arXiv preprint arXiv:2502.01789, 2025

  15. [15]

    Search-o1: Agentic search-enhanced large reasoning models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420–5438, Suzhou, China, 2025. Association for Computational Linguistics

  16. [16]

    Using flowise to streamline biomedical data discovery and analysis

    João António Reis, João Rafael Almeida, Tiago Melo Almeida, and José Luís Oliveira. Using flowise to streamline biomedical data discovery and analysis. In2024 IEEE 22nd Mediterranean Electrotechnical Conference (MELECON), pages 695–700. IEEE, 2024. 9 APREPRINT- DECEMBER2025

  17. [17]

    gpt4_bias: Assessing gpt-4’s potential for perpetuating racial and gender biases in healthcare

    Eric Lehman. gpt4_bias: Assessing gpt-4’s potential for perpetuating racial and gender biases in healthcare. https://github.com/elehman16/gpt4_bias, 2023. Accessed: 2025-08-07

  18. [18]

    Advancement of engineered bacteria for orally delivered therapeutics.Small, 19(48):2302702, 2023

    Peilin Guo, Shuang Wang, Hua Yue, Xiao Zhang, Guanghui Ma, Xin Li, and Wei Wei. Advancement of engineered bacteria for orally delivered therapeutics.Small, 19(48):2302702, 2023

  19. [19]

    Review on the coronavirus disease (covid-19) pandemic: its outbreak and current status.International journal of clinical practice, 74(11):e13637, 2020

    Dalia Almaghaslah, Geetha Kandasamy, Mona Almanasef, Rajalakshimi Vasudevan, and Sriram Chandramohan. Review on the coronavirus disease (covid-19) pandemic: its outbreak and current status.International journal of clinical practice, 74(11):e13637, 2020

  20. [20]

    Causes, symptoms and treatments common hepatitis b today.Pharmacognosy Journal, 13(3), 2021

    Nguyen Tan Danh. Causes, symptoms and treatments common hepatitis b today.Pharmacognosy Journal, 13(3), 2021

  21. [21]

    Protecting the confidence of hiv patients and the role of nurses.European Chemical Bulletin, 2023

    I Nithyamala, Ms Linda Xavier, JC Helen Shaji, K Karpagavalli, and MR Suchitra. Protecting the confidence of hiv patients and the role of nurses.European Chemical Bulletin, 2023

  22. [22]

    The epidemic of tuberculosis on vaccinated population

    Intan Syahrini, Sriwahyuni, Vera Halfiani, Syarifah Meurah Yuni, Taufiq Iskandar, Rasudin, and Marwan Ramli. The epidemic of tuberculosis on vaccinated population. InJournal of Physics: Conference Series, volume 890, page 012017. IOP Publishing, 2017

  23. [23]

    A review on diabetes mellitus-an annihilatory metabolic disorder.Journal of Pharmaceutical Sciences and Research, 12(2):232–235, 2020

    M Reddi Nagesh, N Vijayakumar, and Keserla Bhavani. A review on diabetes mellitus-an annihilatory metabolic disorder.Journal of Pharmaceutical Sciences and Research, 12(2):232–235, 2020

  24. [24]

    Il-1β in neoplastic disease and the role of its tumor-derived form in the progression and treatment of metastatic prostate cancer.Cancers, 17(2):290, 2025

    Yetunde Oyende, Luke J Taus, and Alessandro Fatatis. Il-1β in neoplastic disease and the role of its tumor-derived form in the progression and treatment of metastatic prostate cancer.Cancers, 17(2):290, 2025

  25. [25]

    The role of interleukin-10 in autoimmune disease: systemic lupus erythematosus (sle) and multiple sclerosis (ms).Cytokine & growth factor reviews, 13(4-5):403– 412, 2002

    Amy M Beebe, Daniel J Cua, and Rene de Waal Malefyt. The role of interleukin-10 in autoimmune disease: systemic lupus erythematosus (sle) and multiple sclerosis (ms).Cytokine & growth factor reviews, 13(4-5):403– 412, 2002

  26. [26]

    Sarcoidosis as an autoimmune disease.Frontiers in immunology, 10:2933, 2020

    Anna A Starshinova, Anna M Malkova, Natalia Y Basantsova, Yulia S Zinchenko, Igor V Kudryavtsev, Gennadiy A Ershov, Lidia A Soprun, Vera A Mayevskaya, Leonid P Churilov, and Piotr K Yablonskiy. Sarcoidosis as an autoimmune disease.Frontiers in immunology, 10:2933, 2020

  27. [27]

    Exploring deepseek: A survey on advances, applications, challenges and future directions.IEEE/CAA Journal of Automatica Sinica, 12(5):872–893, 2025

    Zehang Deng, Wanlun Ma, Qing-Long Han, Wei Zhou, Xiaogang Zhu, Sheng Wen, and Yang Xiang. Exploring deepseek: A survey on advances, applications, challenges and future directions.IEEE/CAA Journal of Automatica Sinica, 12(5):872–893, 2025

  28. [28]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

  29. [29]

    Human-centric evaluation for foundation models.arXiv preprint arXiv:2506.01793, 2025

    Yijin Guo, Kaiyuan Ji, Xiaorong Zhu, Junying Wang, Farong Wen, Chunyi Li, Zicheng Zhang, and Guangtao Zhai. Human-centric evaluation for foundation models.arXiv preprint arXiv:2506.01793, 2025

  30. [30]

    A comprehensive analysis of Large Language Model outputs: Similarity, diversity, and bias.arXiv preprint arXiv:2505.09056, 2025

    Brandon Smith, Mohamed Reda Bouadjenek, Tahsin Alamgir Kheya, Phillip Dawson, and Sunil Aryal. A compre- hensive analysis of large language model outputs: Similarity, diversity, and bias.arXiv preprint arXiv:2505.09056, 2025

  31. [31]

    Vertex ai platform, 2025

    Google Cloud. Vertex ai platform, 2025. Accessed 7 August 2025

  32. [32]

    Azure ai foundry, 2025

    Microsoft Azure. Azure ai foundry, 2025. Accessed 7 August 2025

  33. [33]

    Flowise documentation.https://docs.flowiseai.com/, 2026

    Flowise. Flowise documentation.https://docs.flowiseai.com/, 2026. Accessed: 2026-04-20

  34. [34]

    Brave search api, 2025

    Brave Software, Inc. Brave search api, 2025. Accessed 7 August 2025

  35. [35]

    Openai platform, 2025

    OpenAI. Openai platform, 2025. Accessed 7 August 2025

  36. [36]

    Pinecone (vector database) on azure marketplace

    Pinecone Systems, Inc. Pinecone (vector database) on azure marketplace. https://azuremarketplace. microsoft.com/en-US/marketplace/apps/pineconesystemsinc1688761585469.pineconesaas,

  37. [37]

    Accessed: 2025-08-07

  38. [38]

    Supabase: The postgres development platform, 2025

    Supabase. Supabase: The postgres development platform, 2025. Accessed 7 August 2025

  39. [39]

    Enhancing medical ai with retrieval-augmented generation: A mini narrative review.Digital health, 11:20552076251337177, 2025

    Omid Kohandel Gargari and Gholamreza Habibi. Enhancing medical ai with retrieval-augmented generation: A mini narrative review.Digital health, 11:20552076251337177, 2025

  40. [40]

    Controlling the false discovery rate: A practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995

    Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995

  41. [41]

    Mann–whitney u test and kruskal–wallis h test statistics in r

    Kingsley Okoye and Samira Hosseini. Mann–whitney u test and kruskal–wallis h test statistics in r. InR programming: Statistical data analysis in research, pages 225–246. Springer, 2024

  42. [42]

    Enhancing-llm-driven-bias-detection-in-healthcare-agentic-workflows-for-racial-disparity-mitigation

    Sihao Xing. Enhancing-llm-driven-bias-detection-in-healthcare-agentic-workflows-for-racial-disparity-mitigation. GitHub repository, 2025. Accessed: 2025-08-21. 10