Can LLMs Help Allocate Public Health Resources? A Case Study on Childhood Lead Testing
Pith reviewed 2026-05-17 06:55 UTC · model grok-4.3
The pith
LLMs overlook highest-risk neighborhoods when allocating childhood lead testing resources, averaging 0.46 accuracy
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When asked to allocate 1,000 childhood lead test kits across neighborhoods using structured vulnerability data, LLMs produced allocations whose overlap with the Priority Score ranked neighborhoods averaged 0.46, reaching at most 0.66 for one advanced model; the models repeatedly assigned fewer kits to areas with the highest lead prevalence and largest untested populations while giving excess kits to lower-priority neighborhoods.
What carries the argument
The Priority Score, a composite ranking that weights the proportion of untested children, prevalence of elevated blood lead levels, and local public health coverage patterns to guide test-kit distribution.
If this is right
- Public health agencies would need human oversight or additional verification steps when using LLMs to set outreach priorities for lead testing.
- LLMs would require stronger mechanisms for retrieving and weighing current local health statistics before they could replace existing allocation methods.
- Tasks that combine multiple quantitative indicators expose a recurring gap between marketed LLM capabilities and actual performance on evidence-based decisions.
Where Pith is reading between the lines
- Similar weaknesses could appear in LLM use for other resource decisions such as vaccine distribution or lead abatement funding.
- Hybrid systems that let LLMs propose allocations and then run them through an optimization check against the Priority Score might reduce the observed errors.
- Updating the underlying data sources in real time or adding explicit instructions to ignore non-quantitative neighborhood descriptions could improve results on this specific task.
Load-bearing premise
The Priority Score gives a complete and accurate ordering of which neighborhoods most need lead testing resources.
What would settle it
Re-running the allocation task with an alternative ranking based only on recent confirmed lead-poisoning cases or on census poverty and housing-age data and finding that LLMs match that ranking much more closely than they match the original Priority Score.
Figures
read the original abstract
Public health agencies face critical challenges in identifying high-risk neighborhoods for childhood lead exposure with limited resources for outreach and intervention programs. To address this, we develop a Priority Score integrating untested children proportions, elevated blood lead prevalence, and public health coverage patterns to support optimized resource allocation decisions across 136 neighborhoods in Chicago, New York City, and Washington, D.C. We leverage these allocation tasks, which require integrating multiple vulnerability indicators and interpreting empirical evidence, to evaluate whether large language models (LLMs) with agentic reasoning and deep research capabilities can effectively allocate public health resources when presented with structured allocation scenarios. LLMs were tasked with distributing 1,000 test kits within each city based on neighborhood vulnerability indicators. Results reveal significant limitations: LLMs frequently overlooked neighborhoods with highest lead prevalence and largest proportions of untested children, such as West Englewood in Chicago, while allocating disproportionate resources to lower-priority areas like Hunts Point in New York City. Overall accuracy averaged 0.46, reaching a maximum of 0.66 with ChatGPT 5 Deep Research. Despite their marketed deep research capabilities, LLMs struggled with fundamental limitations in information retrieval and evidence-based reasoning, frequently citing outdated data and allowing non-empirical narratives about neighborhood conditions to override quantitative vulnerability indicators.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates LLMs' ability to allocate limited public health resources for childhood lead testing across 136 neighborhoods in Chicago, New York City, and Washington, D.C. It defines a composite Priority Score from untested-child proportions, elevated blood-lead prevalence, and coverage patterns, then prompts LLMs to distribute 1,000 test kits per city and measures agreement with the Priority Score. The paper reports average accuracy of 0.46 (maximum 0.66 with ChatGPT-5 Deep Research), highlights specific failures such as under-allocation to West Englewood (Chicago) and over-allocation to Hunts Point (NYC), and concludes that LLMs struggle with evidence-based reasoning and information retrieval.
Significance. If the Priority Score is accepted as a faithful proxy for vulnerability, the empirical results provide a concrete demonstration of current LLM limitations in multi-factor, high-stakes allocation tasks. The work contributes a reproducible case study with neighborhood-level data to the literature on AI reliability for public-policy decisions.
major comments (2)
- [Abstract / Priority Score definition] Abstract and the section defining the Priority Score: the central claim that LLMs 'frequently overlooked' highest-risk neighborhoods rests on treating the authors' Priority Score as ground truth, yet no correlation is reported with independent outcomes such as CDC or city health-department lead-poisoning incidence, confirmed cases, or post-intervention uptake rates across the 136 neighborhoods.
- [Methods] Methods section (implied by the accuracy numbers in the abstract): it is not stated whether the weights combining untested proportions, BLL prevalence, and coverage were pre-specified before LLM prompting or adjusted after observing results; without this, the reported accuracies (0.46 average, 0.66 max) cannot be interpreted as an unbiased performance measurement.
minor comments (3)
- [Methods] Clarify the exact prompt templates and whether they were held fixed across all LLMs and cities.
- [Results] Add a table or figure showing per-neighborhood allocations for at least one city to allow direct inspection of the reported discrepancies.
- [Abstract] The abstract cites 'ChatGPT 5 Deep Research'; confirm the exact model name and version used.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee report. We appreciate the constructive comments on the framing of the Priority Score and the transparency of our methods. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract / Priority Score definition] Abstract and the section defining the Priority Score: the central claim that LLMs 'frequently overlooked' highest-risk neighborhoods rests on treating the authors' Priority Score as ground truth, yet no correlation is reported with independent outcomes such as CDC or city health-department lead-poisoning incidence, confirmed cases, or post-intervention uptake rates across the 136 neighborhoods.
Authors: We agree that the Priority Score functions as a constructed proxy rather than independently validated ground truth. It was designed to integrate three publicly available vulnerability indicators drawn from established public-health literature on lead exposure. The study evaluates LLM performance relative to this transparent benchmark in a realistic multi-factor allocation task, not to establish the Score as definitive. We will revise the abstract, introduction, and add a dedicated limitations paragraph to explicitly describe the Score as a proxy, note the absence of neighborhood-level correlation with incidence or uptake data, and discuss why such granular outcome data are often unavailable across the three cities. revision: yes
-
Referee: [Methods] Methods section (implied by the accuracy numbers in the abstract): it is not stated whether the weights combining untested proportions, BLL prevalence, and coverage were pre-specified before LLM prompting or adjusted after observing results; without this, the reported accuracies (0.46 average, 0.66 max) cannot be interpreted as an unbiased performance measurement.
Authors: The component weights were pre-specified on the basis of prior epidemiological studies of childhood lead exposure before any LLM prompting or result inspection occurred. We regret that this pre-specification was not stated explicitly in the submitted manuscript. We will expand the Methods section to document the a-priori weight selection process, the literature sources used, and the exact formula, thereby allowing readers to assess the performance metric as an unbiased evaluation against a fixed benchmark. revision: yes
Circularity Check
No significant circularity: empirical evaluation against author-defined benchmark
full rationale
The paper constructs a Priority Score from untested-child proportions, elevated BLL prevalence, and coverage patterns, then measures LLM allocation accuracy (avg 0.46) by closeness to allocations implied by that score. This is a direct empirical comparison, not a derivation that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim (LLMs overlook high-risk areas) is a performance observation against the benchmark and does not loop back to itself. Validity of the Priority Score as a proxy is a separate correctness concern, not circularity per the analysis rules.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Priority Score formed from untested children proportions, elevated blood lead prevalence, and public health coverage patterns correctly ranks neighborhoods for lead-testing resource allocation.
Reference graph
Works this paper leans on
-
[1]
An Update on Child- hood Lead Poisoning,
M. Hauptman, R. Bruccoleri, and A. D. Woolf, “An Update on Child- hood Lead Poisoning,”Clinical Pediatric Emergency Medicine, vol. 18, no. 3, pp. 181–192, Sep. 2017
work page 2017
-
[2]
W. Luo, D. Ruan, C. Yan, S. Yin, and J. Chen, “Effects of chronic lead exposure on functions of nervous system in chinese children and developmental rats,”Neurotoxicology, vol. 33, no. 4, pp. 862–871, 2012
work page 2012
-
[3]
Low-level lead- induced neurotoxicity in children: an update on central nervous system effects,
Y . Finkelstein, M. E. Markowitz, and J. F. Rosen, “Low-level lead- induced neurotoxicity in children: an update on central nervous system effects,”Brain Research Reviews, vol. 27, no. 2, pp. 168–176, 1998
work page 1998
-
[4]
Intellectual Impairment in Children with Blood Lead Concentrations below 10µg per Deciliter,
R. L. Canfield, C. R. Henderson, D. A. Cory-Slechta, C. Cox, T. A. Jusko, and B. P. Lanphear, “Intellectual Impairment in Children with Blood Lead Concentrations below 10µg per Deciliter,”New England Journal of Medicine, vol. 348, no. 16, pp. 1517–1526, Apr. 2003
work page 2003
-
[5]
The prevalence of lead-based paint hazards in U.S. housing
D. E. Jacobs, R. P. Clickner, J. Y . Zhou, S. M. Viet, D. A. Marker, J. W. Rogers, D. C. Zeldin, P. Broene, and W. Friedman, “The prevalence of lead-based paint hazards in U.S. housing.”Environmental Health Perspectives, vol. 110, no. 10, pp. A599–A606, Oct. 2002
work page 2002
-
[6]
Vulnerability of U.S. Cities to Environmental Hazards,
K. A. Borden, M. C. Schmidtlein, C. T. Emrich, W. W. Piegorsch, and S. L. Cutter, “Vulnerability of U.S. Cities to Environmental Hazards,” Journal of Homeland Security and Emergency Management, vol. 4, no. 2, Jan. 2007
work page 2007
-
[7]
H. Moody and S. C. Grady, “Lead Emissions and Population Vulnera- bility in the Detroit (Michigan, USA) Metropolitan Area, 2006–2013: A Spatial and Temporal Analysis,”International Journal of Environmental Research and Public Health, vol. 14, no. 12, p. 1445, Dec. 2017
work page 2006
-
[8]
H. W. Mielke, “Lead in the inner cities: Policies to reduce children’s exposure to lead may be overlooking a major source of lead in the environment,”American scientist, vol. 87, no. 1, pp. 62–73, 1999
work page 1999
-
[9]
K. Afane and J. Chen, “Analyzing and Optimizing the Distribution of Blood Lead Level Testing for Children in New York City: A Data-Driven Approach,”Journal of Urban Health, vol. 102, no. 1, pp. 92–100, Feb. 2025
work page 2025
-
[10]
Lead Pollution, Demographics, and Environmental Health Risks: The Case of Philadelphia, USA,
M. J. O’Shea, J. Toupal, H. Caballero-G ´omez, T. P. McKeon, M. V . Howarth, R. Pepino, and R. Gier ´e, “Lead Pollution, Demographics, and Environmental Health Risks: The Case of Philadelphia, USA,” International Journal of Environmental Research and Public Health, vol. 18, no. 17, p. 9055, Jan. 2021
work page 2021
-
[11]
H. Moody, J. T. Darden, and B. W. Pigozzi, “The Racial Gap in Childhood Blood Lead Levels Related to Socioeconomic Position of Residence in Metropolitan Detroit,”Sociology of Race and Ethnicity, vol. 2, no. 2, pp. 200–218, Apr. 2016
work page 2016
-
[12]
Blood lead levels in children, China,
S. Wang and J. Zhang, “Blood lead levels in children, China,”Environ- mental Research, vol. 101, no. 3, pp. 412–418, Jul. 2006
work page 2006
-
[13]
What are the blood lead levels of children living in Latin America and the Caribbean?
K. P. K. Olympio, C. G. Gonc ¸alves, F. J. Salles, A. P. S. d. S. Ferreira, A. S. Soares, M. A. R. Buzalaf, M. R. A. Cardoso, and E. J. H. Bechara, “What are the blood lead levels of children living in Latin America and the Caribbean?”Environment International, vol. 101, pp. 46–58, Apr. 2017
work page 2017
-
[14]
How Does Low Socioeconomic Status Increase Blood Lead Levels in Korean Children?
E. Kim, H.-j. Kwon, M. Ha, J.-A. Lim, M. H. Lim, S.-J. Yoo, and K. C. Paik, “How Does Low Socioeconomic Status Increase Blood Lead Levels in Korean Children?”International Journal of Environmental Research and Public Health, vol. 15, no. 7, p. 1488, Jul. 2018
work page 2018
-
[15]
J. A. Gleason, J. V . Nanavaty, and J. A. Fagliano, “Drinking water lead and socioeconomic factors as predictors of blood lead levels in New Jersey’s children between two time periods,”Environmental Research, vol. 169, pp. 409–416, Feb. 2019
work page 2019
-
[16]
Effects of tap water lead, water hardness, alcohol, and cigarettes on blood lead concentrations
S. J. Pocock, A. G. Shaper, M. Walker, C. J. Wale, B. Clayton, T. Delves, R. F. Lacey, R. F. Packham, and P. Powell, “Effects of tap water lead, water hardness, alcohol, and cigarettes on blood lead concentrations.” Journal of Epidemiology & Community Health, vol. 37, no. 1, pp. 1–7, Mar. 1983
work page 1983
-
[17]
The contribution of housing renovation to children’s blood lead levels: A cohort study,
A. J. Spanier, S. Wilson, M. Ho, R. Hornung, and B. P. Lanphear, “The contribution of housing renovation to children’s blood lead levels: A cohort study,”Environmental Health, vol. 12, no. 1, p. 72, Aug. 2013
work page 2013
-
[18]
Health Insurance Coverage in the United States: 2024,
U. C. Bureau, “Health Insurance Coverage in the United States: 2024,” https://www.census.gov/library/publications/2025/demo/p60-288.html
work page 2024
-
[19]
D. Altman and W. H. Frist, “Medicare and Medicaid at 50 Years: Perspectives of Beneficiaries, Health Care Professionals and Institutions, and Policy Makers,”JAMA, vol. 314, no. 4, pp. 384–395, Jul. 2015
work page 2015
-
[20]
Health Insurance Effects on Preven- tive Care and Health: A Methodologic Review,
J. Wallace and B. D. Sommers, “Health Insurance Effects on Preven- tive Care and Health: A Methodologic Review,”American Journal of Preventive Medicine, vol. 50, no. 5, Supplement 1, pp. S27–S33, May 2016
work page 2016
-
[21]
Y . Jiang and W. Ni, “Impact of supplementary private health insurance on hospitalization and physical examination in China,”China Economic Review, vol. 63, p. 101514, Oct. 2020
work page 2020
-
[22]
Lawbench: Benchmarking legal knowledge of large language models,
Z. Fei, X. Shen, D. Zhu, F. Zhou, Z. Han, A. Huang, S. Zhang, K. Chen, Z. Yin, Z. Shenet al., “Lawbench: Benchmarking legal knowledge of large language models,” inProceedings of the 2024 conference on empirical methods in natural language processing, 2024, pp. 7933–7962
work page 2024
-
[23]
Next-generation phishing: How llm agents empower cyber attackers,
K. Afane, W. Wei, Y . Mao, J. Farooq, and J. Chen, “Next-generation phishing: How llm agents empower cyber attackers,” in2024 IEEE International Conference on Big Data (BigData). IEEE, 2024, pp. 2558–2567
work page 2024
-
[24]
Risks and benefits of large language models for the environment,
M. C. Rillig, M. ˚Agerstrand, M. Bi, K. A. Gould, and U. Sauerland, “Risks and benefits of large language models for the environment,” Environmental science & technology, vol. 57, no. 9, pp. 3464–3466, 2023
work page 2023
-
[25]
Chatgpt and large language models in academia: opportunities and challenges,
J. G. Meyer, R. J. Urbanowicz, P. C. Martin, K. O’Connor, R. Li, P.- C. Peng, T. J. Bright, N. Tatonetti, K. J. Won, G. Gonzalez-Hernandez et al., “Chatgpt and large language models in academia: opportunities and challenges,”BioData mining, vol. 16, no. 1, p. 20, 2023
work page 2023
-
[26]
Opportunities and challenges for large language models in primary health care,
H. Qin and Y . Tong, “Opportunities and challenges for large language models in primary health care,”Journal of Primary Care & Community Health, vol. 16, p. 21501319241312571, 2025
work page 2025
-
[27]
E. Jo, D. A. Epstein, H. Jung, and Y .-H. Kim, “Understanding the benefits and challenges of deploying conversational ai leveraging large language models for public health intervention,” inProceedings of the 2023 CHI conference on human factors in computing systems, 2023, pp. 1–16
work page 2023
-
[28]
Large language models in medicine,
A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, “Large language models in medicine,”Nature medicine, vol. 29, no. 8, pp. 1930–1940, 2023
work page 1930
-
[29]
J. Haltaufderheide and R. Ranisch, “The ethics of chatgpt in medicine and healthcare: a systematic review on large language models (llms),” NPJ digital medicine, vol. 7, no. 1, p. 183, 2024
work page 2024
-
[30]
Large language models in health care: Development, applications, and challenges,
R. Yang, T. F. Tan, W. Lu, A. J. Thirunavukarasu, D. S. W. Ting, and N. Liu, “Large language models in health care: Development, applications, and challenges,”Health Care Science, vol. 2, no. 4, pp. 255–263, 2023
work page 2023
-
[31]
D. Lupton and E. Butler, “Generative ai in medicine and public health: An overview and position paper on directions for social research,” Medicine and Public Health: An Overview and Position Paper on Directions for Social Research (June 20, 2024), 2024
work page 2024
- [32]
-
[33]
Childhood Lead Poisoning Surveillance,
“Childhood Lead Poisoning Surveillance,” https://dph.illinois.gov/topics-services/environmental-health- protection/lead-poisoning-prevention/childhood-surveillance.html
-
[34]
Lead Poisoning Reports, Publications and Surveillance Data - NYC Health,
“Lead Poisoning Reports, Publications and Surveillance Data - NYC Health,” https://www.nyc.gov/site/doh/data/data-sets/lead-pubs.page
- [35]
- [36]
-
[37]
C. S. Clark, R. L. Bornschein, P. Succop, S. S. Q. Hee, P. B. Hammond, and B. Peace, “Condition and type of housing as an indicator of potential environmental lead exposure and pediatric blood lead levels,” Environmental Research, vol. 38, no. 1, pp. 46–53, Oct. 1985
work page 1985
-
[38]
Environmental Health Disparities in Housing,
D. E. Jacobs, “Environmental Health Disparities in Housing,”American Journal of Public Health, vol. 101, no. S1, pp. S115–S122, Dec. 2011
work page 2011
-
[39]
“Screening Housing to Prevent Lead Toxicity in Children - Bruce P. Lanphear, Richard Hornung, Mona Ho, 2005.”
work page 2005
-
[40]
Amplifying Domain Expertise in Clinical Data Pipelines,
P. Rahman, A. Nandi, and C. Hebert, “Amplifying Domain Expertise in Clinical Data Pipelines,”JMIR Medical Informatics, vol. 8, no. 11, p. e19612, Nov. 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.