Geographical Distribution of Biomedical Research in the USA and China
Pith reviewed 2026-05-24 22:14 UTC · model grok-4.3
The pith
Nearly 20 million PubMed articles place biomedical research in the USA and China around a few stable geographic centroids.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
K-means clustering on geocoded PubMed author affiliations shows that the average published paper lies within a relatively short distance of a few centroids. These centroids have shifted very little over the past 30 years and the distribution of distances to them has remained stable. Overall country centroids have moved south by about 0.2 degrees in the USA and 1.7 degrees in China, while longitude has stayed nearly constant. The pattern indicates that a handful of large scientific hubs dominate output and that the typical investigator sits within geographical reach of one such hub.
What carries the argument
K-means clustering on geocoded PubMed article affiliations to locate research centroids and measure distances from papers to those centroids.
If this is right
- Typical investigators operate within geographical reach of a major biomedical research hub.
- The locations and number of dominant hubs have remained stable for thirty years.
- National centroids have shifted modestly south with negligible longitudinal change.
- The observed pattern supplies a baseline for measuring changes in centralization of biomedical research at national and regional scales.
Where Pith is reading between the lines
- If the same clustering approach were applied to other countries it could reveal whether biomedical research is more or less concentrated elsewhere.
- Stable hubs may shape collaboration patterns by lowering travel costs for investigators near those centers.
- Funding decisions aimed at creating new hubs could be evaluated against how much they would actually reduce average distances for researchers.
Load-bearing premise
The geocoded PubMed articles with author affiliations accurately capture the true geographical distribution of biomedical research without major biases from incomplete addresses, multiple affiliations, or geocoding errors.
What would settle it
Re-running the same K-means procedure on a dataset that adds addresses for the large fraction of PubMed records currently missing usable affiliations and finding substantially more centroids or much larger average distances would falsify the central claim.
read the original abstract
We analyze nearly 20 million geocoded PubMed articles with author affiliations. Using K-means clustering for the lower 48 US states and mainland China, we find that the average published paper is within a relatively short distance of a few centroids. These centroids have shifted very little over the past 30 years, and the distribution of distances to these centroids has not changed much either. The overall country centroids have gradually shifted south (about 0.2{\deg} for the USA and 1.7{\deg} for China), while the longitude has not moved significantly. These findings indicate that there are few large scientific hubs in the USA and China and the typical investigator is within geographical reach of one such hub. This sets the stage to study centralization of biomedical research at national and regional levels across the globe, and over time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes nearly 20 million geocoded PubMed articles with author affiliations in the USA and China. Using K-means clustering on latitude/longitude pairs for the lower 48 US states and mainland China, it identifies a small number of centroids as scientific hubs. These centroids are reported to have shifted very little over 30 years, with the overall country centroids moving south by about 0.2° (USA) and 1.7° (China) while longitude remains stable; distance distributions to the centroids are also largely unchanged. The authors conclude there are few large hubs and that the typical investigator is within geographical reach of one.
Significance. If the geocoded sample is representative, the work supplies a large-scale descriptive account of geographical concentration in biomedical research for two major producers. The observation of long-term centroid stability and modest latitudinal drift offers a concrete baseline for studies of research centralization, and the authors explicitly frame the analysis as preparatory for global and temporal extensions.
major comments (3)
- [Data] Data section: the manuscript provides no information on the fraction of PubMed records successfully geocoded versus those excluded for missing, unparseable, or invalid addresses, nor on the rule used to resolve multiple affiliations per paper. Because the K-means centroids and all distance-to-hub statistics are computed directly from the retained point set, any systematic exclusion of smaller or non-English institutions would bias the reported hubs and compress the distance distributions.
- [Methods] Methods section: the number of clusters, the criterion used to select it, and any validation (e.g., within-cluster sum of squares or stability across random seeds) are not reported. The central claim that “there are few large scientific hubs” is therefore not anchored to a reproducible choice of k and cannot be assessed for sensitivity.
- [Results] Results section (stability and shift paragraphs): the conclusions that centroids have moved little and that distance distributions are stable rest on the assumption that geocoding completeness and precision are constant across the 30-year window. No audit of geocoding accuracy (street-level vs. city-level) or temporal changes in address quality is supplied, so the reported 0.2°/1.7° southward shifts and the invariance of distance histograms could be artifacts of evolving data coverage.
minor comments (2)
- [Abstract] The exact number of articles retained after geocoding should be stated in the main text rather than only the rounded figure “nearly 20 million” given in the abstract.
- [Figures] Figure captions should explicitly state the value of k used for each map and the time periods covered by each panel.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, indicating revisions where appropriate.
read point-by-point responses
-
Referee: [Data] Data section: the manuscript provides no information on the fraction of PubMed records successfully geocoded versus those excluded for missing, unparseable, or invalid addresses, nor on the rule used to resolve multiple affiliations per paper. Because the K-means centroids and all distance-to-hub statistics are computed directly from the retained point set, any systematic exclusion of smaller or non-English institutions would bias the reported hubs and compress the distance distributions.
Authors: We agree this information should be provided. In the revised manuscript we will report the overall geocoding success rate for the USA and China samples, describe the rule applied to papers with multiple affiliations, and add a short discussion of possible selection biases from excluded records. revision: yes
-
Referee: [Methods] Methods section: the number of clusters, the criterion used to select it, and any validation (e.g., within-cluster sum of squares or stability across random seeds) are not reported. The central claim that “there are few large scientific hubs” is therefore not anchored to a reproducible choice of k and cannot be assessed for sensitivity.
Authors: We will add the chosen value of k, the selection criterion (elbow method on within-cluster sum of squares), and stability checks across random seeds to the Methods section so that the number of hubs is reproducible and sensitivity can be evaluated. revision: yes
-
Referee: [Results] Results section (stability and shift paragraphs): the conclusions that centroids have moved little and that distance distributions are stable rest on the assumption that geocoding completeness and precision are constant across the 30-year window. No audit of geocoding accuracy (street-level vs. city-level) or temporal changes in address quality is supplied, so the reported 0.2°/1.7° southward shifts and the invariance of distance histograms could be artifacts of evolving data coverage.
Authors: This is a legitimate concern. We will insert a limitations paragraph noting that changes in PubMed affiliation completeness over time could affect the observed stability, while emphasizing that the same geocoding pipeline was applied throughout. A full temporal audit of address precision is not feasible with the current data sources. revision: partial
Circularity Check
No circularity: purely empirical descriptive analysis of external PubMed data
full rationale
The paper applies standard K-means clustering directly to latitude/longitude pairs from ~20M geocoded PubMed records, computes empirical distances to centroids, and reports observed shifts over time. No equations, fitted parameters, predictions, or derivations are present; all statistics are direct outputs of the input point set. No self-citations or uniqueness claims are invoked. The analysis is self-contained against external benchmarks and does not reduce any result to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Geocoded affiliations from PubMed accurately represent author locations
Reference graph
Works this paper leans on
-
[2]
It covers a significant portion affiliations missing from PubMed. These were harvested from external sources including PubMed Central, Microsoft Academic Graph (MAG), Astrophysics Data System (ADS), and NIH grants. The geographical data of each article is identified by MapAffil, which maps an author’s affiliation to its city and the corresponding city-cen...
-
[3]
For each publication, each city is counted once when multiple coauthors are from the same city
Among 37 million affiliations, there are about 11 million in the lower 48 US states and about 3 million in the mainland China during the period of 1988-2016. For each publication, each city is counted once when multiple coauthors are from the same city. As shown in Figure 2, the number of publications has been growing rapidly over time. Note that the y-ax...
work page 1988
-
[4]
Therefore, the availability of geospatial data surges in 1988 and again in
work page 1988
-
[5]
Number of papers 1988-2016. Fig
work page 1988
-
[6]
Spatial distribution of the USA and its territories. Geographical Distribution of Biomedical Research in the USA … WOSP 2017, June 19, 2017, Toronto, ON, Canada 3 3 METHODS 3.1 Centroids and distances 3.1.1 Geographical centroid. For every affiliation in the corpus, the longitude and latitude of its city have been identified and recorded. Given the assump...
work page 2017
-
[7]
Density map of the lower 48 US states 1988-2016. Fig
work page 1988
-
[8]
WOSP 2017, June 19, 2017, Toronto, ON, Canada Y
Density map of mainland China 1988-2016. WOSP 2017, June 19, 2017, Toronto, ON, Canada Y. Guan et al. 4 4.2 Overall centroids and their movements over time Figure 6 shows the overall centroids when all papers during 1988-2016 are pooled. The US centroid (-89.2, 38.7) is located in southern Illinois, while the centroid in China (116.2, 34.7) is located in ...
work page 1988
-
[9]
Overall centroid movement for the USA (left panels) and China (right panels) over time from 1988 to
work page 1988
-
[10]
Regional clustering: the average distance to the closest centroid decreases rapidly as the number of centroids (k) increases for both the USA and China. Geographical Distribution of Biomedical Research in the USA … WOSP 2017, June 19, 2017, Toronto, ON, Canada 5 Fig
work page 2017
-
[11]
Regional clusters for different number of centroids (k) for the USA (top panels), and China (bottom panels). Another interesting temporal observation is that, although the quantity of publications have increased dramatically, the regional clustering and average distances have remained almost the same. In other words, the quantity within each region has be...
work page 1988
-
[12]
The regional clustering in 1998 (top panel) and 2016 (bottom panel) in the USA are nearly identical. ACKNOWLEDGMENTS Research reported in this publication was supported in part by NIH National Institute on Aging P01AG039347. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. REFERENCES
work page 1998
-
[13]
Urban Studies, 51(10), 2219-2234
Cities and the geographical deconcentration of scientific activity: A multilevel analysis of publications (1987–2007). Urban Studies, 51(10), 2219-2234
work page 1987
-
[14]
Journal of the Association for Information Science and Technology, 62(10), 1954-1962
Which cities produce more excellent papers than can be expected? A new mapping approach, using Google Maps, based on statistical significance testing. Journal of the Association for Information Science and Technology, 62(10), 1954-1962
work page 1954
-
[15]
Science and Public Policy, 41(5), 625-640
Spatial distribution of scientific activities: An exploratory analysis of Brazil, 2000–10. Science and Public Policy, 41(5), 625-640
work page 2000
-
[16]
A bibliometric analysis of geographic publication variations in the Journal of Cardiothoracic and Vascular WOSP 2017, June 19, 2017, Toronto, ON, Canada Y. Guan et al. 6 Anesthesia from 1990 to
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.