Recognition: 3 theorem links
· Lean TheoremFoundation Models to Unlock Real-World Evidence from Nationwide Medical Claims
Pith reviewed 2026-05-08 18:33 UTC · model grok-4.3
The pith
A generative transformer trained on 43.8 billion medical claims events outperforms specialized models on over 1,000 disease prediction tasks while improving real-world evidence analyses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReClaim establishes administrative claims as a scalable substrate for healthcare foundation models by training a generative transformer on 43.8 billion events to capture diagnoses, procedures, medications, and expenditure. The model achieves 75.6 percent mean AUC across over 1,000 tasks, substantially outperforming disease-specific LightGBM at 66.3 percent and Delphi at 69.4 percent, with largest gains for rare diseases. Performance improves with scale, post-training adds 13.8 points, and the model improves expenditure forecasting from 0.28 to 0.37 explained variance while reducing systematic bias by 72 percent in target trial emulation.
What carries the argument
ReClaim, the generative transformer trained to model sequences of medical events and predict future outcomes from longitudinal claims trajectories.
If this is right
- Performance improves monotonically as model size increases from 140 million to 1.7 billion parameters.
- Gains are largest for rare diseases compared to common ones.
- The model improves forecasting of healthcare expenditure.
- It reduces systematic bias in real-world evidence analyses that emulate target trials.
- Representations generalize across time periods and independent datasets.
Where Pith is reading between the lines
- Such models could support real-time disease surveillance by using existing claims streams without custom retraining for each condition.
- Combining ReClaim-style pretraining with other data modalities might further boost accuracy on complex outcomes.
- The approach suggests a path to reduce reliance on labeled datasets in healthcare machine learning by leveraging unlabeled claims at scale.
Load-bearing premise
The claim depends on the training and test splits blocking any future information from reaching past predictions, plus the assumption that performance on MarketScan data will transfer to other claims databases without major shifts in data distribution.
What would settle it
Demonstrating no performance advantage for ReClaim over the baselines when evaluated on a fresh claims database collected after 2022 or from a different national source would falsify the generalization result.
read the original abstract
Evidence derived from large-scale real-world data (RWD) is increasingly informing regulatory evaluation and healthcare decision-making. Administrative claims provide population-scale, longitudinal records of healthcare utilization, expenditure, and detailed coding of diagnoses, procedures, and medications, yet their potential as a substrate for healthcare foundation models remains largely unexplored. Here we present ReClaim, a generative transformer trained from scratch on 43.8 billion medical events from more than 200 million enrollees in the MarketScan claims data spanning 2008-2022. ReClaim models longitudinal trajectories across diagnoses, procedures, medications, and expenditure, and was scaled to 140 million, 700 million, and 1.7 billion parameters. Across over 1,000 disease-onset prediction tasks, ReClaim achieved a mean AUC of 75.6%, substantially outperforming disease-specific LightGBM (66.3%) and the transformer-based Delphi model (69.4%), with the largest gains for rare diseases. These advantages held across retrospective and prospective evaluations and in external validation on two independent datasets. Performance improved monotonically with scale, and post-training added 13.8 percentage points over pre-training alone. Beyond disease prediction, ReClaim captured financial outcomes and improved real-world evidence (RWE) analyses: for healthcare expenditure forecasting it increased explained variance from 0.28 to 0.37 relative to LightGBM, and in a target trial emulation it reduced systematic bias by 72% on average relative to Delphi. Together, these results establish administrative claims as a scalable substrate for healthcare foundation models and show that learned representations generalize across time periods and data sources, supporting disease surveillance, expenditure forecasting, and RWE generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ReClaim, a generative transformer pretrained from scratch on 43.8 billion medical events from >200 million MarketScan enrollees (2008-2022). It reports that models scaled to 140M/700M/1.7B parameters achieve a mean AUC of 75.6% across >1,000 disease-onset prediction tasks, outperforming disease-specific LightGBM (66.3%) and the Delphi transformer (69.4%), with largest gains on rare diseases. Advantages are claimed to persist in retrospective/prospective settings and external validation on two independent datasets; performance scales monotonically, post-training adds 13.8 points over pre-training alone, and the model improves expenditure forecasting (R² 0.28→0.37) and reduces bias by 72% in a target-trial emulation.
Significance. If the temporal-split and generalization claims hold, the work would establish administrative claims as a viable substrate for healthcare foundation models, with particular value for rare-disease prediction and downstream RWE tasks. The scale (1.7B parameters, 43.8B events), breadth (>1,000 tasks), external validation, and explicit scaling results are concrete strengths that could shift practice in disease surveillance and real-world evidence generation.
major comments (3)
- [Methods (prospective evaluation protocol)] § on prospective evaluation protocol (Methods): The headline claim that advantages 'held across retrospective and prospective evaluations' (Abstract) is load-bearing for the 75.6% AUC and the comparison to LightGBM/Delphi, yet the manuscript supplies no explicit description of the calendar cutoff, patient-level vs. event-level hold-out, handling of enrollment gaps, or confirmation that pretraining on the full 2008-2022 corpus does not embed future claims into representations used for earlier predictions. This omission directly affects the validity of the prospective results, especially the reported gains on rare diseases.
- [Results (scaling and post-training)] Results (scaling and post-training paragraph): The statements that performance 'improved monotonically with scale' and that post-training 'added 13.8 percentage points over pre-training alone' lack ablations or controls for hyperparameter search, total compute, or data volume. Without these, it is impossible to attribute the gains to model scale or the post-training stage rather than confounding factors, weakening the scaling-law claim that is central to the foundation-model narrative.
- [External validation] External validation section: The claim of generalization 'in external validation on two independent datasets' (Abstract) is presented without characterizing the datasets, quantifying distribution shift from MarketScan, or detailing the exact evaluation protocol. Because the MarketScan population is the sole training source, this information is required to assess whether the reported outperformance transfers beyond the training distribution.
minor comments (2)
- [Abstract] Abstract: the mean AUC of 75.6% should specify whether it is a macro-average across the 1,000+ tasks and whether tasks are weighted by prevalence or number of cases.
- [Abstract] Notation: the paper introduces RWD and RWE without first-use definitions in the abstract, although they are expanded later; a single sentence of clarification would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which identifies key areas where additional clarity will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Methods (prospective evaluation protocol)] § on prospective evaluation protocol (Methods): The headline claim that advantages 'held across retrospective and prospective evaluations' (Abstract) is load-bearing for the 75.6% AUC and the comparison to LightGBM/Delphi, yet the manuscript supplies no explicit description of the calendar cutoff, patient-level vs. event-level hold-out, handling of enrollment gaps, or confirmation that pretraining on the full 2008-2022 corpus does not embed future claims into representations used for earlier predictions. This omission directly affects the validity of the prospective results, especially the reported gains on rare diseases.
Authors: We agree that the prospective evaluation protocol requires a more explicit description to support the headline claims. In the revised manuscript we will add a dedicated subsection in Methods that specifies: the exact calendar cutoff (training on events through December 2018 and evaluating prospectively on 2019–2022), patient-level rather than event-level hold-out, the requirement of at least 12 months of continuous enrollment prior to each prediction date, and explicit confirmation that all pretraining and representation learning for prospective tasks used only data available up to the cutoff. These additions will eliminate any ambiguity regarding information leakage and directly address the validity of the rare-disease gains. revision: yes
-
Referee: [Results (scaling and post-training)] Results (scaling and post-training paragraph): The statements that performance 'improved monotonically with scale' and that post-training 'added 13.8 percentage points over pre-training alone' lack ablations or controls for hyperparameter search, total compute, or data volume. Without these, it is impossible to attribute the gains to model scale or the post-training stage rather than confounding factors, weakening the scaling-law claim that is central to the foundation-model narrative.
Authors: We acknowledge that stronger controls would improve attribution of the observed gains. All three model sizes were trained on the identical full dataset with the same core hyperparameters and optimizer settings; post-training was performed after pretraining on the same data. Due to the prohibitive cost of retraining at 1.7 B scale, we did not run exhaustive per-scale hyperparameter searches. In revision we will add an appendix table listing exact hyperparameters, approximate FLOPs and GPU-hours per scale, and data volume. We will also include a brief discussion noting that, while full ablations are computationally infeasible, the consistent monotonic trend across three orders of magnitude in parameter count still supports the scaling observation. This constitutes a partial revision given the practical constraints. revision: partial
-
Referee: [External validation] External validation section: The claim of generalization 'in external validation on two independent datasets' (Abstract) is presented without characterizing the datasets, quantifying distribution shift from MarketScan, or detailing the exact evaluation protocol. Because the MarketScan population is the sole training source, this information is required to assess whether the reported outperformance transfers beyond the training distribution.
Authors: We agree that the external-validation description is currently insufficient. We will expand the External validation section to: (i) fully characterize both independent datasets (source, size, time period, and population demographics), (ii) quantify distribution shift via side-by-side statistics on age, sex, enrollment duration, and prevalence of high-frequency conditions, and (iii) specify the precise evaluation protocol (zero-shot inference versus any fine-tuning, task construction, and metric computation). These additions will allow readers to evaluate how well performance transfers beyond the MarketScan training distribution. revision: yes
Circularity Check
No significant circularity in derivation or evaluation chain
full rationale
The paper's central claims consist of empirical AUC, explained-variance, and bias-reduction metrics obtained by training a generative transformer on 43.8B events and evaluating on held-out disease-onset tasks, expenditure forecasting, and target-trial emulation. No equations, self-definitional loops, or fitted-parameter renamings are present that would make reported performance numbers equivalent to the training inputs by construction. Scaling behavior, post-training gains, and comparisons to LightGBM/Delphi are presented as experimental outcomes rather than algebraic identities. External validation on independent datasets and prospective splits further separate the reported results from any self-referential reduction. Minor self-citations, if present, are not load-bearing for the performance numbers.
Axiom & Free-Parameter Ledger
free parameters (2)
- model scale
- training hyperparameters
axioms (2)
- domain assumption Administrative claims codes accurately capture diagnoses, procedures, and medications without systematic under- or over-coding.
- domain assumption Temporal splits in retrospective and prospective evaluations prevent future information leakage.
Lean theorems connected to this paper
-
Foundation (overall framework)reality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ReClaim, a generative transformer trained from scratch on 43.8 billion medical events ... scaled to 140 million, 700 million, and 1.7 billion parameters.
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we add Lz = λ·E[(log Zt)²], λ = 10⁻⁴, and the total pre-training loss is L = LCE + Lz.
-
Foundation.AlphaCoordinateFixationalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Performance improved monotonically with scale, and post-training added 13.8 percentage points over pre-training alone.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Sherman, Steven A
Rachel E. Sherman, Steven A. Anderson, Gerald J. Dal Pan, Gerry W. Gray, Thomas Gross, Nina L. Hunter, Lisa LaVange, Danica Marinac-Dabic, Peter W. Marks, Melissa A. Robb, Jeffrey Shuren, Robert Temple, Janet Woodcock, Lilly Q. Yue, and Robert M. Califf. Real-world evidence—what is it and what can it tell us?New England Journal of Medicine, 375(23):2293–2...
2016
-
[2]
Real-world data: a brief review of the methods, applications, challenges and opportunities.BMC Medical Research Methodology, 22(1):287, 2022
Fang Liu and Demosthenes Panagiotakos. Real-world data: a brief review of the methods, applications, challenges and opportunities.BMC Medical Research Methodology, 22(1):287, 2022
2022
-
[3]
Hernán and James M
Miguel A. Hernán and James M. Robins. Using big data to emulate a target trial when a randomized trial is not available.American Journal of Epidemiology, 183(8):758–764, 2016
2016
-
[4]
Real-world evidence—where are we now?New England Journal of Medicine, 386(18):1680–1682, 2022
John Concato and Jacqueline Corrigan-Curay. Real-world evidence—where are we now?New England Journal of Medicine, 386(18):1680–1682, 2022
2022
-
[5]
Food and Drug Administration
U.S. Food and Drug Administration. Real-world data: Assessing electronic health records and medical claims data to support regulatory decision-making for drug and biological products. Guidance for industry, U.S. Food and Drug Administration, 2024. Issued July 2024
2024
-
[6]
BEHRT: Transformer for electronic health records.Scientific Reports, 10(1):7155, 2020
Yikuan Li, Shishir Rao, Jose Roberto Ayala Solares, Abdelaali Hassaine, Dexter Canoy, Yuan Zhu, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. BEHRT: Transformer for electronic health records.Scientific Reports, 10(1):7155, 2020
2020
-
[7]
Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.npj Digital Medicine, 4(1):86, 2021
Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.npj Digital Medicine, 4(1):86, 2021
2021
-
[8]
Fries, Conor K
Ethan Steinberg, Ken Jung, Jason A. Fries, Conor K. Corbin, Stephen R. Pfohl, and Nigam H. Shah. Language models are an effective representation learning technique for electronic health record data. Journal of Biomedical Informatics, 113:103637, 2021
2021
-
[9]
Transformehr: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records.Nature Communications, 14:7857, 2023
Zhichao Yang, Avijit Mitra, Weisong Liu, Dan Berlowitz, and Hong Yu. Transformehr: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records.Nature Communications, 14:7857, 2023
2023
- [10]
-
[11]
Chao Pang, Xinzhuo Jiang, Nishanth Parameshwar Pavinkurve, Krishna S. Kalluri, Elise L. Minto, Jason Patterson, Linying Zhang, George Hripcsak, Gamze Gürsoy, Noémie Elhadad, and Karthik Natarajan. Cehr-gpt: Generating electronic health records with chronological patient timelines.arXiv preprint arXiv:2402.04400, 2024
- [12]
-
[13]
EHRSHOT: An EHR benchmark for few-shot evaluation of foundation models
Michael Wornow, Rahul Thapa, Ethan Steinberg, Jason Fries, and Nigam Shah. EHRSHOT: An EHR benchmark for few-shot evaluation of foundation models. InAdvances in Neural Information Processing Systems, volume 36, pages 67125–67137, 2023. Datasets and Benchmarks Track
2023
-
[14]
Teo, and Richard J
Zeljko Kraljevic, Dan Bean, Anthony Shek, Rebecca Bendayan, Harry Hemingway, Joshua Au Yeung, Alexander Deng, Alfred Balston, Jack Ross, Esther Idowu, James T. Teo, and Richard J. B. Dobson. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study.The Lancet Digital H...
2024
-
[15]
Zero shot health trajectory prediction using transformer.npj Digital Medicine, 7(1):256, 2024
Pawel Renc, Yugang Jia, Anthony E Samir, Jaroslaw Was, Quanzheng Li, David W Bates, and Arkadiusz Sitek. Zero shot health trajectory prediction using transformer.npj Digital Medicine, 7(1):256, 2024
2024
-
[16]
Learning the natural history of human disease with generative transformers.Nature, 647(8088):248–256, 2025
Artem Shmatko, Alexander Wolfgang Jung, Kumar Gaurav, Søren Brunak, Laust Hvas Mortensen, Ewan Birney, Tom Fitzgerald, and Moritz Gerstung. Learning the natural history of human disease with generative transformers.Nature, 647(8088):248–256, 2025
2025
-
[17]
Generative medical event models improve with scale.arXiv preprint arXiv:2508.12104, 2025
Shane Waxler, Paul Blazek, Davis White, Daniel Sneider, Kevin Chung, Mani Nagarathnam, Patrick Williams, Hank Voeller, Karen Wong, Matthew Swanhorst, et al. Generative medical event models improve with scale.arXiv preprint arXiv:2508.12104, 2025
-
[18]
On the Invariance and Generality of Neural Scaling Laws
Sheng Zhang, Qin Liu, Naoto Usuyama, Cliff Wong, Tristan Naumann, and Hoifung Poon. Exploring scaling laws for ehr foundation models.arXiv preprint arXiv:2505.22964, 2025
-
[19]
Pfeffer, Jason Fries, and Nigam H
Michael Wornow, Yizhe Xu, Rahul Thapa, Birju Patel, Ethan Steinberg, Scott Fleming, Michael A. Pfeffer, Jason Fries, and Nigam H. Shah. The shaky foundations of large language models and foundation models for electronic health records.npj Digital Medicine, 6:135, 2023
2023
-
[20]
Linwood, and Chang Liu
Xianlong Zeng, Simon L. Linwood, and Chang Liu. Pretrained transformer framework on pediatric claims data for population specific tasks.Scientific Reports, 12:3651, 2022
2022
-
[21]
Ricky Sahu, Eric Marriott, Ethan Siegel, David Wagner, Flore Uzan, Troy Yang, and Asim Javed. Introducing the large medical model: State of the art healthcare cost and risk prediction with transformers trained on patient event sequences.arXiv preprint arXiv:2409.13000, 2024
-
[22]
Chao Pang, Vincent Jeanselme, Young Sang Choi, Xinzhuo Jiang, Zilin Jing, Aparajita Kashyap, Yuta Kobayashi, Yanwei Li, Florent Pollet, Karthik Natarajan, and Shalmali Joshi. Fomoh: A clini- cally meaningful foundation model evaluation for structured electronic health records.arXiv preprint arXiv:2505.16941, 2025
-
[23]
Lightgbm: A highly efficient gradient boosting decision tree
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems 30, 2017
2017
-
[24]
Lambert, Annie Olry, Charlotte Rodwell, Charlotte Gueydan, Valérie Lanneau, Daniel Murphy, Yann Le Cam, and Ana Rath
Stéphanie Nguengang Wakap, Deborah M. Lambert, Annie Olry, Charlotte Rodwell, Charlotte Gueydan, Valérie Lanneau, Daniel Murphy, Yann Le Cam, and Ana Rath. Estimating cumulative point prevalence of rare diseases: analysis of the orphanet database.European Journal of Human Genetics, 28:165–173, 2020. |18
2020
-
[25]
Bastarache, Robert J
Wei-Qi Wei, Lisa A. Bastarache, Robert J. Carroll, Joy E. Marlo, Travis J. Osterman, Eric R. Gamazon, Nancy J. Cox, Dan M. Roden, and Joshua C. Denny. Evaluating phecodes, clinical classification software, and icd-9-cm codes for phenome-wide association studies in the electronic health record.PLOS ONE, 12(7):e0175508, 2017
2017
-
[26]
Machine-learning-based prediction models for high-need high-cost patients using nationwide clinical and claims data.NPJ digital medicine, 3(1):148, 2020
Itsuki Osawa, Tadahiro Goto, Yuji Yamamoto, and Yusuke Tsugawa. Machine-learning-based prediction models for high-need high-cost patients using nationwide clinical and claims data.NPJ digital medicine, 3(1):148, 2020
2020
-
[27]
Mitchell
Adriana Hernandez-Viver and Emily M. Mitchell. Concentration of healthcare expenditures and selected characteristics of persons with high expenses, u.s. civilian noninstitutionalized population, 2018–2022. StatisticalBrief560, AgencyforHealthcareResearchandQuality, Rockville, MD,2025. AHRQPublication No. 25-0017-3-EF
2018
-
[28]
Negative controls: a tool for detecting confounding and bias in observational studies.Epidemiology, 21(3):383–388, 2010
Marc Lipsitch, Eric Tchetgen Tchetgen, and Ted Cohen. Negative controls: a tool for detecting confounding and bias in observational studies.Epidemiology, 21(3):383–388, 2010
2010
-
[29]
Interpreting observational studies: why empirical calibration is needed to correct p-values.Statistics in medicine, 33(2):209–218, 2014
Martijn J Schuemie, Patrick B Ryan, William DuMouchel, Marc A Suchard, and David Madigan. Interpreting observational studies: why empirical calibration is needed to correct p-values.Statistics in medicine, 33(2):209–218, 2014
2014
-
[30]
Mitchell M Conover, Patrick B Ryan, Yong Chen, Marc A Suchard, George Hripcsak, and Martijn J Schuemie. Objective study validity diagnostics: a framework requiring pre-specified, empirical verification to increase trust in the reliability of real-world evidence.Journal of the American Medical Informatics Association, 32(3):518–525, 2025
2025
-
[31]
Huilin Tang, Yiwen Lu, Bingyu Zhang, Dazheng Zhang, Ting Zhou, Jiajie Chen, Ying Lu, Tianchu Lyu, Kai Zheng, and Yong Chen. Association of glp-1 receptor agonist use with psychiatric outcomes in adults with type 2 diabetes: a target trial emulation.Diabetes Research and Clinical Practice, page 113038, 2025
2025
-
[32]
Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. |19 A Supplementary Materials T able A1|Overview of the Merative MarketScan Research Databases. Database Description Time range N CCAE Commercial claims for employer-sponsored populations, including employees, early (non-Medicare) retirees, COBRA continuees, and dependents. Jan 2008 ...
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.