arxiv: 2605.02740 · v2 · submitted 2026-05-04 · 💻 cs.AI · cs.CL

Recognition: 3 theorem links

· Lean Theorem

Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

Fan Ma , Yuntian Liu , Xiang Lan , Weipeng Zhou , Jun Ni , Mauro Giuffr\`e , Lingfei Qian , Xueqing Peng

show 16 more authors

Yujia Zhou Ruey-Ling Weng Huan He Lu Li Qingyu Chen Andrew Loza Laila Rasmy Degui Zhi Yuan Lu Chenjie Zeng Joshua C Denny Lee Schwamm Daniella Meeker Lucila Ohno-Machado Yong Chen Hua Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:33 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords foundation modelsmedical claimsreal-world evidencetransformerdisease predictionhealthcare AIlongitudinal data

0 comments

The pith

A generative transformer trained on 43.8 billion medical claims events outperforms specialized models on over 1,000 disease prediction tasks while improving real-world evidence analyses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ReClaim is a generative transformer trained from scratch on 43.8 billion medical events from more than 200 million enrollees in the MarketScan claims data. It models longitudinal trajectories and was scaled up to 1.7 billion parameters. The model achieves a mean AUC of 75.6 percent on more than 1,000 disease-onset prediction tasks, outperforming both disease-specific LightGBM models and the Delphi transformer, with bigger gains for rare diseases. These results hold in retrospective, prospective, and external validation settings. A sympathetic reader would care because this demonstrates that abundant administrative claims data can serve as a foundation for general-purpose healthcare AI rather than requiring separate models for each disease.

Core claim

ReClaim establishes administrative claims as a scalable substrate for healthcare foundation models by training a generative transformer on 43.8 billion events to capture diagnoses, procedures, medications, and expenditure. The model achieves 75.6 percent mean AUC across over 1,000 tasks, substantially outperforming disease-specific LightGBM at 66.3 percent and Delphi at 69.4 percent, with largest gains for rare diseases. Performance improves with scale, post-training adds 13.8 points, and the model improves expenditure forecasting from 0.28 to 0.37 explained variance while reducing systematic bias by 72 percent in target trial emulation.

What carries the argument

ReClaim, the generative transformer trained to model sequences of medical events and predict future outcomes from longitudinal claims trajectories.

If this is right

Performance improves monotonically as model size increases from 140 million to 1.7 billion parameters.
Gains are largest for rare diseases compared to common ones.
The model improves forecasting of healthcare expenditure.
It reduces systematic bias in real-world evidence analyses that emulate target trials.
Representations generalize across time periods and independent datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such models could support real-time disease surveillance by using existing claims streams without custom retraining for each condition.
Combining ReClaim-style pretraining with other data modalities might further boost accuracy on complex outcomes.
The approach suggests a path to reduce reliance on labeled datasets in healthcare machine learning by leveraging unlabeled claims at scale.

Load-bearing premise

The claim depends on the training and test splits blocking any future information from reaching past predictions, plus the assumption that performance on MarketScan data will transfer to other claims databases without major shifts in data distribution.

What would settle it

Demonstrating no performance advantage for ReClaim over the baselines when evaluated on a fresh claims database collected after 2022 or from a different national source would falsify the generalization result.

read the original abstract

Evidence derived from large-scale real-world data (RWD) is increasingly informing regulatory evaluation and healthcare decision-making. Administrative claims provide population-scale, longitudinal records of healthcare utilization, expenditure, and detailed coding of diagnoses, procedures, and medications, yet their potential as a substrate for healthcare foundation models remains largely unexplored. Here we present ReClaim, a generative transformer trained from scratch on 43.8 billion medical events from more than 200 million enrollees in the MarketScan claims data spanning 2008-2022. ReClaim models longitudinal trajectories across diagnoses, procedures, medications, and expenditure, and was scaled to 140 million, 700 million, and 1.7 billion parameters. Across over 1,000 disease-onset prediction tasks, ReClaim achieved a mean AUC of 75.6%, substantially outperforming disease-specific LightGBM (66.3%) and the transformer-based Delphi model (69.4%), with the largest gains for rare diseases. These advantages held across retrospective and prospective evaluations and in external validation on two independent datasets. Performance improved monotonically with scale, and post-training added 13.8 percentage points over pre-training alone. Beyond disease prediction, ReClaim captured financial outcomes and improved real-world evidence (RWE) analyses: for healthcare expenditure forecasting it increased explained variance from 0.28 to 0.37 relative to LightGBM, and in a target trial emulation it reduced systematic bias by 72% on average relative to Delphi. Together, these results establish administrative claims as a scalable substrate for healthcare foundation models and show that learned representations generalize across time periods and data sources, supporting disease surveillance, expenditure forecasting, and RWE generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReClaim shows real scaling gains on claims data but the prospective evaluation rests on splits that need explicit verification.

read the letter

The paper's main contribution is training a single generative transformer from scratch on 43.8 billion claims events across 200 million people and showing it outperforms both disease-specific LightGBM and an earlier transformer (Delphi) on more than 1,000 onset prediction tasks, with the biggest lift on rare diseases. It also reports better expenditure forecasting and reduced bias in one target-trial emulation, plus monotonic gains from scaling to 1.7 billion parameters and an additional 13.8-point jump after post-training. External validation on two other datasets is included. That combination of scale, multi-task coverage, and downstream RWE examples is the part that feels new relative to prior smaller or narrower claims work.

Referee Report

3 major / 2 minor

Summary. The paper presents ReClaim, a generative transformer pretrained from scratch on 43.8 billion medical events from >200 million MarketScan enrollees (2008-2022). It reports that models scaled to 140M/700M/1.7B parameters achieve a mean AUC of 75.6% across >1,000 disease-onset prediction tasks, outperforming disease-specific LightGBM (66.3%) and the Delphi transformer (69.4%), with largest gains on rare diseases. Advantages are claimed to persist in retrospective/prospective settings and external validation on two independent datasets; performance scales monotonically, post-training adds 13.8 points over pre-training alone, and the model improves expenditure forecasting (R² 0.28→0.37) and reduces bias by 72% in a target-trial emulation.

Significance. If the temporal-split and generalization claims hold, the work would establish administrative claims as a viable substrate for healthcare foundation models, with particular value for rare-disease prediction and downstream RWE tasks. The scale (1.7B parameters, 43.8B events), breadth (>1,000 tasks), external validation, and explicit scaling results are concrete strengths that could shift practice in disease surveillance and real-world evidence generation.

major comments (3)

[Methods (prospective evaluation protocol)] § on prospective evaluation protocol (Methods): The headline claim that advantages 'held across retrospective and prospective evaluations' (Abstract) is load-bearing for the 75.6% AUC and the comparison to LightGBM/Delphi, yet the manuscript supplies no explicit description of the calendar cutoff, patient-level vs. event-level hold-out, handling of enrollment gaps, or confirmation that pretraining on the full 2008-2022 corpus does not embed future claims into representations used for earlier predictions. This omission directly affects the validity of the prospective results, especially the reported gains on rare diseases.
[Results (scaling and post-training)] Results (scaling and post-training paragraph): The statements that performance 'improved monotonically with scale' and that post-training 'added 13.8 percentage points over pre-training alone' lack ablations or controls for hyperparameter search, total compute, or data volume. Without these, it is impossible to attribute the gains to model scale or the post-training stage rather than confounding factors, weakening the scaling-law claim that is central to the foundation-model narrative.
[External validation] External validation section: The claim of generalization 'in external validation on two independent datasets' (Abstract) is presented without characterizing the datasets, quantifying distribution shift from MarketScan, or detailing the exact evaluation protocol. Because the MarketScan population is the sole training source, this information is required to assess whether the reported outperformance transfers beyond the training distribution.

minor comments (2)

[Abstract] Abstract: the mean AUC of 75.6% should specify whether it is a macro-average across the 1,000+ tasks and whether tasks are weighted by prevalence or number of cases.
[Abstract] Notation: the paper introduces RWD and RWE without first-use definitions in the abstract, although they are expanded later; a single sentence of clarification would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where additional clarity will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Methods (prospective evaluation protocol)] § on prospective evaluation protocol (Methods): The headline claim that advantages 'held across retrospective and prospective evaluations' (Abstract) is load-bearing for the 75.6% AUC and the comparison to LightGBM/Delphi, yet the manuscript supplies no explicit description of the calendar cutoff, patient-level vs. event-level hold-out, handling of enrollment gaps, or confirmation that pretraining on the full 2008-2022 corpus does not embed future claims into representations used for earlier predictions. This omission directly affects the validity of the prospective results, especially the reported gains on rare diseases.

Authors: We agree that the prospective evaluation protocol requires a more explicit description to support the headline claims. In the revised manuscript we will add a dedicated subsection in Methods that specifies: the exact calendar cutoff (training on events through December 2018 and evaluating prospectively on 2019–2022), patient-level rather than event-level hold-out, the requirement of at least 12 months of continuous enrollment prior to each prediction date, and explicit confirmation that all pretraining and representation learning for prospective tasks used only data available up to the cutoff. These additions will eliminate any ambiguity regarding information leakage and directly address the validity of the rare-disease gains. revision: yes
Referee: [Results (scaling and post-training)] Results (scaling and post-training paragraph): The statements that performance 'improved monotonically with scale' and that post-training 'added 13.8 percentage points over pre-training alone' lack ablations or controls for hyperparameter search, total compute, or data volume. Without these, it is impossible to attribute the gains to model scale or the post-training stage rather than confounding factors, weakening the scaling-law claim that is central to the foundation-model narrative.

Authors: We acknowledge that stronger controls would improve attribution of the observed gains. All three model sizes were trained on the identical full dataset with the same core hyperparameters and optimizer settings; post-training was performed after pretraining on the same data. Due to the prohibitive cost of retraining at 1.7 B scale, we did not run exhaustive per-scale hyperparameter searches. In revision we will add an appendix table listing exact hyperparameters, approximate FLOPs and GPU-hours per scale, and data volume. We will also include a brief discussion noting that, while full ablations are computationally infeasible, the consistent monotonic trend across three orders of magnitude in parameter count still supports the scaling observation. This constitutes a partial revision given the practical constraints. revision: partial
Referee: [External validation] External validation section: The claim of generalization 'in external validation on two independent datasets' (Abstract) is presented without characterizing the datasets, quantifying distribution shift from MarketScan, or detailing the exact evaluation protocol. Because the MarketScan population is the sole training source, this information is required to assess whether the reported outperformance transfers beyond the training distribution.

Authors: We agree that the external-validation description is currently insufficient. We will expand the External validation section to: (i) fully characterize both independent datasets (source, size, time period, and population demographics), (ii) quantify distribution shift via side-by-side statistics on age, sex, enrollment duration, and prevalence of high-frequency conditions, and (iii) specify the precise evaluation protocol (zero-shot inference versus any fine-tuning, task construction, and metric computation). These additions will allow readers to evaluate how well performance transfers beyond the MarketScan training distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or evaluation chain

full rationale

The paper's central claims consist of empirical AUC, explained-variance, and bias-reduction metrics obtained by training a generative transformer on 43.8B events and evaluating on held-out disease-onset tasks, expenditure forecasting, and target-trial emulation. No equations, self-definitional loops, or fitted-parameter renamings are present that would make reported performance numbers equivalent to the training inputs by construction. Scaling behavior, post-training gains, and comparisons to LightGBM/Delphi are presented as experimental outcomes rather than algebraic identities. External validation on independent datasets and prospective splits further separate the reported results from any self-referential reduction. Minor self-citations, if present, are not load-bearing for the performance numbers.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Abstract supplies no explicit list of assumptions or parameters; the central claim rests on standard supervised learning assumptions plus domain-specific claims-data validity.

free parameters (2)

model scale
Explicitly scaled to 140M, 700M, and 1.7B parameters; choice of sizes is not derived from data.
training hyperparameters
Learning rate, batch size, and optimization details required to reach reported performance are not stated.

axioms (2)

domain assumption Administrative claims codes accurately capture diagnoses, procedures, and medications without systematic under- or over-coding.
Required for both training signal and evaluation labels.
domain assumption Temporal splits in retrospective and prospective evaluations prevent future information leakage.
Necessary for the reported AUCs to reflect true predictive performance.

pith-pipeline@v0.9.0 · 5693 in / 1340 out tokens · 28233 ms · 2026-05-08T18:33:42.278115+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation (overall framework) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ReClaim, a generative transformer trained from scratch on 43.8 billion medical events ... scaled to 140 million, 700 million, and 1.7 billion parameters.
Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we add Lz = λ·E[(log Zt)²], λ = 10⁻⁴, and the total pre-training loss is L = LCE + Lz.
Foundation.AlphaCoordinateFixation alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Performance improved monotonically with scale, and post-training added 13.8 percentage points over pre-training alone.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Sherman, Steven A

Rachel E. Sherman, Steven A. Anderson, Gerald J. Dal Pan, Gerry W. Gray, Thomas Gross, Nina L. Hunter, Lisa LaVange, Danica Marinac-Dabic, Peter W. Marks, Melissa A. Robb, Jeffrey Shuren, Robert Temple, Janet Woodcock, Lilly Q. Yue, and Robert M. Califf. Real-world evidence—what is it and what can it tell us?New England Journal of Medicine, 375(23):2293–2...

2016
[2]

Real-world data: a brief review of the methods, applications, challenges and opportunities.BMC Medical Research Methodology, 22(1):287, 2022

Fang Liu and Demosthenes Panagiotakos. Real-world data: a brief review of the methods, applications, challenges and opportunities.BMC Medical Research Methodology, 22(1):287, 2022

2022
[3]

Hernán and James M

Miguel A. Hernán and James M. Robins. Using big data to emulate a target trial when a randomized trial is not available.American Journal of Epidemiology, 183(8):758–764, 2016

2016
[4]

Real-world evidence—where are we now?New England Journal of Medicine, 386(18):1680–1682, 2022

John Concato and Jacqueline Corrigan-Curay. Real-world evidence—where are we now?New England Journal of Medicine, 386(18):1680–1682, 2022

2022
[5]

Food and Drug Administration

U.S. Food and Drug Administration. Real-world data: Assessing electronic health records and medical claims data to support regulatory decision-making for drug and biological products. Guidance for industry, U.S. Food and Drug Administration, 2024. Issued July 2024

2024
[6]

BEHRT: Transformer for electronic health records.Scientific Reports, 10(1):7155, 2020

Yikuan Li, Shishir Rao, Jose Roberto Ayala Solares, Abdelaali Hassaine, Dexter Canoy, Yuan Zhu, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. BEHRT: Transformer for electronic health records.Scientific Reports, 10(1):7155, 2020

2020
[7]

Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.npj Digital Medicine, 4(1):86, 2021

Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction.npj Digital Medicine, 4(1):86, 2021

2021
[8]

Fries, Conor K

Ethan Steinberg, Ken Jung, Jason A. Fries, Conor K. Corbin, Stephen R. Pfohl, and Nigam H. Shah. Language models are an effective representation learning technique for electronic health record data. Journal of Biomedical Informatics, 113:103637, 2021

2021
[9]

Transformehr: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records.Nature Communications, 14:7857, 2023

Zhichao Yang, Avijit Mitra, Weisong Liu, Dan Berlowitz, and Hong Yu. Transformehr: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records.Nature Communications, 14:7857, 2023

2023
[10]

Ethan Steinberg, Jason Fries, Yizhe Xu, and Nigam H. Shah. Motor: A time-to-event foundation model for structured medical records.arXiv preprint arXiv:2301.03150, 2023. |17

work page arXiv 2023
[11]

Cehr-gpt: Generating electronic health records with chronological patient timelines.arXiv preprint arXiv:2402.04400, 2024

Chao Pang, Xinzhuo Jiang, Nishanth Parameshwar Pavinkurve, Krishna S. Kalluri, Elise L. Minto, Jason Patterson, Linying Zhang, George Hripcsak, Gamze Gürsoy, Noémie Elhadad, and Karthik Natarajan. Cehr-gpt: Generating electronic health records with chronological patient timelines.arXiv preprint arXiv:2402.04400, 2024

work page arXiv 2024
[12]

Matthew B. A. McDermott, Bret Nestor, Peniel Argaw, and Isaac Kohane. Event stream gpt: A data pre-processing and modeling library for generative, pre-trained transformers over continuous-time sequences of complex events.arXiv preprint arXiv:2306.11547, 2023

work page arXiv 2023
[13]

EHRSHOT: An EHR benchmark for few-shot evaluation of foundation models

Michael Wornow, Rahul Thapa, Ethan Steinberg, Jason Fries, and Nigam Shah. EHRSHOT: An EHR benchmark for few-shot evaluation of foundation models. InAdvances in Neural Information Processing Systems, volume 36, pages 67125–67137, 2023. Datasets and Benchmarks Track

2023
[14]

Teo, and Richard J

Zeljko Kraljevic, Dan Bean, Anthony Shek, Rebecca Bendayan, Harry Hemingway, Joshua Au Yeung, Alexander Deng, Alfred Balston, Jack Ross, Esther Idowu, James T. Teo, and Richard J. B. Dobson. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study.The Lancet Digital H...

2024
[15]

Zero shot health trajectory prediction using transformer.npj Digital Medicine, 7(1):256, 2024

Pawel Renc, Yugang Jia, Anthony E Samir, Jaroslaw Was, Quanzheng Li, David W Bates, and Arkadiusz Sitek. Zero shot health trajectory prediction using transformer.npj Digital Medicine, 7(1):256, 2024

2024
[16]

Learning the natural history of human disease with generative transformers.Nature, 647(8088):248–256, 2025

Artem Shmatko, Alexander Wolfgang Jung, Kumar Gaurav, Søren Brunak, Laust Hvas Mortensen, Ewan Birney, Tom Fitzgerald, and Moritz Gerstung. Learning the natural history of human disease with generative transformers.Nature, 647(8088):248–256, 2025

2025
[17]

Generative medical event models improve with scale.arXiv preprint arXiv:2508.12104, 2025

Shane Waxler, Paul Blazek, Davis White, Daniel Sneider, Kevin Chung, Mani Nagarathnam, Patrick Williams, Hank Voeller, Karen Wong, Matthew Swanhorst, et al. Generative medical event models improve with scale.arXiv preprint arXiv:2508.12104, 2025

work page arXiv 2025
[18]

On the Invariance and Generality of Neural Scaling Laws

Sheng Zhang, Qin Liu, Naoto Usuyama, Cliff Wong, Tristan Naumann, and Hoifung Poon. Exploring scaling laws for ehr foundation models.arXiv preprint arXiv:2505.22964, 2025

work page arXiv 2025
[19]

Pfeffer, Jason Fries, and Nigam H

Michael Wornow, Yizhe Xu, Rahul Thapa, Birju Patel, Ethan Steinberg, Scott Fleming, Michael A. Pfeffer, Jason Fries, and Nigam H. Shah. The shaky foundations of large language models and foundation models for electronic health records.npj Digital Medicine, 6:135, 2023

2023
[20]

Linwood, and Chang Liu

Xianlong Zeng, Simon L. Linwood, and Chang Liu. Pretrained transformer framework on pediatric claims data for population specific tasks.Scientific Reports, 12:3651, 2022

2022
[21]

Introducing the large medical model: State of the art healthcare cost and risk prediction with transformers trained on patient event sequences.arXiv preprint arXiv:2409.13000, 2024

Ricky Sahu, Eric Marriott, Ethan Siegel, David Wagner, Flore Uzan, Troy Yang, and Asim Javed. Introducing the large medical model: State of the art healthcare cost and risk prediction with transformers trained on patient event sequences.arXiv preprint arXiv:2409.13000, 2024

work page arXiv 2024
[22]

Fomoh: A clini- cally meaningful foundation model evaluation for structured electronic health records.arXiv preprint arXiv:2505.16941, 2025

Chao Pang, Vincent Jeanselme, Young Sang Choi, Xinzhuo Jiang, Zilin Jing, Aparajita Kashyap, Yuta Kobayashi, Yanwei Li, Florent Pollet, Karthik Natarajan, and Shalmali Joshi. Fomoh: A clini- cally meaningful foundation model evaluation for structured electronic health records.arXiv preprint arXiv:2505.16941, 2025

work page arXiv 2025
[23]

Lightgbm: A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems 30, 2017

2017
[24]

Lambert, Annie Olry, Charlotte Rodwell, Charlotte Gueydan, Valérie Lanneau, Daniel Murphy, Yann Le Cam, and Ana Rath

Stéphanie Nguengang Wakap, Deborah M. Lambert, Annie Olry, Charlotte Rodwell, Charlotte Gueydan, Valérie Lanneau, Daniel Murphy, Yann Le Cam, and Ana Rath. Estimating cumulative point prevalence of rare diseases: analysis of the orphanet database.European Journal of Human Genetics, 28:165–173, 2020. |18

2020
[25]

Bastarache, Robert J

Wei-Qi Wei, Lisa A. Bastarache, Robert J. Carroll, Joy E. Marlo, Travis J. Osterman, Eric R. Gamazon, Nancy J. Cox, Dan M. Roden, and Joshua C. Denny. Evaluating phecodes, clinical classification software, and icd-9-cm codes for phenome-wide association studies in the electronic health record.PLOS ONE, 12(7):e0175508, 2017

2017
[26]

Machine-learning-based prediction models for high-need high-cost patients using nationwide clinical and claims data.NPJ digital medicine, 3(1):148, 2020

Itsuki Osawa, Tadahiro Goto, Yuji Yamamoto, and Yusuke Tsugawa. Machine-learning-based prediction models for high-need high-cost patients using nationwide clinical and claims data.NPJ digital medicine, 3(1):148, 2020

2020
[27]

Mitchell

Adriana Hernandez-Viver and Emily M. Mitchell. Concentration of healthcare expenditures and selected characteristics of persons with high expenses, u.s. civilian noninstitutionalized population, 2018–2022. StatisticalBrief560, AgencyforHealthcareResearchandQuality, Rockville, MD,2025. AHRQPublication No. 25-0017-3-EF

2018
[28]

Negative controls: a tool for detecting confounding and bias in observational studies.Epidemiology, 21(3):383–388, 2010

Marc Lipsitch, Eric Tchetgen Tchetgen, and Ted Cohen. Negative controls: a tool for detecting confounding and bias in observational studies.Epidemiology, 21(3):383–388, 2010

2010
[29]

Interpreting observational studies: why empirical calibration is needed to correct p-values.Statistics in medicine, 33(2):209–218, 2014

Martijn J Schuemie, Patrick B Ryan, William DuMouchel, Marc A Suchard, and David Madigan. Interpreting observational studies: why empirical calibration is needed to correct p-values.Statistics in medicine, 33(2):209–218, 2014

2014
[30]

Mitchell M Conover, Patrick B Ryan, Yong Chen, Marc A Suchard, George Hripcsak, and Martijn J Schuemie. Objective study validity diagnostics: a framework requiring pre-specified, empirical verification to increase trust in the reliability of real-world evidence.Journal of the American Medical Informatics Association, 32(3):518–525, 2025

2025
[31]

Huilin Tang, Yiwen Lu, Bingyu Zhang, Dazheng Zhang, Ting Zhou, Jiajie Chen, Ying Lu, Tianchu Lyu, Kai Zheng, and Yong Chen. Association of glp-1 receptor agonist use with psychiatric outcomes in adults with type 2 diabetes: a target trial emulation.Diabetes Research and Clinical Practice, page 113038, 2025

2025
[32]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. |19 A Supplementary Materials T able A1|Overview of the Merative MarketScan Research Databases. Database Description Time range N CCAE Commercial claims for employer-sponsored populations, including employees, early (non-Medicare) retirees, COBRA continuees, and dependents. Jan 2008 ...

work page internal anchor Pith review arXiv 2025