A Nationwide Japanese Medical Claims Foundation Model: Balancing Model Scaling and Task-Specific Computational Efficiency
Pith reviewed 2026-05-08 12:25 UTC · model grok-4.3
The pith
Medical claims foundation models reach peak downstream performance at task-dependent sizes rather than always scaling larger.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Encoder-only Transformer models pretrained on nationwide Japanese medical claims data from 2.3 million patients show that downstream performance on disease prediction continues to benefit from scaling up to 101 million parameters, whereas medication prediction saturates at 11 million parameters, allowing a reduction in pretraining time of 178 hours while the best-performing model at each scale still exceeds a Light Gradient Boosting Machine baseline in area under the precision-recall curve.
What carries the argument
Five-scale pretraining of encoder-only Transformers (2.2M to 101M parameters) on longitudinal structured medical claims, followed by task-specific fine-tuning and evaluation on disease incidence and medication prediction.
Load-bearing premise
The observed differences in saturation thresholds are caused by the inherent characteristics of the two prediction tasks on structured claims data rather than by sampling choices, sparsity patterns, or hyperparameter decisions specific to this Japanese database.
What would settle it
Re-running the five-scale pretraining and evaluation experiment on an independent non-Japanese claims dataset and finding either no saturation or identical saturation thresholds across both tasks would falsify the claim of task-dependent optimal sizes.
Figures
read the original abstract
Clinical risk prediction using longitudinal medical data supports individualized care. Self-supervised foundation models have emerged as a promising approach for leveraging large-scale unlabeled healthcare records. In natural language processing, scaling laws suggest that larger models achieve predictably lower pretraining losses, supporting the foundation model paradigm. However, for structured medical data, characterized by a limited vocabulary and sparse observations, whether increasing model size consistently improves downstream predictions is unclear, as most studies evaluate only a single model scale. In this study, we evaluated the relationship between model scale and downstream task performance for structured medical foundation models. Using a random sample (2.3 million patients, 32 hospitals) from a nationwide 519-hospital Japanese claims database, we pretrained encoder-only Transformers at five scales (2.2M-101M parameters) for disease incidence and medication prediction. Downstream performance saturated at task-dependent thresholds: disease prediction benefited from larger models (32M-101M), whereas medication prediction saturated at 11M, reducing pretraining time by 178 h. Across all tasks, the best-performing model consistently outperformed a Light Gradient Boosting Machine baseline in the area under the precision-recall curve. These findings indicate that, unlike the monotonically decreasing pretraining loss, the optimal model size varied depending on task characteristics. This task-dependent saturation provides practical guidance for balancing predictive performance and computational cost in structured medical foundation models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates scaling of encoder-only Transformer foundation models on structured Japanese medical claims data (2.3M patients from a 519-hospital database). Five model scales (2.2M–101M parameters) are pretrained self-supervised and evaluated on disease incidence and medication prediction tasks. The central claim is that downstream AUPRC saturates at task-dependent thresholds (medication prediction at 11M parameters; disease prediction benefits up to 32M–101M), enabling a 178-hour pretraining reduction, while the best models outperform a LightGBM baseline across tasks. This indicates that optimal scale for structured medical data is not monotonically increasing as in NLP.
Significance. If the saturation points are robust, the result supplies concrete, actionable guidance for balancing predictive performance against compute in healthcare foundation models, where data sparsity and limited vocabulary differ from text. The use of real nationwide claims data, multi-scale empirical comparison, and direct LightGBM baseline evaluation are strengths that make the work practically relevant even if the precise thresholds are dataset-specific.
major comments (2)
- [Methods (pretraining details)] The pretraining setup (Methods section) provides no information on whether hyperparameters (learning rate, batch size, warmup, dropout) were held fixed or scaled with model size. Because the central saturation claim for medication prediction at 11M rests on the assumption that larger models were not under-trained, this omission is load-bearing; fixed schedules would artifactually produce the observed plateau even if the data distribution supports further gains.
- [Results (downstream evaluation)] Results reporting of downstream AUPRC values lacks data-split details (e.g., patient-level or temporal partitioning of the 2.3M cohort), statistical testing, or error bars. Without these, the task-dependent saturation thresholds and the claim that larger models improve disease prediction cannot be distinguished from sampling variability or selection bias in the random 32-hospital subsample.
minor comments (2)
- [Abstract] The abstract states a 178-hour pretraining reduction but does not specify hardware, wall-clock vs. total FLOPs, or whether the time includes only the 11M model or the full scaling curve.
- [Introduction / Methods] Notation for model scales (2.2M–101M) and task definitions (disease incidence vs. medication prediction) should be defined explicitly in a table or early section to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments, which have helped us identify areas where the manuscript can be strengthened for clarity and rigor. We provide point-by-point responses to the major comments below, along with our plans for revision.
read point-by-point responses
-
Referee: [Methods (pretraining details)] The pretraining setup (Methods section) provides no information on whether hyperparameters (learning rate, batch size, warmup, dropout) were held fixed or scaled with model size. Because the central saturation claim for medication prediction at 11M rests on the assumption that larger models were not under-trained, this omission is load-bearing; fixed schedules would artifactually produce the observed plateau even if the data distribution supports further gains.
Authors: We appreciate the referee's emphasis on this key aspect of the experimental design. The hyperparameters were held fixed across model scales, with only the architectural dimensions (layers and hidden size) adjusted to achieve the reported parameter counts. This fixed-hyperparameter approach is intentional in scaling studies to isolate the effect of model capacity. Pretraining was monitored via validation loss to confirm convergence for all scales. We will revise the Methods section to include an explicit table of all hyperparameter values and a statement confirming the fixed schedule, thereby directly addressing the concern that the observed saturation for medication prediction could be an artifact of under-training. revision: yes
-
Referee: [Results (downstream evaluation)] Results reporting of downstream AUPRC values lacks data-split details (e.g., patient-level or temporal partitioning of the 2.3M cohort), statistical testing, or error bars. Without these, the task-dependent saturation thresholds and the claim that larger models improve disease prediction cannot be distinguished from sampling variability or selection bias in the random 32-hospital subsample.
Authors: We agree that these details are necessary to support the robustness of the saturation claims. We will revise the Methods and Results sections to specify the patient-level partitioning approach used for the 2.3M cohort and to clarify the random selection process for the 32-hospital subsample from the full database. We will also add uncertainty quantification (such as bootstrap confidence intervals) for the AUPRC values and include statistical comparisons between scales where feasible. These additions will allow readers to better assess whether the task-dependent patterns exceed sampling variability. revision: yes
Circularity Check
No circularity: purely empirical scaling evaluation on held-out tasks
full rationale
The paper reports direct experimental results from pretraining five encoder-only Transformer scales (2.2M–101M parameters) on a sampled Japanese claims database and measuring downstream AUPRC on disease incidence and medication prediction tasks. No mathematical derivations, equations, or predictions are presented that reduce to fitted inputs by construction. Central claims rest on observed task-dependent saturation thresholds and consistent outperformance versus a LightGBM baseline; these are independent empirical observations on held-out data rather than self-definitional, self-cited, or ansatz-smuggled quantities. The study contains no load-bearing self-citations, uniqueness theorems, or renamings of known results.
Axiom & Free-Parameter Ledger
free parameters (1)
- model parameter scales
axioms (2)
- domain assumption Encoder-only Transformer is suitable for modeling longitudinal structured medical claims
- domain assumption Random sample of 2.3M patients from 32 hospitals is representative for scaling evaluation
Reference graph
Works this paper leans on
-
[1]
US Preventive Services Task Force, “Screening for peripheral artery dis- ease and cardiovascular disease risk assessment with the ankle-brachial index: US preventive services task force recommendation statement,” JAMA, vol. 320, no. 2, pp. 177–183, Jul. 2018
work page 2018
-
[2]
H. Freislinget al., “Lifestyle factors and risk of multimorbidity of cancer and cardiometabolic diseases: A multinational cohort study,”BMC Med., vol. 18, p. 5, Jan. 2020
work page 2020
-
[3]
Scalable and accurate deep learning with electronic health records,
A. Rajkomaret al., “Scalable and accurate deep learning with electronic health records,”npj Digit. Med., vol. 1, p. 18, May 2018
work page 2018
-
[4]
R. Miotto, L. Li, B. A. Kidd, and J. T. Dudley, “Deep patient: An unsupervised representation to predict the future of patients from the electronic health records,”Sci. Rep., vol. 6, p. 26094, May 2016
work page 2016
-
[5]
Semi-supervised learning of the electronic health record for phenotype stratification,
B. K. Beaulieu-Jones and C. S. Greene, “Semi-supervised learning of the electronic health record for phenotype stratification,”J. Biomed. Inform., vol. 64, pp. 168–178, Dec. 2016
work page 2016
-
[6]
Scaling Laws for Neural Language Models
J. Kaplanet al., “Scaling laws for neural language models,” arXiv:2001.08361, Jan. 2020
work page internal anchor Pith review arXiv 2001
-
[7]
On the Opportunities and Risks of Foundation Models
R. Bommasaniet al., “On the opportunities and risks of foundation models,”arXiv:2108.07258, Aug. 2021
work page internal anchor Pith review arXiv 2021
-
[8]
Training Compute-Optimal Large Language Models
J. Hoffmannet al., “Training compute-optimal large language models,” arXiv:2203.15556, Mar. 2022
work page internal anchor Pith review arXiv 2022
-
[9]
BEHRT: Transformer for electronic health records,
Y . Liet al., “BEHRT: Transformer for electronic health records,”Sci. Rep., vol. 10, p. 7155, Apr. 2020
work page 2020
-
[10]
L. Rasmy, Y . Xiang, Z. Xie, C. Tao, and D. Zhi, “Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction,”npj Digit. Med., vol. 4, p. 86, May 2021
work page 2021
-
[11]
CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks,
C. Panget al., “CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks,” inProc. Mach. Learn. Health (ML4H), 2021, pp. 239–260
work page 2021
-
[12]
Y . Liet al., “Hi-BEHRT: Hierarchical transformer-based model for accurate prediction of clinical events using multimodal electronic health records,”IEEE J. Biomed. Health Inform., vol. 26, no. 2, pp. 709–719, 2022
work page 2022
-
[13]
A. Vaswaniet al., “Attention is all you need,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5998–6008
work page 2017
-
[14]
doi:10.48550/arXiv.2207.08815 , urldate =
L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?” arXiv:2207.08815, Jul. 2022
-
[15]
T. Laurentet al., “Context and considerations for use of two Japanese real-world databases in Japan: Medical Data Vision and Japanese Medical Data Center,”Drugs—Real World Outcomes, vol. 9, no. 2, pp. 175–187, Jun. 2022
work page 2022
-
[16]
N. Stamaset al., “Use of healthcare claims data to generate real-world evidence on patients with drug-resistant epilepsy: Practical considera- tions for research,”J. Health Econ. Outcomes Res., vol. 11, no. 1, pp. 57–66, Feb. 2024
work page 2024
-
[17]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, Oct. 2018
work page internal anchor Pith review arXiv 2018
-
[18]
LightGBM: A highly efficient gradient boosting decision tree,
G. Keet al., “LightGBM: A highly efficient gradient boosting decision tree,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 3146– 3154
work page 2017
-
[19]
A large language model for electronic health records,
X. Yanget al., “A large language model for electronic health records,” npj Digit. Med., vol. 5, no. 1, p. 194, 2022
work page 2022
-
[20]
Health system-scale language models are all-purpose prediction engines,
L. R. Jianget al., “Health system-scale language models are all-purpose prediction engines,”Nature, vol. 619, pp. 357–362, 2023
work page 2023
-
[21]
Large language models encode clinical knowledge,
K. Singhalet al., “Large language models encode clinical knowledge,” Nature, vol. 620, pp. 172–180, 2023
work page 2023
-
[22]
Z. Yang, A. Mitra, W. Liu, D. Berlowitz, and H. Yu, “TransformEHR: Transformer-based encoder-decoder generative model to enhance pre- diction of disease outcomes using electronic health records,”Nature Commun., vol. 14, p. 7857, 2023
work page 2023
-
[23]
GenHPF: General healthcare predictive framework for multi-task multi-source learning,
K. Huret al., “GenHPF: General healthcare predictive framework for multi-task multi-source learning,”IEEE J. Biomed. Health Inform., vol. 28, no. 10, pp. 6098–6108, 2024
work page 2024
-
[24]
EHRSHOT: An EHR benchmark for few-shot evaluation of foundation models,
M. Wornow, Y . Rahul, E. Steinberg, S. Fleming, N. H. Shah, and J. A. Fries, “EHRSHOT: An EHR benchmark for few-shot evaluation of foundation models,” inProc. 37th Conf. Neural Inf. Process. Syst. (NeurIPS) Datasets Benchmarks Track, 2023
work page 2023
-
[25]
Generative medical event models improve with scale.arXiv preprint arXiv:2508.12104, 2025
S. Waxleret al., “Generative medical event models improve with scale,” arXiv:2508.12104, 2025
-
[26]
Making pre-trained language models great on tabular prediction,
J. Yanet al., “Making pre-trained language models great on tabular prediction,” inProc. Int. Conf. Learn. Representations (ICLR), 2024
work page 2024
-
[27]
Revisiting deep learning models for tabular data,
Y . Gorishniy, I. Rubachev, V . Khrulkov, and A. Babenko, “Revisiting deep learning models for tabular data,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2021
work page 2021
-
[28]
On embeddings for numerical features in tabular deep learning,
Y . Gorishniy, I. Rubachev, and A. Babenko, “On embeddings for numerical features in tabular deep learning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2022
work page 2022
-
[29]
MIMIC-IV, a freely accessible electronic health record dataset,
A. E. W. Johnsonet al., “MIMIC-IV, a freely accessible electronic health record dataset,”Sci. Data, vol. 10, p. 1, 2023
work page 2023
-
[30]
Foundation models for generalist medical artificial intelligence,
M. Mooret al., “Foundation models for generalist medical artificial intelligence,”Nature, vol. 616, pp. 259–265, 2023
work page 2023
-
[31]
Y . Jin, J. G. Weberpals, S. V . Wang, R. J. Desai, D. Merola, and K. J. Lin, “The impact of longitudinal data-completeness of electronic health record data on the prediction performance of clinical risk scores,”Clin. Pharmacol. Therapeutics, vol. 113, no. 6, pp. 1359–1367, Jun. 2023
work page 2023
-
[32]
A multi-center study on the adaptability of a shared foundation model for electronic health records,
A. Dattaet al., “A multi-center study on the adaptability of a shared foundation model for electronic health records,”npj Digit. Med., vol. 7, p. 191, 2024. 9 Supplementary Materials A. Summary Statistics of the Full Analysis Cohort (N= 2,294,687) Table S1 summarizes the full analysis cohort (N= 2,294,687). Fig. S1a–S1d show distributions of sequence leng...
work page 2024
-
[36]
Chronic Kidney Disease 2. Chronic Kidney Disease 2.2M Pretrained ModelSeed 42Seed 123Seed 456 From Scratch Model Seed 42Seed 123Seed 456 11
-
[40]
Chronic Kidney Disease 2. Chronic Kidney Disease 4.7M Pretrained ModelSeed 42Seed 123Seed 456 From Scratch Model Seed 42Seed 123Seed 456 12
-
[44]
Chronic Kidney Disease 2. Chronic Kidney Disease 11M Pretrained ModelSeed 42Seed 123Seed 456 From Scratch Model Seed 42Seed 123Seed 456 13
-
[48]
Chronic Kidney Disease 2. Chronic Kidney Disease 32M Pretrained ModelFrom Scratch Model Seed 42Seed 123Seed 456 Seed 42Seed 123Seed 456 14
-
[49]
Pregabalin Number of PatientsNumber of Patients AUROC AUPRC
- [50]
- [51]
-
[52]
Chronic Kidney Disease 2. Chronic Kidney Disease 101M Pretrained ModelFrom Scratch Model Seed 42Seed 123Seed 456 Seed 42Seed 123Seed 456 Fig. S2. Comparison of test AUROC (left column) and test AUPRC (right column) between pretrained and from-scratch models across all five model sizes (2.2M = blue, 4.7M = orange, 11M = green, 32M = red, and 101M = purple)...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.