OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib
Pith reviewed 2026-06-27 13:44 UTC · model grok-4.3
The pith
Single-timepoint tissue NGS data fails to predict osimertinib resistance above chance in EGFR-mutant NSCLC.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.
What carries the argument
OncoTraj benchmark of harmonized multi-source patient records with three locked tasks and audited no-leakage train/validation/test splits.
If this is right
- Single-timepoint tissue NGS inputs will keep all models at chance on the three resistance tasks.
- Serial ctDNA enrichment is required to move beyond the current performance ceiling in v2.
- The TP53 co-mutation association serves as a positive control confirming the harmonized data preserves known biology.
- The released splits and evaluation harness create a fixed standard for comparing new algorithms.
Where Pith is reading between the lines
- Clinical tools for guiding osimertinib treatment may need repeated liquid-biopsy sampling rather than one-time tissue sequencing.
- The same benchmarking approach could be applied to other targeted therapies that face predictable clonal evolution.
- Adding imaging or routine clinical variables to the feature set might surface additional signals even before serial ctDNA arrives.
Load-bearing premise
The three source datasets can be accurately harmonized into patient-level records with audited no-leakage train/validation/test splits and correctly labeled resistance mechanisms.
What would settle it
A model that exceeds chance performance on any of the three within-source test sets when restricted to the v1 single-timepoint features would falsify the claim that the performance ceiling is set by the input modality.
Figures
read the original abstract
Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and (C) six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients on first-line osimertinib, harmonized from MSK-CHORD (672), AACR GENIE BPC (34), and FLAURA (107). It defines three locked tasks—(A) 12-month progression binary classification, (B) time-to-first-progression regression, (C) six-class resistance mechanism classification—releases the harmonized data, audited no-leakage patient-level splits, and an open evaluation harness, and reports six baselines (majority, LR, RF, XGBoost, LSTM, multi-task transformer). With v1 single-timepoint snapshot features, no task exceeds chance on within-source evaluation; this uniform ceiling is attributed to the input modality rather than the algorithms. The benchmark recovers a literature-consistent TP53 co-mutation association (12-month progression 29% to 59%).
Significance. If the harmonization and splits hold, OncoTraj supplies a valuable, reproducible public resource that converts an empirical modality limit into concrete design requirements for serial-ctDNA v2. The explicit release of the dataset, splits, and harness is a clear strength that enables community follow-up.
major comments (2)
- [Abstract and data harmonization description] The central claim that uniform failure across all model classes localizes the performance ceiling to single-snapshot tissue NGS (rather than algorithm) is load-bearing on the correctness of resistance-mechanism labels and the no-leakage property of the splits. The manuscript asserts an 'audited no-leakage guarantee' and states the three sources but supplies no concrete mapping rules, conflict-resolution procedure for resistance labels, or audit evidence in the provided text (Abstract and Data harmonization description).
- [Results] § on results: the claim that 'no task clears chance' is presented without reported error bars, exact chance baselines per task, or within-source vs. cross-source breakdowns, which are required to substantiate the modality-limit interpretation.
minor comments (1)
- [Abstract] The abstract states the patient counts per source but does not tabulate the final per-task label distributions or missingness rates after harmonization; a supplementary table would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the potential value of OncoTraj as a public resource. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and data harmonization description] The central claim that uniform failure across all model classes localizes the performance ceiling to single-snapshot tissue NGS (rather than algorithm) is load-bearing on the correctness of resistance-mechanism labels and the no-leakage property of the splits. The manuscript asserts an 'audited no-leakage guarantee' and states the three sources but supplies no concrete mapping rules, conflict-resolution procedure for resistance labels, or audit evidence in the provided text (Abstract and Data harmonization description).
Authors: We agree that the main text does not currently contain the requested concrete details. The full mapping rules, conflict-resolution procedures for resistance labels, and audit documentation are present in the supplementary materials and the public data-release repository. In revision we will expand the Data harmonization section to include explicit mapping rules, representative examples of label conflicts and their resolution, and a concise summary of the audit steps performed, thereby placing the supporting evidence directly in the manuscript. revision: yes
-
Referee: [Results] § on results: the claim that 'no task clears chance' is presented without reported error bars, exact chance baselines per task, or within-source vs. cross-source breakdowns, which are required to substantiate the modality-limit interpretation.
Authors: We acknowledge the omission. The revised manuscript will add 95% confidence intervals (via bootstrapping) to all reported metrics, explicit chance-level baselines computed per task (majority-class accuracy for the two classification tasks and mean-value prediction for regression), and within-source versus cross-source performance tables. These additions will be placed in the Results section and associated supplementary tables to strengthen the modality-limit interpretation. revision: yes
Circularity Check
Empirical benchmark release with no derivation chain or self-referential reductions
full rationale
The paper presents a harmonized dataset, locked tasks, and baseline evaluations on released splits. All claims rest on empirical performance measurements rather than any derivation, equation, or parameter fit that reduces to its own inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The uniformity of baseline failure is reported as an observation on the provided data, not a mathematical necessity derived from the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Data from MSK-CHORD, AACR Project GENIE BPC NSCLC, and FLAURA can be harmonized into consistent patient-level records with accurate resistance mechanism labels.
Reference graph
Works this paper leans on
-
[1]
Rebecca L. Siegel, Kimberly D. Miller, Nikita Sandeep Wagle, and Ahmedin Jemal. Cancer statistics, 2023.CA: A Cancer Journal for Clinicians, 73(1):17–48, 2023. doi: 10.3322/caac. 21763. PMID: 36633525
-
[2]
Gray, Si-Min Lee, Rachel Hodge, Marcelo Marotti, Yuri Rukazenkov, and Suresh S
Jean-Charles Soria, Yuichiro Ohe, Johan Vansteenkiste, Thanyanan Reungwetwattana, Busya- mas Chewaskulyong, Ki Hyeong Lee, Arunee Dechaphunkul, Fumio Imamura, Naoyuki Nogami, Takayasu Kurata, Isamu Okamoto, Caicun Zhou, Byoung Chul Cho, Ying Cheng, Eun Kyung Cho, Pei Jye Voon, David Planchard, Wu-Chou Su, Jhanelle E. Gray, Si-Min Lee, Rachel Hodge, Marcel...
-
[3]
Ramalingam, Johan Vansteenkiste, David Planchard, Byoung Chul Cho, Jhanelle E
Suresh S. Ramalingam, Johan Vansteenkiste, David Planchard, Byoung Chul Cho, Jhanelle E. Gray, Yuichiro Ohe, Caicun Zhou, Thanyanan Reungwetwattana, Ying Cheng, Busyamas Chewaskulyong, Riyaz Shah, Manuel Cobo, Ki Hyeong Lee, Parneet Cheema, Marcello Tiseo, Thomas John, Meng-Chih Lin, Fumio Imamura, Takayasu Kurata, Alexander Todd, Rachel Hodge, Matilde Sa...
-
[4]
Juliann Chmielecki, Jhanelle E. Gray, Ying Cheng, Yuichiro Ohe, Fumio Imamura, Byoung Chul Cho, Meng-Chih Lin, Margarita Majem, Riyaz Shah, Yuri Rukazenkov, Alexander Todd, Alek- sandra Markovets, J. Carl Barrett, Juliann Chmielecki, and Suresh S. Ramalingam. Candidate mechanisms of acquired resistance to first-line osimertinib in EGFR-mutated advanced no...
-
[5]
Mok, Yi-Long Wu, Myung-Ju Ahn, Marina C
Tony S. Mok, Yi-Long Wu, Myung-Ju Ahn, Marina C. Garassino, Hye Ryun Kim, Suresh S. Ramalingam, Frances A. Shepherd, Yuanbin He, Hiroaki Akamatsu, Willemijn S.M.E. Theelen, Chee Khoon Lee, Martin Sebastian, Arnoud Templeton, Helen Mann, Marcelo Marotti, Ser- ban Ghiorghiu, and Vassiliki A. Papadimitrakopoulou. Osimertinib or platinum-pemetrexed in EGFR T7...
-
[6]
doi: 10.1056/NEJMoa1612674. PMID: 27959700. AURA3. ClinicalTrials.gov identifier: NCT02151981
-
[7]
Gray, Myung-Ju Ahn, Geoffrey R
Jhanelle E. Gray, Myung-Ju Ahn, Geoffrey R. Oxnard, Frances A. Shepherd, Fumio Imamura, Ying Cheng, Isamu Okamoto, Byoung Chul Cho, Meng-Chih Lin, Yi-Long Wu, Marcelo Marotti, Alexander Todd, Tarjinder Sahota, Ryan Hartmaier, Ji-Youn Han, Tony Mok, and Suresh S. Ramalingam. Early clearance of plasma EGFR mutations as a predictor of outcome on osimertinib ...
-
[8]
Wilson, Nicholas McGranahan, Nicolai J
Mariam Jamal-Hanjani, Gareth A. Wilson, Nicholas McGranahan, Nicolai J. Birkbak, Thomas B.K. Watkins, Selvaraju Veeriah, Seema Shafi, Diana H. Johnson, Richard Mit- ter, Rachel Rosenthal, Maximilian Salm, Stuart Horswell, Mickael Escudero, Nik Matthews, Andrew Rowan, Tim Chambers, David A. Moore, Samra Turajlic, Hang Xu, Siow-Ming Lee, Martin D. Forster, ...
-
[9]
Frankell, Michelle Dietzen, Maise Al Bakir, Emilia L
Alexander M. Frankell, Michelle Dietzen, Maise Al Bakir, Emilia L. Lim, Takahiro Karasaki, Sophia Ward, Selvaraju Veeriah, Emma Colliver, Ariana Huebner, Abigail Bunkum, et al. The evolution of lung cancer and impact of subclonal selection in TRACERx.Nature, 616(7957): 525–533, 2023. doi: 10.1038/s41586-023-05783-5. PMID: 37046096. TRACERx evolution analysis
-
[10]
Maron, Mohamed Ahmed, Susie Kim, Mono Pirun, Walid K
Justin Jee, Christopher Fong, Karl Pichotta, Thinh Ngoc Tran, Anisha Luthra, Michele Waters, Chenlian Fu, Mirella Altoe, Si-Yang Liu, Steven B. Maron, Mohamed Ahmed, Susie Kim, Mono Pirun, Walid K. Chatila, Caroline Bourque, Larisa Magoc, Pier Bose, Helena A. Yu, Mark T.A. Donoghue, Matthew D. Hellmann, Nikolaus Schultz, Michael F. Berger, Pedram Razavi, ...
-
[11]
Noura J. Choudhury, Jessica A. Lavery, Samantha Brown, Ino de Bruijn, Justin Jee, Thinh Ngoc Tran, Hira Rizvi, Kathryn C. Arbour, Karissa Whiting, Gregory J. Riely, Philippe L. Bedard, Lillian M. Smyth, Mary Mahler, Helena A. Yu, Wungki Tan, Nikolaus Schultz, Aaron Bell, et al. The GENIE BPC NSCLC cohort: a real-world repository integrating standardized c...
-
[12]
Ross A. Soo, Urania Dafni, Ji-Youn Han, Byoung Chul Cho, Ernest Nadal, Chong Ming Yeo, Enric Carcereny, Javier de Castro, Maria Angeles Sala, Linda Coate, Mariano Provencio, Christian Britschgi, Patrick Vagenknecht, Georgia Dimopoulou, Roswitha Kammler, Stephen P. Finn, Solange Peters, and Rolf A. Stahel. ctDNA dynamics and mechanisms of acquired resistan...
-
[13]
Benthe Muntinghe-Wagenaar, Pim Rozendal, Adrianus J
Fenneke Zwierenga, M. Benthe Muntinghe-Wagenaar, Pim Rozendal, Adrianus J. de Langen, Lizza E. L. Hendriks, Michel van den Heuvel, Cor van der Leest, Sayed M. S. Hashemi, Paul van der Leest, T. Jeroen N. Hiltermann, Ed Schuuring, and Anthonie J. van der Wekken. Circulating tumor DNA in advanced EGFRex20+ NSCLC: Concordance with tissue biopsy, monitoring o...
-
[14]
Maoxin Ran, Shao-Lin Zhang, and Kin Yip Tam. Identifying meaningful drug response biomarkers from public pharmacogenomic datasets with biologically informed interpretable neural networks.Computational Biology and Chemistry, 120(Pt 1):108669, 2025. doi: 10.1016/j. compbiolchem.2025.108669. PMID: 40914994. KEGG-informed sparse neural network identifies TP53...
work page doi:10.1016/j 2025
-
[15]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848
-
[16]
Pedersen, Richard Judson, and Krzysztof Fidelis
John Moult, Jan T. Pedersen, Richard Judson, and Krzysztof Fidelis. A large-scale experiment to assess protein structure prediction methods.Proteins: Structure, Function, and Genetics, 23(3):ii–v, 1995. doi: 10.1002/prot.340230303. PMID: 8710822. CASP founding paper
-
[17]
Alistair E. W. Johnson, Lucas Bulgarelli, Lu Shen, Alvin Gayles, Ayad Shammout, Steven Horng, Tom J. Pollard, Sicheng Hao, Benjamin Moody, Brian Gow, Li-Wei H. Lehman, Leo Anthony Celi, and Roger G. Mark. MIMIC-IV, a freely accessible electronic health record dataset.Scientific Data, 10(1):1, 2023. doi: 10.1038/s41597-022-01899-x. PMID: 36596836
-
[18]
Stewart, and Jimeng Sun
Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. Doctor AI: Predicting clinical events via recurrent neural networks. InProceedings of the Machine Learning for Healthcare Conference (MLHC), volume 56, pages 301–318, 2016
2016
-
[19]
Jaime Rubio-Pérez, Rocío Hernández, Cecilia Santolaya, et al. New therapeutic approaches for EGFR-mutated non-small cell lung cancer in the osimertinib era.Cancer Treatment and Research Communications, 44:100945, 2025. doi: 10.1016/j.ctarc.2025.100945. PMID: 40414016. TP53 co-mutation associated with reduced osimertinib PFS. 23
-
[20]
Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021
Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (a report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021. 24
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.