Recognition: unknown
The Shrinking Lifespan of LLMs in Science
Pith reviewed 2026-05-10 17:10 UTC · model grok-4.3
The pith
Release timing now predicts how long language models remain relevant in science better than size or architecture.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Scientific adoption of language models follows an inverted-U trajectory that rises after release, peaks, and declines as newer models appear; this curve compresses over time, with each additional release year associated with a 27 percent reduction in time-to-peak adoption, and release timing explains both time-to-peak and scientific lifespan more strongly than architecture, openness, or scale, although model size and access modality retain modest predictive power for total adoption volume.
What carries the argument
the scientific adoption curve: the inverted-U trajectory of active adoption (versus background reference) tracked across citing papers over time
If this is right
- Models released in later years reach peak scientific adoption more quickly than earlier models.
- Release year is the dominant predictor of both time-to-peak and total scientific lifespan.
- Model size and access modality still affect the total number of papers that adopt a model, though less than release timing does for timing dynamics.
- Rapid capability gains coincide with faster replacement of older models in research use.
Where Pith is reading between the lines
- Researchers may have to refresh their methods and tool dependencies on shorter cycles than before.
- Faster turnover could fragment cumulative knowledge if later papers cite older models less often.
- The same compression pattern could appear in adoption of other AI systems such as vision models or reinforcement-learning agents.
- Policy or funding decisions that assume long model lifespans may need adjustment if the observed trend continues.
Load-bearing premise
Every citation can be classified as active adoption or background reference using only the text of the citing paper.
What would settle it
Reclassifying the same citations with direct access to the original authors' code or data and finding that adoption curves no longer shorten with later release years would falsify the compression result.
Figures
read the original abstract
Scaling laws describe how language model capabilities grow with compute and data, but say nothing about how long a model matters once released. We provide the first large-scale empirical account of how scientists adopt and abandon language models over time. We track 62 LLMs across over 108k citing papers (2018-2025), each with at least three years of post-release data, and classify every citation as active adoption or background reference to construct per-model adoption trajectories that raw citation counts cannot resolve. We find three regularities. First, scientific adoption follows an inverted-U trajectory: usage rises after release, peaks, and declines as newer models appear, a pattern we term the \textit{scientific adoption curve}. Second, this curve is compressing: each additional release year is associated with a 27\% reduction in time-to-peak adoption ($p < 0.001$), robust to minimum-age thresholds and controls for model size. Third, release timing dominates model-level attributes as a predictor of lifecycle dynamics. Release year explains both time-to-peak and scientific lifespan more strongly than architecture, openness, or scale, though model size and access modality retain modest predictive power for total adoption volume. Together, these findings complement scaling laws with adoption-side regularities and suggest that the forces driving rapid capability progress may be the same forces compressing scientific relevance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper provides the first large-scale empirical analysis of LLM adoption and abandonment in science by tracking 62 models across 108k+ citing papers (2018-2025). It classifies each citation as active adoption versus background reference to construct per-model trajectories, revealing an inverted-U 'scientific adoption curve.' Key findings are a 27% compression in time-to-peak per release year (p<0.001, robust to age thresholds and size controls) and that release timing dominates architecture, openness, and scale as a predictor of lifecycle metrics, though size and access retain some power for total volume.
Significance. If the core measurements hold, the work supplies adoption-side regularities that usefully complement scaling laws, documenting rapid compression of scientific relevance for LLMs. The scale (62 models, 108k citations, multi-year post-release windows) and focus on resolved trajectories rather than raw counts are strengths; the observational design yields falsifiable regularities about release-year effects that future studies can test directly.
major comments (3)
- [Methods] Methods (citation classification procedure): every trajectory, the 27% compression result, and the regression dominance claims rest on labeling 108k citations as active adoption vs. background reference using only citing-paper text. No inter-rater reliability, accuracy against ground-truth (code/data usage), or validation set is reported; if error rates correlate with model age or release year, the inverted-U shape, time-to-peak metric, and release-year coefficients will be biased. The abstract's robustness checks do not address this.
- [Results] Results (regression specifications): the claim that 'release year explains both time-to-peak and scientific lifespan more strongly than architecture, openness, or scale' requires the exact model (variables, functional form, handling of multicollinearity between release year and size, and reported effect sizes or partial R²). Without these, it is impossible to assess whether the dominance is robust or an artifact of omitted controls.
- [Data] Data and reproducibility: the manuscript states it uses 'over 108k citing papers' but provides no details on data access, exact query used to retrieve citations, or how the 62-model sample was constructed. This blocks independent verification of the trajectories and the p<0.001 result.
minor comments (2)
- [Methods] Clarify the exact definition of 'scientific lifespan' and 'time-to-peak' (e.g., are these measured in months from release, or normalized?) and report the distribution of these quantities across the 62 models.
- [Results] The abstract mentions controls for model size; add a table or appendix showing the full regression table with all coefficients and standard errors.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where greater methodological transparency, validation, and reproducibility details will improve the manuscript. We respond point-by-point below and will make the corresponding revisions.
read point-by-point responses
-
Referee: [Methods] Methods (citation classification procedure): every trajectory, the 27% compression result, and the regression dominance claims rest on labeling 108k citations as active adoption vs. background reference using only citing-paper text. No inter-rater reliability, accuracy against ground-truth (code/data usage), or validation set is reported; if error rates correlate with model age or release year, the inverted-U shape, time-to-peak metric, and release-year coefficients will be biased. The abstract's robustness checks do not address this.
Authors: We acknowledge that the current manuscript does not report validation metrics for the citation classification. The procedure used contextual cues in the citing paper (e.g., verbs indicating usage such as 'trained', 'fine-tuned', or 'evaluated on' versus purely referential mentions) applied uniformly across all papers. In the revised version we will expand the Methods section with the complete rule set. We will also add a validation subsection reporting results from a stratified random sample of 500 citations (balanced by release year) that were independently annotated by two raters; we will report Cohen's kappa, accuracy against the automated labels, and a direct test of whether error rates vary systematically with model release year or age. Any detected bias will be addressed via sensitivity analyses or re-estimation of the key 27% compression coefficient. revision: yes
-
Referee: [Results] Results (regression specifications): the claim that 'release year explains both time-to-peak and scientific lifespan more strongly than architecture, openness, or scale' requires the exact model (variables, functional form, handling of multicollinearity between release year and size, and reported effect sizes or partial R²). Without these, it is impossible to assess whether the dominance is robust or an artifact of omitted controls.
Authors: We agree the regression details must be fully specified. The models are OLS regressions with time-to-peak (months from release to peak) and lifespan (months from release to last citation above a minimum threshold) as outcomes. Predictors include continuous release year, log(parameters), categorical architecture, binary openness, and access modality. Multicollinearity was checked with VIFs (all <5). In the revision we will present the complete regression tables (coefficients, SEs, p-values) together with partial R² values for each predictor to document the relative explanatory power of release year. We will also add an appendix with alternative specifications (e.g., excluding size, adding interactions, and using log-linear forms) and confirm that the dominance result is robust. revision: yes
-
Referee: [Data] Data and reproducibility: the manuscript states it uses 'over 108k citing papers' but provides no details on data access, exact query used to retrieve citations, or how the 62-model sample was constructed. This blocks independent verification of the trajectories and the p<0.001 result.
Authors: We will add a dedicated Data subsection. Citations were retrieved via the Semantic Scholar API using model-name queries (canonical name plus common aliases) limited to papers published after each model's release date. The 62-model sample consists of all LLMs released 2018–2022 that had at least three full years of post-release observation by the end of 2025, drawn from public model registries and release announcements. The revised manuscript will include the precise query templates, inclusion/exclusion criteria, and aggregate counts. We will also release the processed model list, citation trajectories, and analysis scripts upon acceptance to allow direct replication of the reported p<0.001 result and trajectories. revision: yes
Circularity Check
No significant circularity: purely observational empirical analysis
full rationale
The paper performs an observational study: it gathers external citation data (108k papers), applies text-based classification to label active vs. background citations, constructs per-model trajectories, computes metrics such as time-to-peak and lifespan, and runs regressions to assess predictors. No derivation chain, functional form, or prediction reduces to its own inputs by construction. The reported regularities (inverted-U curves, 27% compression, release-year dominance) are direct outputs of the data processing and statistical analysis on independent citation records, not tautological re-expressions of fitted parameters or self-citations. Self-citation load-bearing and ansatz smuggling are absent; the work contains no uniqueness theorems or mathematical derivations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2308.13418 , year=
Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv preprint arXiv:2308.13418,
-
[2]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Accessed: 2024-09-05
URL https://epochai.org/data/ notable-ai-models. Accessed: 2024-09-05. Lizhou Fan, Lingyao Li, Zihui Ma, Sanggyu Lee, Huizi Yu, and Libby Hemphill. A bib- liometric review of large language models research from 2017 to 2023.ACM Trans. Intell. Syst. Technol., 15(5), October
2024
-
[5]
ISSN 2157-6904. doi: 10.1145/3664930. URL https://doi.org/10.1145/3664930. Faiza Farhat, Emmanuel Sirimal Silva, Hossein Hassani, Dag Øivind Madsen, Shahab Saquib Sohail, Yassine Himeur, M. Afshar Alam, and Aasim Zafar. The scholarly footprint of chatgpt: a bibliometric analysis of the early outbreak phase.Frontiers in Artificial Intelligence, Volume 6 - 2023,
-
[6]
doi: 10.3389/frai.2023.1270749
ISSN 2624-8212. doi: 10.3389/frai.2023.1270749. URL https://www.frontiersin.org/journals/artificial-intelligence/articles/10. 3389/frai.2023.1270749. J. Gao and D. Wang. Quantifying the use and potential benefits of artificial intel- ligence in scientific research.Nature Human Behaviour, 8:2281–2292, 2024a. doi: 10.1038/s41562-024-02020-5. Jian Gao and Da...
-
[7]
Stefan Hajkowicz, Conrad Sanderson, Sarvnaz Karimi, Alexandra Bratanova, and Claire Naughtin. Artificial intelligence adoption in the physical sciences, natural sciences, life sciences, social sciences and the arts and humanities: A bibliometric analysis of research publications from 1960-2021.Technology in Society, 74:102260,
1960
-
[8]
doi: https://doi.org/10.1016/j.techsoc.2023.102260
ISSN 0160-791X. doi: https://doi.org/10.1016/j.techsoc.2023.102260. URL https://www.sciencedirect.com/ science/article/pii/S0160791X23000659. Qianyue Hao, Fengli Xu, Yong Li, and James Evans. Artificial intelligence tools expand scientists’ impact but contract science’s focus.Nature, pp. 1–7,
-
[9]
Training Compute-Optimal Large Language Models
11 Preprint. Under review. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[11]
Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, et al. The semantic scholar open data platform.arXiv preprint arXiv:2301.10140,
-
[12]
Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Dan S Weld. S2orc: The semantic scholar open research corpus.arXiv preprint arXiv:1911.02782,
-
[13]
Accessed: 2025-04-14
URL https://openai.com/index/gpt-4-1/. Accessed: 2025-04-14. Cailean Osborne, Jennifer Ding, and Hannah Rose Kirk. The ai community building the future? a quantitative analysis of development activity on hugging face hub.Journal of Computational Social Science, 7(2):2067–2105,
2025
-
[14]
arXiv preprint arXiv:2304.06588 , year=
URLhttps://github.com/py-econometrics/pyfixest. Petter T¨ornberg. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning.arXiv preprint arXiv:2304.06588,
-
[15]
12 Preprint. Under review. Ana Triˇsovi´c, Alex Fogelson, Janakan Sivaloganathan, and Neil Thompson. The rapid growth of ai foundation model usage in science.arXiv preprint arXiv:2511.21739,
-
[16]
Alex D. Wade. The semantic scholar academic graph (S2AG). InCompanion Proceedings of the Web Conference 2022, pp. 739,
2022
-
[17]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682,
work page internal anchor Pith review arXiv
-
[18]
We manually supplement the dataset with model size (total trainable parameters) and availability (downloadable weights, open source software or API-only)
A Extended Methods A.1 Data Sources Epoch AI Index.We draw our initial list of language models from the Epoch AI Index (Epoch AI, 2024; Sevilla et al., 2022), filtering to retain only models released as reusable artifacts (excluding purely architectural contributions such as the original Transformer (Vaswani et al., 2017)). We manually supplement the data...
2024
-
[19]
Model Disambiguation.Some papers introduce multiple models under a single Semantic Scholar ID (e.g., Llama 7B and 70B)
using a three-sentence context window, which outperformed other approaches we tested (Table 1). Model Disambiguation.Some papers introduce multiple models under a single Semantic Scholar ID (e.g., Llama 7B and 70B). For citations to such papers, we disambiguate the specific model variant using Llama-3.1-8B (Dubey et al., 2024), prompted with the model 13 ...
2024
-
[20]
Values for intermediate n are linearly interpolated
and (0, 1), weighting by their frequencies in the dataset to obtain ¶n(FPa(p)| ˆe+ ˆu= 1). Values for intermediate n are linearly interpolated. By design, all parameter choices yieldupper boundson false positive rates, ensuring our corrections are conservative. Adjustments to Counts.Let f(n) be the fraction of papers in a given subset withn citation sente...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.