pith. machine review for the scientific record. sign in

arxiv: 2604.07530 · v1 · submitted 2026-04-08 · 💻 cs.DL · cs.AI· cs.CY· cs.SI

Recognition: unknown

The Shrinking Lifespan of LLMs in Science

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:10 UTC · model grok-4.3

classification 💻 cs.DL cs.AIcs.CYcs.SI
keywords language modelsscientific adoptioncitation analysisadoption curvesrelease timingLLM lifespaninverted-U trajectories
0
0 comments X

The pith

Release timing now predicts how long language models remain relevant in science better than size or architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study follows 62 language models through more than 108,000 citing papers to separate active scientific use from background mentions. Each model shows an inverted-U pattern: citations rise after release, reach a peak, and then fall as newer models arrive. This pattern has tightened over successive release years, with time-to-peak dropping 27 percent for each later year even after accounting for model size. Release year explains both the speed of peak adoption and the total scientific lifespan more strongly than architecture, openness, or scale, while size and access type still shape the overall volume of use. The findings show that the same pace of model improvement that raises capabilities also shortens how long any single model shapes research practice.

Core claim

Scientific adoption of language models follows an inverted-U trajectory that rises after release, peaks, and declines as newer models appear; this curve compresses over time, with each additional release year associated with a 27 percent reduction in time-to-peak adoption, and release timing explains both time-to-peak and scientific lifespan more strongly than architecture, openness, or scale, although model size and access modality retain modest predictive power for total adoption volume.

What carries the argument

the scientific adoption curve: the inverted-U trajectory of active adoption (versus background reference) tracked across citing papers over time

If this is right

  • Models released in later years reach peak scientific adoption more quickly than earlier models.
  • Release year is the dominant predictor of both time-to-peak and total scientific lifespan.
  • Model size and access modality still affect the total number of papers that adopt a model, though less than release timing does for timing dynamics.
  • Rapid capability gains coincide with faster replacement of older models in research use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers may have to refresh their methods and tool dependencies on shorter cycles than before.
  • Faster turnover could fragment cumulative knowledge if later papers cite older models less often.
  • The same compression pattern could appear in adoption of other AI systems such as vision models or reinforcement-learning agents.
  • Policy or funding decisions that assume long model lifespans may need adjustment if the observed trend continues.

Load-bearing premise

Every citation can be classified as active adoption or background reference using only the text of the citing paper.

What would settle it

Reclassifying the same citations with direct access to the original authors' code or data and finding that adoption curves no longer shorten with later release years would falsify the compression result.

Figures

Figures reproduced from arXiv: 2604.07530 by Ana Tri\v{s}ovi\'c.

Figure 1
Figure 1. Figure 1: LLM adoption in science follows an inverted-U curve that compresses across cohorts. (A) Aggregate trajectory for models released in 2019–2022. (B–E) Per-cohort trajectories. Peak shifts earlier each cohort (4.0 → 2.0 years) and curvature steepens, indicating lifecycle compression. release year ρm as the regressor. Compression is strong and highly significant: each year of later release is associated with a… view at source ↗
Figure 2
Figure 2. Figure 2: Peak adoption age and lifespan per model release year. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Compression rate by model characteristic. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Adoption shapes for each model were manually classified into inverted-U, rising, [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity analysis. 2 4 6 Peak adoption age (yrs) Architecture decoder -31%/year *** encoder-decoder -30%/year *** Training type base -28%/year *** Open weights Open -27%/year *** 2019 2020 2021 2022 2023 Model release year 2 4 6 Peak adoption age (yrs) Model size <10B -26%/year *** 10B -33%/year *** 2019 2020 2021 2022 2023 Model release year Institution type academic -25%/year *** both -26%/year *** in… view at source ↗
Figure 6
Figure 6. Figure 6: Compression rate (time to peak) by model characteristic. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Compression rate (lifespan) by model characteristic. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Scaling laws describe how language model capabilities grow with compute and data, but say nothing about how long a model matters once released. We provide the first large-scale empirical account of how scientists adopt and abandon language models over time. We track 62 LLMs across over 108k citing papers (2018-2025), each with at least three years of post-release data, and classify every citation as active adoption or background reference to construct per-model adoption trajectories that raw citation counts cannot resolve. We find three regularities. First, scientific adoption follows an inverted-U trajectory: usage rises after release, peaks, and declines as newer models appear, a pattern we term the \textit{scientific adoption curve}. Second, this curve is compressing: each additional release year is associated with a 27\% reduction in time-to-peak adoption ($p < 0.001$), robust to minimum-age thresholds and controls for model size. Third, release timing dominates model-level attributes as a predictor of lifecycle dynamics. Release year explains both time-to-peak and scientific lifespan more strongly than architecture, openness, or scale, though model size and access modality retain modest predictive power for total adoption volume. Together, these findings complement scaling laws with adoption-side regularities and suggest that the forces driving rapid capability progress may be the same forces compressing scientific relevance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper provides the first large-scale empirical analysis of LLM adoption and abandonment in science by tracking 62 models across 108k+ citing papers (2018-2025). It classifies each citation as active adoption versus background reference to construct per-model trajectories, revealing an inverted-U 'scientific adoption curve.' Key findings are a 27% compression in time-to-peak per release year (p<0.001, robust to age thresholds and size controls) and that release timing dominates architecture, openness, and scale as a predictor of lifecycle metrics, though size and access retain some power for total volume.

Significance. If the core measurements hold, the work supplies adoption-side regularities that usefully complement scaling laws, documenting rapid compression of scientific relevance for LLMs. The scale (62 models, 108k citations, multi-year post-release windows) and focus on resolved trajectories rather than raw counts are strengths; the observational design yields falsifiable regularities about release-year effects that future studies can test directly.

major comments (3)
  1. [Methods] Methods (citation classification procedure): every trajectory, the 27% compression result, and the regression dominance claims rest on labeling 108k citations as active adoption vs. background reference using only citing-paper text. No inter-rater reliability, accuracy against ground-truth (code/data usage), or validation set is reported; if error rates correlate with model age or release year, the inverted-U shape, time-to-peak metric, and release-year coefficients will be biased. The abstract's robustness checks do not address this.
  2. [Results] Results (regression specifications): the claim that 'release year explains both time-to-peak and scientific lifespan more strongly than architecture, openness, or scale' requires the exact model (variables, functional form, handling of multicollinearity between release year and size, and reported effect sizes or partial R²). Without these, it is impossible to assess whether the dominance is robust or an artifact of omitted controls.
  3. [Data] Data and reproducibility: the manuscript states it uses 'over 108k citing papers' but provides no details on data access, exact query used to retrieve citations, or how the 62-model sample was constructed. This blocks independent verification of the trajectories and the p<0.001 result.
minor comments (2)
  1. [Methods] Clarify the exact definition of 'scientific lifespan' and 'time-to-peak' (e.g., are these measured in months from release, or normalized?) and report the distribution of these quantities across the 62 models.
  2. [Results] The abstract mentions controls for model size; add a table or appendix showing the full regression table with all coefficients and standard errors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify areas where greater methodological transparency, validation, and reproducibility details will improve the manuscript. We respond point-by-point below and will make the corresponding revisions.

read point-by-point responses
  1. Referee: [Methods] Methods (citation classification procedure): every trajectory, the 27% compression result, and the regression dominance claims rest on labeling 108k citations as active adoption vs. background reference using only citing-paper text. No inter-rater reliability, accuracy against ground-truth (code/data usage), or validation set is reported; if error rates correlate with model age or release year, the inverted-U shape, time-to-peak metric, and release-year coefficients will be biased. The abstract's robustness checks do not address this.

    Authors: We acknowledge that the current manuscript does not report validation metrics for the citation classification. The procedure used contextual cues in the citing paper (e.g., verbs indicating usage such as 'trained', 'fine-tuned', or 'evaluated on' versus purely referential mentions) applied uniformly across all papers. In the revised version we will expand the Methods section with the complete rule set. We will also add a validation subsection reporting results from a stratified random sample of 500 citations (balanced by release year) that were independently annotated by two raters; we will report Cohen's kappa, accuracy against the automated labels, and a direct test of whether error rates vary systematically with model release year or age. Any detected bias will be addressed via sensitivity analyses or re-estimation of the key 27% compression coefficient. revision: yes

  2. Referee: [Results] Results (regression specifications): the claim that 'release year explains both time-to-peak and scientific lifespan more strongly than architecture, openness, or scale' requires the exact model (variables, functional form, handling of multicollinearity between release year and size, and reported effect sizes or partial R²). Without these, it is impossible to assess whether the dominance is robust or an artifact of omitted controls.

    Authors: We agree the regression details must be fully specified. The models are OLS regressions with time-to-peak (months from release to peak) and lifespan (months from release to last citation above a minimum threshold) as outcomes. Predictors include continuous release year, log(parameters), categorical architecture, binary openness, and access modality. Multicollinearity was checked with VIFs (all <5). In the revision we will present the complete regression tables (coefficients, SEs, p-values) together with partial R² values for each predictor to document the relative explanatory power of release year. We will also add an appendix with alternative specifications (e.g., excluding size, adding interactions, and using log-linear forms) and confirm that the dominance result is robust. revision: yes

  3. Referee: [Data] Data and reproducibility: the manuscript states it uses 'over 108k citing papers' but provides no details on data access, exact query used to retrieve citations, or how the 62-model sample was constructed. This blocks independent verification of the trajectories and the p<0.001 result.

    Authors: We will add a dedicated Data subsection. Citations were retrieved via the Semantic Scholar API using model-name queries (canonical name plus common aliases) limited to papers published after each model's release date. The 62-model sample consists of all LLMs released 2018–2022 that had at least three full years of post-release observation by the end of 2025, drawn from public model registries and release announcements. The revised manuscript will include the precise query templates, inclusion/exclusion criteria, and aggregate counts. We will also release the processed model list, citation trajectories, and analysis scripts upon acceptance to allow direct replication of the reported p<0.001 result and trajectories. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely observational empirical analysis

full rationale

The paper performs an observational study: it gathers external citation data (108k papers), applies text-based classification to label active vs. background citations, constructs per-model trajectories, computes metrics such as time-to-peak and lifespan, and runs regressions to assess predictors. No derivation chain, functional form, or prediction reduces to its own inputs by construction. The reported regularities (inverted-U curves, 27% compression, release-year dominance) are direct outputs of the data processing and statistical analysis on independent citation records, not tautological re-expressions of fitted parameters or self-citations. Self-citation load-bearing and ansatz smuggling are absent; the work contains no uniqueness theorems or mathematical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Claims rest on the untested accuracy of manual or automated citation classification and on the representativeness of the 62-model sample; no new physical or mathematical axioms, free parameters, or invented entities are introduced beyond standard regression controls.

pith-pipeline@v0.9.0 · 5538 in / 1135 out tokens · 54867 ms · 2026-05-10T17:10:11.999895+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 13 canonical work pages · 5 internal anchors

  1. [1]

    arXiv preprint arXiv:2308.13418 , year=

    Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents.arXiv preprint arXiv:2308.13418,

  2. [2]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

  3. [3]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783,

  4. [4]

    Accessed: 2024-09-05

    URL https://epochai.org/data/ notable-ai-models. Accessed: 2024-09-05. Lizhou Fan, Lingyao Li, Zihui Ma, Sanggyu Lee, Huizi Yu, and Libby Hemphill. A bib- liometric review of large language models research from 2017 to 2023.ACM Trans. Intell. Syst. Technol., 15(5), October

  5. [5]

    doi: 10.1145/3664930

    ISSN 2157-6904. doi: 10.1145/3664930. URL https://doi.org/10.1145/3664930. Faiza Farhat, Emmanuel Sirimal Silva, Hossein Hassani, Dag Øivind Madsen, Shahab Saquib Sohail, Yassine Himeur, M. Afshar Alam, and Aasim Zafar. The scholarly footprint of chatgpt: a bibliometric analysis of the early outbreak phase.Frontiers in Artificial Intelligence, Volume 6 - 2023,

  6. [6]

    doi: 10.3389/frai.2023.1270749

    ISSN 2624-8212. doi: 10.3389/frai.2023.1270749. URL https://www.frontiersin.org/journals/artificial-intelligence/articles/10. 3389/frai.2023.1270749. J. Gao and D. Wang. Quantifying the use and potential benefits of artificial intel- ligence in scientific research.Nature Human Behaviour, 8:2281–2292, 2024a. doi: 10.1038/s41562-024-02020-5. Jian Gao and Da...

  7. [7]

    Stefan Hajkowicz, Conrad Sanderson, Sarvnaz Karimi, Alexandra Bratanova, and Claire Naughtin. Artificial intelligence adoption in the physical sciences, natural sciences, life sciences, social sciences and the arts and humanities: A bibliometric analysis of research publications from 1960-2021.Technology in Society, 74:102260,

  8. [8]

    doi: https://doi.org/10.1016/j.techsoc.2023.102260

    ISSN 0160-791X. doi: https://doi.org/10.1016/j.techsoc.2023.102260. URL https://www.sciencedirect.com/ science/article/pii/S0160791X23000659. Qianyue Hao, Fengli Xu, Yong Li, and James Evans. Artificial intelligence tools expand scientists’ impact but contract science’s focus.Nature, pp. 1–7,

  9. [9]

    Training Compute-Optimal Large Language Models

    11 Preprint. Under review. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10,

  10. [10]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  11. [11]

    Graham, F.Q

    Rodney Kinney, Chloe Anastasiades, Russell Authur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Isabel Cachola, Stefan Candra, Yoganand Chandrasekhar, Arman Cohan, et al. The semantic scholar open data platform.arXiv preprint arXiv:2301.10140,

  12. [12]

    Weld , year=

    Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Dan S Weld. S2orc: The semantic scholar open research corpus.arXiv preprint arXiv:1911.02782,

  13. [13]

    Accessed: 2025-04-14

    URL https://openai.com/index/gpt-4-1/. Accessed: 2025-04-14. Cailean Osborne, Jennifer Ding, and Hannah Rose Kirk. The ai community building the future? a quantitative analysis of development activity on hugging face hub.Journal of Computational Social Science, 7(2):2067–2105,

  14. [14]

    arXiv preprint arXiv:2304.06588 , year=

    URLhttps://github.com/py-econometrics/pyfixest. Petter T¨ornberg. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning.arXiv preprint arXiv:2304.06588,

  15. [15]

    Under review

    12 Preprint. Under review. Ana Triˇsovi´c, Alex Fogelson, Janakan Sivaloganathan, and Neil Thompson. The rapid growth of ai foundation model usage in science.arXiv preprint arXiv:2511.21739,

  16. [16]

    Alex D. Wade. The semantic scholar academic graph (S2AG). InCompanion Proceedings of the Web Conference 2022, pp. 739,

  17. [17]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682,

  18. [18]

    We manually supplement the dataset with model size (total trainable parameters) and availability (downloadable weights, open source software or API-only)

    A Extended Methods A.1 Data Sources Epoch AI Index.We draw our initial list of language models from the Epoch AI Index (Epoch AI, 2024; Sevilla et al., 2022), filtering to retain only models released as reusable artifacts (excluding purely architectural contributions such as the original Transformer (Vaswani et al., 2017)). We manually supplement the data...

  19. [19]

    Model Disambiguation.Some papers introduce multiple models under a single Semantic Scholar ID (e.g., Llama 7B and 70B)

    using a three-sentence context window, which outperformed other approaches we tested (Table 1). Model Disambiguation.Some papers introduce multiple models under a single Semantic Scholar ID (e.g., Llama 7B and 70B). For citations to such papers, we disambiguate the specific model variant using Llama-3.1-8B (Dubey et al., 2024), prompted with the model 13 ...

  20. [20]

    Values for intermediate n are linearly interpolated

    and (0, 1), weighting by their frequencies in the dataset to obtain ¶n(FPa(p)| ˆe+ ˆu= 1). Values for intermediate n are linearly interpolated. By design, all parameter choices yieldupper boundson false positive rates, ensuring our corrections are conservative. Adjustments to Counts.Let f(n) be the fraction of papers in a given subset withn citation sente...