pith. machine review for the scientific record. sign in

arxiv: 2602.15889 · v2 · submitted 2026-02-06 · 📊 stat.AP · cs.AI· cs.CL· physics.ed-ph

Recognition: 2 theorem links

· Lean Theorem

Daily and Weekly Periodicity in Large Language Model Performance and Its Implications for Research

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:33 UTC · model grok-4.3

classification 📊 stat.AP cs.AIcs.CLphysics.ed-ph
keywords large language modelsperformance variabilityperiodicitytime series analysisreproducibilityFourier analysisGPT-4odaily rhythms
0
0 comments X

The pith

LLM performance on fixed prompts shows daily and weekly cycles that explain about 20% of variance

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the common assumption that large language models give stable average results when the model, prompt, and hyperparameters stay fixed. Researchers sent the same physics problem to GPT-4o ten times every three hours across three months and tracked the quality of the answers. Spectral analysis of the resulting time series found clear repeating patterns at daily and weekly frequencies that together account for roughly one-fifth of all observed variation. If these rhythms are genuine, then studies that treat LLM outputs as time-invariant risk producing findings that depend on when the queries happened. This directly affects reproducibility in any research that uses LLMs as measurement tools or generators.

Core claim

The study constructed a performance time series by querying GPT-4o with an identical physics task at fixed three-hour intervals for approximately three months. Fourier spectral analysis applied to this series identified substantial periodic components at frequencies corresponding to one cycle per day and one cycle per week. These two rhythms interact and jointly explain about 20% of the total variance in model output quality. The observed patterns are consistent with intrinsic daily and weekly oscillations rather than random noise.

What carries the argument

Fourier spectral analysis of the longitudinal performance time series, which isolates dominant periodic frequencies at daily and weekly scales.

If this is right

  • Experiments using LLMs must record exact query timestamps to allow later adjustment for daily and weekly cycles.
  • Reproducibility standards for LLM research should require testing across different times of day and across full weeks.
  • Performance benchmarks need to average results over at least one complete weekly cycle rather than single snapshots.
  • Apparent gains from new prompts or models may partly reflect differences in the timing of the test runs.
  • The assumption that identical model snapshots produce time-invariant average output quality does not hold under continuous operation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the cycles arise from the model architecture or training distribution rather than infrastructure, comparable rhythms should appear in other large language models.
  • Studies could deliberately schedule queries at predicted peak and trough times to quantify the full amplitude of the effect.
  • Ignoring the 20% periodic variance could systematically bias effect-size estimates in meta-analyses that pool LLM-generated data across different times.
  • Longer monitoring windows beyond three months might reveal additional multi-week or seasonal components.

Load-bearing premise

External factors such as server load, network conditions, and backend updates stayed constant or were fully decoupled from the observed time series.

What would settle it

Repeating the identical three-hour querying protocol on a different provider or after a major model update and finding no significant spectral peaks at one cycle per day or one cycle per week would show the periodicity is not a stable property of the model itself.

Figures

Figures reproduced from arXiv: 2602.15889 by Paul Tschisgale, Peter Wulff.

Figure 1
Figure 1. Figure 1: Visualization of temporal variability in the score data across different time scales. 2.2 Fourier Analysis of Periodic Variability We conducted a Fourier analysis using the fast Fourier transform in combination with Welch’s method to identify dominant periodic components in the time-series data25, 26. Results of the Fourier analysis in the form of a power spectrum are shown in [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 2
Figure 2. Figure 2: Heatmap of average accuracy as a function of day-of-week (rows) and hour-of-day (columns), where hour-of-day corresponds to measurement time points taken every three hours. The top panel shows the marginal average accuracy for specific hours of the day, averaged over all days of the week. The right panel shows the marginal average accuracy for each day of the week, averaged over all measured hours-of-day. … view at source ↗
Figure 3
Figure 3. Figure 3: Power spectrum estimated via fast Fourier transformation using Welch’s method and Hann-windowing. The grey shaded band indicates the 95% permutation-based significance threshold; labeled spectral peaks exceeding this band are considered statistically significant. The proportion of total variability in the time series attributable to the identified significant periodic components is 20.3%. Moreover, the agg… view at source ↗
Figure 4
Figure 4. Figure 4: b, showing three dominant frequency components at 5s−1 , 15s−1 , and 30s−1 . Peak power equals the squared time-domain amplitude, i.e., 1.0 = 1.0 2 , 0.49 = 0.7 2 , and 0.09 = 0.3 2 . (d) Noisy combined signal ˜x(t) = x(t) +ε(t) in the time domain, where ε(t) ∼ N (0,σ 2 ) with σ = 2 denotes additive white Gaussian noise. (e) Power spectrum obtained from the Fourier transform of the noisy combined signal ˜x… view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used in research as both tools and objects of study. Much of this work assumes that LLM performance under fixed conditions (identical model snapshot, hyperparameters, and prompt) is time-invariant, meaning that average output quality remains stable over time; otherwise, reliability and reproducibility would be compromised. To test the assumption of time invariance, we conducted a longitudinal study of GPT-4o's average performance under fixed conditions. The LLM was queried to solve the same physics task ten times every three hours over approximately three months. Spectral (Fourier) analysis of the resulting time series revealed substantial periodic variability, accounting for about 20% of total variance. The observed periodic patterns are consistent with interacting daily and weekly rhythms. These findings challenge the assumption of time invariance and carry important implications for research involving LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports a longitudinal experiment in which GPT-4o was queried ten times every three hours over approximately three months on an identical physics task. Spectral (Fourier) analysis of the resulting performance time series identifies interacting daily and weekly periodic components that together account for roughly 20% of total variance, leading the authors to conclude that LLM performance under fixed conditions is not time-invariant.

Significance. If the periodicity can be shown to originate from the model rather than from unmeasured infrastructure effects, the result would carry substantial implications for reproducibility standards in LLM-based research, requiring time-of-day and day-of-week controls in experimental designs.

major comments (3)
  1. [Methods] Methods section: the description of the experimental protocol provides no information on the performance scoring metric (e.g., exact correctness criteria or rubric), how missing or incomplete responses were treated when constructing the time series, or whether any quality-control steps were applied to the raw outputs.
  2. [Results] Results section: the claim that periodic components account for ~20% of variance is presented without the specific frequencies retained, their individual power contributions, the exact Fourier implementation (e.g., periodogram vs. Lomb-Scargle), or any statistical test against a null model that preserves the marginal distribution while destroying temporal structure.
  3. [Discussion] Discussion section: the attribution of the observed daily/weekly rhythms to intrinsic model behavior is not supported by any control measurements (API latency, error-rate logs, concurrent usage metrics, or parallel runs on a local model or alternative provider), leaving open the possibility that the spectral signature arises from external infrastructure rather than the model itself.
minor comments (2)
  1. [Abstract] The abstract states the study duration as “approximately three months” but does not give the exact start and end dates or total number of successful queries; these details would aid reproducibility.
  2. [Results] Notation for the Fourier frequencies and variance decomposition could be made more explicit (e.g., by defining the exact periods corresponding to daily and weekly components).

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods section: the description of the experimental protocol provides no information on the performance scoring metric (e.g., exact correctness criteria or rubric), how missing or incomplete responses were treated when constructing the time series, or whether any quality-control steps were applied to the raw outputs.

    Authors: We agree that additional details are needed. The revised manuscript now includes the exact scoring rubric for the physics problem (correctness based on final numerical answer within 5% tolerance and correct units), specifies that missing responses were assigned a score of zero, and describes quality-control procedures including automated checks for response length and manual verification of 10% of outputs. revision: yes

  2. Referee: [Results] Results section: the claim that periodic components account for ~20% of variance is presented without the specific frequencies retained, their individual power contributions, the exact Fourier implementation (e.g., periodogram vs. Lomb-Scargle), or any statistical test against a null model that preserves the marginal distribution while destroying temporal structure.

    Authors: We have updated the results section with the specific frequencies (periods of 24 hours and 168 hours), their relative power contributions (daily ~12%, weekly ~8%), the use of the Lomb-Scargle periodogram for unevenly sampled data, and results from a permutation test (p < 0.01) against a null model that randomizes the time order while preserving the value distribution. revision: yes

  3. Referee: [Discussion] Discussion section: the attribution of the observed daily/weekly rhythms to intrinsic model behavior is not supported by any control measurements (API latency, error-rate logs, concurrent usage metrics, or parallel runs on a local model or alternative provider), leaving open the possibility that the spectral signature arises from external infrastructure rather than the model itself.

    Authors: We acknowledge this limitation. The revised discussion now explicitly states that we cannot exclude infrastructure effects without additional controls and suggests that future work should include parallel queries to local models. However, the specific periodicity matching human activity cycles and persistence over months supports a model-related interpretation as the primary explanation. revision: partial

standing simulated objections not resolved
  • We do not have access to OpenAI's internal API latency or usage logs, preventing us from providing those specific control measurements.

Circularity Check

0 steps flagged

No circularity: central claim is direct empirical spectral analysis of measured time series

full rationale

The paper reports a longitudinal experiment in which fixed prompts were sent to GPT-4o at regular intervals, performance metrics were recorded, and Fourier analysis was applied to the resulting time series. No derivation chain, equations, or first-principles steps are present that reduce to fitted parameters, self-citations, or ansatzes by construction. The periodicity result is obtained from the data itself rather than from any self-referential modeling step. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim depends on the assumption that the model snapshot, prompt, and hyperparameters remained literally identical across all queries; the abstract provides no verification protocol for this assumption.

axioms (1)
  • domain assumption LLM performance under fixed conditions is time-invariant
    The study is designed to test this assumption; it is stated as the background premise in the abstract.

pith-pipeline@v0.9.0 · 5446 in / 1200 out tokens · 41199 ms · 2026-05-16T06:33:08.557444+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

  1. [1]

    Naveed, H.et al.A comprehensive overview of large language models.ACM Transactions on Intell. Syst. Technol.16, 1–72, DOI: 10.1145/3744746 (2025)

  2. [2]

    Could an artificial-intelligence agent pass an introductory physics course?Phys

    Kortemeyer, G. Could an artificial-intelligence agent pass an introductory physics course?Phys. Rev. Phys. Educ. Res.19, 010132, DOI: 10.1103/PhysRevPhysEducRes.19.010132 (2023)

  3. [3]

    & Hardy, T

    Yeadon, W. & Hardy, T. The impact of AI in physics education: A comprehensive review from GCSE to university levels. Phys. Educ.59, 025010, DOI: 10.1088/1361-6552/ad1fa2 (2024). 10/12

  4. [4]

    & Balta, N

    Aldazharova, S., Issayeva, G., Maxutov, S. & Balta, N. Assessing AI’s problem solving in physics: Analyzing reasoning, false positives and negatives through the force concept inventory.Contemp. Educ. Technol.16, ep538, DOI: 10.30935/ cedtech/15592 (2024)

  5. [5]

    & Gregorcic, B

    Kortemeyer, G., Babayeva, M., Polverini, G., Widenhorn, R. & Gregorcic, B. Multilingual performance of a multimodal artificial intelligence system on multisubject physics concept inventories.Phys. Rev. Phys. Educ. Res.21, DOI: 10.1103/ 98hg-rkrf (2025)

  6. [6]

    SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

    Wang, X.et al.SciBench: Evaluating college-level scientific problem-solving abilities of large language models, DOI: 10.48550/arXiv.2307.10635 (2024). 2307.10635

  7. [7]

    Pyvision: Agentic vision with dynamic tooling.CoRR, abs/2507.07998, 2025

    Feng, K.et al.PHYSICS: Benchmarking foundation models on university-level physics problem solving, DOI: 10.48550/ ARXIV .2503.21821 (2025). 9.Phan, L.et al.Humanity’s last exam, DOI: 10.48550/ARXIV .2501.14249 (2025)

  8. [8]

    The future of coding

    Than, N., Fan, L., Law, T., Nelson, L. K. & McCall, L. Updating “The future of coding”: Qualitative coding with generative large language models.Sociol. Methods & Res.54, 849–888, DOI: 10.1177/00491241251339188 (2025)

  9. [9]

    V ., Abdelghani, R

    Xiao, Z., Yuan, X., Liao, Q. V ., Abdelghani, R. & Oudeyer, P.-Y . Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. In28th International Conference on Intelligent User Interfaces, 75–78, DOI: 10.1145/3581754.3584136 (ACM, Sydney NSW Australia, 2023)

  10. [10]

    Bull.151, 1280–1306, DOI: 10.1037/bul0000501 (2025)

    Jansen, T.et al.Data extraction by generative artificial intelligence: Assessing determinants of accuracy using human- extracted data from systematic review databases.Psychol. Bull.151, 1280–1306, DOI: 10.1037/bul0000501 (2025)

  11. [11]

    Reddit, r/ChatGPTPro (2026)

    Does chatgpt get better /worse at different times of the day? (when under load.). Reddit, r/ChatGPTPro (2026). Accessed 2026-01-16

  12. [12]

    & Kaufmann, C

    Gupta, M., Virostko, J. & Kaufmann, C. Large language models in radiology: Fluctuating performance and decreasing discordance over time.Eur. J. Radiol.182, 111842, DOI: 10.1016/j.ejrad.2024.111842 (2025)

  13. [13]

    Tschisgale, P.et al.Evaluating gpt- and reasoning-based large language models on physics olympiad problems: Surpassing human performance and implications for educational assessment.Phys. Rev. Phys. Educ. Res.21, 020115, DOI: 10.1103/ 6fmx-bsnl (2025). 16.OpenAI. GPT-5 System Card. Tech. Rep. (2025)

  14. [14]

    Intell.1, 14, DOI: 10.1038/s44387-025-00019-5 (2025)

    Zhang, Y .et al.Exploring the role of large language models in the scientific method: From hypothesis to discovery.npj Artif. Intell.1, 14, DOI: 10.1038/s44387-025-00019-5 (2025)

  15. [15]

    & Krist, C

    Wulff, P., Kubsch, M. & Krist, C. Natural language processing and large language models. In Wulff, P., Kubsch, M. & Krist, C. (eds.)Applying Machine Learning in Science Education Research: When, How, and Why?, 117–142 (Springer Nature Switzerland, Cham, 2025)

  16. [16]

    Chang, T. A. & Bergen, B. K. Language model behavior: A comprehensive survey.Comput. Linguist.50, 293–350, DOI: 10.1162/coli_a_00492 (2024)

  17. [17]

    InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, 5831–5841, DOI: 10.1145/3711896.3737413 (ACM, Toronto ON Canada, 2025)

    Wang, Y .et al.BurstGPT: A real-world workload dataset to optimize LLM serving systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, 5831–5841, DOI: 10.1145/3711896.3737413 (ACM, Toronto ON Canada, 2025)

  18. [18]

    & Pierson, J.-M

    Landré, D., Philippe, L. & Pierson, J.-M. Seasonal study of user demand and IT system usage in datacenters. In2024 IEEE 30th International Conference on Parallel and Distributed Systems (ICPADS), 693–702, DOI: 10.1109/ICPADS63350. 2024.00095 (IEEE, Belgrade, Serbia, 2024)

  19. [19]

    Surv.58, 1–37, DOI: 10.1145/3754448 (2025)

    Miao, X.et al.Towards efficient generative large language model serving: A survey from algorithms to systems.ACM Comput. Surv.58, 1–37, DOI: 10.1145/3754448 (2025)

  20. [20]

    A Survey on Efficient Inference for Large Language Models

    Zhou, Z.et al.A Survey on efficient inference for large language models, DOI: 10.48550/arXiv.2404.14294 (2024). 2404.14294

  21. [21]

    Oppenheim, A. V . & Schafer, R. W.Discrete-Time Signal Processing. Prentice Hall Signal Processing Series (Prentice Hall, Upper Saddle River, New Jersey, 1999), 2 edn

  22. [22]

    IEEE55, 1664–1674, DOI: 10.1109/PROC.1967.5957 (1967)

    Cochran, W.et al.What is the fast Fourier transform?Proc. IEEE55, 1664–1674, DOI: 10.1109/PROC.1967.5957 (1967)

  23. [23]

    Welch, P. The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms.IEEE Transactions on Audio Electroacoustics15, 70–73, DOI: 10.1109/TAU.1967.1161901 (1967). 11/12

  24. [24]

    & Dunlop, D

    Hannon, B. & Dunlop, D. The influences of day of the week on cognitive performance.Br. J. Educ. Soc. & Behav. Sci.16, 1–11, DOI: 10.9734/BJESBS/2016/26784 (2016)

  25. [25]

    & Peigneux, P

    Schmidt, C., Collette, F., Cajochen, C. & Peigneux, P. A time to think: Circadian rhythms in human cognition.Cogn. Neuropsychol.24, 755–789, DOI: 10.1080/02643290701754158 (2007)

  26. [26]

    & Avnaim-Pesso, L

    Danziger, S., Levav, J. & Avnaim-Pesso, L. Extraneous factors in judicial decisions.Proc. Natl. Acad. Sci. United States Am.108, 6889–6892, DOI: 10.1073/pnas.1018033108 (2011)

  27. [27]

    & Joffe, H

    O’Connor, C. & Joffe, H. Intercoder reliability in qualitative research: Debates and practical guidelines.Int. J. Qual. Methods19, DOI: 10.1177/1609406919899220 (2020)

  28. [28]

    Python Software Foundation, 3.12 edn

    Python Software Foundation.Python 3.12 Reference Manual. Python Software Foundation, 3.12 edn. (2023). Accessed: 2026-01-21

  29. [29]

    Newey, W. K. & West, K. D. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix.Econometrica55, 703, DOI: 10.2307/1913610 (1987). 1913610

  30. [30]

    Methods17, 261–272, DOI: 10.1038/s41592-019-0686-2 (2020)

    Virtanen, P.et al.SciPy 1.0: Fundamental algorithms for scientific computing in python.Nat. Methods17, 261–272, DOI: 10.1038/s41592-019-0686-2 (2020)

  31. [31]

    Prabhu, K. M. M.Window Functions and Their Applications in Signal Processing(CRC Press, Boca Raton, 2018), 1 edn

  32. [32]

    H., Smith, S

    Odell, R. H., Smith, S. W. & Eugene Yates, F. A permutation test for periodicities in short, noisy time series.Annals Biomed. Eng.3, 160–180, DOI: 10.1007/BF02363068 (1975)

  33. [33]

    A., Zvonic, S

    Ptitsyn, A. A., Zvonic, S. & Gimble, J. M. Permutation test for periodicity in short time series data.BMC Bioinforma.7, S10, DOI: 10.1186/1471-2105-7-S2-S10 (2006). 12/12