arxiv: 2602.15889 · v2 · submitted 2026-02-06 · 📊 stat.AP · cs.AI· cs.CL· physics.ed-ph

Recognition: 2 theorem links

· Lean Theorem

Daily and Weekly Periodicity in Large Language Model Performance and Its Implications for Research

Paul Tschisgale , Peter Wulff

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:33 UTC · model grok-4.3

classification 📊 stat.AP cs.AIcs.CLphysics.ed-ph

keywords large language modelsperformance variabilityperiodicitytime series analysisreproducibilityFourier analysisGPT-4odaily rhythms

0 comments

The pith

LLM performance on fixed prompts shows daily and weekly cycles that explain about 20% of variance

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the common assumption that large language models give stable average results when the model, prompt, and hyperparameters stay fixed. Researchers sent the same physics problem to GPT-4o ten times every three hours across three months and tracked the quality of the answers. Spectral analysis of the resulting time series found clear repeating patterns at daily and weekly frequencies that together account for roughly one-fifth of all observed variation. If these rhythms are genuine, then studies that treat LLM outputs as time-invariant risk producing findings that depend on when the queries happened. This directly affects reproducibility in any research that uses LLMs as measurement tools or generators.

Core claim

The study constructed a performance time series by querying GPT-4o with an identical physics task at fixed three-hour intervals for approximately three months. Fourier spectral analysis applied to this series identified substantial periodic components at frequencies corresponding to one cycle per day and one cycle per week. These two rhythms interact and jointly explain about 20% of the total variance in model output quality. The observed patterns are consistent with intrinsic daily and weekly oscillations rather than random noise.

What carries the argument

Fourier spectral analysis of the longitudinal performance time series, which isolates dominant periodic frequencies at daily and weekly scales.

If this is right

Experiments using LLMs must record exact query timestamps to allow later adjustment for daily and weekly cycles.
Reproducibility standards for LLM research should require testing across different times of day and across full weeks.
Performance benchmarks need to average results over at least one complete weekly cycle rather than single snapshots.
Apparent gains from new prompts or models may partly reflect differences in the timing of the test runs.
The assumption that identical model snapshots produce time-invariant average output quality does not hold under continuous operation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the cycles arise from the model architecture or training distribution rather than infrastructure, comparable rhythms should appear in other large language models.
Studies could deliberately schedule queries at predicted peak and trough times to quantify the full amplitude of the effect.
Ignoring the 20% periodic variance could systematically bias effect-size estimates in meta-analyses that pool LLM-generated data across different times.
Longer monitoring windows beyond three months might reveal additional multi-week or seasonal components.

Load-bearing premise

External factors such as server load, network conditions, and backend updates stayed constant or were fully decoupled from the observed time series.

What would settle it

Repeating the identical three-hour querying protocol on a different provider or after a major model update and finding no significant spectral peaks at one cycle per day or one cycle per week would show the periodicity is not a stable property of the model itself.

Figures

Figures reproduced from arXiv: 2602.15889 by Paul Tschisgale, Peter Wulff.

**Figure 1.** Figure 1: Visualization of temporal variability in the score data across different time scales. 2.2 Fourier Analysis of Periodic Variability We conducted a Fourier analysis using the fast Fourier transform in combination with Welch’s method to identify dominant periodic components in the time-series data25, 26. Results of the Fourier analysis in the form of a power spectrum are shown in [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 2.** Figure 2: Heatmap of average accuracy as a function of day-of-week (rows) and hour-of-day (columns), where hour-of-day corresponds to measurement time points taken every three hours. The top panel shows the marginal average accuracy for specific hours of the day, averaged over all days of the week. The right panel shows the marginal average accuracy for each day of the week, averaged over all measured hours-of-day. … view at source ↗

**Figure 3.** Figure 3: Power spectrum estimated via fast Fourier transformation using Welch’s method and Hann-windowing. The grey shaded band indicates the 95% permutation-based significance threshold; labeled spectral peaks exceeding this band are considered statistically significant. The proportion of total variability in the time series attributable to the identified significant periodic components is 20.3%. Moreover, the agg… view at source ↗

**Figure 4.** Figure 4: b, showing three dominant frequency components at 5s−1 , 15s−1 , and 30s−1 . Peak power equals the squared time-domain amplitude, i.e., 1.0 = 1.0 2 , 0.49 = 0.7 2 , and 0.09 = 0.3 2 . (d) Noisy combined signal ˜x(t) = x(t) +ε(t) in the time domain, where ε(t) ∼ N (0,σ 2 ) with σ = 2 denotes additive white Gaussian noise. (e) Power spectrum obtained from the Fourier transform of the noisy combined signal ˜x… view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used in research as both tools and objects of study. Much of this work assumes that LLM performance under fixed conditions (identical model snapshot, hyperparameters, and prompt) is time-invariant, meaning that average output quality remains stable over time; otherwise, reliability and reproducibility would be compromised. To test the assumption of time invariance, we conducted a longitudinal study of GPT-4o's average performance under fixed conditions. The LLM was queried to solve the same physics task ten times every three hours over approximately three months. Spectral (Fourier) analysis of the resulting time series revealed substantial periodic variability, accounting for about 20% of total variance. The observed periodic patterns are consistent with interacting daily and weekly rhythms. These findings challenge the assumption of time invariance and carry important implications for research involving LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPT-4o shows daily and weekly performance cycles on a fixed physics task that explain about 20% of variance, but the data leave open whether the source is the model or the API infrastructure.

read the letter

The paper's main observation is that repeated queries to GPT-4o on the same physics problem produce a time series with clear daily and weekly periodic components, together accounting for roughly 20% of total variance over three months of data collected every three hours. They apply standard Fourier analysis to the scores and interpret the peaks as interacting rhythms. This directly challenges the assumption that LLM performance stays stable under fixed prompts and settings, which is a reasonable thing to check given how often these models are used as measurement tools. The data collection itself is a strength: holding the prompt and hyperparameters constant across thousands of queries gives a clean longitudinal record, and the spectral method is a straightforward way to quantify periodicity without overcomplicating the analysis. The result supplies a concrete number on how much variance might be time-dependent, which is useful for anyone designing experiments that treat LLM outputs as reliable. The soft spot is the attribution. All queries route through OpenAI's public API, so the same daily and weekly patterns could arise from server load, backend rollouts, rate-limit behavior, or maintenance windows rather than any internal model state. The paper reports no API latency logs, error-rate tracking, or parallel runs on a local model or different provider that would help separate those factors. Scoring details for the physics answers and handling of any incomplete responses are also thin, which makes it hard to judge how robust the 20% figure is to those choices. This is relevant for researchers who run LLM-based studies in education, psychology, or social science and need to think about temporal reproducibility. A reader focused on experimental design would get practical value from the empirical pattern even if the cause stays unclear. The work is coherent on its own terms and engages the reproducibility literature without circularity, so it deserves peer review so referees can request the missing controls and scoring transparency.

Referee Report

3 major / 2 minor

Summary. The manuscript reports a longitudinal experiment in which GPT-4o was queried ten times every three hours over approximately three months on an identical physics task. Spectral (Fourier) analysis of the resulting performance time series identifies interacting daily and weekly periodic components that together account for roughly 20% of total variance, leading the authors to conclude that LLM performance under fixed conditions is not time-invariant.

Significance. If the periodicity can be shown to originate from the model rather than from unmeasured infrastructure effects, the result would carry substantial implications for reproducibility standards in LLM-based research, requiring time-of-day and day-of-week controls in experimental designs.

major comments (3)

[Methods] Methods section: the description of the experimental protocol provides no information on the performance scoring metric (e.g., exact correctness criteria or rubric), how missing or incomplete responses were treated when constructing the time series, or whether any quality-control steps were applied to the raw outputs.
[Results] Results section: the claim that periodic components account for ~20% of variance is presented without the specific frequencies retained, their individual power contributions, the exact Fourier implementation (e.g., periodogram vs. Lomb-Scargle), or any statistical test against a null model that preserves the marginal distribution while destroying temporal structure.
[Discussion] Discussion section: the attribution of the observed daily/weekly rhythms to intrinsic model behavior is not supported by any control measurements (API latency, error-rate logs, concurrent usage metrics, or parallel runs on a local model or alternative provider), leaving open the possibility that the spectral signature arises from external infrastructure rather than the model itself.

minor comments (2)

[Abstract] The abstract states the study duration as “approximately three months” but does not give the exact start and end dates or total number of successful queries; these details would aid reproducibility.
[Results] Notation for the Fourier frequencies and variance decomposition could be made more explicit (e.g., by defining the exact periods corresponding to daily and weekly components).

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Methods] Methods section: the description of the experimental protocol provides no information on the performance scoring metric (e.g., exact correctness criteria or rubric), how missing or incomplete responses were treated when constructing the time series, or whether any quality-control steps were applied to the raw outputs.

Authors: We agree that additional details are needed. The revised manuscript now includes the exact scoring rubric for the physics problem (correctness based on final numerical answer within 5% tolerance and correct units), specifies that missing responses were assigned a score of zero, and describes quality-control procedures including automated checks for response length and manual verification of 10% of outputs. revision: yes
Referee: [Results] Results section: the claim that periodic components account for ~20% of variance is presented without the specific frequencies retained, their individual power contributions, the exact Fourier implementation (e.g., periodogram vs. Lomb-Scargle), or any statistical test against a null model that preserves the marginal distribution while destroying temporal structure.

Authors: We have updated the results section with the specific frequencies (periods of 24 hours and 168 hours), their relative power contributions (daily ~12%, weekly ~8%), the use of the Lomb-Scargle periodogram for unevenly sampled data, and results from a permutation test (p < 0.01) against a null model that randomizes the time order while preserving the value distribution. revision: yes
Referee: [Discussion] Discussion section: the attribution of the observed daily/weekly rhythms to intrinsic model behavior is not supported by any control measurements (API latency, error-rate logs, concurrent usage metrics, or parallel runs on a local model or alternative provider), leaving open the possibility that the spectral signature arises from external infrastructure rather than the model itself.

Authors: We acknowledge this limitation. The revised discussion now explicitly states that we cannot exclude infrastructure effects without additional controls and suggests that future work should include parallel queries to local models. However, the specific periodicity matching human activity cycles and persistence over months supports a model-related interpretation as the primary explanation. revision: partial

standing simulated objections not resolved

We do not have access to OpenAI's internal API latency or usage logs, preventing us from providing those specific control measurements.

Circularity Check

0 steps flagged

No circularity: central claim is direct empirical spectral analysis of measured time series

full rationale

The paper reports a longitudinal experiment in which fixed prompts were sent to GPT-4o at regular intervals, performance metrics were recorded, and Fourier analysis was applied to the resulting time series. No derivation chain, equations, or first-principles steps are present that reduce to fitted parameters, self-citations, or ansatzes by construction. The periodicity result is obtained from the data itself rather than from any self-referential modeling step. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim depends on the assumption that the model snapshot, prompt, and hyperparameters remained literally identical across all queries; the abstract provides no verification protocol for this assumption.

axioms (1)

domain assumption LLM performance under fixed conditions is time-invariant
The study is designed to test this assumption; it is stated as the background premise in the abstract.

pith-pipeline@v0.9.0 · 5446 in / 1200 out tokens · 41199 ms · 2026-05-16T06:33:08.557444+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Spectral (Fourier) analysis ... periodic patterns ... daily and weekly rhythms ... 20.3% of total variability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

Naveed, H.et al.A comprehensive overview of large language models.ACM Transactions on Intell. Syst. Technol.16, 1–72, DOI: 10.1145/3744746 (2025)

work page doi:10.1145/3744746 2025
[2]

Could an artificial-intelligence agent pass an introductory physics course?Phys

Kortemeyer, G. Could an artificial-intelligence agent pass an introductory physics course?Phys. Rev. Phys. Educ. Res.19, 010132, DOI: 10.1103/PhysRevPhysEducRes.19.010132 (2023)

work page doi:10.1103/physrevphyseducres.19.010132 2023
[3]

& Hardy, T

Yeadon, W. & Hardy, T. The impact of AI in physics education: A comprehensive review from GCSE to university levels. Phys. Educ.59, 025010, DOI: 10.1088/1361-6552/ad1fa2 (2024). 10/12

work page doi:10.1088/1361-6552/ad1fa2 2024
[4]

& Balta, N

Aldazharova, S., Issayeva, G., Maxutov, S. & Balta, N. Assessing AI’s problem solving in physics: Analyzing reasoning, false positives and negatives through the force concept inventory.Contemp. Educ. Technol.16, ep538, DOI: 10.30935/ cedtech/15592 (2024)

work page 2024
[5]

& Gregorcic, B

Kortemeyer, G., Babayeva, M., Polverini, G., Widenhorn, R. & Gregorcic, B. Multilingual performance of a multimodal artificial intelligence system on multisubject physics concept inventories.Phys. Rev. Phys. Educ. Res.21, DOI: 10.1103/ 98hg-rkrf (2025)

work page 2025
[6]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Wang, X.et al.SciBench: Evaluating college-level scientific problem-solving abilities of large language models, DOI: 10.48550/arXiv.2307.10635 (2024). 2307.10635

work page internal anchor Pith review doi:10.48550/arxiv.2307.10635 2024
[7]

Pyvision: Agentic vision with dynamic tooling.CoRR, abs/2507.07998, 2025

Feng, K.et al.PHYSICS: Benchmarking foundation models on university-level physics problem solving, DOI: 10.48550/ ARXIV .2503.21821 (2025). 9.Phan, L.et al.Humanity’s last exam, DOI: 10.48550/ARXIV .2501.14249 (2025)

work page internal anchor Pith review doi:10.48550/arxiv 2025
[8]

The future of coding

Than, N., Fan, L., Law, T., Nelson, L. K. & McCall, L. Updating “The future of coding”: Qualitative coding with generative large language models.Sociol. Methods & Res.54, 849–888, DOI: 10.1177/00491241251339188 (2025)

work page doi:10.1177/00491241251339188 2025
[9]

V ., Abdelghani, R

Xiao, Z., Yuan, X., Liao, Q. V ., Abdelghani, R. & Oudeyer, P.-Y . Supporting qualitative analysis with large language models: Combining codebook with GPT-3 for deductive coding. In28th International Conference on Intelligent User Interfaces, 75–78, DOI: 10.1145/3581754.3584136 (ACM, Sydney NSW Australia, 2023)

work page doi:10.1145/3581754.3584136 2023
[10]

Bull.151, 1280–1306, DOI: 10.1037/bul0000501 (2025)

Jansen, T.et al.Data extraction by generative artificial intelligence: Assessing determinants of accuracy using human- extracted data from systematic review databases.Psychol. Bull.151, 1280–1306, DOI: 10.1037/bul0000501 (2025)

work page doi:10.1037/bul0000501 2025
[11]

Reddit, r/ChatGPTPro (2026)

Does chatgpt get better /worse at different times of the day? (when under load.). Reddit, r/ChatGPTPro (2026). Accessed 2026-01-16

work page 2026
[12]

& Kaufmann, C

Gupta, M., Virostko, J. & Kaufmann, C. Large language models in radiology: Fluctuating performance and decreasing discordance over time.Eur. J. Radiol.182, 111842, DOI: 10.1016/j.ejrad.2024.111842 (2025)

work page doi:10.1016/j.ejrad.2024.111842 2024
[13]

Tschisgale, P.et al.Evaluating gpt- and reasoning-based large language models on physics olympiad problems: Surpassing human performance and implications for educational assessment.Phys. Rev. Phys. Educ. Res.21, 020115, DOI: 10.1103/ 6fmx-bsnl (2025). 16.OpenAI. GPT-5 System Card. Tech. Rep. (2025)

work page 2025
[14]

Intell.1, 14, DOI: 10.1038/s44387-025-00019-5 (2025)

Zhang, Y .et al.Exploring the role of large language models in the scientific method: From hypothesis to discovery.npj Artif. Intell.1, 14, DOI: 10.1038/s44387-025-00019-5 (2025)

work page doi:10.1038/s44387-025-00019-5 2025
[15]

& Krist, C

Wulff, P., Kubsch, M. & Krist, C. Natural language processing and large language models. In Wulff, P., Kubsch, M. & Krist, C. (eds.)Applying Machine Learning in Science Education Research: When, How, and Why?, 117–142 (Springer Nature Switzerland, Cham, 2025)

work page 2025
[16]

Chang, T. A. & Bergen, B. K. Language model behavior: A comprehensive survey.Comput. Linguist.50, 293–350, DOI: 10.1162/coli_a_00492 (2024)

work page doi:10.1162/coli_a_00492 2024
[17]

InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, 5831–5841, DOI: 10.1145/3711896.3737413 (ACM, Toronto ON Canada, 2025)

Wang, Y .et al.BurstGPT: A real-world workload dataset to optimize LLM serving systems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, 5831–5841, DOI: 10.1145/3711896.3737413 (ACM, Toronto ON Canada, 2025)

work page doi:10.1145/3711896.3737413 2025
[18]

& Pierson, J.-M

Landré, D., Philippe, L. & Pierson, J.-M. Seasonal study of user demand and IT system usage in datacenters. In2024 IEEE 30th International Conference on Parallel and Distributed Systems (ICPADS), 693–702, DOI: 10.1109/ICPADS63350. 2024.00095 (IEEE, Belgrade, Serbia, 2024)

work page doi:10.1109/icpads63350 2024
[19]

Surv.58, 1–37, DOI: 10.1145/3754448 (2025)

Miao, X.et al.Towards efficient generative large language model serving: A survey from algorithms to systems.ACM Comput. Surv.58, 1–37, DOI: 10.1145/3754448 (2025)

work page doi:10.1145/3754448 2025
[20]

A Survey on Efficient Inference for Large Language Models

Zhou, Z.et al.A Survey on efficient inference for large language models, DOI: 10.48550/arXiv.2404.14294 (2024). 2404.14294

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14294 2024
[21]

Oppenheim, A. V . & Schafer, R. W.Discrete-Time Signal Processing. Prentice Hall Signal Processing Series (Prentice Hall, Upper Saddle River, New Jersey, 1999), 2 edn

work page 1999
[22]

IEEE55, 1664–1674, DOI: 10.1109/PROC.1967.5957 (1967)

Cochran, W.et al.What is the fast Fourier transform?Proc. IEEE55, 1664–1674, DOI: 10.1109/PROC.1967.5957 (1967)

work page doi:10.1109/proc.1967.5957 1967
[23]

Welch, P. The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms.IEEE Transactions on Audio Electroacoustics15, 70–73, DOI: 10.1109/TAU.1967.1161901 (1967). 11/12

work page doi:10.1109/tau.1967.1161901 1967
[24]

& Dunlop, D

Hannon, B. & Dunlop, D. The influences of day of the week on cognitive performance.Br. J. Educ. Soc. & Behav. Sci.16, 1–11, DOI: 10.9734/BJESBS/2016/26784 (2016)

work page doi:10.9734/bjesbs/2016/26784 2016
[25]

& Peigneux, P

Schmidt, C., Collette, F., Cajochen, C. & Peigneux, P. A time to think: Circadian rhythms in human cognition.Cogn. Neuropsychol.24, 755–789, DOI: 10.1080/02643290701754158 (2007)

work page doi:10.1080/02643290701754158 2007
[26]

& Avnaim-Pesso, L

Danziger, S., Levav, J. & Avnaim-Pesso, L. Extraneous factors in judicial decisions.Proc. Natl. Acad. Sci. United States Am.108, 6889–6892, DOI: 10.1073/pnas.1018033108 (2011)

work page doi:10.1073/pnas.1018033108 2011
[27]

& Joffe, H

O’Connor, C. & Joffe, H. Intercoder reliability in qualitative research: Debates and practical guidelines.Int. J. Qual. Methods19, DOI: 10.1177/1609406919899220 (2020)

work page doi:10.1177/1609406919899220 2020
[28]

Python Software Foundation, 3.12 edn

Python Software Foundation.Python 3.12 Reference Manual. Python Software Foundation, 3.12 edn. (2023). Accessed: 2026-01-21

work page 2023
[29]

Newey, W. K. & West, K. D. A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix.Econometrica55, 703, DOI: 10.2307/1913610 (1987). 1913610

work page doi:10.2307/1913610 1987
[30]

Methods17, 261–272, DOI: 10.1038/s41592-019-0686-2 (2020)

Virtanen, P.et al.SciPy 1.0: Fundamental algorithms for scientific computing in python.Nat. Methods17, 261–272, DOI: 10.1038/s41592-019-0686-2 (2020)

work page doi:10.1038/s41592-019-0686-2 2020
[31]

Prabhu, K. M. M.Window Functions and Their Applications in Signal Processing(CRC Press, Boca Raton, 2018), 1 edn

work page 2018
[32]

H., Smith, S

Odell, R. H., Smith, S. W. & Eugene Yates, F. A permutation test for periodicities in short, noisy time series.Annals Biomed. Eng.3, 160–180, DOI: 10.1007/BF02363068 (1975)

work page doi:10.1007/bf02363068 1975
[33]

A., Zvonic, S

Ptitsyn, A. A., Zvonic, S. & Gimble, J. M. Permutation test for periodicity in short time series data.BMC Bioinforma.7, S10, DOI: 10.1186/1471-2105-7-S2-S10 (2006). 12/12

work page doi:10.1186/1471-2105-7-s2-s10 2006