pith. sign in

arxiv: 2605.17979 · v1 · pith:SXPCKVMCnew · submitted 2026-05-18 · 💰 econ.EM

Comment on Scientific production in the era of large language models

Pith reviewed 2026-05-20 00:19 UTC · model grok-4.3

classification 💰 econ.EM
keywords LLM adoptionevent studymechanical biaspreprint outputdetection thresholdreplicationtreatment timingscientific productivity
0
0 comments X

The pith

The rule dating LLM adoption to the first month with a flagged abstract mechanically selects high-output months and generates spurious positive event-study results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that dating the start of large language model use by the first month in which at least one abstract exceeds a detection threshold creates a built-in connection to submission volume. Months with more papers have a higher chance of containing at least one flagged entry, so the identified adoption dates fall disproportionately in high-activity periods. As a result, an event study around these dates tends to show rising output afterward even if the models themselves have no effect, because earlier months are chosen from periods with no prior flag. Simulations that assume independent productivity and zero causal impact still recover the positive post-event path, and the same pattern appears when the original stacked event study is replicated with several placebo flagging rules.

Core claim

The treatment-timing rule is mechanically related to output. The probability that at least one paper is flagged in a month increases with the number of papers submitted in that month, so detected-adoption months are disproportionately high-output months. An event study centered on first detection can therefore display positive post-event dynamics even when the flagging rule contains no information about true LLM adoption, because the omitted pre-treatment period is selected from months with no prior detection. This pattern appears in a simulation with i.i.d. productivity and no causal effect, and the stacked event study replicates the same positive post-treatment path under random paper flgs

What carries the argument

First-detection timing rule that marks adoption in the month an abstract first exceeds the LLM-detection threshold

If this is right

  • Event studies that use first flagging as the adoption date will show positive post-treatment output trends even without any real LLM effect.
  • Placebo versions that assign flags at random or use neutral keywords produce similarly positive post-treatment coefficients.
  • The i.i.d. simulation isolates the volume-dependent detection probability as sufficient to generate the observed dynamics.
  • The stacked event-study estimates remain positive across multiple neutral flagging constructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Any study that dates technology adoption by the first crossing of a detection threshold may inherit similar selection bias when the crossing probability rises with activity volume.
  • Alternative timing rules based on direct surveys or usage logs could be compared against the detection rule to separate mechanical selection from genuine productivity changes.
  • The same volume-dependent selection issue could appear in other count-based settings where treatment begins at the first rare event.

Load-bearing premise

Monthly paper submissions are independent draws whose only connection to the detection indicator runs through the higher chance of at least one flag when volume is larger.

What would settle it

A direct check showing that the probability of at least one flagged abstract is unrelated to the total number of submissions in a month would disprove the mechanical link.

Figures

Figures reproduced from arXiv: 2605.17979 by Antonin Bergeaud, Cl\'ement Bosquet, Thomas Renault.

Figure 1
Figure 1. Figure 1: Reproduction of Figure 1A from the replication package of Kusumegi et al. (2025), with the coefficient for treatment month k = 0 plotted rather than omitted. The visible jump at k = 0 shows directly that treatment timing is associated with unusually high output in the treatment month itself. one in which treatment timing is assigned randomly and independently of output [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 2
Figure 2. Figure 2: Simulation of the identification problem. Average stacked-PPML event-study coefficients across 1,000 simulations with 100,000 authors and i.i.d. productivity. Orange points reflect first-detection timing, whereas purple points are based on timing assigned randomly and independently of output. Error bars denote Monte Carlo 95% confidence intervals for the mean coefficient at each event time, but are too sma… view at source ↗
Figure 3
Figure 3. Figure 3: Replication and placebo event studies. All panels report stacked-PPML event-study coefficients γˆk, with 95% confidence intervals. Panel A is the baseline replication; panels B–D are placebo exercises that replace the LLM detector with random paper-level flags, neutral keyword flags, and an earlier observation window, respectively. Discussion We do not argue that LLMs cannot affect scientific productivity,… view at source ↗
read the original abstract

Kusumegi et al. (2025) study whether researchers' preprint output rises after adopting large language models (LLMs), dating adoption as the first month in which at least one submitted abstract exceeds an LLM-detection threshold. We show that this treatment-timing rule is mechanically related to output. The probability that at least one paper is flagged in a month is increasing in the number of papers submitted in that month, so detected-adoption months are disproportionately high-output months. An event study centered on first detection can therefore display positive post-event dynamics even when the flagging rule contains no information about true LLM adoption, because the omitted pre-treatment period is selected from months with no prior detection. We demonstrate this in a simulation: with i.i.d. productivity and no causal effect, first-detection timing generates a spurious positive post-treatment path. We also replicate the stacked event study of Kusumegi et al. (2025) and show that three placebo exercises (random paper-level assignment, neutral keyword flags, and a pre-ChatGPT observation window) each produce a similarly positive post-treatment pattern.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript claims that Kusumegi et al. (2025) date LLM adoption to the first month in which at least one preprint abstract exceeds an LLM-detection threshold, but this rule is mechanically tied to monthly submission volume. Because the probability of at least one flag equals 1-(1-p)^n and rises with n, detected-adoption months are disproportionately high-output months. Consequently, an event study centered on first detection can produce positive post-treatment output dynamics even when the flagging rule carries no information about true adoption and there is no causal effect. The authors demonstrate the artifact in an i.i.d. productivity simulation and replicate the original stacked event study with three placebo constructions (random paper-level flags, neutral-keyword flags, and a pre-ChatGPT window), each recovering a similar positive post-treatment path.

Significance. If the central mechanical claim holds, the paper supplies a clear, replicable caution for any study that uses threshold-based detection to time technology adoption or similar events whose detection probability scales with activity volume. The simulation isolates the selection channel under explicitly stated i.i.d. assumptions, and the placebo replications show that the artifact emerges without requiring the full data-generating process of the original study. This strengthens the critique and offers a template for diagnosing analogous timing biases in productivity research.

minor comments (1)
  1. The simulation section would benefit from a brief statement of the exact number of Monte Carlo replications and the precise value of p used to generate the flags.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for recommending acceptance. The referee's summary accurately captures our central argument and the design of the simulation and placebo exercises.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core demonstration follows from the general probabilistic identity that P(at least one detection) = 1 - (1-p)^n is strictly increasing in monthly submissions n for any fixed p > 0, combined with an explicitly i.i.d. simulation under a null of no causal effect and three placebo replications that recover the same artifact. These steps use stated assumptions independent of the target paper's fitted values or data-generating process and do not reduce any claimed result to a self-referential fit or self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard probabilistic reasoning about independent flagging and simulation assumptions rather than new fitted parameters or postulated entities.

axioms (1)
  • standard math The probability of at least one abstract exceeding the LLM-detection threshold increases with the number of submissions in a month under standard independence assumptions.
    Invoked to establish that detected-adoption months are disproportionately high-output months.

pith-pipeline@v0.9.0 · 5722 in / 1207 out tokens · 44164 ms · 2026-05-20T00:19:03.223456+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

  1. [1]

    Scientific production in the era of large language models

    Kusumegi, Keigo et al. (2025). “Scientific production in the era of large language models”. In:Science390, pp. 1240–1243.doi:10.1126/science.adw3000. de Chaisemartin, Cl´ ement and Xavier D’Haultfœuille (2023). “Two-way fixed effects and differences-in-differences with heterogeneous treatment effects: a survey”. In:The Econo- metrics Journal26.3, pp. C1–C...

  2. [2]

    (2025, SM S1.1)

    We exclude papers in core AI subfields (cs.CV, cs.LG, cs.AI, cs.IR, cs.CL) following Kusumegi et al. (2025, SM S1.1). The active-author sample for the main analysis requires at least four publications between 2018 and 2021; the observation window is January 2022 to June 2024 (30 months). For the pre-ChatGPT placebo, the activity filter requires at least f...

  3. [3]

    Placebo 1 (random assignment).Each paper published after December 2022 receives an independent Bernoulli(p) flag, drawn without reference to abstract content

    Placebo tests.All three placebo tests use the same stacked DiD specification as the main analysis; only the treatment indicator and, where applicable, the observation window differ. Placebo 1 (random assignment).Each paper published after December 2022 receives an independent Bernoulli(p) flag, drawn without reference to abstract content. A researcher’s 9...

  4. [4]

    Stacked DiD.We follow the stacked DiD design of Kusumegi et al. (2025). Each cohort is defined by the calendar month of first detection under the relevant rule. Never-treated authors are assigned pseudo-treatment months drawn uniformly from the post-introduction window: January 2023 to June 2024 in the main, random, and keyword analyses, and January 2021 ...

  5. [5]

    •k = −1(reference period)The month just before the first detection, by construction, contains no flagged paper

    Months in which a detection is observed are, on average, months with above-average output. •k = −1(reference period)The month just before the first detection, by construction, contains no flagged paper. Its conditional distribution is weighted by 1 −q (y) = (1 −p)y, which is decreasing iny. Therefore E yi,t∗ i −1 = E[y(1−p) y] E[(1−p) y] =µ− Cov(y, q(y)) ...