Recognition: unknown
Segmenting Human-LLM Co-authored Text via Change Point Detection
Pith reviewed 2026-05-07 16:31 UTC · model grok-4.3
The pith
Treating sequences of LLM detector scores as time series allows change point detection to identify segments written by humans versus large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The segmentation of human-LLM co-authored text reduces to locating change points in a sequence of detection scores, where the proposed weighted and generalized change point algorithms achieve minimax optimality by accommodating heterogeneous variability in those scores.
What carries the argument
Change point detection adapted to LLM detection scores, including weighted and generalized algorithms for score variability.
Load-bearing premise
That LLM detector scores form a time series where authorship changes create detectable statistical shifts that can be isolated by change point methods despite varying score noise levels.
What would settle it
A test document with known human-LLM switches where the change point algorithms fail to identify the true boundaries at rates better than chance or existing methods.
Figures
read the original abstract
The rise of large language models (LLMs) has created an urgent need to distinguish between human-written and LLM-generated text to ensure authenticity and societal trust. Existing detectors typically provide a binary classification for an entire passage; however, this is insufficient for human--LLM co-authored text, where the objective is to localize specific segments authored by humans or LLMs. To bridge this gap, we propose algorithms to segment text into human- and LLM-authored pieces. Our key observation is that such a segmentation task is conceptually similar to classical change point detection in time-series analysis. Leveraging this analogy, we adapt change point detection to LLM-generated text detection, develop a weighted algorithm and a generalized algorithm to accommodate heterogeneous detection score variability, and establish the minimax optimality of our procedure. Empirically, we demonstrate the strong performance of our approach against a wide range of existing baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that segmenting human-LLM co-authored text can be achieved by adapting classical change point detection (CPD) to per-token or per-sentence LLM detector scores treated as a time series. It introduces a weighted algorithm and a generalized algorithm to handle heterogeneous score variability, derives a minimax optimality result for the procedure, and reports strong empirical performance relative to existing baselines.
Significance. If the modeling assumptions and derivations hold, the work supplies a theoretically grounded method for localizing authorship boundaries in mixed documents, filling a gap left by binary whole-passage detectors. The minimax optimality result and the explicit handling of score heterogeneity via weighted/generalized estimators are notable strengths that could guide future detector design. The empirical comparisons, if robust, would demonstrate practical value for applications such as academic integrity checks.
major comments (2)
- [Theoretical analysis (minimax lower bound derivation)] The minimax optimality claim rests on modeling detector scores as a piecewise i.i.d. sequence whose only distributional shifts occur at authorship boundaries (see the change-point model setup and the derivation of the lower bound). This assumption is load-bearing: if real detector outputs exhibit autocorrelation or content-dependent variance (e.g., higher variance on technical versus narrative text), both the optimality guarantee and the claimed advantage of the weighted algorithm over vanilla CPD become void. The manuscript should add a diagnostic experiment quantifying autocorrelation and variance stationarity on held-out detector outputs.
- [Experimental evaluation] The abstract asserts superior empirical performance, yet the experimental protocol, data exclusion rules, and exact baseline implementations are not fully specified. This makes it impossible to verify whether post-hoc choices inflate the reported gains. Tables comparing methods should report confidence intervals or statistical tests rather than point estimates alone.
minor comments (2)
- [Abstract] The abstract refers to 'a wide range of existing baselines' without naming them; listing the primary competitors (e.g., by citation) would improve clarity.
- [Method (weighted algorithm)] Notation for the weighted estimator should explicitly define how per-segment variance estimates are obtained and plugged into the weights to avoid ambiguity for readers implementing the method.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the theoretical assumptions and improve the transparency of our experimental results. We address each major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: The minimax optimality claim rests on modeling detector scores as a piecewise i.i.d. sequence whose only distributional shifts occur at authorship boundaries (see the change-point model setup and the derivation of the lower bound). This assumption is load-bearing: if real detector outputs exhibit autocorrelation or content-dependent variance (e.g., higher variance on technical versus narrative text), both the optimality guarantee and the claimed advantage of the weighted algorithm over vanilla CPD become void. The manuscript should add a diagnostic experiment quantifying autocorrelation and variance stationarity on held-out detector outputs.
Authors: We agree that the piecewise i.i.d. assumption is foundational to the minimax lower bound derivation, as is standard in change-point detection literature. Our weighted and generalized algorithms explicitly address heterogeneous variance, which mitigates some content-dependent effects. Nevertheless, we acknowledge that real detector outputs may exhibit autocorrelation or non-stationary variance. In the revision we will add a dedicated diagnostic subsection that computes autocorrelation functions and tests for variance stationarity on held-out detector score sequences from multiple datasets, thereby quantifying the extent to which the modeling assumptions hold in practice and discussing any resulting limitations. revision: yes
-
Referee: The abstract asserts superior empirical performance, yet the experimental protocol, data exclusion rules, and exact baseline implementations are not fully specified. This makes it impossible to verify whether post-hoc choices inflate the reported gains. Tables comparing methods should report confidence intervals or statistical tests rather than point estimates alone.
Authors: We appreciate the referee highlighting the need for greater experimental transparency. In the revised manuscript we will expand the experimental section to fully specify the protocol, including precise data exclusion criteria, preprocessing steps, and exact hyper-parameter choices for all baselines. We will also replace point estimates in the comparison tables with 95% confidence intervals obtained via bootstrapping and include statistical significance tests (paired t-tests or Wilcoxon signed-rank tests with appropriate multiple-comparison correction) to substantiate the reported performance differences. revision: yes
Circularity Check
No circularity; standard adaptation of CPD theory to new domain
full rationale
The derivation applies classical change point detection (CPD) to detector scores treated as a time series, with weighted/generalized variants to handle score heterogeneity and a minimax optimality result drawn from established CPD theory. No step reduces by construction to a fitted parameter defined on the same data, no self-citation chain bears the central claim, and the modeling assumptions (piecewise i.i.d. scores with authorship-induced shifts) are stated explicitly rather than smuggled in via prior self-work. Empirical comparisons to baselines provide independent validation outside the derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Detection scores from existing LLM detectors form a sequence whose change points align with authorship transitions
Reference graph
Works this paper leans on
-
[1]
Bhattacharyya, S. and Ramdas, A. Theoretical guarantees for change localization using conformal p-values.arXiv preprint arXiv:2510.08749,
-
[2]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review arXiv
-
[3]
Wild binary segmentation for multiple change-point detection
doi: 10.1214/14-AOS1245. Gehrmann, S., Strobelt, H., and Rush, A. GLTR: Statistical detection and visualization of generated text. In Costa-jussà, M. R. and Alfonseca, E. (eds.),Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 111–116, Florence, Italy, July
-
[4]
How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection
Association for Computational Linguistics. Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., and Wu, Y. How close is ChatGPT to human experts? comparison corpus, evaluation, and detection.arXiv preprint arXiv:2301.07597,
-
[5]
Jiang, L., Wu, D., and Zheng, X
Association for Computational Linguistics. Jiang, L., Wu, D., and Zheng, X. SenDetEX: Sentence-level AI-generated text detection for human-AI hybrid content via style and context fusion. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 5287...
2025
-
[6]
Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities
Lee, M., Liang, P., and Yang, Q. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. InProceedings of the 2022 CHI conference on human factors in computing systems, pp. 1–19,
2022
-
[7]
doi: 10.1214/20-AOS1994. Ma, T., Verchand, K. A., and Samworth, R. J. High-probability minimax lower bounds.arXiv preprint arXiv:2406.13447,
-
[8]
Narayan, S., Cohen, S. B., and Lapata, M. Don’t give me the details, just the summary! 20 topic-aware convolutional neural networks for extreme summarization.ArXiv, abs/1808.08745,
-
[9]
Padilla, O
URLhttps://cdn.openai.com/gpt-5-system-card.pdf. Padilla, O. H. M., Yu, Y., Wang, D., and Rinaldo, A. Optimal nonparametric multivariate change point detection and localization.IEEE Transactions on Information Theory, 68(3): 1922–1944,
1922
-
[10]
SQuAD: 100,000+ questions for machine comprehension of text
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. In Su, J., Duh, K., and Carreras, X. (eds.),Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, Austin, Texas, November
2016
-
[11]
Association for Computational Linguistics. Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Emma Strubell, Ananya Ganesh, and Andrew McCallum
Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W., Kreps, S., et al. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203,
-
[13]
DetectLLM: Leveraging log rank information for zero-shot detection of machine-generated text
Su, J., Zhuo, T., Wang, D., and Nakov, P. DetectLLM: Leveraging log rank information for zero-shot detection of machine-generated text. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 12395–12412, Singapore, December
2023
-
[14]
Univariate mean change point detection: Penalization, cusum and optimality.Electronic Journal of Statistics, 14(1):1917–1961,
Wang, D., Yu, Y., and Rinaldo, A. Univariate mean change point detection: Penalization, cusum and optimality.Electronic Journal of Statistics, 14(1):1917–1961,
1917
-
[15]
SeqXGPT: Sentence-level AI- generated text detection
Wang, P., Li, L., Ren, K., Jiang, B., Zhang, D., and Qiu, X. SeqXGPT: Sentence-level AI- generated text detection. In Bouamor, H., Pino, J., and Bali, K. (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 1144–1156, Singapore, December
2023
- [16]
-
[17]
arXiv preprint arXiv:2011.01857 , year=
Yu, Y. A review on minimax rates in change point detection and localisation.arXiv preprint arXiv:2011.01857,
-
[18]
Zeng, Z., Liu, S., Sha, L., Li, Z., Yang, K., Liu, S., Gaševic, D., and Chen, G. Towards detecting ai-generated text within human-ai collaborative hybrid texts.arXiv preprint arXiv:2403.03506, 2024a. Zeng, Z., Liu, S., Sha, L., Li, Z., Yang, K., Liu, S., Gašević, D., and Chen, G. Detecting AI-generated sentences in human-AI collaborative hybrid texts: cha...
-
[19]
arXiv preprint arXiv:2601.06586 , year=
Zhou, H., Zhu, J., Yang, Y., and Shi, C. Detecting LLM-generated text with performance guarantees.arXiv preprint arXiv:2601.06586, 2026a. Zhou, H., Zhu, J., Ye, K., Yang, Y., Xu, E., and Shi, C. Learn-to-distance: Distance learning for detecting LLM-generated text. InThe Fourteenth International Conference on Learning Representations, 2026b. 23 A Proofs A...
-
[20]
We do not further pursue those directions as the main contribution is to understand the effects of heterogeneousσi on the signal-to-noise condition and localization error
and heavy-tailed distributions (e.g., Li & Yu, 2021; Cho & Kirch, 2022). We do not further pursue those directions as the main contribution is to understand the effects of heterogeneousσi on the signal-to-noise condition and localization error. Assumptions (iii) and (iv) are mild as bothrandMare user-specified. A.2 Proofs for Theorems 1– 3 For a real-valu...
2021
-
[21]
Suppose1≤s < e≤N, such thatτj−1< s≤τj < e≤τj+1 for some j= 1,...,K
Lemma 5.Let µ= (µ1,...,µN)denote the mean vector ofY = (Y1,...,YN)and let τ1,...,τK be the change points. Suppose1≤s < e≤N, such thatτj−1< s≤τj < e≤τj+1 for some j= 1,...,K. Letη= min{Sw s:τj,Sw (τj +1):e}andκ=|µh−µm|. Then Wµ s,e(τj) = max s≤b<e Wµ s,e(b), and√ηκ/ √ 2≤Wµ s,e(τj)≤√ηκ. Moreover, it holds that ∥ψb s,e⟨ψb s,e,µ⟩w−ψτj s,e⟨ψτj s,e,µ⟩w∥2 w = (W...
2019
-
[22]
Moreover, these two sets are non-empty since maxi∈[N]wi< ∆ 2/6by assumption, so that each set contains at least one element
Similarly, IR j will also not include any pointsi > τj+1. Moreover, these two sets are non-empty since maxi∈[N]wi< ∆ 2/6by assumption, so that each set contains at least one element. Again, we let [s1,e 1],..., [sM,eM]be the randomly generated intervals and consider eventC as above. Note that for each interval[sm, em], we haveP(sm∈IL j ,em∈IR j )≥1/N2. Th...
2019
-
[23]
Forb<τj, following the same arguments and using the factSw (τj +1):e>r 2/κ2, we would obtain τj∑ ˆb+1 1 σ2 i ≤r2 κ2
SinceSw s:τj >r 2/κ2, we have that the estimatorˆb must satisfy Sw (τj +1):ˆb = ˆb∑ i=τj +1 wi = ˆb∑ i=τj +1 1 σ2 i ≤r2 κ2, as, otherwise, it cannot be the maximizer ofWY s,e(b). Forb<τj, following the same arguments and using the factSw (τj +1):e>r 2/κ2, we would obtain τj∑ ˆb+1 1 σ2 i ≤r2 κ2. Therefore, combining these two cases, we obtain max{τj,ˆb}∑ m...
2019
-
[24]
Now, applying Ma et al
=∑τ+h1 i=τ+1 κ2 σ2 i . Now, applying Ma et al. (2024, Corollary 6), a version of Le Cam’s two-point lemma for high-probability lower bounds, we obtain that if h1 is chosen such that τ+h1∑ i=τ+1 κ2 σ2 i ≤log( 1 4δ(1−δ)), then infˆτsupP∈PQ(δ,ˆτ,P) ≳h
2024
-
[25]
Combining these two cases yields our claim
We takemax{h2∈[τ] :∑τ i=τ−h2 κ2 σ2 i ≤log( 1 4δ(1−δ))}and this value is guaranteed to be less thanτsince κ2Sw 1:τ≳log (1/δ), as required inP. Combining these two cases yields our claim. A.3 Equivalence of GCP and WCP In this section, we show the equivalence between(3)and(2)when ϕtakes certain general forms that are used in the literature (e.g. Mitchell et...
2023
-
[26]
B Additional numerical results B.1 Additional experiments with varying detectors Our method requires a sentence-level detector as input
Underϕ2, we have DX s,e(b) = √ Sw s:bSw (b+1):e Sws:e ⏐⏐⏐⏐⏐ 1 Sw s:b b∑ i=s ni∑ j=1 log pˆφ(Xi,j|X<i,<j) pˆφ(Zi,j|Z<i,<j)− 1 Sw (b+1):e e∑ i=b+1 ni∑ j=1 log pˆφ(Xi,j|X<i,<j) pˆφ(Zi,j|Z<i,<j) ⏐⏐⏐⏐⏐ WY s,e(b) = √ Sw s:bSw (b+1):e Sws:e ⏐⏐⏐⏐⏐ 1 Sw s:b b∑ i=s ni∑ j=1 log pˆφ(Xi,j|Xi,<j) pˆφ(Zi,j|Zi,<j)− 1 Sw (b+1):e e∑ i=b+1 ni∑ j=1 log pˆφ(Xi,j|Xi,<j) pˆφ(Zi...
2019
-
[27]
The text to detect is: [filled in the bracket]
that are written by a language model. The text to detect is: [filled in the bracket]. The sentence indices are: For TextTiling, we use the implementation innltk.tokenize.texttiling(Bird et al., 2009). For SegFormer in the CoAuthor experiment, we use the implementation provided by Zeng et al. (2024b), which is publicly available on GitHub2. Human–LLM co-au...
2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.