arxiv: 2605.03723 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.AI· stat.ME

Recognition: unknown

Segmenting Human-LLM Co-authored Text via Change Point Detection

Mengchu Li , Jin Zhu , Jinglai Li , Chengchun Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIstat.ME

keywords change point detectionLLM text detectiontext segmentationco-authored textauthorship localizationminimax optimalitytime series analysis

0 comments

The pith

Treating sequences of LLM detector scores as time series allows change point detection to identify segments written by humans versus large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the problem of breaking down text written jointly by humans and LLMs into their original authorship segments. Current detectors classify entire passages as human or machine, but this falls short when parts come from each. The authors map the segmentation task onto change point detection from statistics, creating weighted and generalized versions of the method to handle uneven reliability in the scores across a document. They prove these procedures are optimal in the minimax sense and test them successfully against other approaches.

Core claim

The segmentation of human-LLM co-authored text reduces to locating change points in a sequence of detection scores, where the proposed weighted and generalized change point algorithms achieve minimax optimality by accommodating heterogeneous variability in those scores.

What carries the argument

Change point detection adapted to LLM detection scores, including weighted and generalized algorithms for score variability.

Load-bearing premise

That LLM detector scores form a time series where authorship changes create detectable statistical shifts that can be isolated by change point methods despite varying score noise levels.

What would settle it

A test document with known human-LLM switches where the change point algorithms fail to identify the true boundaries at rates better than chance or existing methods.

Figures

Figures reproduced from arXiv: 2605.03723 by Chengchun Shi, Jinglai Li, Jin Zhu, Mengchu Li.

**Figure 2.** Figure 2: AUCs of various detectors on the WikiQA and Story datasets with varying lengths of view at source ↗

read the original abstract

The rise of large language models (LLMs) has created an urgent need to distinguish between human-written and LLM-generated text to ensure authenticity and societal trust. Existing detectors typically provide a binary classification for an entire passage; however, this is insufficient for human--LLM co-authored text, where the objective is to localize specific segments authored by humans or LLMs. To bridge this gap, we propose algorithms to segment text into human- and LLM-authored pieces. Our key observation is that such a segmentation task is conceptually similar to classical change point detection in time-series analysis. Leveraging this analogy, we adapt change point detection to LLM-generated text detection, develop a weighted algorithm and a generalized algorithm to accommodate heterogeneous detection score variability, and establish the minimax optimality of our procedure. Empirically, we demonstrate the strong performance of our approach against a wide range of existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts change point detection to localize authorship switches in mixed human-LLM text using detector scores, with weighted/generalized variants and a minimax claim, but the i.i.d. modeling assumptions look shaky for real detectors.

read the letter

This paper's core move is to treat sequences of LLM detector scores as a time series and apply change point detection to find where authorship switches between human and machine. They introduce a weighted algorithm and a generalized one to handle varying score variability across segments, plus a minimax optimality result for the procedure. Empirically they report better segmentation than a range of baselines. The analogy is clean and moves the field past whole-document binary detection, which is genuinely limiting for co-authored material. If the optimality derivation holds under their model, it gives the work a theoretical edge that many detection papers miss. The empirical side at least claims broad comparisons, which is better than many abstracts in this area. The soft spot is the modeling. The construction assumes detector scores are piecewise i.i.d. with shifts only at authorship boundaries and that intra-segment variability is stationary and estimable separately from the change locations. Real detectors frequently produce autocorrelated scores and content-dependent variance, for example higher spread on technical passages than on narrative ones. The abstract gives no indication these were checked on actual detector output, so both the optimality guarantee and the practical gain from the weighted version could weaken once those effects appear. Experiments would need to show robustness rather than just average-case wins on curated data. This is for people working on detection tools, education verification, or statistical methods for text. A reader who wants a principled way to segment mixed documents would get value if the assumptions survive real data. It deserves a serious referee because the application is timely, the statistical framing is grounded, and the theory is non-trivial even if revisions are needed on validation and assumption checks. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that segmenting human-LLM co-authored text can be achieved by adapting classical change point detection (CPD) to per-token or per-sentence LLM detector scores treated as a time series. It introduces a weighted algorithm and a generalized algorithm to handle heterogeneous score variability, derives a minimax optimality result for the procedure, and reports strong empirical performance relative to existing baselines.

Significance. If the modeling assumptions and derivations hold, the work supplies a theoretically grounded method for localizing authorship boundaries in mixed documents, filling a gap left by binary whole-passage detectors. The minimax optimality result and the explicit handling of score heterogeneity via weighted/generalized estimators are notable strengths that could guide future detector design. The empirical comparisons, if robust, would demonstrate practical value for applications such as academic integrity checks.

major comments (2)

[Theoretical analysis (minimax lower bound derivation)] The minimax optimality claim rests on modeling detector scores as a piecewise i.i.d. sequence whose only distributional shifts occur at authorship boundaries (see the change-point model setup and the derivation of the lower bound). This assumption is load-bearing: if real detector outputs exhibit autocorrelation or content-dependent variance (e.g., higher variance on technical versus narrative text), both the optimality guarantee and the claimed advantage of the weighted algorithm over vanilla CPD become void. The manuscript should add a diagnostic experiment quantifying autocorrelation and variance stationarity on held-out detector outputs.
[Experimental evaluation] The abstract asserts superior empirical performance, yet the experimental protocol, data exclusion rules, and exact baseline implementations are not fully specified. This makes it impossible to verify whether post-hoc choices inflate the reported gains. Tables comparing methods should report confidence intervals or statistical tests rather than point estimates alone.

minor comments (2)

[Abstract] The abstract refers to 'a wide range of existing baselines' without naming them; listing the primary competitors (e.g., by citation) would improve clarity.
[Method (weighted algorithm)] Notation for the weighted estimator should explicitly define how per-segment variance estimates are obtained and plugged into the weights to avoid ambiguity for readers implementing the method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the theoretical assumptions and improve the transparency of our experimental results. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: The minimax optimality claim rests on modeling detector scores as a piecewise i.i.d. sequence whose only distributional shifts occur at authorship boundaries (see the change-point model setup and the derivation of the lower bound). This assumption is load-bearing: if real detector outputs exhibit autocorrelation or content-dependent variance (e.g., higher variance on technical versus narrative text), both the optimality guarantee and the claimed advantage of the weighted algorithm over vanilla CPD become void. The manuscript should add a diagnostic experiment quantifying autocorrelation and variance stationarity on held-out detector outputs.

Authors: We agree that the piecewise i.i.d. assumption is foundational to the minimax lower bound derivation, as is standard in change-point detection literature. Our weighted and generalized algorithms explicitly address heterogeneous variance, which mitigates some content-dependent effects. Nevertheless, we acknowledge that real detector outputs may exhibit autocorrelation or non-stationary variance. In the revision we will add a dedicated diagnostic subsection that computes autocorrelation functions and tests for variance stationarity on held-out detector score sequences from multiple datasets, thereby quantifying the extent to which the modeling assumptions hold in practice and discussing any resulting limitations. revision: yes
Referee: The abstract asserts superior empirical performance, yet the experimental protocol, data exclusion rules, and exact baseline implementations are not fully specified. This makes it impossible to verify whether post-hoc choices inflate the reported gains. Tables comparing methods should report confidence intervals or statistical tests rather than point estimates alone.

Authors: We appreciate the referee highlighting the need for greater experimental transparency. In the revised manuscript we will expand the experimental section to fully specify the protocol, including precise data exclusion criteria, preprocessing steps, and exact hyper-parameter choices for all baselines. We will also replace point estimates in the comparison tables with 95% confidence intervals obtained via bootstrapping and include statistical significance tests (paired t-tests or Wilcoxon signed-rank tests with appropriate multiple-comparison correction) to substantiate the reported performance differences. revision: yes

Circularity Check

0 steps flagged

No circularity; standard adaptation of CPD theory to new domain

full rationale

The derivation applies classical change point detection (CPD) to detector scores treated as a time series, with weighted/generalized variants to handle score heterogeneity and a minimax optimality result drawn from established CPD theory. No step reduces by construction to a fitted parameter defined on the same data, no self-citation chain bears the central claim, and the modeling assumptions (piecewise i.i.d. scores with authorship-induced shifts) are stated explicitly rather than smuggled in via prior self-work. Empirical comparisons to baselines provide independent validation outside the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that detector scores behave as a time series with authorship-driven change points and that standard change point theory applies directly once heterogeneity is handled.

axioms (1)

domain assumption Detection scores from existing LLM detectors form a sequence whose change points align with authorship transitions
This is the key observation stated in the abstract that justifies adapting change point detection.

pith-pipeline@v0.9.0 · 5451 in / 1239 out tokens · 90896 ms · 2026-05-07T16:31:24.638288+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 12 canonical work pages · 2 internal anchors

[1]

and Ramdas, A

Bhattacharyya, S. and Ramdas, A. Theoretical guarantees for change localization using conformal p-values.arXiv preprint arXiv:2510.08749,

work page arXiv
[2]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review arXiv
[3]

Wild binary segmentation for multiple change-point detection

doi: 10.1214/14-AOS1245. Gehrmann, S., Strobelt, H., and Rush, A. GLTR: Statistical detection and visualization of generated text. In Costa-jussà, M. R. and Alfonseca, E. (eds.),Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 111–116, Florence, Italy, July

work page doi:10.1214/14-aos1245
[4]

How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection

Association for Computational Linguistics. Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., and Wu, Y. How close is ChatGPT to human experts? comparison corpus, evaluation, and detection.arXiv preprint arXiv:2301.07597,

work page arXiv
[5]

Jiang, L., Wu, D., and Zheng, X

Association for Computational Linguistics. Jiang, L., Wu, D., and Zheng, X. SenDetEX: Sentence-level AI-generated text detection for human-AI hybrid content via style and context fusion. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 5287...

2025
[6]

Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities

Lee, M., Liang, P., and Yang, Q. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. InProceedings of the 2022 CHI conference on human factors in computing systems, pp. 1–19,

2022
[7]

Ma, T., Verchand, K

doi: 10.1214/20-AOS1994. Ma, T., Verchand, K. A., and Samworth, R. J. High-probability minimax lower bounds.arXiv preprint arXiv:2406.13447,

work page doi:10.1214/20-aos1994
[8]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Narayan, S., Cohen, S. B., and Lapata, M. Don’t give me the details, just the summary! 20 topic-aware convolutional neural networks for extreme summarization.ArXiv, abs/1808.08745,

work page Pith review arXiv
[9]

Padilla, O

URLhttps://cdn.openai.com/gpt-5-system-card.pdf. Padilla, O. H. M., Yu, Y., Wang, D., and Rinaldo, A. Optimal nonparametric multivariate change point detection and localization.IEEE Transactions on Information Theory, 68(3): 1922–1944,

1922
[10]

SQuAD: 100,000+ questions for machine comprehension of text

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. In Su, J., Duh, K., and Carreras, X. (eds.),Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, Austin, Texas, November

2016
[11]

OpenAI GPT-5 System Card

Association for Computational Linguistics. Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Emma Strubell, Ananya Ganesh, and Andrew McCallum

Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W., Kreps, S., et al. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203,

work page arXiv 1908
[13]

DetectLLM: Leveraging log rank information for zero-shot detection of machine-generated text

Su, J., Zhuo, T., Wang, D., and Nakov, P. DetectLLM: Leveraging log rank information for zero-shot detection of machine-generated text. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 12395–12412, Singapore, December

2023
[14]

Univariate mean change point detection: Penalization, cusum and optimality.Electronic Journal of Statistics, 14(1):1917–1961,

Wang, D., Yu, Y., and Rinaldo, A. Univariate mean change point detection: Penalization, cusum and optimality.Electronic Journal of Statistics, 14(1):1917–1961,

1917
[15]

SeqXGPT: Sentence-level AI- generated text detection

Wang, P., Li, L., Ren, K., Jiang, B., Zhang, D., and Qiu, X. SeqXGPT: Sentence-level AI- generated text detection. In Bouamor, H., Pino, J., and Bali, K. (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 1144–1156, Singapore, December

2023
[16]

URLhttps://data.x.ai/2025-08-20-grok-4-model-card. pdf. Xie, Z., Cohn, T., and Lau, J. H. The next chapter: A study of large language models in storytelling.arXiv preprint arXiv:2301.09790,

work page arXiv 2025
[17]

arXiv preprint arXiv:2011.01857 , year=

Yu, Y. A review on minimax rates in change point detection and localisation.arXiv preprint arXiv:2011.01857,

work page arXiv 2011
[18]

Towards detecting ai-generated text within human-ai collaborative hybrid texts.arXiv preprint arXiv:2403.03506, 2024a

Zeng, Z., Liu, S., Sha, L., Li, Z., Yang, K., Liu, S., Gaševic, D., and Chen, G. Towards detecting ai-generated text within human-ai collaborative hybrid texts.arXiv preprint arXiv:2403.03506, 2024a. Zeng, Z., Liu, S., Sha, L., Li, Z., Yang, K., Liu, S., Gašević, D., and Chen, G. Detecting AI-generated sentences in human-AI collaborative hybrid texts: cha...

work page doi:10.24963/ijcai.2024/835 2024
[19]

arXiv preprint arXiv:2601.06586 , year=

Zhou, H., Zhu, J., Yang, Y., and Shi, C. Detecting LLM-generated text with performance guarantees.arXiv preprint arXiv:2601.06586, 2026a. Zhou, H., Zhu, J., Ye, K., Yang, Y., Xu, E., and Shi, C. Learn-to-distance: Distance learning for detecting LLM-generated text. InThe Fourteenth International Conference on Learning Representations, 2026b. 23 A Proofs A...

work page arXiv 2020
[20]

We do not further pursue those directions as the main contribution is to understand the effects of heterogeneousσi on the signal-to-noise condition and localization error

and heavy-tailed distributions (e.g., Li & Yu, 2021; Cho & Kirch, 2022). We do not further pursue those directions as the main contribution is to understand the effects of heterogeneousσi on the signal-to-noise condition and localization error. Assumptions (iii) and (iv) are mild as bothrandMare user-specified. A.2 Proofs for Theorems 1– 3 For a real-valu...

2021
[21]

Suppose1≤s < e≤N, such thatτj−1< s≤τj < e≤τj+1 for some j= 1,...,K

Lemma 5.Let µ= (µ1,...,µN)denote the mean vector ofY = (Y1,...,YN)and let τ1,...,τK be the change points. Suppose1≤s < e≤N, such thatτj−1< s≤τj < e≤τj+1 for some j= 1,...,K. Letη= min{Sw s:τj,Sw (τj +1):e}andκ=|µh−µm|. Then Wµ s,e(τj) = max s≤b<e Wµ s,e(b), and√ηκ/ √ 2≤Wµ s,e(τj)≤√ηκ. Moreover, it holds that ∥ψb s,e⟨ψb s,e,µ⟩w−ψτj s,e⟨ψτj s,e,µ⟩w∥2 w = (W...

2019
[22]

Moreover, these two sets are non-empty since maxi∈[N]wi< ∆ 2/6by assumption, so that each set contains at least one element

Similarly, IR j will also not include any pointsi > τj+1. Moreover, these two sets are non-empty since maxi∈[N]wi< ∆ 2/6by assumption, so that each set contains at least one element. Again, we let [s1,e 1],..., [sM,eM]be the randomly generated intervals and consider eventC as above. Note that for each interval[sm, em], we haveP(sm∈IL j ,em∈IR j )≥1/N2. Th...

2019
[23]

Forb<τj, following the same arguments and using the factSw (τj +1):e>r 2/κ2, we would obtain τj∑ ˆb+1 1 σ2 i ≤r2 κ2

SinceSw s:τj >r 2/κ2, we have that the estimatorˆb must satisfy Sw (τj +1):ˆb = ˆb∑ i=τj +1 wi = ˆb∑ i=τj +1 1 σ2 i ≤r2 κ2, as, otherwise, it cannot be the maximizer ofWY s,e(b). Forb<τj, following the same arguments and using the factSw (τj +1):e>r 2/κ2, we would obtain τj∑ ˆb+1 1 σ2 i ≤r2 κ2. Therefore, combining these two cases, we obtain max{τj,ˆb}∑ m...

2019
[24]

Now, applying Ma et al

=∑τ+h1 i=τ+1 κ2 σ2 i . Now, applying Ma et al. (2024, Corollary 6), a version of Le Cam’s two-point lemma for high-probability lower bounds, we obtain that if h1 is chosen such that τ+h1∑ i=τ+1 κ2 σ2 i ≤log( 1 4δ(1−δ)), then infˆτsupP∈PQ(δ,ˆτ,P) ≳h

2024
[25]

Combining these two cases yields our claim

We takemax{h2∈[τ] :∑τ i=τ−h2 κ2 σ2 i ≤log( 1 4δ(1−δ))}and this value is guaranteed to be less thanτsince κ2Sw 1:τ≳log (1/δ), as required inP. Combining these two cases yields our claim. A.3 Equivalence of GCP and WCP In this section, we show the equivalence between(3)and(2)when ϕtakes certain general forms that are used in the literature (e.g. Mitchell et...

2023
[26]

B Additional numerical results B.1 Additional experiments with varying detectors Our method requires a sentence-level detector as input

Underϕ2, we have DX s,e(b) = √ Sw s:bSw (b+1):e Sws:e ⏐⏐⏐⏐⏐ 1 Sw s:b b∑ i=s ni∑ j=1 log pˆφ(Xi,j|X<i,<j) pˆφ(Zi,j|Z<i,<j)− 1 Sw (b+1):e e∑ i=b+1 ni∑ j=1 log pˆφ(Xi,j|X<i,<j) pˆφ(Zi,j|Z<i,<j) ⏐⏐⏐⏐⏐ WY s,e(b) = √ Sw s:bSw (b+1):e Sws:e ⏐⏐⏐⏐⏐ 1 Sw s:b b∑ i=s ni∑ j=1 log pˆφ(Xi,j|Xi,<j) pˆφ(Zi,j|Zi,<j)− 1 Sw (b+1):e e∑ i=b+1 ni∑ j=1 log pˆφ(Xi,j|Xi,<j) pˆφ(Zi...

2019
[27]

The text to detect is: [filled in the bracket]

that are written by a language model. The text to detect is: [filled in the bracket]. The sentence indices are: For TextTiling, we use the implementation innltk.tokenize.texttiling(Bird et al., 2009). For SegFormer in the CoAuthor experiment, we use the implementation provided by Zeng et al. (2024b), which is publicly available on GitHub2. Human–LLM co-au...

2009