arxiv: 2604.13061 · v2 · submitted 2026-03-18 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction

Wael Hafez , Amir Nazeri

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords conversational driftLLM interactionstoken statisticsstructural consistencyBipredictabilityInformation Digital Twinmulti-turn evaluationdrift detection

0 comments

The pith

Token frequency statistics can monitor conversational structural consistency in multi-turn LLM interactions without embeddings or model internals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that structural consistency in conversations with large language models can be tracked using only token frequency statistics. It introduces Bipredictability as a measure of shared predictability between context, response, and next prompt relative to overall uncertainty. Implemented in the Information Digital Twin, this approach was tested on over 4,500 turns across various conditions and models. It aligned with structural consistency in 85 percent of cases and detected all contradictions, shifts, and non-sequiturs with complete sensitivity, while aligning with semantic quality in only 44 percent. This indicates that interaction reliability requires separate monitoring of structure beyond individual response quality.

Core claim

Conversational structural consistency can be monitored directly from token frequency statistics, without embeddings, auxiliary evaluators or access to model internals. We formalize this signal as Bipredictability, P, which measures shared predictability across the context, response, next prompt loop relative to the turn total uncertainty, and implement it in a lightweight auxiliary architecture, the Information Digital Twin, IDT. Across 4,574 conversational turns spanning 34 conditions, one student model and three frontier teacher models, P established a stable runtime baseline, aligned with structural consistency in 85 percent of conditions but with semantic quality in only 44 percent, and

What carries the argument

Bipredictability, a ratio of shared predictability in the conversation loop to total turn uncertainty, implemented through the Information Digital Twin auxiliary architecture to detect drift from token statistics.

If this is right

Structural consistency in LLM conversations cannot be reduced to semantic response quality alone.
Token stream monitoring provides a lightweight complement to semantic evaluation in deployment.
The Information Digital Twin detects contradictions, topic shifts, and non-sequiturs with 100 percent sensitivity.
Consistency monitoring should be integrated into extended LLM interactions for better reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Simple statistical methods like this could reduce reliance on complex embedding-based evaluators for consistency checks.
Real-time application in chat interfaces might allow automatic flagging of drifting conversations.
Further testing on more diverse models and longer conversations would help confirm generalization.

Load-bearing premise

Token frequency statistics alone are sufficient to capture structural consistency independently of the semantic content of the conversation.

What would settle it

A multi-turn conversation containing a clear contradiction or topic shift in which the Bipredictability measure remains at its stable baseline level, indicating no drift was detected.

Figures

Figures reproduced from arXiv: 2604.13061 by Amir Nazeri, Wael Hafez.

**Figure 3.** Figure 3: Information Digital Twin (IDT) architecture for real-time interaction monitoring. One IDT instance is maintained per active conversation. Operating on the observable interaction stream (𝑆 , 𝐴, 𝑆 ′ ), the IDT estimates Bipredictability 𝑃 and predictive asymmetry Δ𝐻, compares them with a learned baseline, and flags significant deviations in coupling. Dashed components indicate architecturally specified but e… view at source ↗

read the original abstract

Large language models, LLMs, are increasingly deployed in multiturn settings where earlier responses shape later ones, making reliability dependent on whether a conversation remains consistent over time. When this consistency degrades undetected, downstream decisions lose their grounding in the exchange that produced them. Yet current evaluation methods assess isolated outputs rather than the interaction producing them. Here we show that conversational structural consistency can be monitored directly from token frequency statistics, without embeddings, auxiliary evaluators or access to model internals. We formalize this signal as Bipredictability, P, which measures shared predictability across the context, response, next prompt loop relative to the turn total uncertainty, and implement it in a lightweight auxiliary architecture, the Information Digital Twin, IDT. Across 4,574 conversational turns spanning 34 conditions, one student model and three frontier teacher models, P established a stable runtime baseline, aligned with structural consistency in 85 percent of conditions but with semantic quality in only 44 percent, and the IDT detected all tested contradictions, topic shifts and non-sequiturs with 100 percent sensitivity. These results show that reliability in extended LLM interaction cannot be reduced to response quality alone, and that structural monitoring from the observable token stream can complement semantic evaluation in deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Token frequency stats can flag structural drift in LLM chats via Bipredictability, but the formula and generalization tests are missing so the 100% sensitivity claim stays narrow.

read the letter

The main thing here is that token frequency statistics alone can monitor structural consistency in multi-turn LLM interactions through a measure called Bipredictability, implemented in their Information Digital Twin. It aligned with structural consistency in 85% of conditions and caught every tested contradiction, topic shift, and non-sequitur with 100% sensitivity across their runs, while aligning with semantic quality only 44% of the time. That separation is the useful signal they are highlighting. What is new is the specific formalization of Bipredictability as shared predictability in the context-response-next prompt loop relative to total turn uncertainty. The paper does well by staying strictly model-agnostic and using only observable tokens, with no embeddings or model internals required. They tested it on 4,574 turns spanning 34 conditions and four models, including one student model and three frontier ones, which gives the empirical part some reach. The soft spots are clear and proportionate. The description supplies no derivation or explicit formula for turning token frequencies into the P measure, no error bars on the alignment numbers, and no cross-validation or hold-out results. Generalization beyond those exact conditions and tokenizers is not shown, so the perfect sensitivity may not hold when vocabulary or drift patterns shift. The circularity concern is reasonable until the math is laid out. This paper is for engineers and researchers working on reliable multi-turn conversational systems who want lightweight, observable monitoring rather than heavy semantic evaluators. Readers focused on practical deployment checks will find the core idea worth considering. It deserves a serious referee because the problem is real, the approach is straightforward, and the empirical hook is there even if the method needs more detail and broader validation. I would recommend sending it to peer review with requests for the full derivation of Bipredictability and additional tests on new models and conditions.

Referee Report

3 major / 1 minor

Summary. The paper claims that token frequency statistics alone suffice to monitor conversational structural consistency in multi-turn LLM interactions via a new measure called Bipredictability (P), which quantifies shared predictability across the context-response-next-prompt loop relative to total turn uncertainty; this is implemented in a lightweight Information Digital Twin (IDT) architecture. Across 4,574 turns in 34 conditions with one student and three frontier models, P aligns with structural consistency in 85% of cases (but only 44% with semantic quality) and the IDT detects all tested contradictions, topic shifts, and non-sequiturs with 100% sensitivity.

Significance. If the central empirical claims hold after formalization, the work would demonstrate a lightweight, embedding-free, and model-internal-free signal for detecting structural drift in extended LLM conversations, providing a practical complement to semantic evaluation methods in deployment settings.

major comments (3)

[Abstract] Abstract: Bipredictability P is introduced only descriptively as measuring 'shared predictability across the context, response, next prompt loop relative to the turn total uncertainty,' with no explicit formula, derivation from token frequencies, or conversion procedure supplied; this absence directly undermines evaluation of the reported 85% alignment and 100% sensitivity figures and leaves open the circularity concern that P may be partly determined by the same data used for validation.
[Results] Empirical evaluation (results section): the 100% sensitivity and 85% structural alignment claims rest on 34 conditions and four models with no cross-validation, hold-out conditions, error bars, or tests on new tokenizers/vocabularies; without these, the generalization of P beyond the specific 4,574 turns cannot be assessed and the central claim that token statistics suffice independently of semantic content remains unsupported.
[Methods] Methods: no exclusion criteria, statistical tests, or definition of how raw token frequencies are aggregated into the P measure are provided, so it is impossible to determine whether the IDT's perfect detection of contradictions/topic shifts/non-sequiturs is independent of post-hoc choices or data-dependent normalization.

minor comments (1)

[Abstract] Abstract: the phrase 'one student model and three frontier teacher models' is used without naming the specific models or tokenizers, reducing reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, indicating revisions made to strengthen the manuscript while maintaining the original scope and claims.

read point-by-point responses

Referee: [Abstract] Abstract: Bipredictability P is introduced only descriptively as measuring 'shared predictability across the context, response, next prompt loop relative to the turn total uncertainty,' with no explicit formula, derivation from token frequencies, or conversion procedure supplied; this absence directly undermines evaluation of the reported 85% alignment and 100% sensitivity figures and leaves open the circularity concern that P may be partly determined by the same data used for validation.

Authors: The abstract is intentionally concise, but the full manuscript derives P explicitly in the Methods from token frequency counts: raw frequencies are converted to empirical probabilities per segment, mutual information terms are estimated via co-occurrence counts across the context-response-next-prompt loop, and P is the normalized ratio of summed predictability to total turn entropy. We have revised the abstract to include this compact formula and a reference to the derivation. Structural labels for validation were assigned by independent human review of conversation consistency and were not used in computing P, eliminating circularity. revision: yes
Referee: [Results] Empirical evaluation (results section): the 100% sensitivity and 85% structural alignment claims rest on 34 conditions and four models with no cross-validation, hold-out conditions, error bars, or tests on new tokenizers/vocabularies; without these, the generalization of P beyond the specific 4,574 turns cannot be assessed and the central claim that token statistics suffice independently of semantic content remains unsupported.

Authors: The 44% alignment with semantic quality versus 85% with structure provides direct evidence that P captures structural properties independently of semantics. We have added bootstrap-derived error bars to the reported alignment percentages and a limitations paragraph noting the exploratory design across the 34 conditions. Formal cross-validation or hold-out testing on new vocabularies lies outside the current controlled experimental scope. revision: partial
Referee: [Methods] Methods: no exclusion criteria, statistical tests, or definition of how raw token frequencies are aggregated into the P measure are provided, so it is impossible to determine whether the IDT's perfect detection of contradictions/topic shifts/non-sequiturs is independent of post-hoc choices or data-dependent normalization.

Authors: The revised Methods section now specifies: token frequencies are tallied separately for context, response, and next-prompt segments; probabilities are obtained by length normalization; P is computed as the average pointwise predictability across the loop divided by turn entropy. Turns with fewer than 10 tokens were excluded to ensure stable frequency estimates. Proportion tests for the alignment rates and sensitivity are now reported, confirming the detection results are not driven by post-hoc normalization choices. revision: yes

standing simulated objections not resolved

Absence of cross-validation or evaluation on entirely new tokenizers and vocabularies outside the four models tested; expanding to such cases would require new data collection beyond the scope of this revision.

Circularity Check

1 steps flagged

Bipredictability P defined relative to turn total uncertainty reduces detection to a normalized statistic on the same data

specific steps

self definitional [Abstract (formalization of Bipredictability)]
"We formalize this signal as Bipredictability, P, which measures shared predictability across the context, response, next prompt loop relative to the turn total uncertainty, and implement it in a lightweight auxiliary architecture, the Information Digital Twin, IDT. ... the IDT detected all tested contradictions, topic shifts and non-sequiturs with 100 percent sensitivity."

P is defined as shared predictability relative to the turn total uncertainty computed from the identical token statistics of the tested turns. The subsequent claim that this P (via IDT) detects all inconsistencies with 100% sensitivity therefore reduces to a property of the relative normalization on the same data, rather than an independent derivation or prediction.

full rationale

The paper's core derivation formalizes Bipredictability P directly from token frequency statistics as a relative measure to turn total uncertainty, then reports 100% sensitivity on the identical 4,574 turns and 34 conditions used to compute those statistics. This creates a self-definitional loop where the 'prediction' of structural consistency is constructed from the same inputs that define P, with no independent hold-out or external benchmark separating the measure from its validation set. The 85% alignment claim therefore follows from the normalization rather than an independent test of the signal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The claim rests on the unproven domain assumption that token frequencies encode structural consistency separately from semantics, plus two newly introduced entities whose independent evidence is absent from the abstract.

axioms (1)

domain assumption Token frequency statistics capture structural consistency independently of semantic content.
Invoked to justify monitoring without embeddings or model internals.

invented entities (2)

Bipredictability (P) no independent evidence
purpose: Quantify shared predictability across context-response-next-prompt loop relative to total uncertainty.
New formal signal introduced to operationalize the monitoring claim.
Information Digital Twin (IDT) no independent evidence
purpose: Lightweight auxiliary architecture implementing Bipredictability detection.
New implementation vehicle for the proposed signal.

pith-pipeline@v0.9.0 · 5514 in / 1443 out tokens · 46407 ms · 2026-05-15T09:16:20.449335+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Bipredictability (P) ... P = MI(S, A; S') / {H(S) + H(A) + H(S')}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 6 internal anchors

[1]

Science 384, 842–845 (2024)

Bengio, Y., Hinton, G., Yao, A., Song, D., Abbeel, P., et al., Managing extreme AI risks amid rapid progress. Science 384, 842–845 (2024)

work page 2024
[2]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arber, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E. et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

& Glass, J

Chuang, Y.-S., Qiu, L., Hsieh, C.-Y., Krishna, R., Kim, Y. & Glass, J. Lookback lens: detecting and mitigating contextual hallucinations in large language models using only attention maps. In Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP) 1419–1436 (2024)

work page 2024
[4]

Conant, R. C. & Ashby, W. R. Every good regulator of a system must be a model of that system. International Journal of Systems Science 1, 89–97 (1970). 12

work page 1970
[5]

Cover, T. M. & Thomas, J. A. Elements of Information Theory 2nd edn (Wiley-Interscience, 2006)

work page 2006
[6]

& Gal, Y

Farquhar, S., Kossen, J., Kuhn, L. & Gal, Y . Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630 (2024)

work page 2024
[7]

Gu, J., Ye, W., Zheng, Z., Wang, Y. & Li, Y. A survey on LLM-as-a-judge. arXiv preprint arXiv:2411.15594 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

& Reid, C

Hafez, W., Wei, C., Felipe, R., Nazeri, A. & Reid, C. A mathematical theory of agency and intelligence. arXiv preprint arXiv:2602.22519 (2026)

work page arXiv 2026
[9]

& Steinhardt, J

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. & Steinhardt, J. Measuring massive multitask language understanding. In Proc. International Conference on Learning Representations (ICLR) (2021)

work page 2021
[10]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D. & Khabsa, M. Llama Guard: LLM-based input-output safeguard for human-AI conversations. arXiv preprint arXiv:2312.06674 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

J., Madotto, A

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A. & Fung, P. Survey of hallucination in natural language generation. ACM Computing Surveys 55, 1–38 (2023)

work page 2023
[12]

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DaSilva, N., Elhage, N. et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

& Farquhar, S

Kuhn, L., Gal, Y . & Farquhar, S. Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation. In Proc. International Conference on Learning Representations (ICLR) (2023)

work page 2023
[14]

LLMs Get Lost In Multi-Turn Conversation

Laban, P., Hayashi, H., Zhou, Y. & Neville, J. LLMs get lost in multi-turn conversation. arXiv preprint arXiv:2505.06120 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Li, Q., Dou, S., Shao, K., Chen, C. & Hu, H. Evaluating scoring bias in LLM-as-a-judge. arXiv preprint arXiv:2506.22316 (2025a)

work page arXiv
[16]

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Li, Y., Shen, X., Yao, X., Ding, X., Miao, Y., Krishnan, R. & Padman, R. Beyond single-turn: a survey on multi- turn interactions with large language models. arXiv preprint arXiv:2504.04717 (2025b)

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y ., Narayanan, D., Wu, Y ., Kumar, A. et al. Holistic evaluation of language models. Transactions on Machine Learning Research (2023)

work page 2023
[18]

& Gales, M

Malinin, A. & Gales, M. Uncertainty estimation in autoregressive structured prediction. In Proc. International Conference on Learning Representations (ICLR) (2021)

work page 2021
[19]

& Gales, M

Manakul, P., Liusie, A. & Gales, M. SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. In Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP) 9004–9017 (2023)

work page 2023
[20]

& Cotterell, R

Meister, C. & Cotterell, R. Language model evaluation beyond perplexity. In Proc. Association for Computational Linguistics (ACL) 5328–5339 (2021)

work page 2021
[21]

& Topol, E

Meskó, B. & Topol, E. J. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. npj Digital Medicine 6, 120 (2023)

work page 2023
[22]

Artificial intelligence risk management framework: generative artificial intelligence profile

National Institute of Standards and Technology. Artificial intelligence risk management framework: generative artificial intelligence profile. NIST AI 600-1 (2024)

work page 2024
[23]

& Cohen, J

Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C. & Cohen, J. NeMo Guardrails: a toolkit for controllable and safe LLM applications with programmable rails. arXiv preprint arXiv:2310.10501 (2023)

work page arXiv 2023
[24]

& Gurevych, I

Reimers, N. & Gurevych, I. Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP) 3982–3992 (2019)

work page 2019
[25]

Shannon, C. E. A mathematical theory of communication. Bell System Technical Journal 27, 379–423 (1948)

work page 1948
[26]

N., Kaiser, Ł

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30 (2017)

work page 2017
[27]

Cybernetics: or Control and Communication in the Animal and the Machine (MIT Press, 1948)

Wiener, N. Cybernetics: or Control and Communication in the Animal and the Machine (MIT Press, 1948). 13

work page 1948
[28]

& Wang, W

Xiao, Y . & Wang, W. Y . On hallucination and predictive uncertainty in conditional language generation. arXiv preprint arXiv:2103.15025 (2021)

work page arXiv 2021
[29]

In Advances in Neural Information Processing Systems (NeurIPS) 36 (2023)

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, et al., Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In Advances in Neural Information Processing Systems (NeurIPS) 36 (2023)

work page 2023