When Self-Reference Fails to Close: Matrix-Level Dynamics in Large Language Models

Ji Ho Bae

arxiv: 2604.12128 · v1 · submitted 2026-04-13 · 💻 cs.CL

When Self-Reference Fails to Close: Matrix-Level Dynamics in Large Language Models

Ji Ho Bae This is my paper

Pith reviewed 2026-05-10 14:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords self-referencenon-closing truth recursionattention effective ranktransformer matrix dynamicsprompt hierarchyLLM instabilitysingular value decomposition

0 comments

The pith

Non-closing truth recursion disrupts attention matrices in large language models more than stable self-reference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines internal matrix changes in transformers when prompts involve self-reference of different kinds. It finds that self-reference by itself does not produce instability, but prompts that set up truth-value questions with no finite resolution do. These non-closing cases drive higher attention effective rank and variance shifts across multiple models, with the changes visible at every layer examined. The pattern also correlates with more contradictory generated text. The work links the observations to classical questions about matrix semigroups and suggests that the recursion depth limit itself pushes the dynamics into unstable regimes.

Core claim

Prompts inducing non-closing truth recursion produce anomalously elevated attention effective rank and variance kurtosis, reaching Cohen's d of 3.14 to 3.52 versus stable self-reference in the 70B model, with 281 of 397 metric-model combinations showing significant separation after correction and per-layer SVD confirming disruption at all sampled depths.

What carries the argument

The 14-level prompt hierarchy that isolates non-closing truth recursion (NCTR) as truth computations without finite-depth closure, measured through 106 scalar metrics on attention and other matrices together with per-layer singular value decomposition.

If this is right

NCTR prompts increase contradictory output by 34 to 56 percentage points relative to controls.
Disruption registers at every sampled layer with effect sizes above 1.0 in the models tested.
43 of the 106 metrics replicate the NCTR distinction across all four models from three architecture families.
A simple classifier trained on the internal metrics reaches AUC values between 0.81 and 0.90.
Minimal prompt pairs differing only in recursion closure still separate on dozens of metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reported link to matrix-semigroup problems implies that algebraic results on semigroup convergence might predict which recursion patterns will trigger collapse.
Architectures that explicitly track recursion depth during inference could avoid the elevated rank regimes observed here.
Training corpora containing many unresolved self-referential statements may embed similar dynamical sensitivities that surface at inference time.

Load-bearing premise

The 14-level hierarchy and NCTR definition correctly isolate the driver of the observed matrix changes without confounds from prompt wording or metric aggregation.

What would settle it

Finding no elevation in attention effective rank or loss of differentiation between NCTR and stable self-reference when the same metrics are applied to new models or alternative layer-wise decompositions.

Figures

Figures reproduced from arXiv: 2604.12128 by Ji Ho Bae.

**Figure 2.** Figure 2: Four-cluster comparison across four models. Each column shows one key metric; each row is one [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: NCTR effect size (C4 vs. C1) by model scale. Top 20 metrics by 70B [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Per-layer Cohen’s d for attention effective rank. Red: C4 vs. C1; green: C4 vs. C2; orange: C4 vs. C3. NCTR elevates effective rank at every sampled layer (d > 1.0), while C4 vs. C3 remains below d = 0.7. Model C4 (NCTR) C1 (Control) Difference Qwen 8B 62.5% 15.0% +47.5 pp Gemma 9B 58.8% 2.5% +56.3 pp Llama 11B 36.2% 2.5% +33.7 pp Llama 70B 51.2% 2.5% +48.7 pp [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

We investigate how self-referential inputs alter the internal matrix dynamics of large language models. Measuring 106 scalar metrics across up to 7 analysis passes on four models from three architecture families -- Qwen3-VL-8B, Llama-3.2-11B, Llama-3.3-70B, and Gemma-2-9B -- over 300 prompts in a 14-level hierarchy at three temperatures ($T \in \{0.0, 0.3, 0.7\}$), we find that self-reference alone is not destabilizing: grounded self-referential statements and meta-cognitive prompts are markedly more stable than paradoxical self-reference on key collapse-related metrics, and on several such metrics can be as stable as factual controls. Instability concentrates in prompts inducing non-closing truth recursion (NCTR) -- truth-value computations with no finite-depth resolution. NCTR prompts produce anomalously elevated attention effective rank -- indicating attention reorganization with global dispersion rather than simple concentration collapse -- and key metrics reach Cohen's $d = 3.14$ (attention effective rank) to $3.52$ (variance kurtosis) vs. stable self-reference in the 70B model; 281/397 metric-model combinations differentiate NCTR from stable self-reference after FDR correction ($q < 0.05$), 198 with $|d| > 0.8$. Per-layer SVD confirms disruption at every sampled layer ($d > +1.0$ in all three models analyzed), ruling out aggregation artifacts. A classifier achieves AUC $0.81$-$0.90$; 30 minimal pairs yield 42/387 significant combinations; 43/106 metrics replicate across all four models. We connect these observations to three classical matrix-semigroup problems and propose, as a conjecture, that NCTR forces finite-depth transformers toward dynamical regimes where these problems concentrate. NCTR prompts also produce elevated contradictory output ($+34$-$56$ percentage points vs. controls), suggesting practical relevance for understanding self-referential failure modes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NCTR prompts produce distinct matrix instability in LLMs with large effects and replication, though prompt surface features may still confound the isolation.

read the letter

Hi, the main thing here is that self-reference by itself does not destabilize these models much. The instability concentrates in prompts that create non-closing truth recursion, showing up as higher attention effective rank, shifts in variance metrics, and more contradictory outputs. This pattern appears across four models with large effect sizes and holds after FDR correction in a majority of the metric combinations they tested. The per-layer SVD checks and classifier performance add some weight to the claim that the changes are real and not just aggregation noise. They also ran minimal pairs and saw decent cross-model replication in 43 metrics. That systematic comparison across a 14-level hierarchy and multiple temperatures is the freshest part of the work. The connection to classical matrix-semigroup problems is presented as a conjecture, which keeps it honest. The soft spot is still the prompt construction. Even with the hierarchy and pairs, NCTR prompts could differ in token length, nesting, or lexical choices that affect attention rank independently. The abstract does not spell out explicit surface-feature matching, so that remains a possible confound despite the large d values. The analysis stays observational with no derivation from the model parameters themselves. This is useful for people doing LLM interpretability and reliability work who want concrete internal signatures for self-reference failures. The empirical patterns and replication checks are solid enough that it deserves peer review so the methods and exact prompts can be examined in detail.

Referee Report

2 major / 3 minor

Summary. The manuscript investigates how self-referential inputs affect internal matrix dynamics in LLMs by computing 106 scalar metrics (including attention effective rank and variance statistics) across four models (Qwen3-VL-8B, Llama-3.2-11B, Llama-3.3-70B, Gemma-2-9B), a 14-level prompt hierarchy, and three temperatures. It reports that instability is concentrated in non-closing truth recursion (NCTR) prompts, which produce elevated attention effective rank and other collapse-related metrics with large effect sizes (Cohen's d = 3.14 to 3.52 vs. stable self-reference), 281/397 significant metric-model differences after FDR correction, per-layer SVD disruption at every sampled layer, a classifier with AUC 0.81-0.90, and elevated contradictory outputs (+34-56 pp). Stable self-reference and meta-cognitive prompts are more stable, comparable to factual controls. The work connects the observations to matrix-semigroup problems via an explicit conjecture.

Significance. If the NCTR isolation is robust, the scale of the analysis (300 prompts, 106 metrics, multi-model replication with 43/106 metrics consistent across all four models, 30 minimal pairs, and per-layer verification) provides a strong observational foundation for linking specific prompt structures to matrix-level reorganization in transformers. The large effect sizes and FDR-controlled results across architectures strengthen the empirical contribution, while the conjecture offers a potential bridge to classical linear-algebra problems even if left speculative.

major comments (2)

[Prompt hierarchy and NCTR definition (abstract and §3)] The central attribution of matrix instability to NCTR (non-closing truth recursion) rather than prompt surface features requires that the 14-level hierarchy and minimal-pair design isolate the recursive property. The manuscript does not report explicit statistical matching or regression controls for token length, syntactic nesting depth, or lexical properties between NCTR and stable self-reference conditions. Although 30 minimal pairs and cross-model replication (43/106 metrics) are presented, these do not substitute for surface-feature controls; the reported d > 3 effects could still contain contributions from such confounds. This is load-bearing for the causal interpretation of the NCTR contrast.
[Methods and metric definitions (§4)] The per-layer SVD analysis rules out simple aggregation artifacts, but the manuscript provides no explicit description of how the 106 scalar metrics are derived from the attention and key/value matrices (e.g., exact formulas for effective rank and kurtosis) or whether metric selection was pre-specified versus post-hoc. Without this or access to the metric definitions and raw per-layer matrices, it is difficult to evaluate whether the 281/397 significant combinations reflect genuine dynamical differences or metric-construction choices.

minor comments (3)

[Abstract] The abstract is information-dense; consider moving some quantitative details (e.g., exact counts of significant combinations) to a results table or bullet list for readability.
[Introduction] Ensure the first use of 'NCTR' and 'attention effective rank' includes a concise parenthetical definition even if expanded later.
[Discussion] The conjecture linking NCTR to matrix-semigroup problems is clearly labeled as such; a short appendix sketching the three classical problems and the observed metric mappings would help readers assess its plausibility without altering the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our empirical findings on NCTR-induced matrix instability. We respond to each major comment below.

read point-by-point responses

Referee: [Prompt hierarchy and NCTR definition (abstract and §3)] The central attribution of matrix instability to NCTR (non-closing truth recursion) rather than prompt surface features requires that the 14-level hierarchy and minimal-pair design isolate the recursive property. The manuscript does not report explicit statistical matching or regression controls for token length, syntactic nesting depth, or lexical properties between NCTR and stable self-reference conditions. Although 30 minimal pairs and cross-model replication (43/106 metrics) are presented, these do not substitute for surface-feature controls; the reported d > 3 effects could still contain contributions from such confounds. This is load-bearing for the causal interpretation of the NCTR contrast.

Authors: The 30 minimal pairs were constructed to isolate the non-closing recursion property while matching token length, syntactic structure, and lexical content as closely as possible within the constraints of the prompt hierarchy. We also report cross-model replication for 43 metrics and per-layer consistency. However, we acknowledge that we did not perform explicit regression-based controls for residual surface features in the submitted version. In the revision, we will add statistical comparisons of token lengths and nesting depths across conditions, along with a linear regression analysis controlling for these variables when predicting the key metrics. This will directly test whether the large effect sizes persist after accounting for potential confounds, strengthening the causal attribution to NCTR. revision: yes
Referee: [Methods and metric definitions (§4)] The per-layer SVD analysis rules out simple aggregation artifacts, but the manuscript provides no explicit description of how the 106 scalar metrics are derived from the attention and key/value matrices (e.g., exact formulas for effective rank and kurtosis) or whether metric selection was pre-specified versus post-hoc. Without this or access to the metric definitions and raw per-layer matrices, it is difficult to evaluate whether the 281/397 significant combinations reflect genuine dynamical differences or metric-construction choices.

Authors: We agree that greater transparency in metric derivation is necessary. The 106 metrics were pre-specified based on established measures of matrix stability in the linear algebra and neural network literature prior to data collection. In the revised manuscript, we will expand §4 with a table and formulas detailing each metric, including the precise definition of effective rank (normalized sum of singular values) and kurtosis of the attention weight distributions. We will also make the full extraction code and a subset of the raw per-layer attention matrices available as supplementary material to enable independent verification. This addresses the concern about potential metric-construction artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical measurements with independent metrics

full rationale

The paper performs statistical comparisons of 106 scalar metrics (including attention effective rank, variance kurtosis, per-layer SVD) across categorized prompt hierarchies (NCTR vs. stable self-reference vs. controls) on multiple models. No derivation chain exists that reduces any reported difference or conjecture to quantities defined by the inputs themselves; the NCTR label is a prompt categorization criterion, not a fitted parameter, and the matrix-semigroup link is explicitly presented as an unproven conjecture rather than a derived result. All claims rest on direct computation and FDR-corrected tests, with no self-definitional loops, fitted-input predictions, or load-bearing self-citations.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the validity of the prompt taxonomy and the sufficiency of the chosen scalar metrics to detect dynamical instability; no new mathematical entities are postulated beyond the NCTR label.

free parameters (2)

Temperature set
Discrete values T in {0.0, 0.3, 0.7} chosen for the experiment.
Prompt hierarchy depth
14-level hierarchy constructed to isolate NCTR.

axioms (2)

standard math Cohen's d and FDR-corrected tests are appropriate for comparing the 106 metrics across conditions.
Used to declare 281 significant combinations.
domain assumption The 106 scalar metrics plus SVD capture the relevant matrix-level dynamics induced by prompt type.
Foundation for all reported differences and per-layer confirmation.

invented entities (1)

NCTR (non-closing truth recursion) no independent evidence
purpose: Category of prompts whose truth-value computation has no finite-depth resolution.
Introduced to distinguish the unstable class from other self-referential prompts.

pith-pipeline@v0.9.0 · 5679 in / 1602 out tokens · 55088 ms · 2026-05-10T14:54:41.358749+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Emergent introspective awareness in large language models

Anthropic. Emergent introspective awareness in large language models. Technical report, October 2025. https://www.anthropic.com/research/introspection

work page 2025
[2]

Benjamini and Y

Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J.\ Royal Statist.\ Soc.\ B , 57(1):289--300, 1995

work page 1995
[3]

C. Berg, D. de Lucena, and J. Rosenblatt. Large language models report subjective experience under self-referential processing. arXiv:2510.24797 , 2025

work page arXiv 2025
[4]

M. A. Berger and Y. Wang. Bounded semigroups of matrices. Lin.\ Alg.\ Appl. , 166:21--27, 1992

work page 1992
[5]

F. J. Binder, J. Chua, T. Korbak, H. Sleight, J. Hughes, R. Long, E. Perez, M. Turpin, and O. Evans. Looking inward: Language models can learn about themselves by introspection. In Proc.\ ICLR , 2025. arXiv:2410.13787

work page arXiv 2025
[6]

V. D. Blondel and J. N. Tsitsiklis. The boundedness of all products of a pair of matrices is undecidable. Syst.\ & Control Lett. , 41(2):135--140, 2000

work page 2000
[7]

Z. P. Dadfar. When models examine themselves: Vocabulary-activation correspondence in self-referential processing. arXiv:2602.11358 , 2026

work page arXiv 2026
[8]

Dwarka and A

V. Dwarka and A. Blom. Not all who wander are lost: Hallucinations as neutral dynamics in residual transformers. OpenReview, submitted to ICLR 2026, 2025. https://openreview.net/forum?id=fDfctZ8Fhg

work page 2026
[9]

Merrill and A

W. Merrill and A. Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. Trans.\ ACL , 11:531--545, 2023

work page 2023
[10]

Naphade, S

A. Naphade, S. Bhargav, S. Lim, and M. Shah. Me, myself, and : Evaluating and explaining LLM introspection. arXiv:2603.20276 , 2026

work page arXiv 2026
[11]

interpreting GPT : the logit lens

nostalgebraist. interpreting GPT : the logit lens. Alignment Forum , 2020

work page 2020
[12]

Ouaknine and J

J. Ouaknine and J. Worrell. Decision problems for linear recurrence sequences. In RP 2012 , LNCS, pp. 21--28. Springer, 2012

work page 2012
[13]

Ouaknine and J

J. Ouaknine and J. Worrell. Ultimate positivity is decidable for simple linear recurrence sequences. In ICALP 2014 , LNCS, pp. 330--341. Springer, 2014

work page 2014
[14]

M. S. Paterson. Unsolvability in 3 3 matrices. Stud.\ Appl.\ Math. , 49(1):105--107, 1970

work page 1970
[15]

arXiv preprint arXiv:2510.06477 , year=

J. Queipo-de-Llano, N. Arroyo, F. Barbero, Y. Dong, M. Bronstein, Y. LeCun, and R. Shwartz-Ziv. Attention sinks and compression valleys in LLMs are two sides of the same coin. In Proc.\ ICLR , 2026. arXiv:2510.06477

work page arXiv 2026
[16]

Suresh, J

P. Suresh, J. Stanley, S. Joseph, L. Scimeca, and D. Bzdok. From noise to narrative: Tracing the origins of hallucinations in transformers. In NeurIPS , 2025. arXiv:2509.06938

work page arXiv 2025
[17]

A. Tarski. The concept of truth in formalized languages. 1933. English translation in Logic, Semantics, Metamathematics , Clarendon, 1956

work page 1933
[18]

Thrush et al

T. Thrush et al. I am a strange dataset: Metalinguistic tests for language models. In Proc.\ ACL , 2024

work page 2024
[19]

Wilcoxon

F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bull. , 1(6):80--83, 1945

work page 1945

[1] [1]

Emergent introspective awareness in large language models

Anthropic. Emergent introspective awareness in large language models. Technical report, October 2025. https://www.anthropic.com/research/introspection

work page 2025

[2] [2]

Benjamini and Y

Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J.\ Royal Statist.\ Soc.\ B , 57(1):289--300, 1995

work page 1995

[3] [3]

C. Berg, D. de Lucena, and J. Rosenblatt. Large language models report subjective experience under self-referential processing. arXiv:2510.24797 , 2025

work page arXiv 2025

[4] [4]

M. A. Berger and Y. Wang. Bounded semigroups of matrices. Lin.\ Alg.\ Appl. , 166:21--27, 1992

work page 1992

[5] [5]

F. J. Binder, J. Chua, T. Korbak, H. Sleight, J. Hughes, R. Long, E. Perez, M. Turpin, and O. Evans. Looking inward: Language models can learn about themselves by introspection. In Proc.\ ICLR , 2025. arXiv:2410.13787

work page arXiv 2025

[6] [6]

V. D. Blondel and J. N. Tsitsiklis. The boundedness of all products of a pair of matrices is undecidable. Syst.\ & Control Lett. , 41(2):135--140, 2000

work page 2000

[7] [7]

Z. P. Dadfar. When models examine themselves: Vocabulary-activation correspondence in self-referential processing. arXiv:2602.11358 , 2026

work page arXiv 2026

[8] [8]

Dwarka and A

V. Dwarka and A. Blom. Not all who wander are lost: Hallucinations as neutral dynamics in residual transformers. OpenReview, submitted to ICLR 2026, 2025. https://openreview.net/forum?id=fDfctZ8Fhg

work page 2026

[9] [9]

Merrill and A

W. Merrill and A. Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. Trans.\ ACL , 11:531--545, 2023

work page 2023

[10] [10]

Naphade, S

A. Naphade, S. Bhargav, S. Lim, and M. Shah. Me, myself, and : Evaluating and explaining LLM introspection. arXiv:2603.20276 , 2026

work page arXiv 2026

[11] [11]

interpreting GPT : the logit lens

nostalgebraist. interpreting GPT : the logit lens. Alignment Forum , 2020

work page 2020

[12] [12]

Ouaknine and J

J. Ouaknine and J. Worrell. Decision problems for linear recurrence sequences. In RP 2012 , LNCS, pp. 21--28. Springer, 2012

work page 2012

[13] [13]

Ouaknine and J

J. Ouaknine and J. Worrell. Ultimate positivity is decidable for simple linear recurrence sequences. In ICALP 2014 , LNCS, pp. 330--341. Springer, 2014

work page 2014

[14] [14]

M. S. Paterson. Unsolvability in 3 3 matrices. Stud.\ Appl.\ Math. , 49(1):105--107, 1970

work page 1970

[15] [15]

arXiv preprint arXiv:2510.06477 , year=

J. Queipo-de-Llano, N. Arroyo, F. Barbero, Y. Dong, M. Bronstein, Y. LeCun, and R. Shwartz-Ziv. Attention sinks and compression valleys in LLMs are two sides of the same coin. In Proc.\ ICLR , 2026. arXiv:2510.06477

work page arXiv 2026

[16] [16]

Suresh, J

P. Suresh, J. Stanley, S. Joseph, L. Scimeca, and D. Bzdok. From noise to narrative: Tracing the origins of hallucinations in transformers. In NeurIPS , 2025. arXiv:2509.06938

work page arXiv 2025

[17] [17]

A. Tarski. The concept of truth in formalized languages. 1933. English translation in Logic, Semantics, Metamathematics , Clarendon, 1956

work page 1933

[18] [18]

Thrush et al

T. Thrush et al. I am a strange dataset: Metalinguistic tests for language models. In Proc.\ ACL , 2024

work page 2024

[19] [19]

Wilcoxon

F. Wilcoxon. Individual comparisons by ranking methods. Biometrics Bull. , 1(6):80--83, 1945

work page 1945